Senior Behavioural Scientist at Rethink Priorities
Thanks for highlighting this possible source of confusion. The ' - ' is intended to be like a literal minus sign, but it could be read more like a spectrum with Longtermism tending left and Neartermism tending right which seems to be how you (and possibly others) read it. I've adapted the labeling of the graph which I hope clarifies this
Good summary and especially helpful with the table to present the differing views succinctly. I'm looking forward to watching this debate and wasn't aware it was out. It's really a shame that they had some glitch that might have affected how much we can rely on the final scores to represent what really happened with the audience views
We can get a better intimation of the magnitude of the effect here with some further calculations. If we take all the people who have pre and post FTX satisfaction responses (n = 951), we see that 4% of them have a satisfaction score that went up, 53% remained the same, and 43% went down. That’s quite a striking negative impact. For those people whose scores went down, 67% had a reduction of only 1 point, 22% of 2 points, and then 7%, 3%, and 1% each for -3, -4, and -5 points.
We can also try to translate this effect into some more commonly used effect size metrics. Firstly, we can utilise a nice summary effect size metric for these ratings known as probability of superiority (PSup), which makes relatively few assumptions about the data - mainly that higher ratings are higher and lower ratings are lower, within the same respondent. This metric summarises the difference over time by taking the proportion of cases in which a score was higher pre-FTX (42.7%), and assigning a 50% weight to cases in which the score was the same from pre to post FTX (.5 * 53.2% = 26.6%), and adding these quantities together (69.3%). This metric is taken as an approximation of the proportion of people who would report being more satisfied before vs. after in a forced choice of being more or less satisfied. If everyone was more satisfied before, PSup would be 100%, if everyone was more satisfied after, PSup would be 0, and if it were just as likely for people to be more or less satisfied before or after, PSup would be 50%. In this case, we get a PSup of 69.3%. This corresponds to an effect size in standard deviation units (like Cohen’s d), of approximately .7.
We would encourage people not to just look up whether these are small or large effects in a table that would say e.g, from wikipedia, that .7 is in the ‘medium’ effect size bin. Think about how you would respond on this kind of question, what a difference of 1 or more points would mean in your head, and what precisely you think the proportions of people giving different responses substantively might mean to them. How one can best interpret effect sizes varies greatly with context
Thanks for the response and the links to these graphs. This is just a quick look and so could be wrong but looking into some files from the World Values Survey, I find this information which, if correct, would make me think I would not weight this information into my consideration of whether we should be concerned about a country being annexed even to a level of 1% weight. The population of China is ~1.4 billion. The population of Taiwan is ~24 million. The sample size for the Chinese data seems to be 2300 people. And for Taiwan about 1200. I tried to upload a screenshot which I can't work out how to do, but the numbers are in the doc "WV6 Results By Country v20180912" on this page https://www.worldvaluessurvey.org/WVSDocumentationWV6.jsp
I do not think we can have any faith at all that a sample of 2300 people can even come close to representing all the variation in relevant factors related to happiness or satisfaction across the population of China. The ratio of population to respondents is over 600,000, larger than some estimates for the population of Oslo, Glasgow, Rotterdam etc. (https://worldpopulationreview.com/continents/europe/cities)
I may be missing something or making some basic error there but if it is roughly correct, then I would indeed call it silly to factor in this survey result when deciding what our response should be to the annexation of Taiwan. I do not think that such a question is in principle about life satisfaction/happiness, but even if it were I would not use this information.
How bad is authoritarianism anyways? China and Taiwan’s life satisfaction isn’t that different.
I'm not sure why this is deeply confusing. I don't think we should be assessing whether or not authoritarian regimes are bad or not based on measures of life satisfaction, and if that is what one wants to do then certainly not contemplating it via a 1v1 comparison of just two countries.
Is the claim that they are not that different on this metric true - where is the source for this and how many alternative sources or similar metrics are there? If true, are all the things that feed into people's responses to a survey about life satisfaction in these different places the same (how confident are they that they can give their true opinions, and how low have their aspirations or capacity to contemplate a flourishing life become), and are the measures representative of the actual population experience within those countries (what about the satisfaction of people in encampments in China that help sustain the regime and quash dissent)?
Even granted that the ratings really reflect all the same processes going on in each country and that it is representative, Taiwan lives under threat of occupation and invasion, and there are many other differences between the two countries. The case is then just a confounded comparison of 1 country vs 1 other, which is not an especially good comparison of whether the one variable chosen and used to define those countries makes a difference or not.
Given the differences in the questions it doesn't seem correct to compare the raw probabilities provided across these - also our question was specifically about extinction rather than just a catastrophe. That being said there may be some truth to this implying some difference between the population estimates and what the metaculus estimates imply if we rank them - AI risk comes out top on the metaculus ratings and bottom in the public, and climate change also shows a sizable rank difference.
One wrinkle in taking the rankings like this would be that people were only allowed to pick one item in our questions, and so it is also possible that the rankings could be different if people actually rated each one and then we ranked their ratings. This would be the case if e.g., all the other risks are more likely than AI to be the absolute top risk across people, but many people have AI risk as their second risk, which would suggest a very high ordinal ranking that we can't see from looking at the distribution of top picks.
Great to hear Jacob!