Disclaimer: I work for Metaculus.
You can now forecast on how much AI benchmark progress will continue to be underestimated by the Metaculus Community Prediction (CP) on this Metaculus question! Thanks @Javier Prieto for prompting us to think more about this and inspiring this question!
Predict a distribution with a mean of
Here is a Colab Notebook to get you started with some simulations.
And don't forget to update your forecasts about the AI benchmark progress questions in question, if the CP on this one has a mean far away from 0.5!
Disclaimer: I work for Metaculus.
Thanks for carefully looking into this @Javier Prieto, this looks very interesting! I'm particularly intrigued by identifying different biases for different categories and wondered how much weight you'd put on this being a statistical artefact vs a real, persistent bias that you would continue to worry about. Concretely, if we waited until a comparable number of AI benchmark progress questions, say, resolved, what would your P(Metaculus is underconfident on AI benchmark progress again) be? (Looking only at the new questions.)
Some minor comments:
About 70% of the predictions at question close had a positive log score, i.e. they were better than predicting a maximally uncertain uniform distribution over the relevant range (chance level).
I think the author knows what's going on here, but it may invite misunderstanding. This notion of "being better than predicting a […] uniform distribution" implies that a perfect forecast on the sum of two independent dice is "better than predicting a uniform distribution" only 2 out of 3 times, i.e. less than 70% of the time! (The probabilities for D_1+D_2 = 2,3,4,10,11, or 12 are all smaller than 1/#{possible outcomes}.)
The average log score at question close was 0.701 (Median: 0.868, IQR: [-0.165, 1.502][7]) compared to an average of 2.17 for all resolved continuous questions on Metaculus.
Given that quite a lot of these AI questions closed over a year before resolution, which is rather atypical for Metaculus, comparing log scores at question close seems a bit unfair. I think time-averaged scores would be more informative. (I reckon they'd produce a quantitatively different, albeit qualitatively similar picture.)
This also goes back to "Metaculus narrowly beats chance": We tried to argue why we believe that this isn't as narrow as others made it out to be (for reasonable definitions of "narrow") here.
You may want to have a look at our API!
As for the code, I wrote it in Julia and as part of a much bigger, ongoing project, so it's a bit of a mess. I.e. lots of code that's not relevant for this particular analysis. If you're interested, I could either send it to you directly or make it more public after cleaning it up a little.
[Disclaimer: I'm working for FutureSearch]
To add another perspective: Reasoning helps aggregating forecasts. Just consider one of the motivating examples for extremising, where, IIRC, some US president is handed the several (well-calibrated, say) estimates around ≈70% for P(head of some terrorist organisation is in location X)—if these estimates came from different sources, the aggregate ought to be bigger than 70%, whereas if it's all based on the same few sources, 70% may be one's best guess.
This is also something that a lot of forecasters may just do subconsciously when considering different points of view (which may be something as simple as different base rates or something as complicated as different AGI arrival models).
So from an engineering perspective there is a lot of value in providing rationales, even if they don't show up in the final forecasts.