You may want to have a look at our API!
As for the code, I wrote it in Julia and as part of a much bigger, ongoing project, so it's a bit of a mess. I.e. lots of code that's not relevant for this particular analysis. If you're interested, I could either send it to you directly or make it more public after cleaning it up a little.
Disclaimer: I work for Metaculus.
Thanks for carefully looking into this @Javier Prieto, this looks very interesting! I'm particularly intrigued by identifying different biases for different categories and wondered how much weight you'd put on this being a statistical artefact vs a real, persistent bias that you would continue to worry about. Concretely, if we waited until a comparable number of AI benchmark progress questions, say, resolved, what would your P(Metaculus is underconfident on AI benchmark progress again) be? (Looking only at the new questions.)
Some minor comments:
I think the author knows what's going on here, but it may invite misunderstanding. This notion of "being better than predicting a […] uniform distribution" implies that a perfect forecast on the sum of two independent dice is "better than predicting a uniform distribution" only 2 out of 3 times, i.e. less than 70% of the time! (The probabilities for D_1+D_2 = 2,3,4,10,11, or 12 are all smaller than 1/#{possible outcomes}.)
Given that quite a lot of these AI questions closed over a year before resolution, which is rather atypical for Metaculus, comparing log scores at question close seems a bit unfair. I think time-averaged scores would be more informative. (I reckon they'd produce a quantitatively different, albeit qualitatively similar picture.)
This also goes back to "Metaculus narrowly beats chance": We tried to argue why we believe that this isn't as narrow as others made it out to be (for reasonable definitions of "narrow") here.