Can't think of anything better than a t-test, but open for suggestions.
If a forecaster is consistently off by like 10 percentage points - I think that is a difference that matters. But even in that extreme scenario where the (simulated) difference between two forecasters is in fact quite large, we have a hard time picking that up using standard significance tests.
In principle yes. In practice also usually yes, but the specifics depend on whether the average user who predicted on a question gets a positive amount of points. So if you predicted very late and your points are close to zero, but the mean number of points forecasters on that question received is positive, then you will end up with a negative update to your reputation score.
Completely agree that a lot hinges on that reputation score. It seems to work decent for the Metaculus Prediction, but it would be good to see what results look like for a different metric of past performance.
I slightly tend towards yes, but that's mere intuition. As someone on Twitter put it, "Metaculus has a more hardcore user base, because it's less fun" - I find it plausible that the Metaculus user base and the Manifold user base differs. But higher trading volume I think would have helped.
For this particular analysis I'm not sure correcting for the number of forecasters would really be possible in a sound way. It would be great to get the MetaculusBot more active again to collect more data.
Good comment, thank you!