This is a special post for quick takes by Will Howard. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
This is an edited version of a memo I shared within the online team at CEA. It’s about the forum, but you could also make it about other stuff. (Note: this is just my personal opinion)
There's this stylised fact about war that almost none of the deaths are caused by gunshots, which is surprising given that for the average soldier war consists of walking around with a gun and occasionally pointing it at people. Whether or not this is actually true, the lesson that quoters of this fact are trying to teach is that the possibility of something happening can have a big impact on the course of events, even if it very rarely actually happens.
[warning: analogy abuse incoming]
I think a similar thing can happen on the forum, and trying to understand what’s going on in a very data driven way will tend to lead us astray in cases like this.
A concrete example of this is people being apprehensive about posting on the forum, and saying this is because they are afraid of criticism. But if you go and look through all the comments there aren’t actually that many examples of well intentioned posts being torn apart. At this point if you’re being very data minded you would say “well I guess people are wrong, posts don’t actually get torn apart in the comments; so we should just encourage people to overcome their fear of posting (or something)”.
I think this is probably wrong because something like this happens: users correctly identify that people would tear their post apart if it was bad, so they either don’t write the post at all, or they put a lot of effort into making it good. The result of this is that the amount of realised harsh criticism on the forum is low, and the quality of posts is generally high (compared to other forums, facebook, etc).
I would guess that criticising actually-bad posts even more harshly would in fact lower the total amount of criticism, for the same reason that hanging people for stealing bread probably lowered the theft rate among victorian street urchins (this would probably also be bad for the same reason)
Comparing average Brier scores between people only makes sense if they have made predictions on exactly the same questions, because making predictions on more certain questions (such as "will there be a 9.0 earthquake in the next year?") will tend to give you a much better Brier score than making predictions on more uncertain questions (such as "will this coin come up head or tails?"). This is one of those things that lots of people know but then everyone (including me) keeps using them anyway because it's a nice simple number to look at.
The Brier score for a binary prediction is the squared difference between the predicted probability and the actual outcome (O−p)2. For a given forecast, predicting the correct probability will give you the minimum possible Brier score (which is what you want). But this minimum possible score varies depending on the true probability of the event happening.
For the coin flip the true probability is 0.5, so if you make a perfect prediction you will get a Brier score of 0.25 (=0.5∗(1−0.5)2+0.5∗(0−0.5)2). For the earthquake question maybe the correct probability is 0.1, so the best expected Brier score you can get is 0.09 (=0.1∗(1−0.1)2+0.9∗(0−0.9)2), and it's only if you are really badly wrong (you think p>0.5) that you can get a score higher than the best score you can get for the coin flip.
So if forecasters have a choice of questions to make predictions on, someone who mainly goes for things that are pretty certain will end up with a (much!) better average Brier score than someone who predicts things that are genuinely more 50/50. This also acts as a disincentive for predicting more uncertain things which seems bad.
We've just added Fatebook (which is great!) to our slack and I've noticed this putting me off making forecasts for things that are highly uncertain. I'm interested in if there is some lore around dealing with this among people who use Metaculus or other platforms where Brier scores are an important metric. I only really use prediction markets, which don't suffer from this problem.
Yeah, I'm starting to believe that a severe limitation on Brier scores is this inability to use them in a forward-looking way. Brier scores reflect the performance of specific people on specific questions and using them as evidence for future prediction performance seems really fraught...but it's the best we have as far as I can tell.
Quick take that I'm saving as draft
Now editing
dfsfsdf
gfsgsfgsfd
should appear immediately
Most deaths in war aren’t from gunshots
This is an edited version of a memo I shared within the online team at CEA. It’s about the forum, but you could also make it about other stuff. (Note: this is just my personal opinion)
There's this stylised fact about war that almost none of the deaths are caused by gunshots, which is surprising given that for the average soldier war consists of walking around with a gun and occasionally pointing it at people. Whether or not this is actually true, the lesson that quoters of this fact are trying to teach is that the possibility of something happening can have a big impact on the course of events, even if it very rarely actually happens.
[warning: analogy abuse incoming]
I think a similar thing can happen on the forum, and trying to understand what’s going on in a very data driven way will tend to lead us astray in cases like this.
A concrete example of this is people being apprehensive about posting on the forum, and saying this is because they are afraid of criticism. But if you go and look through all the comments there aren’t actually that many examples of well intentioned posts being torn apart. At this point if you’re being very data minded you would say “well I guess people are wrong, posts don’t actually get torn apart in the comments; so we should just encourage people to overcome their fear of posting (or something)”.
I think this is probably wrong because something like this happens: users correctly identify that people would tear their post apart if it was bad, so they either don’t write the post at all, or they put a lot of effort into making it good. The result of this is that the amount of realised harsh criticism on the forum is low, and the quality of posts is generally high (compared to other forums, facebook, etc).
I would guess that criticising actually-bad posts even more harshly would in fact lower the total amount of criticism, for the same reason that hanging people for stealing bread probably lowered the theft rate among victorian street urchins (this would probably also be bad for the same reason)
quick take
A complaint about using average Brier scores
Comparing average Brier scores between people only makes sense if they have made predictions on exactly the same questions, because making predictions on more certain questions (such as "will there be a 9.0 earthquake in the next year?") will tend to give you a much better Brier score than making predictions on more uncertain questions (such as "will this coin come up head or tails?"). This is one of those things that lots of people know but then everyone (including me) keeps using them anyway because it's a nice simple number to look at.
To explain:
The Brier score for a binary prediction is the squared difference between the predicted probability and the actual outcome (O−p)2. For a given forecast, predicting the correct probability will give you the minimum possible Brier score (which is what you want). But this minimum possible score varies depending on the true probability of the event happening.
For the coin flip the true probability is 0.5, so if you make a perfect prediction you will get a Brier score of 0.25 (=0.5∗(1−0.5)2+0.5∗(0−0.5)2). For the earthquake question maybe the correct probability is 0.1, so the best expected Brier score you can get is 0.09 (=0.1∗(1−0.1)2+0.9∗(0−0.9)2), and it's only if you are really badly wrong (you think p>0.5) that you can get a score higher than the best score you can get for the coin flip.
So if forecasters have a choice of questions to make predictions on, someone who mainly goes for things that are pretty certain will end up with a (much!) better average Brier score than someone who predicts things that are genuinely more 50/50. This also acts as a disincentive for predicting more uncertain things which seems bad.
We've just added Fatebook (which is great!) to our slack and I've noticed this putting me off making forecasts for things that are highly uncertain. I'm interested in if there is some lore around dealing with this among people who use Metaculus or other platforms where Brier scores are an important metric. I only really use prediction markets, which don't suffer from this problem.
Note: this also applies to log scores etc
Yeah, I'm starting to believe that a severe limitation on Brier scores is this inability to use them in a forward-looking way. Brier scores reflect the performance of specific people on specific questions and using them as evidence for future prediction performance seems really fraught...but it's the best we have as far as I can tell.
@Ollie Etherington 🔹 Global catastrophic risk 500 Million, But Not A Single One More
Test debate week tag
Test shortform
Basic typing in ckeditor working as before 12
34[1]
Apply nowfdsfsdfs
test footnote