Project Idea: Profiles Aggregating Forecasting Performance Metrics

Damien Laird

This is a linkpost for https://damienlaird.substack.com/p/application-profiles-aggregating

Summary

In this post I describe a hypothetical single site that aggregates a given forecaster’s performance across multiple public platforms into a single place. I detail what I expect the technical hurdles to be in implementing such a site. I justify these costs by arguing that this would help foster an efficient culture of learning within the forecasting community that would accelerate the rate of improvement of forecasters and their techniques. I also detail the current performance metrics displayed on four popular platforms and find them to be vastly different.

What I’m Proposing

Imagine a minimum viable "Linkedin" for forecasting. At a glance, a profile there could answer questions like…

What has an individual or team forecasted?
How accurate have their forecasts been? How about relative to the crowd?
What platforms do they forecast on?
What topics do they tend to forecast?
Have they won any prizes or ranked highly in any leaderboards or tournaments?

There are currently multiple popular forecasting platforms, and they all surface answers to some subset of the above questions in their own way. A user may not even have the same username across multiple platforms, so I believe the current state of the art is for a prolific forecaster to either link to all of their independent profiles from their own personal site/blog or to summarize their results themselves and manually update them.

A single site could allow a user to link their various accounts and concisely display their performance across platforms such that it can be continually updated, either via API access or web scraping.

Technical Challenges

I expect there to be some barriers to implementing this that I’m not qualified to fully assess:

Account verification. How do we know that the accounts to be aggregated actually belong to the same user? Can they post a particular verification code in their profile on those platforms, or link back to this specific overview profile, and this then gets scraped? Is easier verification possible, for example on sites that support Google credentials?
How feasible is the API access / web scraping? I believe Metaculus and Manifold have APIs, but do they surface all of the profile metrics? I don’t believe good Judgement Open or INFER have API’s, so how fragile would a web scraping application be? IE, if either platform changes the way they display metrics it could break the link to this new site. Could partnerships be made with these sites to mitigate these risks?

Why Do This?

Obviously such a site would cost resources to create and maintain. These metrics are already being collected, so why should we care about aggregating them?

I view rigorous judgmental forecasting as a recently developed art. We know, first from academic research and now from these online platforms, that usefully impressive human performance is possible. We even know some predictors of it and some common attributes of accurate forecasts and forecasters.

But how much better can forecasters get? What are the limits of human judgmental forecasting performance and what techniques get us there? Which domains or question types are most amenable to what techniques? These questions can be answered by academia rigorously, but at what feels to me like a glacial pace. Athletes and coaches don’t wait for double blind studies to confirm which strategies or equipment they should use. Instead, they live in a culture of continual experimentation where they’re surrounded by evidence to evaluate and infer causes from. Critically, they can evaluate which teams and players are top performing thanks to clear scoring and statistics.

I believe that lowering the barrier to accessing a forecasters performance metrics (along with other interventions that I will continue to describe on this blog) can help foster a similar culture within the forecasting world, where individuals and teams can better learn from each other. When someone shares a resource on a topic or advice on how to structure a forecast, being able to evaluate their track record at a glance minimizes the friction to weighing the value of that information as you consume it. Making performance more public also increases the incentive to perform well, and explore the limits of current techniques.

With the advent of open, online forecasting platforms, I strongly believe that the most powerful lever on advancing the art of forecasting is to foster communities of open experimentation and collaboration, and making performance as clear and accessible as possible seems fundamental to this. By aggregating these metrics in one place, you also open the door to have a new API that makes surfacing the most salient information in other locations (like Discord servers or on forums) much easier.

The Current State of Platform Metrics

All of this depends on what metrics are being surfaced by open forecasting platforms in the first place. In writing this post I was surprised that the current state is massively varied. I expect this to be a subject of continual improvement for these platforms, but I wanted to capture the current state here for posterity. Even now, I believe the kind of aggregation would have strong benefits, and the diversity between the scoring metrics may even make this case stronger.

I only list the metrics/information relevant to forecasting directly, not question writing. I’ve also omitted visual badges/achievements as they represent things already captured in the other scores, but these are typical across the platforms. This list is accurate as of April 1st, 2023.

Metaculus

Level (I believe this is just a function of accumulated points)
Number of predictions, across how many questions, and how many of those are resolved
Number of comments across how many questions
List of tournaments and projects
Users also appear on an overall leaderboard of points (more forecasts and more correct forecasts than the crowd = more points), and individual tournament leaderboards
Notably, I don’t see anything like a brier score or objective scoring anywhere.

Manifold Markets

Trading profits, balance, and portfolio value in Mana (Manifold’s play currency)
Calibration plot with grade ("C+") and "score" (numerical, but not brier)

Good Judgement Open

Overall brier score
Number of questions forecasted, and how many of those have been scored
Upvotes
Calendar of forecasting activity

INFER

Relative brier score
This can also be displayed over time as a plot
This can also be filtered by question, topics, or year
Number of questions forecasted, and how many have been scored
Number of forecasts
Number of upvotes

niplavApr 18 20232

I like this idea :-)

I think that there are some tricky questions about comparing across different forecasters and their predictions. If you simply take Brier score, this can be Goodharted: people can choose the "easiest" questions and get way better scores than the ones taking on difficult questions.

I can think of some attempts to go at this:

Ranking forecasters:
- For two forecasters, they get ranked according to their Brier scores on questions they have both forecasted on. I fear that this will lead to cyclical rankings, which could be dealt with using the Smith set or Hodge decomposition.
- Forecasters are ranked according to their performance relative to all other forecasters on each question. (Making easier questions less impactful on a forecasters score).
I'd like to look into credibility theory to see whether it has some insights into ranking with different sample sizes since IMDb uses it for ranking movies.

Damien LairdApr 18 20231

I agree with your concerns on using a pure Brier score with open platforms. I expect that currently it makes the most sense within "tournaments" where participants are answering every question. Technically, I think some sort of objective, proper scoring rule is a prerequisite to a more advanced scoring system that conveys more useful information in open contexts.

I've seen some sort of a "relative Brier score" referenced frequently in associated research (definitely in the good judgement project papers, at a minimum) that scored forecasters based on the difficulty of each question, as determined by the performance of others who forecasted it. This seems promising, and I expect there are a lot of options in that direction.

Effective Altruism Forum
EA Forum