The Rationale-Shaped Hole At The Heart Of Forecasting

dschwarz; FutureSearch; Lawrence Phillips; Daniel Hnyk; Peter Mühlbacher

Thanks to Eli Lifland, Molly Hickman, Değer Turan, and Evan Miyazono for reviewing drafts of this post. The opinions expressed here are my own.

Summary:

Forecasters produce reasons and models that are often more valuable than the final forecasts
Most of this value is being lost due to the historical practice & incentives of forecasting, and the difficulty of crowds to “adversarially collaborate”
FutureSearch is a forecasting system with legible reasons and models at its core (examples at the end)

The Curious Case of the Missing Reasoning

Ben Landau-Taylor of Bismarck Analysis wrote a piece on March 6 called “Probability Is Not A Substitute For Reasoning”, citing a piece where he writes:

There has been a great deal of research on what criteria must be met for forecasting aggregations to be useful, and as Karger, Atanasov, and Tetlock argue, predictions of events such as the arrival of AGI are a very long way from fulfilling them.

Last summer, Tyler Cowen wrote on AGI ruin forecasts:

Publish, publish, not on blogs, not long stacked arguments or six hour podcasts or tweet storms, no, rather peer review, peer review, peer review, and yes with models too... if you wish to convince your audience of one of the most radical conclusions of all time…well, more is needed than just a lot of vertically stacked arguments.

Widely divergent views and forecasts on AGI persist, leading to FRI’s excellent adversarial collaboration on forecasting AI risk this month. Reading it, I saw… a lot of vertically stacked arguments.

There have been other big advances in judgmental forecasting recently, on non-AGI AI, Covid19 origins and scientific progress. How well justified are the forecasts?

Feb 28: Steinhardt’s lab’s impressive paper on “Approaching Human-Level Forecasting with Language Models” (press). The pipeline rephrases the question, lists arguments, ranks them, adjusts for biases, and then guesses the forecast. They note “The model can potentially generate weak arguments”, and the appendix shows some good ones (decision trees) and some bad ones.
March 11: Good Judgment’s 50-superforecast analysis of Covid-19 origins (substack). Reports that the forecasters used base rates, scientific evidence, geopolitical context, and views from intelligence communities, but not what these were. (Conversely, the RootClaim debate gives so much info that even Scott Alexander’s summary is a dozen pages.) 10 of the 50 superforecasters ended with a dissenting belief.
March 18: Metaculus and Federation of American Scientists’ pilot of forecasting expected value of scientific projects. “[T]he research proposals lacked details about their research plans, what methods and experimental protocols would be used, and what preliminary research the author(s) had done so far. This hindered their ability to properly assess the technical feasibility of the proposals and their probability of success.”
March 20: DeepMind’s “Evaluating Frontier Models for Dangerous Capabilities”, featuring Swift Centre forecasts (X). Reports forecaster themes: “Across all hypotheticals, there was substantial disagreement between individual forecasters.” Lists a few cruxes but doesn’t provide any complete arguments or models.

In these cases and the FRI collaboration, the forecasts are from top practitioners with great track records of accuracy (or “approaching” this, in the case of AI crowds). The questions are of the utmost importance.

Yet what can we learn from these? Dylan Matthews wrote last month in Vox about “the tight connection between forecasting and building a model of the world.” Where is this model of the world?

FRI’s adversarial collaboration did the best here. They list several “cruxes”, and measure how much of people’s disagreement can be explained by them. Still, I can’t use these cruxes to update my models of the world.

When DeepMind hired Swift Centre forecasters, as when OpenAI hired forecasters (see 2.12 in the GPT-4 paper), domain experts and elite generalist forecasters probably had great discussions and probed deeply. But the published result reminds me of critiques of crowd forecasting that Eli Lifland made, and Nuño Sempere and Alex Lawsen published, back in 2021. Eli put it simply to me:

In AI / AI safety, what we need most right now is deep research into specific topics rather than shallow guesses.

Those Who Seek Rationales, And Those Who Do Not

The lack of rationales in crowd forecasts has always been conspicuous. What were the primary sources? What were the models and base rates? My personal experience echoes Tyler Cowen, that this absence can be a dealbreaker to academics, journalists, executives, and the public. The Economist wrote on the Cosmic Bazaar, “The long-term viability of forecasting will depend, though, not just on accuracy, but also explainability.” A paper in Intelligence and National Security went so far as to conclude that the “most fundamental” issue was “decision-makers lacking interest in probability estimates.”

Some platforms have made progress on showing more than just probabilities. Metaculus (source) and INFER (source) produce AI summaries drawn from comments. Kalshi recently got approval to host comments; private prediction markets sometimes require commenting to get payouts. Good Judgment, Swift Centre, and Samotsvety do give (private) justifications to their clients.

But when rationales are just prose, rather than models, and there’s a crowd producing it, the insights are lost. Vitalik predicted in January crowds are getting larger:

Prediction markets have been a holy grail of epistemics technology for a long time […] one specific feature of prediction market ecosystems that we can expect to see in the 2020s that we did not see in the 2010s: the possibility of ubiquitous participation by AIs.

I created the 2024 Manifold humans vs. AI tournament, and I endorse “ubiquitous participation by AIs.” Methods to elicit useful models from large crowds may help, perhaps with the AI Objective Institute’s “Talk to the City” approach.

More likely there’s a deep tradeoff here. Steinhardt’s paper produced decently good overall forecasts from an AI crowd largely drawing from variable-quality arguments.

Squint, and this is what all forecasting platforms look like: good aggregate forecasts, with a melange of variable-quality comments.

So What Do Elite Forecasters Actually Know?

At Google and at Metaculus, I hosted dozens of “Forecast Fridays”, where top forecasters worked through questions for the audience. I’ve spoken at length to some of the very best forecasters in the world. I’m consistently impressed with their clarity of thought, their objectivity, their humility. Their models are often simple yet clever, such as “defenders have a 3-5 fold military advantage compared to invaders” and “bills in congress have X%, Y%, and Z% chance of passing A, B, C committees and votes”.

Oddly, some Forecast Friday presenters reported spending more time preparing their talks than they did on a typical tournament question for big prizes. Spending 10+ hours per question is tough across the 30+ questions nearly all tournaments have, from Tetlock’s first IARPA tournament in 2011, to the ACX ones in 2023 and 2024. These tournaments need this volume to statistically identify the best forecasters.

Elite forecasters’ complaints of time constraints match my experience. Even on one single question, I “see a world in a grain of sand.” Every basic fact of the world is rich with assumptions and historical context. The more I pull on a thread, the more I unravel. And it’s not all chaos: the forecasting revolution is happening because patterns do exist and can be found, as evidenced by the great scores many achieve.

I recently spent half a day on one question I considered easy: how SCOTUS will rule in an upcoming 1st Amendment case. I started on a series of historical models: (1) on how ideology impacts justice rulings, (2) on how “swing voters'' weigh the societal implications of their rulings, and (3) on the effect of lower court rulings on higher court rulings. Hour after hour, I made my models better, and squeezed out more and more Brier Score. I could write a whole dissertation on this question.

So imagine how the elite forecasters feel, facing 30-50 questions in a tournament. As one put it to me, this is the “dirty secret” of forecasting tournaments - the winners are those that spend the most time. The best strategy is to spend your marginal hour getting into the right ballpark, or doing a quick-and-dirty update when something changes. In fact, I think our main discovery is that elite forecasters make far fewer big mistakes.

As Eli Lifland wrote:

Some questions are orders of magnitude more important than others but usually don’t get nearly orders of magnitude more effort on crowd forecasting platforms.

Such are the incentives of the academic approach to forecasting. Such are the incentives of even Metaculus's great new scoring rules, and such are the incentives of prediction markets. Even in the FRI collaboration, which focused on one big “AGI ruin by 2100?” question, the participants spent an average of 55 hours total on 38 questions.

The Rationale-Shaped Hole At The Heart Of Forecasting

These incentives in the forecasting ecosystem have produced great forecasters and accurate forecasts. But the ecosystem is not geared for knowledge generation. Political scientists like Tetlock treat forecasting as a psychology problem; economists and mathematicians treat it as an aggregation and scoring problem; systems thinkers treat forecasting as a black box, hoping to weave together a big causal graph of forecasts.

Individual forecasters, though, do treat forecasting as a knowledge generation problem. They gather facts, they reason through possibilities, and they build models, and they refine these through debate.

Imagine we gave the world’s elite forecasters unlimited time on a single question. How well would they do? How good of a rationale could they produce? How persuasive would they be to the wider world? Samotsvety’s 2022 nuclear risk forecasting is one indicator, but according to forecasting and nuclear risk expert Peter Scoblic, even that had much room for improvement.

For simplicity, I divide a forecasting rationale into three components: facts, reasons, and models. Let’s consider each in turn, asking what it would look like to focus on its production.

Facts: Cite Your Sources

As a general practice, published forecasts should list the key facts they depend on, and link these facts back to primary sources.

That’s it. Onto Reasons and Models.

Reasons: So You Think You Can Persuade With Words

The standard I’d like to see is Scott Alexander’s 2019 adversarial collaboration, “where two people with opposite views on a controversial issue work together to present a unified summary of the evidence and its implications.” The key benefit here is that “all the debate and rhetoric and disagreement have already been done by the time you start reading, so you’re just left with the end result" [emphasis mine]

Scott concluded this succeeded with: (a) whether calorie restriction slows aging; (b) the net benefit/harm of eating meat; (c) the net benefit/harm of circumcision; (d) on economic fallout from AI automation.

It fell short (e.g. no convergence on a conclusion) on (a) will space colonization would reduce x-risk, (b) net benefit/harm of gene editing; (c) net benefit/harm of abortion; (d) whether spiritual experiences are scientifically valid.

So in 4 of 8 cases, two people with opposite views on a controversial topic converged to a shared conclusion. And the final pieces are valuable: they persuaded each other, so they are also likely to persuade the public too.

The pieces are long, and have some models in addition to reasons. But they're shorter than many recent works, for exactly the reason Scott gives: the debate is already done, so you just see the result. Scott himself did this for the RootClaim debate, producing a summary that is far more accessible than the original 15 hours of debates.

Both the aforementioned FRI paper, and the 2022 Good Judgment effort to forecast AGI, did produce summaries that highlighted key cruxes. FRI went further and quantitatively estimated how important each crux was - a great starting point towards an adversarially-collaborated synthesis.

Yet reading these papers still means wading into the debate. Publishing a crowd forecast is like publishing an unresolved debate. Perhaps that’s an accurate reflection of reality. But I’d like to see adversarial collaborations where the dissenting forecasters are the ones that write up the shared view, rather than the study investigators.

That leaves us with Models. Can they help us navigate the labyrinth of prose?

Models: So You Think You Can Model the World

Molly Hickman, of FRI and Samotsvety, wrote a great piece in Asterisk Magazine with the subtitle “Good forecasting thrives on a delicate balance of math, expertise, and… vibes.” She writes:

But there’s a more insidious second kind of error [after “trusting our preconceptions too much”] that bites forecasters — putting too much store in clever models that minimize the role of judgment. Just because there’s math doesn’t make it right.

Yet she concludes:

It’s almost always worth the effort to make a quantitative model — not because its results are the immutable truth but because practicing decomposing questions and generating specific probabilities are how you train yourself to become a better forecaster.

All models are wrong, but some are useful for forecast accuracy. For the purpose of producing useful knowledge, though, we don’t use models enough, especially in AGI forecasting.

Tyler Cowen again:

If the chance of existential risk from AGI is 99 percent, or 80 percent, or even 30 percent, surely some kind of modeled demonstration of the basic mechanics and interlocking pieces is possible.

It is possible! It’s much harder than modeling geopolitics, where the future more resembles the past. I’m partial to Nuño’s base rates of technological disruption which led him to posit “30% that AI will undergo a ‘large and robust’ discontinuity, at the rate of maybe 2% per year if it does so.” The beauty of his analysis is that you can inspect it. I think Nuño and I would converge, or get close to it, if we hashed it out.

Other great examples include Tom Davidson’s compute-centric model, Roodman's “materialist” model, and Joe Carlsmith’s six ingredients model. These models are full of prose, yet unlike pure reasoning, they have facts you can substitute and numbers you can adjust that directly change the conclusion.

I bet that if the FRI adversarial collaborators had drawn from Sempere’s, Davidson’s, Roodman’s, or Carlsmith’s models, they would have converged more. A quick ctrl+f of the 150 page FRI report shows only two such references - both to Davidson’s... appearance on a podcast! The 2022 GJ report used the Carlsmith model to generate the questions, but it appears none of the superforecasters appealed to any models of any kind, not even Epoch data, in their forecasts.

This goes a long way towards explaining the vast gulf between superforecasters and AI researchers on AGI forecasts. The FRI effort was a true adversarial collaboration, yet as Scott wrote, “After 80 hours, the skeptical superforecasters increased their probability of existential risk from AI! All the way from 0.1% to . . . 0.12%.”

This may be rational, in that the superforecasters already knew the quality of the arguments of the “concerned” group. My guess is that they correctly judged them as not up to their superforecaster standards.

Yet the superforecasters themselves are lacking models on which to base their conclusions. Even after years of tournaments, it is hard for them accumulate enough non-trivial, resolved questions to train to accurately tell apart 0.1% from 1%.

There Is No Microeconomics of AGI

This piece focuses on AGI forecasting, because it’s important, and because it has been the focus of recent progress in the art and science of forecasting. But judgmental forecasting overall does much better on other topics. See e.g. Metaculus’s AI track record, getting Briers of 0.182 compared to overall Briers of 0.11 (0.25 is chance), and Nuño's and Misha's post on challenges in forecasting AI, or Robert de Neufville’s “Forecasting Existential Risk is Hard”, or Yudkowsky in 2013 (page 23).

If accuracy continues to be low on AGI forecasting, then the case for focusing on facts, reasons, and models as the primary output is even stronger. Epoch’s work on training data, compute performance, and algorithmic performance are a great example. Epoch’s work gets better and more useful over time, whereas the 2022 Good Judgment AGI forecasts already feel stale.

The Swift Centre’s probability distributions on questions like “When will self-proliferation capabilities be achieved?” in the DeepMind paper are provocative but shallow. Forecasts without rationales usually are. This is also true of the page of 50 key AI forecasts I put together at Metaculus.

Consider: would you rather have Brier-0.18-quality predictions on all 700+ Arb questions on AI safety? Or a distilled set of facts, reasons, and models on a few of the easier questions - say, semiconductors in China, EU AI regulation, and AI Safety labs funding?

700 AI questions you say? Aren’t We In the Age of AI Forecasters?

Yes, indeed we are. For the first 6 months of FutureSearch, we planned to generate big batches of questions, forecast them quickly, and update them frequently for our clients. After all, don’t we all want efficiently discovered causal graphs? Is this not the time for Robin Hanson’s combinatorial prediction markets?

Probably it is. But the best use of LLMs in forecasting we see is within research topics within individual questions. From Schoenegger and Tetlock, we know LLM Assistants Improve Human Forecasting Accuracy, even when the assistant is just GPT-4 prompted to follow the 10 commandments of superforecasting.

The more we tinkered with FutureSearch, building and improving (and often discarding) features to test in evals, the more we struggled to decide on the rationales to target, lacking good public examples. Imitating Metaculus comments couldn’t be the way.

Really good reasons are hard to generate with LLMs. But what did work was using LLMs to build simple models. Consider the antitrust suit against Apple. Does it matter that the DOJ initiated the suit, not the FTC? In antitrust suits, do tech companies fare better or worse than the base rate for all companies? What’s the distribution of outcomes of other suits against Apple over the years? What is the disposition of the US attorney general Merrick Garland? Research tasks like that are perfect for LLM-based systems. You can see all of FutureSearch’s conclusions on these things and judge them for yourself.

And, once you’ve done research like this on one question, you can’t help but store your facts, reasons, and models and use them for other questions too.

You can see where this is going.

Towards “Towards Rationality Engines”

In February, Scott wrote:

An AI that can generate probabilistic forecasts for any question seems like in some way a culmination of the rationalist project. And if you can make something like this work, it doesn’t sound too outlandish that you could apply the same AI to conditional forecasts, or to questions about the past and present (eg whether COVID was a lab leak).

And in March, he wrote:

But [Forecasting AIs] can’t answer many of the questions we care about most - questions that aren’t about prediction… One of the limitations of existing LLMs is that they hate answering controversial questions. They either say it’s impossible to know, or they give the most politically-correct answer. This is disappointing and unworthy of true AI.

Indeed. We at FutureSearch think a forecasting system with legible reasons and models at its core can contribute to such a rationality engine. If other orgs and platforms join us and FRI in putting more emphasis on rationales, we’ll see more mainstream adoption of the conclusions we draw.

Paraphrasing Yudkowsky and Sutskever: to predict the next token, you have to simulate the world. To paraphrase my cofounder Lawrence Phillips, forecasting is the ultimate loss function to optimize a world model against. Let’s build these world models.

Please reach out to hello@futuresearch.ai if you want to get involved!

Sample Forecasts With Reasons and Models

The 2024 U.S. Supreme Court case on whether to uphold emergency abortion care protections, Moyle v. United States

The 2024 U.S. Supreme Court case on whether to grant former presidents immunity from prosecution, Trump v. United States

The New York Times lawsuit on whether OpenAI can continue to serve models train on NYT articles

The DOJ’s antitrust suit against Apple filed on March 21, 2024

155 Reactions

Mentioned in

82Two tools for rethinking existential risk

More posts like this

Comments14

Sorted by

New & upvoted

Click to highlight new comments since: Today at 12:05 AM

Molly HickmanApr 25 202412

FRI went further and quantitatively estimated how important each crux was - a great starting point towards an adversarially-collaborated synthesis.

And you can too! We evaluated cruxes on two axes: "value of information" (VOI) and "value of discrimination" (VOD). Essentially: VOI is how much someone expects to gain by finding out the answer to a given crux question (with respect to an ultimate question), and VOD is how much two people expect to converge on the ultimate question when they find out the answer to the crux question.

There's a google sheets calculator, as well as an R library, which will be released on CRAN at some point.

BenjaminTereickApr 3 202411

Hi Dan,

Thanks for writing this! Some (weakly-held) points of skepticism:

I find it a bit nebulous what you do and don't count as a rationale. Similarly to Eli,* I think on some readings of your post, “forecasting” becomes very broad and just encompasses all of research. Obviously, research is important!
Rationales are costly! Taking that into account, I think there is a role to play for “just the numbers” forecasting, e.g.:
1. Sometimes you just want to defer to others, especially if an existing track record establishes that the numbers are reliable. For instance, when looking at weather forecasts, or (at least until last year) looking at 538’s numbers for an upcoming election, it would be great if you understood all the details of what goes into the numbers, but the numbers themselves are plenty useful, too.
2. Even without a track record, just-the-number forecasts give you a baseline of what people believe, which allows you to observe big shifts. I’ve heard many people express things like “I don’t defer to the Metaculus on AGI arrival, but it was surely informative to see by how much the community prediction has moved over the last few years”.
3. Just-the-number forecasts let you spot disagreements with other people, which helps finding out where talking about rationales/models is particularly important.
I’m worried that in the context of getting high-stakes decision makers to use forecasts, some of the demand for rationales is due to lack of trust in the forecasts. Replying to this demand with AI-generated rationales might shift the skeptical take from “they’re just making up numbers” to “it’s all based on LLM hallucinations” that I’m not sure really addresses the underlying problem.

*OTOH, I think Eli is also hinting at a definition of forecasting that is too narrow. I do think that generating models/rationales is part of forecasting as it is commonly understood (including in EA circles), and certainly don't agree that forecasting by definition means that little effort was put into it!
Maybe the right place to draw the line between forecasting rationales and “just general research” is asking “is the model/rationale for the most part tightly linked to the numerical forecast?" If yes, it's forecasting, if not, it's something else.

Peter MühlbacherApr 4 20246

[Disclaimer: I'm working for FutureSearch]

on some readings of your post, “forecasting” becomes very broad and just encompasses all of research.

To add another perspective: Reasoning helps aggregating forecasts. Just consider one of the motivating examples for extremising, where, IIRC, some US president is handed the several (well-calibrated, say) estimates around ≈70% for P(head of some terrorist organisation is in location X)—if these estimates came from different sources, the aggregate ought to be bigger than 70%, whereas if it's all based on the same few sources, 70% may be one's best guess.

This is also something that a lot of forecasters may just do subconsciously when considering different points of view (which may be something as simple as different base rates or something as complicated as different AGI arrival models).

So from an engineering perspective there is a lot of value in providing rationales, even if they don't show up in the final forecasts.

dschwarzApr 3 20244

Yeah, I do like your four examples of "just the numbers" forecasts that are valuable: weather, elections, what people believe, and "where is there lots of disagreement? I'm more skeptical that these are useful, rather than curiosity-satisfying.

Election forecasts are a case in point. People will usually prepare for all outcomes regardless of the odds. And if you work in politics, deciding who to choose for VP or where to spend your marginal ad dollar, you need models of voter behavior.

Probably the best case for just-the-numbers is probably your point (b), shift-detection. I echo your point that many people seem struck by the shift in AGI risk on the Metaculus question.

I’m worried that in the context of getting high-stakes decision makers to use forecasts, some of the demand for rationales is due to lack of trust in the forecasts.

Undoubtedly some of it is. Anecdotally, though, high-level folks frequently take one (or zero) glances at the calibration chart, nod, and then say "but how I am supposed to use this?", even on questions I pick to be highly relevant to them, just like the paper I cited finding "decision-makers lacking interest in probability estimates."

Even if you're (rightly) skeptical about AI-generated rationales, I think the point holds for human rationales. One example: Why did DeepMind hire Swift Centre forecasters when they already had Metaculus forecasts on the same topics, as well as access to a large internal prediction market?

eliflandApr 3 20242

I do think that generating models/rationales is part of forecasting as it is commonly understood (including in EA circles), and certainly don't agree that forecasting by definition means that little effort was put into it!
Maybe the right place to draw the line between forecasting rationales and “just general research” is asking “is the model/rationale for the most part tightly linked to the numerical forecast?" If yes, it's forecasting, if not, it's something else.

Thanks for clarifying! Would you consider OpenPhil worldview investigations reports such Scheming AIs, Is power-seeking AI an existential risk, Bio Anchors, and Davidson's takeoff model forecasting? It seems to me that they are forecasting in a relevant sense and (for all except Scheming AIs maybe?) the sense you describe of the rationale linked tightly to a numerical forecast, but wouldn't fit under the OP forecasting program area (correct me if I'm wrong).

Maybe not worth spending too much time on these terminological disputes, perhaps the relevant question for the community is what the scope of your grantmaking program is. If indeed the months-year-long reports above wouldn't be covered, then it seems to me that the amount of effort spent is a relevant dimension of what counts as "research with a forecast attached" vs. "forecasting as is generally understood in EA circles and would be covered under your program". So it might be worth clarifying the boundaries there. If you indeed would consider reports like worldview investigations ones under your program, then never mind but good to clarify as I'd guess most would not guess that.

BenjaminTereickApr 6 20245

I think it’s borderline whether reports of this type are forecasting as commonly understood, but would personally lean no in the specific cases you mention (except maybe the bio anchors report).

I really don’t think that this intuition is driven by the amount of time or effort that went into them, but rather the percentage of intellectual labor that went into something like “quantifying uncertainty” (rather than, e.g. establishing empirical facts, reviewing the literature, or analyzing the structure of commonly-made arguments).

As for our grantmaking program: I expect we’ll have a more detailed description of what we want to cover later this year, where we might also address points about the boundaries to worldview investigations.

eliflandApr 2 20247

Thanks for writing this up, and I'm excited about FutureSearch! I agree with most of this, but I'm not sure framing it as more in-depth forecasting is the most natural given how people generally use the word forecasting in EA circles (i.e. associated with Tetlock-style superforecasting, often aggregation of very part-time forecasters' views, etc.). It might be imo more natural to think of it as being a need for in-depth research, perhaps with a forecasting flavor. Here's part of a comment I left on a draft.

However, I kind of think the framing of the essay is wrong [ETA: I might hedge wrong a bit if writing on EAF :p] in that it categorizes a thing as "forecasting" that I think is more naturally categorized as "research" to avoid confusion. See point (2)(a)(ii) at https://www.foxy-scout.com/forecasting-interventions/ ; basically I think calling "forecasting" anything where you slap a number on the end is confusing, because basically every intellectual task/decision can be framed as forecasting.

It feels like this essay is overall arguing that AI safety macrostrategy research is more important than AI safety superforecasting (and the superforecasting is what EAs mean when they say "forecasting"). I don't think the distinction being pointed to here is necessarily whether you put a number at the end of your research project (though I think that's usually useful as well), but the difference between deep research projects and Tetlock-style superforecasting.

I don't think they are necessarily independent btw, they might be complementary (see https://www.foxy-scout.com/forecasting-interventions/ (6)(b)(ii) ), but I agree with you that the research is generally more important to focus on at the current margin.

[...] Like, it seems more intuitive to call https://arxiv.org/abs/2311.08379 a research project rather than forecasting project even though one of the conclusions is a forecast (because as you say, the vast majority of the value of that research doesn't come from the number at the end).

dschwarzApr 2 20243

Agreed Eli, I'm still working to understand where the forecasting ends and the research begins. You're right, the distinction is not whether you put a number at the end of your research project.

In AGI (or other hard sciences) the work may be very different, and done by different people. But in other fields, like geopolitics, I see Tetlock-style forecasting as central, even necessary, for research.

At the margin, I think forecasting should be more research-y in every domain, including AGI. Otherwise I expect AGI forecasts will continue to be used, while not being very useful.

ArepoApr 2 20245

I found this interesting, and a model I've recently been working on might be relevant - I've emailed you about it. One bit of feedback:

Please reach out to hello@futuresearch.ai if you want to get involved!

You might want to make it more clear what kind of collaboration you're hoping to receive.

dschwarzApr 2 20244

I suppose I left it intentionally vague :-). We're early, and are interested in talking to research partners, critics, customers, job applicants, funders, forecaster copilots, writers.

We'll list specific opportunities soon, consider this to be our big hello.

Seth HerdApr 3 20245

I think a major issue is that the people who would be best at predicting AGI usually don't want to share their rationale.

Gears-level models of the phenomenon in question are highly useful in making accurate predictions. Those with the best models are either worriers who don't want to advance timelines, or enthusiasts who want to build it first. Neither has an incentive to convince the world it's coming soon by sharing exactly how that might happen.

The exceptions are people who have really thought about how to get from AI to AGI, but are not in the leading orgs and are either uninterested in racing or want to attract funding and attention for their approach. Yann LeCun comes to mind.

Imagine trying to predict the advent of heavier-than-air flight without studying either birds or mechanical engineering. You'd get predictions like the ones we saw historically - so wild as to be worthless, except those from the people actually trying to achieve that goal.

(copied from LW comment since the discussion is happening over here)

dschwarzApr 3 20243

This seems plausible, perhaps more plausible 3 years ago. AGI is so mainstream now that I imagine there are many people who are motivated to advance the conversation but have no horse in the race.

If only the top cadre of AI experts are capable of producing the models, then yes, we might have a problem of making such knowledge a public good.

Perhaps philanthropists can provide bigger incentives to share than their incentives not to share.

SummaryBotApr 2 20243

Executive summary: The forecasting ecosystem produces accurate predictions but lacks sufficient focus on generating knowledge through facts, reasons, and models, especially for important questions like those related to AGI.

Key points:

Recent high-profile forecasting efforts on AGI and other topics provide forecasts but lack detailed rationales and models.
Elite forecasters face time constraints in tournaments, limiting their ability to deeply explore questions and build comprehensive models.
Published forecasts should cite key facts and primary sources to support their conclusions.
Adversarial collaborations where dissenting forecasters write up a shared view could help resolve debates and persuade the public.
Quantitative models, even if imperfect, can help decompose questions, generate probabilities, and allow for inspection and adjustment.
Focusing on facts, reasons, and models is especially important for AGI forecasting, where accuracy remains low and the stakes are high.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Jared T PetersonMay 16 20241

A couple years ago I was wondering why all the focus is on Superforecasters when really we should be emphasizing the best arguments or the best persuaders. It seems like knowing who is best at forecasting is is less useful to me that knowing what (or who) would persuade me to change my mind (since I only care about forecasts in so far as they change my mind, anyways).

The incentive system for this seems simple enough. Imagine instead of upvoting a comment, the comment has a "update your forecast" button. Comments that are persuasive get boosted by the algorithm. Authors who create convincing arguments can get prestige. Authors who create convincing argument that, on balance, lead to people making better forecasts, get even more prestige.

It could even be a widget that you embed at the beginning and end of off-site articles. That way we could find the "super-bloggers" or "super-journalists" or whatever you want to call them.

Heck, you could even create another incentive system for the people who are best at finding arguments worth updating on.

The point is, you need to incentivize more than good forecasts. You need an entire knowledge generation economy.

There is probably all kinds of ways this gets gamed. But it seems at least worth exploring. Forecasts by themselves are just not that useful. Explanations, not probabilities, are what expert decision-makers rely on. At least that is the case within my field of Naturalistic Decision Making, and also seems true in Managerial Decision Making - managers don't seem to use probabilities in order to do Expected Utility calculations, but rather to try and understand the situation and its uncertainties.

This is the conclusion Dominic Cummings came to during the pandemic, as well. Summarized here

> During the pandemic, Dominic Cummings said some of the most useful stuff that he received and circulated in the British government was not forecasting, it was qualitative information explaining the general model of what’s going on, which enabled decision-makers to think more clearly about their options for action and the likely consequences. If you’re worried about a new disease outbreak, you don’t just want a percentage probability estimate about future case numbers, you want an explanation of how the virus is likely to spread, what you can do about it, how you can prevent it. Not the best estimate for how many COVID cases there will be in a month, but why forecasters believe there will be X COVID cases in a month.
https://www.samstack.io/p/five-questions-for-michael-story