Draft report on existential risk from power-seeking AI

Joe_Carlsmith

Effective Altruism Forum
EA Forum

Draft report on existential risk from power-seeking AI

Apr 28 20211 min read 34

88

Existential riskAI safetyForecastingAI alignmentAI forecastingConjunctive vs. disjunctive risk modelsEstimation of existential riskOpen Philanthropy

Frontpage

I’ve written a draft report evaluating a version of the overall case for existential risk from misaligned AI, and taking an initial stab at quantifying the risk from this version of the threat. I’ve made the draft viewable as a public google doc here (Edit: arXiv version here, video presentation here, human-narrated audio version here). Feedback would be welcome.

This work is part of Open Philanthropy’s “Worldview Investigations” project. However, the draft reflects my personal (rough, unstable) views, not the “institutional views” of Open Philanthropy.

88 Reactions

Mentioned in

88Tips for conducting worldview investigations

80"Existential risk from AI" survey results

44General vs specific arguments for the longtermist importance of shaping AI development

26EA Forum Prize: Winners for April 2021

Comments34

Sorted by

New & upvoted

Click to highlight new comments since: Today at 11:41 AM

RobBensingerApr 30 202123

It's great to see a new examination of what the core AI risk argument is (or should be). I like the focus on "power-seeking", and I think this is a clearer term than "influence-seeking".

I want to articulate a certain intuition that's pinging me. You write:

My current, highly-unstable, subjective estimate is that there is a ~5% percent chance of existential catastrophe by 2070 from scenarios in which (1)-(6) are true.

You also treat this as ~equivalent to:

~5% probability of existential catastrophe from misaligned, power-seeking AI by 2070.

This is equivalent to saying you're ~95% confident there won't be such a disaster between now and 2070. This seems like an awful lot of confidence to me!

(For the latter probability, you say that you'd "probably bump this up a bit [from 5%] -- maybe by a percentage point or two, though this is especially unprincipled (and small differences are in the noise anyway) -- to account for power-seeking scenarios that don’t strictly fit all the premises above". This still seems like an awful lot of confidence to me!)

To put my immediate reaction into words: From my perspective, the world just looks like the kind of world where "existential catastrophe from misaligned, power-seeking AI by 2070" is true. At least, that seems like the naive extrapolation I'd make if no exciting surprises happened (though I do think there's a decent chance of exciting surprises!).

If the proposition is true, then it's very important to figure that out ASAP. But if the current evidence isn't enough to raise your probability above ~6%, then what evidence would raise it higher? What would a world look like where this claim was obviously true, or at least plausibly true, rather than being (with ~94% confidence) false?

Another way of stating my high-level response: If the answer to a question is X, and you put a lot of work into studying the question and carefully weighing all the considerations, then the end result of your study shouldn't look like '94% confidence not-X'. From my perspective, that's beyond the kind of mistake you should make in any ordinary way, and should require some mistake in methodology.

(Caveat: this comment is my attempt to articulate a different framing than what I think is the more common framing in public, high-visibility EA writing. My sense is that the more common framing is something like "assigning very-high probabilities to catastrophe is extreme, assigning very-low probabilities is conservative". For a full version of my objection, it would be important that I go into the details of your argument rather than stopping here.)

RobBensingerApr 30 202130

There are some obvious responses to my argument here, like: 'X seems likely to you because of a conjunction fallacy; we can learn from this test that X isn't likely, though it's also not vanishingly improbable.' If a claim is conjunctive enough, and the conjuncts are individually unlikely enough, then you can obviously study a question for months or years and end up ~95% confident of not-X (e.g., 'this urn contains seventeen different colors of balls, so I don't expect the ball I randomly pick to be magenta').

I worry there's possibly something rude about responding to a careful analysis by saying 'this conclusion is just too wrong', without providing an equally detailed counter-analysis or drilling down on specific premises.

(I'm maybe being especially rude in a context like the EA Forum, where I assume a good number of people don't share the perspective that AI is worth worrying even at the ~5% level!)

You mention the Multiple Stages Fallacy (also discussed here, as "the multiple-stage fallacy"), which is my initial guess as to a methodological crux behind our different all-things-considered probabilities.

But the more basic reason why I felt moved to comment here is a general worry that EAs have a track record of low-balling probabilities of AI risk and large-AI-impacts-soon in their public writing. E.g.:

The headline number from Holden Karnofsky's 2016 Some Background on Our Views Regarding Advanced Artificial Intelligence is "I think there is a nontrivial likelihood (at least 10% with moderate robustness, and at least 1% with high robustness) of transformative AI within the next 20 years." Only mentioning the lower bound on your probability, and not the upper bound, makes sense from the perspective of 'this is easier to argue for and is sufficient for the set of actions we're currently trying to justify', but it means readers don't come away knowing what the actual estimates are, and they come away anchored to the lowest number in your range of reasonable predictions.
80,000 Hours' 2017 high-level summary of AI risk states at the top that "We estimate that the risk of a severe, even existential catastrophe caused by machine intelligence within the next 100 years iis [sic] between 1% and 10%."
Paul Christiano's original draft of What Failure Looks Like in 2019 phrased things in ways that caused journalists to conclude that the risk of sudden or violent AI takeover is relatively low.

Back in Sep. 2017, I wrote (based on some private correspondence with researchers):

I think that at least 80% of the AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would currently assign a >10% probability to this claim: "The research community will fail to solve one or more technical AI safety problems, and as a consequence there will be a permanent and drastic reduction in the amount of value in our future."

80,000 Hours is summarizing a research field where 80+% of specialists think that there's >10% probability of existential catastrophe from event A; they stick their neck out to say that these 80+% are wrong, and in fact so ostentatiously wrong that their estimate isn't even in the credible range of estimates, which they assert to be 1-10%; and they seemingly go further by saying this is true for the superset 'severe catastrophes from A' and not just for existential catastrophes from A.'

If this were a typical technical field, that would be a crazy thing to do in a career summary, especially without flagging that that's what 80,000 Hours is doing (so readers can decide for themselves how to weight the views of e.g. alignment researchers vs. ML researchers vs. meta-researchers like 80K). You could say that AI is really hard to forecast so it's harder to reach a confident estimate, but that should widen your range of estimates, not squeeze it all into the 1-10% range. Uncertainty isn't an argument for optimism.

There are obvious social reasons one might not want to sound alarmist about a GCR, especially a weird/novel GCR. But—speaking here to EAs as a whole, since it's a lot harder for me to weigh in on whether you're an instance of this trend than for me to weigh in on whether the trend exists at all—I want to emphasize that there are large potential costs to being more quiet about "high-seeming numbers" than "low-seeming numbers" in this domain, analogous to the costs e.g. of experts trying to play down their worries in the early days of the COVID-19 pandemic. Even if each individual decision seems reasonable at the time, the aggregate effect is a very skewed group awareness of reality.

Rohin ShahApr 30 202127

I think that at least 80% of the AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would currently assign a >10% probability to this claim: "The research community will fail to solve one or more technical AI safety problems, and as a consequence there will be a permanent and drastic reduction in the amount of value in our future."

If you're still making this claim now, want to bet on it? (We'd first have to operationalize who counts as an "AI safety researcher".)

I also think it wasn't true in Sep 2017, but I'm less confident about that, and it's not as easy to bet on.

RobBensingerMay 1 202112

(Am e-mailing with Rohin, will report back e.g. if we check this with a survey.)

Rohin ShahJun 1 202120

Results are in this post.

Joe_CarlsmithMay 1 202118

(Continued from comment on the main thread)

I'm understanding your main points/objections in this comment as:

You think the multiple stage fallacy might be the methodological crux behind our disagreement.
You think that >80% of AI safety researchers at MIRI, FHI, CHAI, OpenAI, and DeepMind would assign >10% probability to existential catastrophe from technical problems with AI (at some point, not necessarily before 2070). So it seems like 80k saying 1-10% reflects a disagreement with the experts, which would be strange in the context of e.g. climate change, and at least worth flagging/separating. (Presumably, something similar would apply to my own estimates.)
You worry that there are social reasons not to sound alarmist about weird/novel GCRs, and that it can feel “conservative” to low-ball rather than high-ball the numbers. But low-balling (and/or focusing on/making salient lower-end numbers) has serious downsides. And you worry that EA folks have a track record of mistakes in this vein.

(as before, let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p).

Re 1 (and 1c, from my response to the main thread): as I discuss in the document, I do think there are questions about multiple-stage fallacies, here, though I also think that not decomposing a claim into sub-claims can risk obscuring conjunctiveness (and I don’t see “abandon the practice of decomposing a claim into subclaims” as a solution to this). As an initial step towards addressing some of these worries, I included an appendix that reframes the argument using fewer premises (and also, in positive (e.g., “p is false”) vs. negative (“p is true”) forms). Of course, this doesn’t address e.g. the “the conclusion could be true, but some of the premises false” version of the “multiple stage fallacy” worry; but FWIW, I really do think that the premises here capture the majority of my own credence on p, at least. In particular, the timelines premise is fairly weak, premises 4-6 are implied by basically any p-like scenario, so it seems like the main contenders for false premises (even while p is true) are 2: (“There will be strong incentives to build APS systems”) and 3: (“It will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway”). Here, I note the scenarios most salient to me in footnote 173, namely: “we might see unintentional deployment of practical PS-misaligned APS systems even if they aren’t superficially attractive to deploy” and “practical PS-misaligned might be developed and deployed even absent strong incentives to develop them (for example, simply for the sake of scientific curiosity).” But I don’t see these are constituting more than e.g. 50% of the risk. If your own probability is driven substantially by scenarios where the premises I list are false, I’d be very curious to hear which ones (setting aside scenarios that aren’t driven by power-seeking, misaligned AI), and how much credence if you give them. I’d also be curious, more generally, to hear your more specific disagreements with the probabilities I give to the premises I list.

Re: 2, your characterization of the distribution of views amongst AI safety researchers (outside of MIRI) is in some tension with my own evidence; and I consulted with a number of people who fit your description of “specialists”/experts in preparing the document. That said, I’d certainly be interested to see more public data in this respect, especially in a form that breaks down in (rough) quantitative terms the different factors driving the probability in question, as I’ve tried to do in the document (off the top of my head, the public estimates most salient to me are Ord (2020) at 10% by 2100, Grace et al (2017)’s expert survey (5% median, with no target date), and FHI’s (2008) survey (5% on extinction from superintelligent AI by 2100), though we could gather up others from e.g. LW and previous X-risk books.) That said, importantly, and as indicated in my comment on the main thread, I don’t think of the community of AI safety researchers at the orgs you mention as in an epistemic position analogous to e.g. the IPCC, for a variety of reasons (and obviously, there are strong selection effects at work). Less importantly, I also don’t think the technical aspects of this problem the only factors relevant to assessing risk; at this point I have some feeling of having “heard the main arguments”; and >10% (especially if we don’t restrict to pre-2070 scenarios) is within my “high-low” range mentioned in footnote 178 (e.g., .1%-40%).

Re: 3, I do think that the “conservative” thing to do here is to focus on the higher-end estimates (especially given uncertainty/instability in the numbers), and I may revise to highlight this more in the text. But I think we should distinguish between the project of figuring out “what to focus on”/what’s “appropriately conservative,” and what our actual best-guess probabilities are; and just as there are risks of low-balling for the sake of not looking weird/alarmist, I think there are risks of high-balling for the sake of erring on the side of caution. My aim here has been to do neither; though obviously, it’s hard to eliminate biases (in both directions).

Ben PaceMay 2 202120

I think I share Robby's sense that the methodology seems like it will obscure truth.

That said, I have neither your (Joe) extensive philosophical background nor have spent substantial time like you on a report like this, and I am interested in evidence to the contrary.

To me, it seems like you've tried to lay out a series of 6 steps of an argument, that you think each very accurately carve the key parts of reality that are relevant, and pondered each step for quite a while.

When I ask myself whether I've seen something like this produce great insight, it's hard. It's not something I've done much myself explicitly. However, I can think of a nearby example where I think this has produced great insight, which is Nick Bostrom's work. I think (?) Nick spends a lot of his time considering a simple, single key argument, looking at it from lots of perspectives, scrutinizing wording, asking what people from different scientific fields would think of it, poking and prodding and rotating and just exploring it. Through that work, I think he's been able to find considerations that were very surprising and invalidated the arguments, and proposed very different arguments instead.

When I think of examples here, I'm imagining that this sort of intellectual work produced the initial arguments about astronomical waste, and arguments since then about unilateralism and the vulnerable world hypothesis. Oh, and also simulation hypothesis (which became a tripartite structure).

I think of Bostrom as trying to consider a single worldview, and find out whether it's a consistent object. One feeling I have about turning it into a multi-step probabilistic argument is that it does the opposite, it does not try to examine one worldview to find falsehoods, but instead integrates over all the parts of the worldview that Bostrom would scrutinize, to make a single clump of lots of parts of different worldviews. I think Bostrom may have literally never published a six-step argument of the form that you have, where it was meant to hold anything of weight in the paper or book, and also never done so assigning each step a probability.

To be clear, probabilistic discussions are great. Talking about precisely how strong a piece of evidence is – is it 2:1, 10:1, 100:1? Helps a lot in noticing which hypotheses to even pay attention to. The suspicion I have is that they are fairly different from the kind of cognition Bostrom does when doing this sort of philosophical argumentation that produces simple arguments of world-shattering importance. I suspect you've set yourself a harder task than Bostrom ever has (a 6-step argument), and thought you've made it easier for yourself by making it only probabilistic instead of deductive, whereas in fact this removes most of the tools that Bostrom was able to use to ensure he didn't take mis-steps.

But I am pretty interested if there are examples of great work using your methodology that you were inspired by when writing this up, or great works with nearby methodologies that feel similar to you. I'd be excited to read/discuss some.

Ben PaceMay 2 202110

I tried to look for writing like this. I think that people do multiple hypothesis testing, like Harry in chapter 86 of HPMOR. There Harry is trying to weigh some different hypotheses against each other to explain his observations. There isn't really a single train of conditional steps that constitutes the whole hypothesis.

My shoulder-Scott-Alexander is telling me (somewhat similar to my shoulder-Richard-Feynman) that there's a lot of ways to trick myself with numbers, and that I should only do very simple things with them. I looked through some of his posts just now (1, 2, 3, 4, 5).

Here's an example of a conclusion / belief from Scott's post Teachers: Much More Than You Wanted to Know:

In summary: teacher quality probably explains 10% of the variation in same-year test scores. A +1 SD better teacher might cause a +0.1 SD year-on-year improvement in test scores. This decays quickly with time and is probably disappears entirely after four or five years, though there may also be small lingering effects. It’s hard to rule out the possibility that other factors, like endogenous sorting of students, or students’ genetic potential, contributes to this as an artifact, and most people agree that these sorts of scores combine some signal with a lot of noise. For some reason, even though teachers’ effects on test scores decay very quickly, studies have shown that they have significant impact on earning as much as 20 or 25 years later, so much so that kindergarten teacher quality can predict thousands of dollars of difference in adult income. This seemingly unbelievable finding has been replicated in quasi-experiments and even in real experiments and is difficult to banish. Since it does not happen through standardized test scores, the most likely explanation is that it involves non-cognitive factors like behavior. I really don’t know whether to believe this and right now I say 50-50 odds that this is a real effect or not – mostly based on low priors rather than on any weakness of the studies themselves. I don’t understand this field very well and place low confidence in anything I have to say about it.

I don't know any post where Scott says "there's a particular 6-step argument, and I assign 6 different probabilities to each step, and I trust that outcome number seems basically right". His conclusions read more like 1 key number with some uncertainty, which never came from a single complex model, but from aggregating loads of little studies and pieces of evidence into a judgment.

I think I can't think of a post like this by Scott or Robin or Eliezer or Nick or anyone. But would be interested in an example that is like this (from other fields or wherever), or feels similar.

Gregory Lewis🔸May 2 202130

Maybe not 'insight', but re. 'accuracy' this sort of decomposition is often in the tool box of better forecasters. I think the longest path I evaluated in a question had 4 steps rather than 6, and I think I've seen other forecasters do similar things on occasion. (The general practice of 'breaking down problems' to evaluate sub-issues is recommended in Superforecasting IIRC).

I guess the story why this works in geopolitical forecasting is folks tend to overestimate the chance 'something happens' and tend to be underdamped in increasing the likelihood of something based on suggestive antecedents (e.g. chance of a war given an altercation, etc.) So attending to "Even if A, for it to lead to D one should attend to P(B|A), P(C|B) etc. etc.", tend to lead to downwards corrections.

Naturally, you can mess this up. Although it's not obvious you are at greater risk if you arrange your decomposed considerations conjunctively or disjunctively: "All of A-E must be true for P to be true" ~also means "if any of ¬A-¬E are true, then ¬P". In natural language and heuristics, I can imagine "Here are several different paths to P, and each of these seem not-too-improbable, so P must be highly likely" could also lead one astray.

Joe_CarlsmithMay 7 202122

Hi Ben,

A few thoughts on this:

It seems possible that attempting to produce “great insight” or “simple arguments of world-shattering importance” warrants a methodology different from the one I’ve used here. But my aim here is humbler: to formulate and evaluate an existing argument that I and various others take seriously, and that lots of resources are being devoted to; and to come to initial, informal, but still quantitative best-guesses about the premises and conclusion, which people can (hopefully) agree/disagree with at a somewhat fine-grained level -- e.g., a level that just giving overall estimates, or just saying e.g. “significant probability,” “high enough to worry about,” etc can make more difficult to engage on.
In that vein, I think it’s possible you’re over-estimating how robust I take the premises and numbers here to be (I'm thinking here of your comments re: “very accurately carve the key parts of reality that are relevant,” and "trust the outcome number"). As I wrote in response to Rob above, my low-end/high-end range here is .1% to 40% (see footnote 179, previously 178), and in general, I hold the numbers here very lightly (I try to emphasize this in section 8).
FWIW, I think Superintelligence can be pretty readily seen as a multi-step argument (e.g., something like: superintelligence will happen eventually; fast take-off is plausible; if fast-take-off, then a superintelligence will probably get a decisive strategic advantage; alignment will be tricky; misalignment leads to power-seeking; therefore plausible doom). And more broadly, I think that people make arguments with many premises all the time (though sometimes the premises are suppressed). It’s true that people don’t usually assign probabilities to the premises (and Bostrom doesn’t, in Superintelligence -- a fact that leaves the implied p(doom) correspondingly ambiguous) -- but I think this is centrally because assigning informal probabilities to claims (whether within a multi-step argument, or in general) just isn’t a very common practice, for reasons not centrally to do with e.g. multi-stage-fallacy type problems. Indeed, I expect I’d prefer a world where people assigned informal, lightly-held probabilities to their premises and conclusions (and formulated their arguments in premise-premise-conclusion form) more frequently.
I’m not sure exactly what you have in mind re: “examining a single worldview to see whether it’s consistent,” but consistency in a strict sense seems too cheap? E.g., “Bob has always been wrong before, but he’ll be right this time”; “Mortimer Snodgrass did it”; etc are all consistent. That said, my sense is that you have something broader in mind -- maybe something like "plausible," "compelling," "sense-making," etc. But it seems like these still leave the question of overall probabilities open...

Overall, my sense is that disagreement here is probably more productively focused on the object level -- e.g., on the actual probabilities I give to the premises, and/or on pointing out and giving weight to scenarios that the premises don’t cover -- rather than on the methodology in the abstract. In particular, I doubt that people who disagree a lot with my bottom line will end up saying: “If I was to do things your way, I’d roughly agree with the probabilities you gave to the premises; I just disagree that you should assign probabilities to premises in a multi-step argument as a way of thinking about issues like this.” Rather, I expect a lot of it comes down to substantive disagreement about the premises at issue (and perhaps, to people assigning significant credence to scenarios that don’t fit these premises, though I don't feel like I've yet heard strong candidates -- e.g., ones that seem to me to plausibly account for, say, >2/3rds of the overall X-risk from power-seeking, misaligned AI by 2070 -- in this regard).

Ben PaceMay 8 20212

Thanks for the thoughtful reply.

I do think I was overestimating how robust you're treating your numbers and premises, it seems like you're holding them all much more lightly than I think I'd been envisioning.

FWIW I am more interested in engaging with some of what you wrote in in your other comment than engaging on the specific probability you assign, for some of the reasons I wrote about here.

I think I have more I could say on the methodology, but alas, I'm pretty blocked up with other work atm. It'd be neat to spend more time reading the report and leave more comments here sometime.

[anonymous]May 8 20212

your other comment

This links to A Sketch of Good Communication, not whichever comment you were intending to link :)

Ben PaceMay 8 20212

Fixed, tah.

richard_ngoDec 10 20224

Great comment :)

RyanCareyJun 2 202111

The upshot seems to be that Joe, 80k, the AI researcher survey (2008), Holden-2016 are all at about a 3% estimate of AI risk, whereas AI safety researchers now are at about 30%. The latter is a bit lower (or at least differently distributed) than Rob expected, and seems higher than among Joe's advisors.

The divergence is big, but pretty explainable, because it concords with the direction that apparent biases point in. For the 3% camp, the credibility of one's name, brand, or field benefits from making a lowball estimates. Whereas the 30% camp is self-selected to have severe concern. And risk perception all-round has increased a bit in the last 5-15 years due to Deep Learning.

RobBensingerApr 30 202112

Re 80K's 2017 take on the risk level: You could also say that the AI safety field is crazy and people in it are very wrong, as part of a case for lower risk probabilities. There are some very unhealthy scientific fields out there. Also, technology forecasting is hard. A career-evaluating group could investigate a field like climate change, decide that researchers in the field are very confused about the expected impact of climate change, but still think it's an important enough problem to warrant sending lots of people to work on the problem. But in that case, I'd still want 80K to explicitly argue that point, and note the disagreement.

I previously complained about this on LessWrong.

Max_DanielApr 30 202116

I think there is a tenable view that considers an AI catastrophe less likely than what AI safety researchers think but is not committed to anything nearly as strong as the field being "crazy" or people in it being "very wrong":

We might simply think that people are more likely to work on AI safety if they consider an AI catastrophe more likely. When considering their beliefs as evidence we'd then need to correct for that selection effect.

[ETA: I thought I should maybe add that even the direction of the update doesn't seem fully clear. It depends on assumptions about the underlying population. E.g. if we think that everyone's credence is determined by an unbiased but noisy process, then people with high credences will self-select into AI safety because of noise, and we should think the 'correct' credence is lower than what they say. On the other hand, if we think that there are differences in how people form their beliefs, then it at least could be the case that some people are simply better at predicting AI catastrophes, or are fast at picking up 'warning signs', and if AI risk is in fact high then we would see a 'vanguard' of people self-selecting into AI safety early who also will have systematically more accurate beliefs about AI risk than the general population.]

(I am sympathetic to "I'd still want 80K to explicitly argue that point, and note the disagreement.", though haven't checked to what extent they might do that elsewhere.)

RobBensingerMay 2 20217

Yeah, I like this correction.

RobBensingerMay 28 20214

Though in the world where the credible range of estimates is 1-10%, and 80% of the field believed the probability were >10% (my prediction from upthread), that would start to get into 'something's seriously wrong with the field' territory from my perspective; that's not a small disagreement.

(I'm assuming here, as I did when I made my original prediction, that they aren't all clustered around 15% or whatever; rather, I'd have expected a lot of the field to give a much higher probability than 10%.)

Joe_CarlsmithApr 30 202115

Hi Rob,

Thanks for these comments.

Let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p. I’m understanding your main objections in this comment as:

It seems to you like we’re in a world where p is true, by default. Hence, 5% on p seems too low to you. In particular:
1. It implies 95% confidence on not p, which seems to you overly confident.
2. If p is true by default, you think the world would look like it does now; so if this world isn’t enough to get me above 5%, what would be?
3. Because p seems true to you by default, you suspect that an analysis that only ends up putting 5% on p involves something more than “the kind of mistake you should make in any ordinary way,” and requires some kind of mistake in methodology.

One thing I’ll note at the outset is the content of footnote 178, which (partly prompted by your comment) I may revise to foreground more in the main text: “In sensitivity tests, where I try to put in ‘low-end’ and ‘high-end’ estimates for the premises above, this number varies between ~.1% and ~40% (sampling from distributions over probabilities narrows this range a bit, but it also fails to capture certain sorts of correlations). And my central estimate varies between ~1-10% depending on my mood, what considerations are salient to me at the time, and so forth. This instability is yet another reason not to put too much weight on these numbers. And one might think variation in the direction of higher risk especially worrying.”

Re 1a: I’m open to 5% being too low. Indeed, I take “95% seems awfully confident,” and related worries in that vein, seriously as an objection. However, as the range above indicates, I also feel open to 5% being too high (indeed, at times it seems that way too me), and I don’t see “it would be strange to be so confident that all of humanity won’t be killed/disempowered because of X” as a forceful argument on its own (quite the contrary): rather, I think we really need to look at the object-level evidence and argument for X, which is what the document tries to do (not saying that quote represents your argument; but hopefully it can illustrate why one might start from a place of being unsurprised if the probability turns out low).

Re 1b: I’m not totally sure I’ve understood you here, but here are a few thoughts. At a high level, one answer to “what sort of evidence would make me update towards p being more likely” is “the considerations discussed in the document that I see as counting against p don’t apply, or seem less plausible” (examples here include considerations related to longer timelines, non-APS/modular/specialized/myopic/constrained/incentivized/not-able-to-easily-intelligence-explode systems sufficing in lots/maybe ~all of incentivized applications, questions about the ease of eliminating power-seeking behavior on relevant inputs during training/testing given default levels of effort, questions about why and in what circumstances we might expect PS-misaligned systems to be superficially/sufficiently attractive to deploy, warning shots, corrective feedback loops, limitations to what APS systems with lopsided/non-crazily-powerful capabilities can do, general incentives to avoid/prevent ridiculously destructive deployment, etc, plus more general considerations like “this feels like a very specific way things could go”).

But we could also imagine more “outside view” worlds where my probability would be higher: e.g., there is a body of experts as large and established as the experts working on climate change, which uses quantitative probabilistic models of the quality and precision used by the IPCC, along with an understanding of the mechanisms underlying the threat as clear and well-established as the relationship between carbon emissions and climate change, to reach a consensus on much higher estimates. Or: there is a significant, well-established track record of people correctly predicting future events and catastrophes of this broad type decades in advance, and people with that track record predict p with >5% probability.

That said, I think maybe this isn’t getting at the core of your objection, which could be something like: “if in fact this is a world where p is true, is your epistemology sensitive enough to that? E.g., show me that your epistemology is such that, if p is true, it detects p as true, or assigns it significant probability.” I think there may well be something to objections in this vein, and I'm interested in thinking about the more; but I also want to flag that at a glance, it feels kind of hard to articulate them in general terms. Thus, suppose Bob has been wrong about 99/100 predictions in the past. And you say: “OK, but if Bob was going to be right about this one, despite being consistently wrong in the past, the world would look just like it does now. Show me that your epistemology is sensitive enough to assign high probability to Bob being right about this one, if he’s about to be.” But this seems like a tough standard; you just should have low probability on Bob being right about this one, even if he is. Not saying that’s the exact form of your objection, or even that it's really getting at the heart of things, but maybe you could lay out your objection in a way that doesn’t apply to the Bob case?

(Responses to 1c below)

PabloApr 30 202115

From my perspective, the world just looks like the kind of world where "existential catastrophe from misaligned, power-seeking AI by 2070" is true.

Could you clarify what you mean by this? I think I don't understand what the word "true", italicized, is supposed to mean here. Are you just reporting the impression (i.e. a belief not adjusted to account for other people's beliefs) that you are ~100% certain an existential catastrophe from misaligned, power-seeking AI will (by default) occur by 2070? Or are you saying that this is what prima facie seems to you to be the case, when you extrapolate naively from current trends? The former seems very overconfident (even conditional on an existential catastrophe occurring by that date, it is far from certain that it will be caused by misaligned AI), whereas the latter looks pretty uninformative, given that it leaves open the possibility that the estimate will be substantially revised downward after additional considerations are incorporated (and you do note that you think "there's a decent chance of exciting surprises"). Or perhaps you meant neither of these things?

I guess the most helpful thing (at least to someone like me who's trying to make sense of this apparent disagreement between you and Joe) would be for you to state explicitly what probability assignment you think the totality of the evidence warrants (excluding evidence derived from the fact that other reasonable people have beliefs about this), so that one can then judge whether the discrepancy between your estimate and Joe's is so significant that it suggests "some mistake in methodology" on your part or his, rather than a more mundane mistake.

RobBensingerMay 1 20217

Could you clarify what you mean by this? I don't understand what the word "true", italicized, is supposed to mean here.

A pattern I think I've seen with a fair number of EAs is that they'll start with a pretty well-calibrated impression of how serious AGI risk is; but then they'll worry that if they go around quoting a P(doom) like "25%" or "70%" (especially if the cause is something as far-fetched as AI), they'll look like a crackpot. So the hypothetical EA tries to find a way to justify a probability more like 1-10%, so they can say the moderate-sounding "AI disaster is unlikely, but the EV is high", rather than the more crazy-sounding "AI disaster is likely".

This obviously isn't the only reason people assign low probabilities to AI x-catastrophe, and I don't at all know whether that pattern applies here (and I haven't read Joe's replies here yet); and it's rude to open a conversation by psychologizing. Still, I wanted to articulate some perspectives from which there's less background pressure to try to give small probabilities to crazy-sounding scenarios, on the off chance that Joe or some third party found it helpful:

A 5% probability of disaster isn't any more or less confident/extreme/radical than a 95% probability of disaster; in both cases you're sticking your neck out to make a very confident prediction.
If AGI doom were likely, what additional evidence would we expect to see? If we wouldn't necessarily expect additional evidence, then why are we confident of low P(doom) in the first place?
AGI doom is just a proposition like any other, and we should think about it in the 'ordinary way' (for lack of a better way of summarizing this point).

The latter two points especially are what I was trying (and probably failing) to communicate with "'existential catastrophe from misaligned, power-seeking AI by 2070' is true."

I guess the most helpful thing (at least to someone like me who's trying to make sense of this apparent disagreement between you and Joe) would be for you to state explicitly what probability assignment you think the totality of the evidence warrants

Define a 'science AGI' system as one that can match top human thinkers in at least two big ~unrelated hard-science fields (e.g., particle physics and organic chemistry).

If the first such systems are roughly as opaque as 2020's state-of-the-art ML systems (e.g., GPT-3) and the world order hasn't already been upended in some crazy way (e.g., there isn't a singleton), then I expect an AI-mediated existential catastrophe with >95% probability.

I don't have an unconditional probability that feels similarly confident/stable to me, but I think those two premises have high probability, both individually and jointly. This isn't the same proposition Joe was evaluating, but it maybe illustrates why I have a very different high-level take on "probability of existential catastrophe from misaligned, power-seeking AI".

Paul_ChristianoMay 3 202131

A 5% probability of disaster isn't any more or less confident/extreme/radical than a 95% probability of disaster; in both cases you're sticking your neck out to make a very confident prediction.

"X happens" and "X doesn't happen" are not symmetrical once I know that X is a specific event. Most things at the level of specificity of "humans build an AI that outmaneuvers humans to permanently disempower them" just don't happen.

The reason we are even entertaining this scenario is because of a special argument that it seems very plausible. If that's all you've got---if there's no other source of evidence than the argument---then you've just got to start talking about the probability that the argument is right.

And the argument actually is a brittle and conjunctive thing. (Humans do need to be able to build such an AI by the relevant date, they do need to decide to do so, the AI they build does need to decide to disempower humans notwithstanding a prima facie incentive for humans to avoid that outcome.)

That doesn't mean this is the argument or that the argument is brittle in this way---there might be a different argument that explains in one stroke why several of these things will happen. In that case, it's going to be more productive to talk about that.

(For example, in the context of the multi-stage argument undershooting success probabilities, it's that people will be competently trying to achieve X and most of uncertainty is estimating how hard and how effectively people are trying---which is correlated across steps. So you would do better by trying to go for the throat and reason about the common cause of each success, and you will always lose if you don't see that structure.)

And of course some of those steps may really just be quite likely and one shouldn't be deterred from putting high probabilities on highly-probable things. E.g. it does seem like people have a very strong incentive to build powerful AI systems (and moreover the extrapolation suggesting that we will be able to build powerful AI systems is actually about the systems we observe in practice and already goes much of the way to suggesting that we will do so). Though I do think that the median MIRI staff-member's view is overconfident on many of these points.

Rohin ShahMay 8 202123

If AGI doom were likely, what additional evidence would we expect to see?

Humans are pursuing convergent instrumental subgoals much more. (Related question: will AGIs want to take over the world?)
1. A lot more anti-aging research is going on.
2. Children's inheritances are ~always conditional on the child following some sort of rule imposed by the parent, intended to further the parent's goals after their death.
3. Holidays and vacations are rare; when they are taken it is explicitly a form of rejuvenation before getting back to earning tons of money.
4. Humans look like they are automatically strategic.
Humans are way worse at coordination. (Related question: can humans coordinate to prevent AI risk?)
1. Nuclear war happened some time after WW2.
2. Airplanes crash a lot more.
3. Unions never worked.
Economic incentives point strongly towards generality rather than specialization. (Related question: how general will AI systems be? Will they be capable of taking over the world?)
1. Universities don't have "majors", instead they just teach you how to be more generally intelligent.
2. (Really the entire world would look hugely different if this were the case; I struggle to imagine it.)

There's probably more, I haven't thought very long about it.

(Before responses of the form "what about e.g. the botched COVID response?", let me note that this is about additional evidence; I'm not denying that there is existing evidence.)

RobBensingerMay 1 20212

My basic perspective here is pretty well-captured by Being Half-Rational About Pascal's Wager is Even Worse. In particular:

[...] Where the heck did Fermi get that 10% figure for his 'remote possibility' [that neutrons may be emitted in the fission of uranium], especially considering that fission chain reactions did in fact turn out to be possible? [...] So far as I know, there was no physical reason whatsoever to think a fission chain reaction was only a ten percent probability. They had not been demonstrated experimentally, to be sure; but they were still the default projection from what was already known. If you'd been told in the 1930s that fission chain reactions were impossible, you would've been told something that implied new physical facts unknown to current science (and indeed, no such facts existed).
[...]
I mention all this because it is dangerous to be half a rationalist, and only stop making one of the two mistakes. If you are going to reject impractical 'clever arguments' that would never work in real life, and henceforth not try to multiply tiny probabilities by huge payoffs, then you had also better reject all the clever arguments that would've led Fermi or Szilard to assign probabilities much smaller than ten percent. (Listing out a group of conjunctive probabilities leading up to taking an important action, and not listing any disjunctive probabilities, is one widely popular way of driving down the apparent probability of just about anything.)
[...]
I don't believe in multiplying tiny probabilities by huge impacts. But I also believe that Fermi could have done better than saying ten percent, and that it wasn't just random luck mixed with overconfidence that led Szilard and Rabi to assign higher probabilities than that. Or to name a modern issue which is still open, Michael Shermer should not have dismissed the possibility of molecular nanotechnology, and Eric Drexler will not have been randomly lucky when it turns out to work: taking current physical models at face value imply that molecular nanotechnology ought to work, and if it doesn't work we've learned some new fact unknown to present physics, etcetera. Taking the physical logic at face value is fine, and there's no need to adjust it downward for any particular reason; if you say that Eric Drexler should 'adjust' this probability downward for whatever reason, then I think you're giving him rules that predictably give him the wrong answer. Sometimes surface appearances are misleading, but most of the time they're not.
A key test I apply to any supposed rule of reasoning about high-impact scenarios is, "Does this rule screw over the planet if Reality actually hands us a high-impact scenario?" and if the answer is yes, I discard it and move on. The point of rationality is to figure out which world we actually live in and adapt accordingly, not to rule out certain sorts of worlds in advance.
There's a doubly-clever form of the argument wherein everyone in a plausibly high-impact position modestly attributes only a tiny potential possibility that their face-value view of the world is sane, and then they multiply this tiny probability by the large impact, and so they act anyway and on average worlds in trouble are saved. I don't think this works in real life - I don't think I would have wanted Leo Szilard to think like that. I think that if your brain really actually thinks that fission chain reactions have only a tiny probability of being important, you will go off and try to invent better refrigerators or something else that might make you money. And if your brain does not really feel that fission chain reactions have a tiny probability, then your beliefs and aliefs are out of sync and that is not something I want to see in people trying to handle the delicate issue of nuclear weapons. But in any case, I deny the original premise[....]
And finally, I once again state that I abjure, refute, and disclaim all forms of Pascalian reasoning and multiplying tiny probabilities by large impacts when it comes to existential risk. We live on a planet with upcoming prospects of, among other things, human intelligence enhancement, molecular nanotechnology, sufficiently advanced biotechnology, brain-computer interfaces, and of course Artificial Intelligence in several guises. If something has only a tiny chance of impacting the fate of the world, there should be something with a larger probability of an equally huge impact to worry about instead. You cannot justifiably trade off tiny probabilities of x-risk improvement against efforts that do not effectuate a happy intergalactic civilization, but there is nonetheless no need to go on tracking tiny probabilities when you'd expect there to be medium-sized probabilities of x-risk reduction.
[...]
To clarify, "Don't multiply tiny probabilities by large impacts" is something that I apply to large-scale projects and lines of historical probability. On a very large scale, if you think FAI stands a serious chance of saving the world, then humanity should dump a bunch of effort into it, and if nobody's dumping effort into it then you should dump more effort than currently into it. On a smaller scale, to compare two x-risk mitigation projects in demand of money, you need to estimate something about marginal impacts of the next added effort (where the common currency of utilons should probably not be lives saved, but "probability of an ok outcome", i.e., the probability of ending up with a happy intergalactic civilization). In this case the average marginal added dollar can only account for a very tiny slice of probability, but this is not Pascal's Wager. Large efforts with a success-or-failure criterion are rightly, justly, and unavoidably going to end up with small marginally increased probabilities of success per added small unit of effort. It would only be Pascal's Wager if the whole route-to-an-OK-outcome were assigned a tiny probability, and then a large payoff used to shut down further discussion of whether the next unit of effort should go there or to a different x-risk.

RobBensingerMay 1 20215

+ in Hero Licensing:

[...] The multiple-stage fallacy is an amazing trick, by the way. You can ask people to think of key factors themselves and still manipulate them really easily into giving answers that imply a low final answer, because so long as people go on listing things and assigning them probabilities, the product is bound to keep getting lower. Once we realize that by continually multiplying out probabilities the product keeps getting lower, we have to apply some compensating factor internally so as to go on discriminating truth from falsehood.
You have effectively decided on the answer to most real-world questions as “no, a priori” by the time you get up to four factors, let alone ten. It may be wise to list out many possible failure scenarios and decide in advance how to handle them—that’s Murphyjitsu—but if you start assigning “the probability that X will go wrong and not be handled, conditional on everything previous on the list having not gone wrong or having been successfully handled,” then you’d better be willing to assign conditional probabilities near 1 for the kinds of projects that succeed sometimes—projects like Methods. Otherwise you’re ruling out their success a priori, and the “elicitation” process is a sham.
Frankly, I don’t think the underlying methodology is worth repairing. I don’t think it’s worth bothering to try to make a compensating adjustment toward higher probabilities. We just shouldn’t try to do “conjunctive breakdowns” of a success probability where we make up lots and lots of failure factors that all get informal probability assignments. I don’t think you can get good estimates that way even if you try to compensate for the predictable bias. [...]

Wei DaiJun 2 202110

I’m focused, here, on a very specific type of worry. There are lots of other ways to be worried about AI -- and even, about existential catastrophes resulting from AI.

Can you talk about your estimate of the overall AI-related x-risk (see here for an attempt at a comprehensive list), as well as total x-risk from all sources? (If your overall AI-related x-risk is significantly higher than 5%, what do you think are the other main sources?) I think it would be a good idea for anyone discussing a specific type of x-risk to also give their more general estimates, for a few reasons:

It's useful for the purpose of prioritizing between different types of x-risk.
Quantification of specific risks can be sensitive to how one defines categories. For example one might push some kinds of risks out of "existential risk from misaligned AI" and into "AI-related x-risk in general" by defining the former in a narrow way, thereby reducing one's estimate of it. This would be less problematic (e.g., less likely to give the reader a false sense of security) if one also talked about more general risk estimates.
Different people may be more or less optimistic in general, making it hard to compare absolute risk estimates between individuals. Relative risk levels suffer less from this problem.

Ben PaceMay 2 202110

One thing that I think would really help me read this document would be (from Joe) a sense of "here's the parts where my mind changed the most in the course of this investigation".

Something like (note that this is totally made up) "there's a particular exploration of alignment where I had conceptualized it as kinda like about making the AI think right but now I conceptualize it as about not thinking wrong which I explore in section a.b.c".

Also maybe something like a sense of which of the premises Joe changed his mind on the most – where the probabilities shifted a lot.

Joe_CarlsmithMay 8 202125

Hi Ben,

This does seem like a helpful kind of content to include (here I think of Luke’s section on this here, in the context of his work on moral patienthood). I’ll consider revising to say more in this vein. In the meantime, here are a few updates off the top of my head:

It now feels more salient to me now just how many AI applications may be covered by systems that either aren’t agentic planning/strategically aware (including e.g. interacting modular systems, especially where humans are in the loop for some parts, and/or intuitively “sphexish/brittle” non-APS systems ), or by systems which are specialized/myopic/limited in capability in various ways. That is, a generalized learning agent that’s superhuman (let alone better than e.g. all of human civilization) in ~all domains, with objectives as open-ended and long-term as “maximize paperclips,” now seems to me a much more specific type of system, and one whose role in an automated economy -- especially early on -- seems more unclear. (I discuss this a bit in Section 3, section 4.3.1.3, and section 4.3.2).
Thinking about the considerations discussed in the "unusual difficulties" section generally gave me more clarity about how this problem differs from safety problems arising in the context of other technologies (I think I had previously been putting more weight on considerations like "building technology that performs function F is easier than building some technology that performs function F safely and reliably," which apply more generally).
I realized how much I had been implicitly conceptualizing the “alignment problem” as “we must give these AI systems objectives that we’re OK seeing pursued with ~arbitrary degrees of capability” (something akin to the “omni test”). Meeting standards in this vicinity (to the extent that they're well defined in a given case) seems like a very desirable form of robustness (and I’m sympathetic to related comments from Eliezer to the effect that “don’t build systems that are searching for ways to kill you, even if you think the search will come up empty”), but I found it helpful to remember that the ultimate problem is “we need to ensure that these systems don’t seek power in misaligned ways on any inputs they’re in fact exposed to” (e.g., what I’m calling “practical PS-alignment”) -- a framing that leaves more conceptual room, at least, for options that don’t “get the objectives exactly right," and/or that involve restricting a system’s capabilities/time horizons, preventing it from “intelligence exploding,” controlling its options/incentives, and so on (though I do think options in this vein raise their own issues, of the type of that the "omni test" is meant to avoid, see 4.3.1.3, 4.3.2.3, and 4.3.3). I discuss this a bit in section 4.1.
I realized that my thinking re: “races to the bottom on safety” had been driven centrally by abstract arguments/models that could apply in principle to many industries (e.g., pharmaceuticals). It now seems to me a knottier and more empirical question how models of this kind will actually apply in a given real-world case re: AI. I discuss this a bit in section 5.3.1.

Ben PaceMay 8 20212

Great answer, thanks.

HaydnBelfieldApr 29 20218

Hey Joe!

Great report, really fascinating stuff. Draws together lots of different writing on the subject, and I really like how you identify concerns that speak to different perspectives (eg to Drexler's CAIS and classic Bostrom superintelligence).

Three quick bits of feedback:

I feel like some of Jess Whittlestone and collaborators' recent research would be helpful in your initial framing, eg
1. Prunkl, C. and Whittlestone, J. (2020). Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society. - on capability vs impact
2. Gruetzemacher, R. and Whittlestone, J. (2019). The Transformative Potential of Artificial Intelligence. - on different scales of impact
3. Cremer, C. Z., & Whittlestone, J. (2021). Artificial Canaries: Early Warning Signs for Anticipatory and Democratic Governance of AI. - on milestones and limitations
I don't feel like you do quite enough to argue for premise 5 "Some of this power-seeking will scale (in aggregate) to the point of permanently disempowering ~all of humanity | (1)-(4)."
Which is, unfortunately, a pretty key premise and the one I have the most questions about! My impression is that section 6.3 is where that argumentation is intended to occur, but I didn't leave it with a sense of how you thought this would scale, disempower everyone, and be permanent. Would love for you to say more on this.
On a related, but distinct point, one thing I kept thinking is "does it matter that much if its an AI system that takes over the world and disempowers most people?". Eg you set out in 6.3.1 a number of mechanisms by which an AI system could gain power - but 10 out of the 11 you give (all except Destructive capacity) seem relevant to a small group of humans in control of advanced capabilities too.
Presumably we should also be worried about a small group doing this as well? For example, consider a scenario in which a powerhungry small group, or several competing groups, use aligned AI systems with advanced capabilities (perhaps APS, perhaps not) to the point of permanently disempowering ~all of humanity.
If I went through and find-replaced all the "PS-misaligned AI system" with "power-hungry small group", would it read that differently? To borrow Tegmark's terms, does it matter if its Omega Team or Prometheus?
I'd be interested in seeing some more from you about whether you're also concerned about that scenario, whether you're more/less concerned, and how you think its different from the AI system scenario.

Again, really loved the report, it is truly excellent work.

Joe_CarlsmithMay 1 20214

Hi Hadyn,

Thanks for your kind words, and for reading.

Thanks for pointing out these pieces. I like the breakdown of the different dimensions of long-term vs. near-term.
Broadly, I agree with you that the document could benefit from more about premise 5. I’ll consider revising to add some.
I’m definitely concerned about misuse scenarios too (and I think lines here can get blurry -- see e.g. Katja Grace’s recent post); but I wanted, in this document, to focus on misalignment in particular. The question of how to weigh misuse vs. misalignment risk, and how the two are similar/different more generally, seems like a big one, so I’ll mostly leave it for another time (one big practical difference is that misalignment makes certain types of technical work more relevant).
Eventually, the disempowerment has to scale to ~all of humanity (a la premise 5), so that would qualify as TAI in the “transition as big of a deal as the industrial revolution” sense. However, it’s true that my timelines condition in premise 1 (e.g., APS systems become possible and financially feasible) is weaker than Ajeya’s.

HaydnBelfieldApr 30 20213

Oh and:

4. Cotra aims to predict when it will be possible for "a single computer program [to] perform a large enough diversity of intellectual labor at a high enough level of performance that it alone can drive a transition similar to the Industrial Revolution." - that is a "growth rate [of the world economy of] 20%-30% per year if used everywhere it would be profitable to use"

Your scenario is premise 4 "Some deployed APS systems will be exposed to inputs where they seek power in unintended and high-impact ways (say, collectively causing >$1 trillion dollars of damage), because of problems with their objectives" (italics added).

Your bar is (much?) lower, so we should expect your scenario to come (much?) earlier.

Eli RoseMay 13 20214

Thanks for this work!

I'm wondering about "crazy teenager builds misaligned APS system in a basement" scenarios and to what extent you see the considerations in this report as bearing on those.

To be a bit more precise: I'm thinking about worlds where "alignment is easy" for society at large (i.e. your claim 3 is not true), but building powerful AI is feasible even for people who are not interested in taking the slightest precautions, even those that would be recommended by ordinary self-interest. I think mostly about individuals or small groups rather than organizations.

I think these scenarios are distinct from misuse scenarios (which you mention below your report is not intended to cover), though the line is blurry. If someone who wanted to see enormous damage to the world built an AI with the intent of causing such damage, and was successful, I'd call that "misuse." But I'm interested more in "crazy" than "omnicidal" here, where I don't think it's clear whether to call this "misuse" or not.

Maybe you see this as a pretty separate type of worry than what the report is intended to cover.