I’ve written a draft report evaluating a version of the overall case for existential risk from misaligned AI, and taking an initial stab at quantifying the risk from this version of the threat. I’ve made the draft viewable as a public google doc here (Edit: arXiv version here, video presentation here, human-narrated audio version here). Feedback would be welcome.
This work is part of Open Philanthropy’s “Worldview Investigations” project. However, the draft reflects my personal (rough, unstable) views, not the “institutional views” of Open Philanthropy.
It's great to see a new examination of what the core AI risk argument is (or should be). I like the focus on "power-seeking", and I think this is a clearer term than "influence-seeking".
I want to articulate a certain intuition that's pinging me. You write:
You also treat this as ~equivalent to:
This is equivalent to saying you're ~95% confident there won't be such a disaster between now and 2070. This seems like an awful lot of confidence to me!
(For the latter probability, you say that you'd "probably bump this up a bit [from 5%] -- maybe by a percentage point or two, though this is especially unprincipled (and small differences are in the noise anyway) -- to account for power-seeking scenarios that don’t strictly fit all the premises above". This still seems like an awful lot of confidence to me!)
To put my immediate reaction into words: From my perspective, the world just looks like the kind of world where "existential catastrophe from misaligned, power-seeking AI by 2070" is true. At least, that seems like the naive extrapolation I'd make if no exciting surprises happened (though I do think there's a decent chance of exciting surprises!).
If the proposition is true, then it's very important to figure that out ASAP. But if the current evidence isn't enough to raise your probability above ~6%, then what evidence would raise it higher? What would a world look like where this claim was obviously true, or at least plausibly true, rather than being (with ~94% confidence) false?
Another way of stating my high-level response: If the answer to a question is X, and you put a lot of work into studying the question and carefully weighing all the considerations, then the end result of your study shouldn't look like '94% confidence not-X'. From my perspective, that's beyond the kind of mistake you should make in any ordinary way, and should require some mistake in methodology.
(Caveat: this comment is my attempt to articulate a different framing than what I think is the more common framing in public, high-visibility EA writing. My sense is that the more common framing is something like "assigning very-high probabilities to catastrophe is extreme, assigning very-low probabilities is conservative". For a full version of my objection, it would be important that I go into the details of your argument rather than stopping here.)
There are some obvious responses to my argument here, like: 'X seems likely to you because of a conjunction fallacy; we can learn from this test that X isn't likely, though it's also not vanishingly improbable.' If a claim is conjunctive enough, and the conjuncts are individually unlikely enough, then you can obviously study a question for months or years and end up ~95% confident of not-X (e.g., 'this urn contains seventeen different colors of balls, so I don't expect the ball I randomly pick to be magenta').
I worry there's possibly something rude about responding to a careful analysis by saying 'this conclusion is just too wrong', without providing an equally detailed counter-analysis or drilling down on specific premises.
(I'm maybe being especially rude in a context like the EA Forum, where I assume a good number of people don't share the perspective that AI is worth worrying even at the ~5% level!)
You mention the Multiple Stages Fallacy (also discussed here, as "the multiple-stage fallacy"), which is my initial guess as to a methodological crux behind our different all-things-considered probabilities.
But the more basic reason why I felt moved to comment here is a general worry that EAs have a track record of low-balling probabilities of AI risk and large-AI-impacts-soon in their public writing. E.g.:
Back in Sep. 2017, I wrote (based on some private correspondence with researchers):
80,000 Hours is summarizing a research field where 80+% of specialists think that there's >10% probability of existential catastrophe from event A; they stick their neck out to say that these 80+% are wrong, and in fact so ostentatiously wrong that their estimate isn't even in the credible range of estimates, which they assert to be 1-10%; and they seemingly go further by saying this is true for the superset 'severe catastrophes from A' and not just for existential catastrophes from A.'
If this were a typical technical field, that would be a crazy thing to do in a career summary, especially without flagging that that's what 80,000 Hours is doing (so readers can decide for themselves how to weight the views of e.g. alignment researchers vs. ML researchers vs. meta-researchers like 80K). You could say that AI is really hard to forecast so it's harder to reach a confident estimate, but that should widen your range of estimates, not squeeze it all into the 1-10% range. Uncertainty isn't an argument for optimism.
There are obvious social reasons one might not want to sound alarmist about a GCR, especially a weird/novel GCR. But—speaking here to EAs as a whole, since it's a lot harder for me to weigh in on whether you're an instance of this trend than for me to weigh in on whether the trend exists at all—I want to emphasize that there are large potential costs to being more quiet about "high-seeming numbers" than "low-seeming numbers" in this domain, analogous to the costs e.g. of experts trying to play down their worries in the early days of the COVID-19 pandemic. Even if each individual decision seems reasonable at the time, the aggregate effect is a very skewed group awareness of reality.
If you're still making this claim now, want to bet on it? (We'd first have to operationalize who counts as an "AI safety researcher".)
I also think it wasn't true in Sep 2017, but I'm less confident about that, and it's not as easy to bet on.
(Am e-mailing with Rohin, will report back e.g. if we check this with a survey.)
Results are in this post.
(Continued from comment on the main thread)
I'm understanding your main points/objections in this comment as:
(as before, let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p).
Re 1 (and 1c, from my response to the main thread): as I discuss in the document, I do think there are questions about multiple-stage fallacies, here, though I also think that not decomposing a claim into sub-claims can risk obscuring conjunctiveness (and I don’t see “abandon the practice of decomposing a claim into subclaims” as a solution to this). As an initial step towards addressing some of these worries, I included an appendix that reframes the argument using fewer premises (and also, in positive (e.g., “p is false”) vs. negative (“p is true”) forms). Of course, this doesn’t address e.g. the “the conclusion could be true, but some of the premises false” version of the “multiple stage fallacy” worry; but FWIW, I really do think that the premises here capture the majority of my own credence on p, at least. In particular, the timelines premise is fairly weak, premises 4-6 are implied by basically any p-like scenario, so it seems like the main contenders for false premises (even while p is true) are 2: (“There will be strong incentives to build APS systems”) and 3: (“It will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway”). Here, I note the scenarios most salient to me in footnote 173, namely: “we might see unintentional deployment of practical PS-misaligned APS systems even if they aren’t superficially attractive to deploy” and “practical PS-misaligned might be developed and deployed even absent strong incentives to develop them (for example, simply for the sake of scientific curiosity).” But I don’t see these are constituting more than e.g. 50% of the risk. If your own probability is driven substantially by scenarios where the premises I list are false, I’d be very curious to hear which ones (setting aside scenarios that aren’t driven by power-seeking, misaligned AI), and how much credence if you give them. I’d also be curious, more generally, to hear your more specific disagreements with the probabilities I give to the premises I list.
Re: 2, your characterization of the distribution of views amongst AI safety researchers (outside of MIRI) is in some tension with my own evidence; and I consulted with a number of people who fit your description of “specialists”/experts in preparing the document. That said, I’d certainly be interested to see more public data in this respect, especially in a form that breaks down in (rough) quantitative terms the different factors driving the probability in question, as I’ve tried to do in the document (off the top of my head, the public estimates most salient to me are Ord (2020) at 10% by 2100, Grace et al (2017)’s expert survey (5% median, with no target date), and FHI’s (2008) survey (5% on extinction from superintelligent AI by 2100), though we could gather up others from e.g. LW and previous X-risk books.) That said, importantly, and as indicated in my comment on the main thread, I don’t think of the community of AI safety researchers at the orgs you mention as in an epistemic position analogous to e.g. the IPCC, for a variety of reasons (and obviously, there are strong selection effects at work). Less importantly, I also don’t think the technical aspects of this problem the only factors relevant to assessing risk; at this point I have some feeling of having “heard the main arguments”; and >10% (especially if we don’t restrict to pre-2070 scenarios) is within my “high-low” range mentioned in footnote 178 (e.g., .1%-40%).
Re: 3, I do think that the “conservative” thing to do here is to focus on the higher-end estimates (especially given uncertainty/instability in the numbers), and I may revise to highlight this more in the text. But I think we should distinguish between the project of figuring out “what to focus on”/what’s “appropriately conservative,” and what our actual best-guess probabilities are; and just as there are risks of low-balling for the sake of not looking weird/alarmist, I think there are risks of high-balling for the sake of erring on the side of caution. My aim here has been to do neither; though obviously, it’s hard to eliminate biases (in both directions).
I think I share Robby's sense that the methodology seems like it will obscure truth.
That said, I have neither your (Joe) extensive philosophical background nor have spent substantial time like you on a report like this, and I am interested in evidence to the contrary.
To me, it seems like you've tried to lay out a series of 6 steps of an argument, that you think each very accurately carve the key parts of reality that are relevant, and pondered each step for quite a while.
When I ask myself whether I've seen something like this produce great insight, it's hard. It's not something I've done much myself explicitly. However, I can think of a nearby example where I think this has produced great insight, which is Nick Bostrom's work. I think (?) Nick spends a lot of his time considering a simple, single key argument, looking at it from lots of perspectives, scrutinizing wording, asking what people from different scientific fields would think of it, poking and prodding and rotating and just exploring it. Through that work, I think he's been able to find considerations that were very surprising and invalidated the arguments, and proposed very different arguments instead.
When I think of examples here, I'm imagining that this sort of intellectual work produced the initial arguments about astronomical waste, and arguments since then about unilateralism and the vulnerable world hypothesis. Oh, and also simulation hypothesis (which became a tripartite structure).
I think of Bostrom as trying to consider a single worldview, and find out whether it's a consistent object. One feeling I have about turning it into a multi-step probabilistic argument is that it does the opposite, it does not try to examine one worldview to find falsehoods, but instead integrates over all the parts of the worldview that Bostrom would scrutinize, to make a single clump of lots of parts of different worldviews. I think Bostrom may have literally never published a six-step argument of the form that you have, where it was meant to hold anything of weight in the paper or book, and also never done so assigning each step a probability.
To be clear, probabilistic discussions are great. Talking about precisely how strong a piece of evidence is – is it 2:1, 10:1, 100:1? Helps a lot in noticing which hypotheses to even pay attention to. The suspicion I have is that they are fairly different from the kind of cognition Bostrom does when doing this sort of philosophical argumentation that produces simple arguments of world-shattering importance. I suspect you've set yourself a harder task than Bostrom ever has (a 6-step argument), and thought you've made it easier for yourself by making it only probabilistic instead of deductive, whereas in fact this removes most of the tools that Bostrom was able to use to ensure he didn't take mis-steps.
But I am pretty interested if there are examples of great work using your methodology that you were inspired by when writing this up, or great works with nearby methodologies that feel similar to you. I'd be excited to read/discuss some.
I tried to look for writing like this. I think that people do multiple hypothesis testing, like Harry in chapter 86 of HPMOR. There Harry is trying to weigh some different hypotheses against each other to explain his observations. There isn't really a single train of conditional steps that constitutes the whole hypothesis.
My shoulder-Scott-Alexander is telling me (somewhat similar to my shoulder-Richard-Feynman) that there's a lot of ways to trick myself with numbers, and that I should only do very simple things with them. I looked through some of his posts just now (1, 2, 3, 4, 5).
Here's an example of a conclusion / belief from Scott's post Teachers: Much More Than You Wanted to Know:
I don't know any post where Scott says "there's a particular 6-step argument, and I assign 6 different probabilities to each step, and I trust that outcome number seems basically right". His conclusions read more like 1 key number with some uncertainty, which never came from a single complex model, but from aggregating loads of little studies and pieces of evidence into a judgment.
I think I can't think of a post like this by Scott or Robin or Eliezer or Nick or anyone. But would be interested in an example that is like this (from other fields or wherever), or feels similar.
Maybe not 'insight', but re. 'accuracy' this sort of decomposition is often in the tool box of better forecasters. I think the longest path I evaluated in a question had 4 steps rather than 6, and I think I've seen other forecasters do similar things on occasion. (The general practice of 'breaking down problems' to evaluate sub-issues is recommended in Superforecasting IIRC).
I guess the story why this works in geopolitical forecasting is folks tend to overestimate the chance 'something happens' and tend to be underdamped in increasing the likelihood of something based on suggestive antecedents (e.g. chance of a war given an altercation, etc.) So attending to "Even if A, for it to lead to D one should attend to P(B|A), P(C|B) etc. etc.", tend to lead to downwards corrections.
Naturally, you can mess this up. Although it's not obvious you are at greater risk if you arrange your decomposed considerations conjunctively or disjunctively: "All of A-E must be true for P to be true" ~also means "if any of ¬A-¬E are true, then ¬P". In natural language and heuristics, I can imagine "Here are several different paths to P, and each of these seem not-too-improbable, so P must be highly likely" could also lead one astray.
Hi Ben,
A few thoughts on this:
Overall, my sense is that disagreement here is probably more productively focused on the object level -- e.g., on the actual probabilities I give to the premises, and/or on pointing out and giving weight to scenarios that the premises don’t cover -- rather than on the methodology in the abstract. In particular, I doubt that people who disagree a lot with my bottom line will end up saying: “If I was to do things your way, I’d roughly agree with the probabilities you gave to the premises; I just disagree that you should assign probabilities to premises in a multi-step argument as a way of thinking about issues like this.” Rather, I expect a lot of it comes down to substantive disagreement about the premises at issue (and perhaps, to people assigning significant credence to scenarios that don’t fit these premises, though I don't feel like I've yet heard strong candidates -- e.g., ones that seem to me to plausibly account for, say, >2/3rds of the overall X-risk from power-seeking, misaligned AI by 2070 -- in this regard).
Thanks for the thoughtful reply.
I do think I was overestimating how robust you're treating your numbers and premises, it seems like you're holding them all much more lightly than I think I'd been envisioning.
FWIW I am more interested in engaging with some of what you wrote in in your other comment than engaging on the specific probability you assign, for some of the reasons I wrote about here.
I think I have more I could say on the methodology, but alas, I'm pretty blocked up with other work atm. It'd be neat to spend more time reading the report and leave more comments here sometime.
This links to A Sketch of Good Communication, not whichever comment you were intending to link :)
Fixed, tah.
Great comment :)
The upshot seems to be that Joe, 80k, the AI researcher survey (2008), Holden-2016 are all at about a 3% estimate of AI risk, whereas AI safety researchers now are at about 30%. The latter is a bit lower (or at least differently distributed) than Rob expected, and seems higher than among Joe's advisors.
The divergence is big, but pretty explainable, because it concords with the direction that apparent biases point in. For the 3% camp, the credibility of one's name, brand, or field benefits from making a lowball estimates. Whereas the 30% camp is self-selected to have severe concern. And risk perception all-round has increased a bit in the last 5-15 years due to Deep Learning.
Re 80K's 2017 take on the risk level: You could also say that the AI safety field is crazy and people in it are very wrong, as part of a case for lower risk probabilities. There are some very unhealthy scientific fields out there. Also, technology forecasting is hard. A career-evaluating group could investigate a field like climate change, decide that researchers in the field are very confused about the expected impact of climate change, but still think it's an important enough problem to warrant sending lots of people to work on the problem. But in that case, I'd still want 80K to explicitly argue that point, and note the disagreement.
I previously complained about this on LessWrong.
I think there is a tenable view that considers an AI catastrophe less likely than what AI safety researchers think but is not committed to anything nearly as strong as the field being "crazy" or people in it being "very wrong":
We might simply think that people are more likely to work on AI safety if they consider an AI catastrophe more likely. When considering their beliefs as evidence we'd then need to correct for that selection effect.
[ETA: I thought I should maybe add that even the direction of the update doesn't seem fully clear. It depends on assumptions about the underlying population. E.g. if we think that everyone's credence is determined by an unbiased but noisy process, then people with high credences will self-select into AI safety because of noise, and we should think the 'correct' credence is lower than what they say. On the other hand, if we think that there are differences in how people form their beliefs, then it at least could be the case that some people are simply better at predicting AI catastrophes, or are fast at picking up 'warning signs', and if AI risk is in fact high then we would see a 'vanguard' of people self-selecting into AI safety early who also will have systematically more accurate beliefs about AI risk than the general population.]
(I am sympathetic to "I'd still want 80K to explicitly argue that point, and note the disagreement.", though haven't checked to what extent they might do that elsewhere.)
Yeah, I like this correction.
Though in the world where the credible range of estimates is 1-10%, and 80% of the field believed the probability were >10% (my prediction from upthread), that would start to get into 'something's seriously wrong with the field' territory from my perspective; that's not a small disagreement.
(I'm assuming here, as I did when I made my original prediction, that they aren't all clustered around 15% or whatever; rather, I'd have expected a lot of the field to give a much higher probability than 10%.)
Hi Rob,
Thanks for these comments.
Let’s call “there will be an existential catastrophe from power-seeking AI before 2070” p. I’m understanding your main objections in this comment as:
One thing I’ll note at the outset is the content of footnote 178, which (partly prompted by your comment) I may revise to foreground more in the main text: “In sensitivity tests, where I try to put in ‘low-end’ and ‘high-end’ estimates for the premises above, this number varies between ~.1% and ~40% (sampling from distributions over probabilities narrows this range a bit, but it also fails to capture certain sorts of correlations). And my central estimate varies between ~1-10% depending on my mood, what considerations are salient to me at the time, and so forth. This instability is yet another reason not to put too much weight on these numbers. And one might think variation in the direction of higher risk especially worrying.”
Re 1a: I’m open to 5% being too low. Indeed, I take “95% seems awfully confident,” and related worries in that vein, seriously as an objection. However, as the range above indicates, I also feel open to 5% being too high (indeed, at times it seems that way too me), and I don’t see “it would be strange to be so confident that all of humanity won’t be killed/disempowered because of X” as a forceful argument on its own (quite the contrary): rather, I think we really need to look at the object-level evidence and argument for X, which is what the document tries to do (not saying that quote represents your argument; but hopefully it can illustrate why one might start from a place of being unsurprised if the probability turns out low).
Re 1b: I’m not totally sure I’ve understood you here, but here are a few thoughts. At a high level, one answer to “what sort of evidence would make me update towards p being more likely” is “the considerations discussed in the document that I see as counting against p don’t apply, or seem less plausible” (examples here include considerations related to longer timelines, non-APS/modular/specialized/myopic/constrained/incentivized/not-able-to-easily-intelligence-explode systems sufficing in lots/maybe ~all of incentivized applications, questions about the ease of eliminating power-seeking behavior on relevant inputs during training/testing given default levels of effort, questions about why and in what circumstances we might expect PS-misaligned systems to be superficially/sufficiently attractive to deploy, warning shots, corrective feedback loops, limitations to what APS systems with lopsided/non-crazily-powerful capabilities can do, general incentives to avoid/prevent ridiculously destructive deployment, etc, plus more general considerations like “this feels like a very specific way things could go”).
But we could also imagine more “outside view” worlds where my probability would be higher: e.g., there is a body of experts as large and established as the experts working on climate change, which uses quantitative probabilistic models of the quality and precision used by the IPCC, along with an understanding of the mechanisms underlying the threat as clear and well-established as the relationship between carbon emissions and climate change, to reach a consensus on much higher estimates. Or: there is a significant, well-established track record of people correctly predicting future events and catastrophes of this broad type decades in advance, and people with that track record predict p with >5% probability.
That said, I think maybe this isn’t getting at the core of your objection, which could be something like: “if in fact this is a world where p is true, is your epistemology sensitive enough to that? E.g., show me that your epistemology is such that, if p is true, it detects p as true, or assigns it significant probability.” I think there may well be something to objections in this vein, and I'm interested in thinking about the more; but I also want to flag that at a glance, it feels kind of hard to articulate them in general terms. Thus, suppose Bob has been wrong about 99/100 predictions in the past. And you say: “OK, but if Bob was going to be right about this one, despite being consistently wrong in the past, the world would look just like it does now. Show me that your epistemology is sensitive enough to assign high probability to Bob being right about this one, if he’s about to be.” But this seems like a tough standard; you just should have low probability on Bob being right about this one, even if he is. Not saying that’s the exact form of your objection, or even that it's really getting at the heart of things, but maybe you could lay out your objection in a way that doesn’t apply to the Bob case?
(Responses to 1c below)
Could you clarify what you mean by this? I think I don't understand what the word "true", italicized, is supposed to mean here. Are you just reporting the impression (i.e. a belief not adjusted to account for other people's beliefs) that you are ~100% certain an existential catastrophe from misaligned, power-seeking AI will (by default) occur by 2070? Or are you saying that this is what prima facie seems to you to be the case, when you extrapolate naively from current trends? The former seems very overconfident (even conditional on an existential catastrophe occurring by that date, it is far from certain that it will be caused by misaligned AI), whereas the latter looks pretty uninformative, given that it leaves open the possibility that the estimate will be substantially revised downward after additional considerations are incorporated (and you do note that you think "there's a decent chance of exciting surprises"). Or perhaps you meant neither of these things?
I guess the most helpful thing (at least to someone like me who's trying to make sense of this apparent disagreement between you and Joe) would be for you to state explicitly what probability assignment you think the totality of the evidence warrants (excluding evidence derived from the fact that other reasonable people have beliefs about this), so that one can then judge whether the discrepancy between your estimate and Joe's is so significant that it suggests "some mistake in methodology" on your part or his, rather than a more mundane mistake.
A pattern I think I've seen with a fair number of EAs is that they'll start with a pretty well-calibrated impression of how serious AGI risk is; but then they'll worry that if they go around quoting a P(doom) like "25%" or "70%" (especially if the cause is something as far-fetched as AI), they'll look like a crackpot. So the hypothetical EA tries to find a way to justify a probability more like 1-10%, so they can say the moderate-sounding "AI disaster is unlikely, but the EV is high", rather than the more crazy-sounding "AI disaster is likely".
This obviously isn't the only reason people assign low probabilities to AI x-catastrophe, and I don't at all know whether that pattern applies here (and I haven't read Joe's replies here yet); and it's rude to open a conversation by psychologizing. Still, I wanted to articulate some perspectives from which there's less background pressure to try to give small probabilities to crazy-sounding scenarios, on the off chance that Joe or some third party found it helpful:
The latter two points especially are what I was trying (and probably failing) to communicate with "'existential catastrophe from misaligned, power-seeking AI by 2070' is true."
Define a 'science AGI' system as one that can match top human thinkers in at least two big ~unrelated hard-science fields (e.g., particle physics and organic chemistry).
If the first such systems are roughly as opaque as 2020's state-of-the-art ML systems (e.g., GPT-3) and the world order hasn't already been upended in some crazy way (e.g., there isn't a singleton), then I expect an AI-mediated existential catastrophe with >95% probability.
I don't have an unconditional probability that feels similarly confident/stable to me, but I think those two premises have high probability, both individually and jointly. This isn't the same proposition Joe was evaluating, but it maybe illustrates why I have a very different high-level take on "probability of existential catastrophe from misaligned, power-seeking AI".
"X happens" and "X doesn't happen" are not symmetrical once I know that X is a specific event. Most things at the level of specificity of "humans build an AI that outmaneuvers humans to permanently disempower them" just don't happen.
The reason we are even entertaining this scenario is because of a special argument that it seems very plausible. If that's all you've got---if there's no other source of evidence than the argument---then you've just got to start talking about the probability that the argument is right.
And the argument actually is a brittle and conjunctive thing. (Humans do need to be able to build such an AI by the relevant date, they do need to decide to do so, the AI they build does need to decide to disempower humans notwithstanding a prima facie incentive for humans to avoid that outcome.)
That doesn't mean this is the argument or that the argument is brittle in this way---there might be a different argument that explains in one stroke why several of these things will happen. In that case, it's going to be more productive to talk about that.
(For example, in the context of the multi-stage argument undershooting success probabilities, it's that people will be competently trying to achieve X and most of uncertainty is estimating how hard and how effectively people are trying---which is correlated across steps. So you would do better by trying to go for the throat and reason about the common cause of each success, and you will always lose if you don't see that structure.)
And of course some of those steps may really just be quite likely and one shouldn't be deterred from putting high probabilities on highly-probable things. E.g. it does seem like people have a very strong incentive to build powerful AI systems (and moreover the extrapolation suggesting that we will be able to build powerful AI systems is actually about the systems we observe in practice and already goes much of the way to suggesting that we will do so). Though I do think that the median MIRI staff-member's view is overconfident on many of these points.
There's probably more, I haven't thought very long about it.
(Before responses of the form "what about e.g. the botched COVID response?", let me note that this is about additional evidence; I'm not denying that there is existing evidence.)
My basic perspective here is pretty well-captured by Being Half-Rational About Pascal's Wager is Even Worse. In particular:
+ in Hero Licensing:
Can you talk about your estimate of the overall AI-related x-risk (see here for an attempt at a comprehensive list), as well as total x-risk from all sources? (If your overall AI-related x-risk is significantly higher than 5%, what do you think are the other main sources?) I think it would be a good idea for anyone discussing a specific type of x-risk to also give their more general estimates, for a few reasons:
One thing that I think would really help me read this document would be (from Joe) a sense of "here's the parts where my mind changed the most in the course of this investigation".
Something like (note that this is totally made up) "there's a particular exploration of alignment where I had conceptualized it as kinda like about making the AI think right but now I conceptualize it as about not thinking wrong which I explore in section a.b.c".
Also maybe something like a sense of which of the premises Joe changed his mind on the most – where the probabilities shifted a lot.
Hi Ben,
This does seem like a helpful kind of content to include (here I think of Luke’s section on this here, in the context of his work on moral patienthood). I’ll consider revising to say more in this vein. In the meantime, here are a few updates off the top of my head:
Great answer, thanks.
Hey Joe!
Great report, really fascinating stuff. Draws together lots of different writing on the subject, and I really like how you identify concerns that speak to different perspectives (eg to Drexler's CAIS and classic Bostrom superintelligence).
Three quick bits of feedback:
Which is, unfortunately, a pretty key premise and the one I have the most questions about! My impression is that section 6.3 is where that argumentation is intended to occur, but I didn't leave it with a sense of how you thought this would scale, disempower everyone, and be permanent. Would love for you to say more on this.
Presumably we should also be worried about a small group doing this as well? For example, consider a scenario in which a powerhungry small group, or several competing groups, use aligned AI systems with advanced capabilities (perhaps APS, perhaps not) to the point of permanently disempowering ~all of humanity.
If I went through and find-replaced all the "PS-misaligned AI system" with "power-hungry small group", would it read that differently? To borrow Tegmark's terms, does it matter if its Omega Team or Prometheus?
I'd be interested in seeing some more from you about whether you're also concerned about that scenario, whether you're more/less concerned, and how you think its different from the AI system scenario.
Again, really loved the report, it is truly excellent work.
Hi Hadyn,
Thanks for your kind words, and for reading.
Oh and:
4. Cotra aims to predict when it will be possible for "a single computer program [to] perform a large enough diversity of intellectual labor at a high enough level of performance that it alone can drive a transition similar to the Industrial Revolution." - that is a "growth rate [of the world economy of] 20%-30% per year if used everywhere it would be profitable to use"
Your scenario is premise 4 "Some deployed APS systems will be exposed to inputs where they seek power in unintended and high-impact ways (say, collectively causing >$1 trillion dollars of damage), because of problems with their objectives" (italics added).
Your bar is (much?) lower, so we should expect your scenario to come (much?) earlier.
Thanks for this work!
I'm wondering about "crazy teenager builds misaligned APS system in a basement" scenarios and to what extent you see the considerations in this report as bearing on those.
To be a bit more precise: I'm thinking about worlds where "alignment is easy" for society at large (i.e. your claim 3 is not true), but building powerful AI is feasible even for people who are not interested in taking the slightest precautions, even those that would be recommended by ordinary self-interest. I think mostly about individuals or small groups rather than organizations.
I think these scenarios are distinct from misuse scenarios (which you mention below your report is not intended to cover), though the line is blurry. If someone who wanted to see enormous damage to the world built an AI with the intent of causing such damage, and was successful, I'd call that "misuse." But I'm interested more in "crazy" than "omnicidal" here, where I don't think it's clear whether to call this "misuse" or not.
Maybe you see this as a pretty separate type of worry than what the report is intended to cover.