Trying to better understand the practical epistemology of EA, and how we can improve upon it.
I'm a bit unclear on why you characterise 80,000 Hours as having a "narrower" cause focus than (e.g.) Charity Entrepreneurship. CE's page cites the following cause areas:
Meanwhile, 80k provide a list of the world's "most pressing problems":
These areas feel comparably "broad" to me? Likewise for Longview, who you list as part of the "AI x-risk community", state six distinct focus areas for their grantmaking — only one of which is AI. Unless I've missed a recent pivot from these orgs, both Longview & 80k feel more similar to CE in terms of breadth than Animal Advocacy Careers.
I agree that you need "specific values and epistemic assumptions" to agree with the areas these orgs have highlighted as most important, but I think you need specific values and epistemic assumptions to agree with more standard near-termist recommendations for impactful careers and donations, too. So I'm a bit confused about what the difference between "question" and "answer" communities is meant to denote aside from the split between near/longtermism.[1] Is the idea that (for example) CE is more skeptically focused on exploring the relative priorities of distinct cause areas, whereas organizations like Longview and 80k are more focused on funnelling people+money into areas which have already been decided as the most important? Or something else?
I do think it's correct note that the more 'longtermist' side of the community works with different values and epistemics to the more 'neartermist' side of the community, and I think it would be beneficial to emphasise this more. But given that you note there are already distinct communities in some sense (e.g., there are x-risk specific conferences), what other concrete steps would you like to see implemented in order to establish distinct communities?
I'm aware that many people justify focus on areas like biorisk and AI in virtue of the risks posed to the present generation, and might not subscribe to longtermism as a philosophical thesis. I still think that the ‘longtermist’ moniker is useful as a sociological label — used to denote the community of people who work on cause areas that longtermists are likely to rate as among the highest priorities.
Ay thanks, sorry I’m late back to you. I’ll respond to various parts in turn.
I don't find Carlsmith et al's estimates convincing because they are starting with a conjunctive frame and applying conjunctive reasoning. They are assuming we're fine by default (why?), and then building up a list of factors that need to go wrong for doom to happen.
My initial interpretation of this passage is: you seem to be saying that conjunctive/disjunctive arguments are presented against a mainline model (say, one of doom/hope). In presenting a ‘conjunctive’ argument, Carlsmith belies a mainline model of hope. However, you doubt the mainline model of hope, and so his argument is unconvincing. If that reading is correct, then my view is that the mainline model of doom has not been successfully argued for. What do you take to be the best argument for a ‘mainline model’ of doom? If I’m correct in interpreting the passage below as an argument for a ‘mainline model’ of doom, then it strikes me as unconvincing:
Any one of a vast array of things can cause doom. Just the 4 broad categories mentioned at the start of the OP (subfields of Alignment) and the fact that "any given [alignment] approach that might show some promise on one or two of these still leaves the others unsolved." is enough to provide a disjunctive frame!
Under your framing, I don’t think that you’ve come anywhere close to providing an argument for your preferred disjunctive framing. On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts, and an argument for a disjunctive frame requires showing this for all of the disjuncts.
I claimed that an argument for (my slight alteration of) Nate’s framing was likely to rely on the conjunction of many assumptions, and you (very reasonably) asked me to spell them out. To recap, here’s the framing:
For humanity to be dead by 2070, only one of the following needs to be true:
- Humanity has < 20 years to prepare for AGI
- The technical challenge of alignment isn’t “pretty easy”
- Research culture isn’t alignment-conscious in a competent way.
For this to be a disjunctive argument for doom, all of the following need to be true:
- If humanity has < 20 years to prepare for AGI, then doom is highly likely.
- Etc …
That is, the first point requires an argument which shows the following:
A Conjunctive Case for the Disjunctive Case for Doom:[1]
If I try to spell out the arguments for this framing, things start to look pretty messy. If technical alignment were “pretty easy”, and tackled by a culture which competently pursued alignment research, then I don’t feel >90% confident in doom. The claim “if humanity has < 20 years to prepare for AGI, then doom is highly likely” requires (non-exhaustively) the following assumptions:
So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts. For instance: if we have >20 years to conduct AI alignment research conditional on the problem not being super hard, why can’t there be a decent chance that a not-super-competent research community solves the problem? Again, I find it hard to motivate the case for a claim like that without already assuming a mainline model of doom.
I’m not saying there aren’t interesting arguments here, but I think that arguments of this type mostly assume a mainline model of doom (or the adequacy of a ‘disjunctive framing’), rather than providing independent arguments for a mainline model of doom.
This blog is ~1k words. Can you write a similar length blog for the other side, rebutting all my points?
I think so! But I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format. Otherwise, I feel like I have to spend a lot of work trying to understand the logical structure of your argument, which requires a decent chunk of time-investment.
Still, I’m happy to chat over DM if you think that discussing this further would be profitable. Here’s my attempt to summarize your current view of things.
We’re on a doomed path, and I’d like to see arguments which could allow me to justifiably believe that there are paths which will steer us away from the default attractor state of doom. The technical problem of alignment has many component pieces, and it seems like failure to solve any one of the many component pieces is likely sufficient for doom. Moreover, the problems for each piece of the alignment puzzle look ~independent.
Suggestions for better argument names are not being taken at this time.
Based solely on my own impression, I'd guess that one reason for the lack of engagement on your original question stems from the fact that it felt like you were operating within a very specific frame, and I sensed that untangling the specific assumptions of your frame (and consequently a high P(doom)) would take a lot of work. In my own case, I didn’t know which assumptions are driving your estimates, and so I consequently felt unsure as to which counter-arguments you'd consider relevant to your key cruxes.
(For example: many reviewers of the Carlsmith report (alongside Carlsmith himself) put P(doom) ≤ 10%. If you've read these responses, why did you find the responses uncompelling? Which specific arguments did you find faulty?)
Here's one example from this post where I felt as though it would take a lot of work to better understand the argument you want to put forward:
“The above considerations are the basis for the case that disjunctive reasoning should predominantly be applied to AI x-risk: the default is doom.”
When I read this, I found myself asking “wait, what are the relevant disjuncts meant to be?”. I understand a disjunctive argument for doom to be saying that doom is highly likely conditional on any one of {A, B, C, … }. If each of A, B, C … is independently plausible, then obviously this looks worrying. If you say that some claim is disjunctive, I want an argument for believing that each disjunct is independently plausible, and an argument for accepting the disjunctive framing offered as the best framing for the claim at hand.
For instance, here’s a disjunctive framing of something Nate said in his review of the Carlsmith Report.
For humanity to be dead by 2070, only one premise below needs to be true:
- Humanity has < 20 years to prepare for AGI
- The technical challenge of alignment isn’t “pretty easy”
- Research culture isn’t alignment-conscious in a competent way.
Phrased this way, Nate offers a disjunctive argument. And, to be clear, I think it’s worth taking seriously. But I feel like ‘disjunctive’ and ‘conjunctive’ are often thrown around a bit too loosely, and such terms mostly serve to impede the quality of discussion. It’s not obvious to me that Nate’s framing is the best framing for the question at hand, and I expect that making the case for Nate’s framing is likely to rely on the conjunction of many assumptions. Also, that’s fine! I think it’s a valuable argument to make! I just think there should be more explicit discussions and arguments about the best framings for predicting the future of AI.
Finally, I feel like asking for “a detailed technical argument for believing P(doom|AGI) ≤ 10%” is making an isolated demand for rigor. I personally don’t think there are ‘detailed technical arguments’ P(doom|AGI) greater than 10%. I don’t say this critically, because reasoning about the chances of doom given AGI is hard. I'm also >10% on many claims in the absence of 'detailed, technical arguments' for such claims in the absence of such arguments, and I think we can do a lot better than we're doing currently.
I agree that it’s important to avoid squeamishness about proclamations of confidence in pessimistic conclusions if that’s what we genuinely believe the arguments suggest. I'm also glad that you offered the 'social explanation' for people's low doom estimates, even though I think it's incorrect, and even though many people (including, tbh, me) will predictably find it annoying. In the same spirit, I'd like to offer an analogous argument: I think many arguments for p(doom | AGI) > 90% are the result of overreliance on specific default frame, and insufficiently careful attention to argumentative rigor. If that claim strikes you as incorrect, or brings obvious counterexamples to mind, I'd be interested to read them (and to elaborate my dissatisfaction with existing arguments for high doom estimates).
thnx! : )
Your analogy successfully motivates the “man, I’d really like more people to be thinking about the potentially looming Octopcracy” sentiment, and my intuitions here feel pretty similar to the AI case. I would expect the relevant systems (AIs, von-Neumann-Squidwards, etc) to inherit human-like properties wrt human cognition (including normative cognition, like plan search), and a small-but-non-negligible chance that we end up with extinction (or worse).
On maximizers: to me, the most plausible reason for believing that continued human survival would be unstable in Grace’s story either consists in the emergence of dangerous maximizers, or the emergence of related behaviors like rapacious influence-seeking (e.g., Part II of What Failure Looks Like). I agree that maximizers aren't necessary for human extinction, but it does seem like the most plausible route to ‘human extinction’ rather than ‘something else weird and potentially not great’.
Pushback appreciated! But I don’t think you show that “LLMs distill human cognition” is wrong. I agree that ‘next token prediction’ is very different to the tasks that humans faced in their ancestral environments, I just don’t see this as particularly strong evidence against the claim ‘LLMs distill human cognition’.
I initially stated that “LLMs distill human cognition” struck me as a more useful predictive abstraction than a view which claims that the trajectory of ML leads us to a scenario where future AIs, are “in the ways that matter”, doing something more like “randomly sampling from the space of simplicity-weighted plans”. My initial claim still seems right to me.
If you want to pursue the debate further, it might be worth talking about the degree to which you’re (un)convinced by Quintin Pope’s claims in this tweet thread. Admittedly, it sounds like you don’t view this issue as super cruxy for you:
“The cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values”
I don’t know the literature on moral psychology, but that claim doesn’t feel intuitive to me (possibly I’m misunderstanding what you mean by ‘human values’; I’m also interested in any relevant sources). Some thoughts/questions:
A working attempt to sketch a simple three-premise argument for the claim: ‘TAI will result in human extinction’, and offer objections. Made mostly for my own benefit while working on another project, but I thought it might be useful to post here.
The structure of my preferred argument is similar to an earlier framing suggested by Katja Grace.
I’ll offer some rough probabilities, but the probabilities I’m offering shouldn’t be taken seriously. I don’t think probabilities are the best way to adjudicate disputes of this kind, but I thought offering a more quantitative sense of my uncertainty (based on my immediate impressions) might be helpful in this case. For the (respective) premises, I might go for 98%, 7%, 83%, resulting in a ~6% chance of human extinction given TAI.
Some more specific objections:
I also think the story by Katja Grace below is plausible, in which superhuman AI systems are “goal-directed”, but don’t lead to human extinction.
AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.
Perhaps the story above is unlikely because the AI systems in Grace’s story would (in the absence of strong preventative efforts) be dangerous maximizers. I think that this is most plausible on something like Eliezer’s model of agency, and if my views change my best bet is that I’ll have updated towards his view.
Finally, I sometimes feel confused by the concept of ‘capabilities’ as it’s used in discussions about AGI. From Jenner and Treutlein’s response to Grace’s counterarguments:
Assuming it is feasible, the question becomes: why will there be incentives to build increasingly capable AI systems? We think there is a straightforward argument that is essentially correct: some of the things we care about are very difficult to achieve, and we will want to build AI systems that can achieve them. At some point, the objectives we want AI systems to achieve will be more difficult than disempowering humanity, which is why we will build AI systems that are sufficiently capable to be dangerous if unaligned.”
Maybe one thing I’m thinking here is that “more difficult” is hard to parse. The AI systems might be able to achieve some narrower outcome that we desire, without being “capable” of destroying humanity. I think this is compatible with having systems which are superhumanly capable of pursuing some broadly-scoped goals, without being capable of pursuing all broadly-scoped goals.
(Also, I’m no doubt missing a bunch of relevant information here. But this is probably true for most people, and I think it’s good for people to share objections even if they’re missing important details)
Nice post!
I think I’d want to revise your first taxonomy a bit. To me, one (perhaps the primary) disagreement among ML researchers regarding AI risk consists of differing attitudes to epistemological conservatism, which I think extends beyond making conservative predictions. Here’s why I prefer my framing:
I also think that the language of conservative epistemology helps counteract (what I see as) a mistaken frame motivating this post. (I’ll try to motivate my claim, but I’ll note that I remain a little fuzzy on exactly what I’m trying to gesture at.)
The mistaken frame I see is something like “modeling conservative epistemologists as if they were making poor strategic choices within a non-conservative world-model”. You state:
The level of concern and seriousness I see from ML researchers discussing AGI on any social media platform or in any mainstream venue seems wildly out of step with "half of us think there's a 10+% chance of our work resulting in an existential catastrophe".
I have concerns about you inferring this claim from the survey data provided,[1] but perhaps more pertinently for my point: I think you’re implicitly interpreting the reported probabilities as something like all-things-considered credences in the proposition researchers were queried about. I’m much more tempted to interpret the probabilities offered by researchers as meaning very little. Sure, they’ll provide a number on a survey, but this doesn’t represent ‘their’ probability of an AI-induced existential catastrophe.
I don’t think that most ML researchers have, as a matter of psychological fact, any kind of mental state that’s well-represented by a subjective probability about the chance of an AI-induced existential catastrophe. They’re more likely to operate with a conservative epistemology, in a way that isn’t neatly translated into probabilistic predictions over an outcome space that includes the outcomes you are most worried about. I think many people are likely to filter out the hypothesis given the perceived lack of evidential support for the outcome.
I actually do think the distinction between 'conservative predictions' and 'conservative decision-making' is helpful, though I'm skeptical about its relevance for analyzing different attitudes to AI risk.
If my analysis is right, then a first-pass at the practical conclusions might consist in being more willing to center arguments about alignment from a more empirically grounded perspective (e.g. here), or more directly attempting to have conversations about the costs and benefits of more conservative epistemological approaches.
First, there are obviously selection effects present in surveying OpenAI and DeepMind researchers working on long-term AI. Citing this result without caveat feels similar using (e.g.) PhilPapers survey results revealing that most specialists in philosophy of religion are to support the claim that most philosophers are theists. I can also imagine similar selection effects being present (though to lesser degrees) in the AI Impacts Survey. Given selection effects, and given that response rates from the AI Impacts survey were ~17%, I think your claim is misleading.
I haven’t read Kosoy & Diffractor’s stuff, but I will now!
FWIW I’m pretty skeptical that their framework will be helpful for making progress in practical epistemology (which I gather is not their main focus anyway?). That said, I’d be very happy to learn that I'm wrong here, so I’ll put some time into understanding what their approach is.
Thanks :)
I’m sympathetic to the view that calibration on questions with larger bodies of obviously relevant evidence aren’t transferable to predictions on more speculative questions. Ultimately I believe that the amount of skill transfer is an open empirical question, though I think the absence of strong theorizing about the relevant mechanisms involved heavily counts against deferring to (e.g.) Metaculus predictions about AI timelines.
A potential note of disagreement on your final sentence. While I think focusing on calibration can Goodhart us away from some of the most important sources of epistemic insight, there are “predictions” (broadly construed) that I think we ought to weigh more highly than “domain-relevant specific accomplishments and skills”.
Here's a dynamic that I've seen pop up more than once.
Person A says that an outcome they judge to be bad will occur with high probability, while making a claim of the form "but I don't want (e.g.) alignment to be doomed — it would be a huge relief if I'm wrong!"
It seems uncontroversial that Person A would like to be shown that they're wrong in a way that vindicates their initial forecast as ex ante reasonable.
It seems more controversial whether Person A would like to be shown that their prediction was wrong, in a way that also shows their initial prediction to have been ex ante unreasonable.
In my experience, it's much easier to acknowledge that you were wrong about some specific belief (or the probability of some outcome), than it is to step back and acknowledge that the reasoning process which led you to your initial statement was misfiring. Even pessimistic beliefs can be (in Ozzie’s language) "convenient beliefs" to hold.
If we identify ourselves with our ability to think carefully, coming to believe that there are errors in our reasoning process can hit us much more personally than updates about errors in our conclusions. Optimistic updates might be an update towards me thinking that my projects have been less worthwhile than I thought, that my local community is less effective than I thought, or that my background framework or worldview was in error. I think these updates can be especially painful for people who are more liable to identify with their ability to reason well, or identify with the unusual merits of their chosen community.
To clarify: I'm not claiming that people with more pessimistic conclusions are, in general, more likely to be making reasoning errors. Obviously there are plenty of incentives towards believing rosier conclusions. I'm simply claiming that: if someone arrives at a pessimistic conclusion based on faulty reasoning, then you shouldn't necessarily expect optimistic pushback to be uniformly welcomed— for all of the standard reasons that updates of the form "I could've done better on a task I care about" can be hard to accept.