AS

Aaron_Scher

492 karmaJoined Claremont, CA, USA

Bio

I'm Aaron, I've done Uni group organizing at the Claremont Colleges for a bit. Current cause prioritization is AI Alignment.

Comments
74

How is the super-alignment team going to interface with the rest of the AI alignment community, and specifically what kind of work from others would be helpful to them (e.g., evaluations they would want to exist in 2 years, specific problems in interpretability that seem important to solve early, curricula for AIs to learn about the alignment problem while avoiding content we may not want them reading)? 

To provide more context on my thinking that leads to this question: I'm pretty worried that OpenAI is making themselves a single point of failure in existential security . Their plan seems to be a less-disingenuous version of "we are going to build superintelligence in the next 10 years, and we're optimistic that our alignment team will solve catastrophic safety problems, but if they can't then humanity is screwed anyway, because as mentioned, we're going to build the god machine. We might try to pause if we can't solve alignment, but we don't expect that to help much." Insofar as a unilateralist is taking existentially risky actions like this and they can't be stopped, other folks might want to support their work to increase the chance of the super-alignment team's success. Insofar as I want to support their work, I currently don't know what they need. 

Another framing behind this question is just "many people in the AI alignment community are also interested in solving this problem, how can they indirectly collaborate with you (some people will want to directly collaborate, but this has corporate-closed-ness limitation).

I am not aware of modeling here, but I have thought about this a bit. Besides what you mention, some other ways I think this story may not pan out (very speculative):

  1. At the critical time, the cost of compute for automated researchers may be really high such that it's actually not cost effective to buy labor this way. This would mainly be because many people want to use the best hardware for AI training or productive work, and this demand just overwhelms suppliers and prices skyrocket. This is like the labs and governments paying a lot more except that they're buying things which are not altruistically-motivated research. Because autonomous labor is really expensive, it isn't a much better deal than 2023 human labor. 
  2. A similar problem is that there may not be a market for buying autonomous labor because somebody is restricting this. Perhaps a government implements compute controls including on inference to slow AI progress (because they think that rapid progress would lead to catastrophe from misalignment). Perhaps the lab that develops the first of these capable-of-autonomous-research models restricts who can use it. To spell this out more, say GPT-6 is capable of massively accelerating research, then OpenAI may only make it available to alignment researchers for 3 months. Alternatively, they may only make it available to cancer researchers. In the first case, it's probably relatively cheap to get autonomous alignment research (I'm assuming OpenAI is subsidizing this, though this may not be a good assumption). In the second case you can't get useful alignment research with your money because you're not allowed to. 
  3. It might be that the intellectual labor we can get out of AI systems at the critical time is bottlenecked by human labor (i.e., humans are needed to: review the output of AI debates, give instructions to autonomous software engineers, or construct high quality datasets). In this situation, you can't buy very much autonomous labor with your money because autonomous labor isn't the limiting factor on progress. This is pretty much the state of things in 2023; AI systems help speed up human researchers, but the compute cost of them doing so is still far below the human costs, and you probably didn't need to save significant money 5 years ago to make this happen. 

My current thinking is that there's a >20% chance that EA-oriented funders should be saving significant money to spend on compute for autonomous researchers, and it is an important thing for them to gain clarity on. I want to point out that there is probably a partial-automation phase (like point 3 above) before a full-automation phase. The partial-automation phase has less opportunity to usefully spend money on compute (plausibly still in the tens of millions of dollars), but our actions are more likely to matter. After that comes the full-automation phase where money can be scalably spent to e.g., differentially speed up alignment vs. AI capabilities research by hundreds of millions of dollars, but there's a decent chance our actions don't matter then. 

As you mention, perhaps our actions don't matter then because humans don't control the future. I would emphasize that if we have fully autonomous, no humans in the loop, research happening without already having good alignment of those systems, it's highly likely that we get disempowered. That is, it might not make sense to aim to do alignment research at that point because either the crucial alignment work was already done, or we lose. Conditional on having aligned systems at this point, having saved money to spend on altruistically motivated cognitive work probably isn't very important because economic growth gets going really fast and there's plenty of money to be spent on non-alignment altruistic causes. On the other hand, something something at that point it's the last train on it's way to the dragon and it sure would be sad to not have money saved to buy those bed-nets. 

A few weeks ago I did a quick calculation for the amount of digital suffering I expect in the short term, which probably gets at your question about these sizes, for the short term. tldr of my thinking on the topic: 

  • There is currently a global compute stock of ~1.4e21 FLOP/s (each second, we can do about that many floating point operations). 
  • It seems reasonable to expect this to grow ~40x in the next 10 years based on naively extrapolating current trends in spending and compute efficiency per dollar. That brings us to 1.6e23 FLOP/s in 2033. 
  • Human brains do about 1e15 FLOP/s (each second, a human brain does about 1e15 floating point operations worth of computation)
  • We might naively assume that future AIs will have similar consciousness-compute efficiency to humans. We'll also assume that 63% of the 2033 compute stock is being used to run such AIs (makes the numbers easier). 
  • Then the number of human-consciousness-second-equivalent AIs that can be run each second in 2033 is 1e23 / 1e15 = 1e8, or 100 million. 
  • For reference, there are probably around 31 billion land animals being factory farmed each second. I make a few adjustments based on brain size and guesses about the experience of suffering AIs and get that digital suffering in 2033 seems to be similar in scale to factory farming. 
  • Overall my analysis is extremely uncertain, and I'm unsurprised if it's off by 3 orders of magnitude in either direction. Also note that I am only looking at the short term. 

You can read the slightly more thorough, but still extremely rough and likely wrong BOTEC here

Thanks for your response. I'll just respond to a couple things. 

Re Constitutional AI: I agree normatively that it seems bad to hand over judging AI debates to AIs[1]. I also think this will happen. To quote from the original AI Safety via Debate paper, 

Human time is expensive: We may lack enough human time to judge every debate, which we can address by training ML models to predict human reward as in Christiano et al. [2017]. Most debates can be judged by the reward predictor rather than by the humans themselves. Critically, the reward predictors do not need to be as smart as the agents by our assumption that judging debates is easier than debating, so they can be trained with less data. We can measure how closely a reward predictor matches a human by showing the same debate to both.

Re 

We'd also really contest the 'perform very similarly to human raters' is enough---it'd be surprising if we already have a free lunch, no information lost, way to simulate humans well enough to make better AI. 

I also find this surprising, or at least I did the first 3 times I came across medium-quality evidence pointing this direction. I don't find it as surprising any more because I've updated my understanding of the world to "welp, I guess 2023 AIs actually are that good on some tasks." Rather than making arguments to try and convince you, I'll just link some of the evidence that I have found compelling, maybe you will too, maybe not: Model Written Evals, MACHIAVELLI benchmark, Alpaca (maybe the most significant for my thinking), this database, Constitutional AI

I'm far from certain that this trend, of LLMs being useful for making better LLMs and for replacing human feedback, continues rather than hitting a wall in the next 2 years, but it does seem more likely than not to me, based on my read of the evidence. Some important decisions in my life rely on how soon this AI stuff is happening (for instance if we have 20+ years I should probably aim to do policy work), so I'm pretty interested in having correct views. Currently, LLMs improving the next generation of AIs via more and better training data is one of the key factors in how I'm thinking about this. If you don't find these particular evidences compelling and are able to explain why, that would be useful to me! 

  1. ^

    I'm actually unsure here. I expect there are some times where it's fine to have no humans in the loop and other times where it's critical. It generally gives me the ick to take humans out of the loop, but I expect there are some times where I would think it's correct. 

The article doesn't seem to have a comment section so I'm putting some thoughts here. 

  • Economic growth: I don't feel I know enough about historical economic growth to comment on how much to weigh the "the trend growth rate of GDP per capita in the world's frontier economy has never exceeded three percent per year." I'll note that I think the framing here is quite different than that of Christiano's Hyperbolic Growth, despite them looking at roughly the same data as far as I can tell. 
  • Scaling current methods: the article seems to cherrypick the evidence pretty significantly and makes the weak claim that "Current methods may also not be enough." It is obvious that my subjective probability that current methods are enough should be <1, but I have yet to come across arguments that push that credence below say 50%. 
    • "Scaling compute another order of magnitude would require hundreds of billions of dollars more spending on hardware." This is straightforwardly false. The table included in the article, from the Chinchilla paper with additions, is a bit confusing because it doesn't include where we are now, and because it lists only model size rather than total training compute (FLOP). Based on Epoch's database of models, PaLM 2 is trained with about 7.34e24 FLOP, and GPT-4 is estimated at 2.10e25 (note these are not official numbers). This corresponds to being around the 280B param (9.9e24 FLOP) or 520B param (3.43e25 FLOP) rows in the table. In this range, tens of millions of dollars are being spent on compute for the biggest training runs now. It should be obvious that you can get a couple more orders of magnitude more compute before hitting hundreds of billions of dollars. In fact, the 10 Trillion param row in the table, listed at $28 billion, corresponds to a total training compute of 1.3e28 FLOP, which is more than 2 orders of magnitude above the biggest publicly-known models are estimated. I agree that cost may soon become a limiting factor, but the claim that an order of magnitude would push us into hundreds of billions is clearly wrong given that currently costs are tens of millions. 
    • Re cherrypicking data, I guess one of the most important points that seems to be missing from this section is the rate of algorithmic improvement. I would point to Epoch's work here. 
  • "Constitutional AI, a state-of-the-art alignment technique that has even reached the steps of Capitol Hill, also does not aim to remove humans from the process at all: "rather than removing human supervision, in the longer term our goal is to make human supervision as efficacious as possible."" This seems to me like a misunderstanding of Constitutional AI, for which a main component is "RL from AI Feedback." Constitutional AI is all about removing humans from the loop in order to get high quality data more efficiently. There's a politics thing where developers don't want to say they're removing human supervision, and it's also true that human supervision will probably play a role in data generation in the future, but the human:total (AI+human) contribution to data ratio is surely going to go down. For example research using AIs where we used to use humans, see also Anthropic's paper Model Written Evaluations, and the AI-labeled MACHIAVELLI benchmark. More generally, I would bet the trend toward automating datasets and benchmarks will continue, even if humans remain in the loop somewhat; insofar as humans are a limiting factor, developers will try to make them less necessary, and we already have AIs that perform very similarly to human raters at some tasks. 
  • "We are constantly surprised in our day jobs as a journalist and AI researcher by how many questions do not have good answers on the internet or in books, but where some expert has a solid answer that they had not bothered to record. And in some cases, as with a master chef or LeBron James, they may not even be capable of making legible how they do what they do." Not a disagreement, but I do wonder how much of this is a result of information being diffuse and just hard to properly find, a kind of task I expect AIs to be good at. For instance, 2025 language models equipped with search might be similarly useful to if you had a panel of relevant experts you could ask questions to. 
  • Noting that section 3: "Even if technical AI progress continues, social and economic hurdles may limit its impact" matters for some outcomes and not for others. It matters given the authors define "transformative AI in terms of its observed economic impact." It matters for many outcomes I care about like human well-being, that are related to economic impacts. It applies less to worries around existential risk and human disempowerment, for which powerful AIs may pose risks even while not causing large economic impacts ahead of time (e.g., bioterrorism doesn't require first creating a bunch of economic growth). 
    • Overall I think the claim of section 3 is likely to be right. A point pushing the other direction is that there may be a regulatory race to the bottom where countries want to enable local economic growth from AI and so relax regulations, think medical tourism for all kinds of services. 
  • "Yet as this essay has outlined, myriad hurdles stand in the way of widespread transformative impact. These hurdles should be viewed collectively. Solving a subset may not be enough." I definitely don't find the hurdles discussed here to be sufficient to make this claim. It feels like there's a motte and bailey, where the easy to defend claim is "these 3+ hurdles might exist, and we don't have enough evidence to discount any of them", and the harder to defend claim is "these hurdles disjunctively prevent transformative AI in the short term, so all of them must be conquered to get such AI." I expect this shift isn't intended by the authors, but I'm noting that I think it's a leap. 
  • "Scenarios where AI grows to an autonomous, uncontrollable, and incomprehensible existential threat must clear the same difficult hurdles an economic transformation must." I don't think this is the case. For example, section 3 seems to not apply as I mentioned earlier. I think it's worth noting that AI safety researcher Eliezer Yudkowsky has made a similar argument to what you make in section 3, and he is also thinks existential catastrophe in the near term is likely. I think the point your making here is directionally right, however, that AI which poses existential risk is likely to be transformative in the sense you're describing. That is, it's not necessary for such AI to be economically transformative, and there are a couple other ways catastrophically-dangerous AI can bypass the hurdles you lay out, but I think it's overall a good bet that existentially dangerous AIs are also capable of being economically transformative, so the general picture of hurdles, insofar as they are real, will affect such risks as well [I could easily see myself changing my mind about this with more thought]. I welcome more discussion on this point and have some thoughts myself, but I'm tired and won't include them in this comment; happy to chat privately about where "economically transformative" and "capable of posing catastrophic risks" lie on various spectrums. 

While my comment has been negative and focused on criticism, I am quite glad this article was written. Feel free to check out a piece I wrote, laying out some of my thinking around powerful AI coming soon, which is mostly orthogonal to this article. This comment was written sloppily, partially as my off-the-cuff notes while reading, sorry for any mistakes and impolite tone. 

I'm not Buck, but I can venture some thoughts as somebody who thinks it's reasonably likely we don't have much time

Given that "I'm skeptical that humans will go extinct in the near future" and that you prioritize preventing suffering over creating happiness, it seems reasonable for you to condition your plan on humanity surviving the creation of AGI. You might then back-chain from possible futures you want to steer toward or away from. For instance, if AGI enables space colonization, it sure would be terrible if we just had planets covered in factory farms. What is the path by which we would get there, and how can you change it so that we have e.g., cultured meat production planets instead. I think this is probably pretty hard to do; the term "singularity" has been used partially to describe that we cannot predict what would happen after it. That said, the stakes are pretty astronomical such that I think it would be pretty reasonable for >20% of animal advocacy effort to be specifically aimed at preventing AGI-enabled futures with mass animal suffering. This is almost the opposite of "we have ~7 years to deliver (that is, realise) as much good as we can for animals." Instead it might be better to have an attitude like "what happens after 7 years is going to be a huge deal in some direction, let's shape it to prevent animal suffering."

I don't know what kind of actions would be recommended by this thinking. To venture a guess: trying to accelerate meat alternatives, doing lots of polling around public opinions on moral questions around eating meat (with the goal of hopefully finding that humans think factory farming is wrong so a friendly AI system might adopt such a goal as well; human behavior in this regard seems like a particularly bad basis on which to train AIs). Pretty uncertain about these two idea and I wouldn't be surprised if they're actually quite bad. 

I agree that persuasion frames are often a bad way to think about community building.

I also agree that community members should feel valuable, much in the way that I want everybody in the world to feel valued/loved.

I probably disagree about the implications, as they are affected by some other factors. One intuition that helps me is to think about the donors who donate toward community building efforts. I expect that these donors are mostly people who care about preventing kids from dying of malaria, and many donors also donate lots of money towards charities that can save a kid’s like for $5000. They are, I assume, donating toward community building efforts because they think these efforts are on average a better deal, costing less than $5000 for a live saved in expectation.

For mental health reasons, I don’t think people should generally hold themselves to this bar and be like “is my expected impact higher than where money spent on me would go otherwise?” But I think when you’re using other peoples altruistic money to community build, you should definitely be making trade offs, crunching numbers, and otherwise be aiming to maximize the impact from those dollars.

Furthermore, I would be extremely worried if I learned that community builders aren’t attempting to quantify their impact or think about these things carefully (noting that I have found it very difficult to quantify impact here). Community building is often indistinguishable (at least from the outside) from “spending money on ourselves” and I think it’s reasonable to have a super high bar for doing this in the name of altruism.

Noting again that I think it’s hard to balance mental health with the whacky terrible state of the world where a few thousand dollars can save a life. Making a distinction between personal dollars and altruistic dollars can perhaps help folks preserve their mental health while thinking rigorously about how to help others the most. Interesting related ideas:

https://www.lesswrong.com/posts/3p3CYauiX8oLjmwRF/purchase-fuzzies-and-utilons-separately https://forum.effectivealtruism.org/posts/zu28unKfTHoxRWpGn/you-have-more-than-one-goal-and-that-s-fine

Sorry about the name mistake. Thanks for the reply. I'm somewhat pessimistic about us two making progress on our disagreements here because it seems to me like we're very confused about basic concepts related to what we're talking about. But I will think about this and maybe give a more thorough answer later. 

Edit: corrected name, some typos and word clarity fixed

Overall I found this post hard to read and I spent far too long trying to understand it. I suspect the author is about as confused about key concepts as I am. David, thanks for writing this, I am glad to see writing on this topic and I think some of your points are gesturing in a useful and important direction. Below are some tentative thoughts about the arguments. For each core argument I first try to summarize your claim and then respond, hopefully this makes it clearer where we actually disagree vs. where I am misunderstanding.

High level: The author makes a claim that the risk of deception arising is <1%, but they don’t provide numbers elsewhere. They argue that 3 conditions must all be satisfied for deception but neither of them are likely. The “how likely” affects that 1% number. My evaluation of the arguments (below) is that for each of these conjunctive conditions my rough probabilities (where higher means deception more likely) are: (totally unsure can’t reason about it) * (unsure but maybe low) * (high), yielding an unclear but probably >1% probability.

  • Key claims from post:
    • Why I expect an understanding of the base objective to happen before goal-directedness: “Models that are only pre-trained almost certainly don’t have consequentialist goals beyond the trivial next token prediction. Because a pre-trained model will already have high-level representations of key base goal concepts, all it will have to do to become aligned is to point them.” Roughly the argument is that pretraining on tons of data will give a good idea of the base objective but by not cause goal-directed behavior, and then we can just make the model do the base objective thing.
      • My take: It’s not obvious what the goals of pre-trained language models are or what the goals of RLHFed models; plausibly they both have a goal like “minimize loss on the next token” but the RLHF one is doing that on a different distribution. I am generally confused about what it means for a language model to have goals. Overall I’m just so unsure about this that I can’t reasonably put a probability on models developing an understanding of the base objective before goal directedness, but I wouldn’t confidently say this number is high or low. An example of the probability being high is if goal-directedness only emerges in response to RL (this seems unlikely); an example of the probability being low would be if models undergoing pre-training become goal-directed around predicting next tokens early on in training. Insofar as David thinks this probability is high, I do not understand why.
    • Why I expect an understanding of the base objective to happen significantly before optimizing across episodes/long-term goal horizons: You only get long-term goals via gradient descent finding them, but this is unlikely to happen because gradient descent operates on a hyper-local horizon. Training runs + oversight will be quite long periods, so even if gradient descent moves you to “slightly-long-term goals,” these won’t perform well.
      • My take: This argument makes the most sense to me, or at least I think we can reason about it easier than the others. Pointing in the other direction, phase changes seem somewhat likely here; humans (sometimes) generally don’t care about outcomes in the world 100 or 1,000 years out, and then they get sold on longtermism and suddenly care about 10,000 years out. "On what time-span do I care about my goals" is plausibly a measure that will be discontinuous. Perhaps this looks like the goal “where am I minimizing loss” jumping horizons from “next token” to “this sentence/response” and perhaps “all of my performance ever” or “all of the performance of models similar to myself ever” or “all of the performance of models similar to myself in the multiverse”. I’m also unconfident about how likely this is, including still being confused about having goals or understanding base objectives, but I would not be surprised if the author turned out to be right that models understand the base objective before doing long-term goal optimization. Unfortunately I expect that competitive pressures will lead AI developers to want their AIs to pursue long-term objectives, and that might mess things up.
    • Why I expect understanding of the base objective to happen significantly before situational awareness: The deception-relevant situational awareness, which involves understanding one’s future gradient updates, is unlikely to be selected for by gradient descent. This is because, assuming the model has a goal acting only in the current training batch, reasoning about future gradient updates is unlikely to improve short-term reward by much. Models might also get situational awareness via human generated prompts, but this is unlikely because we won’t want them to have this awareness of their own gradient updates.
      • My take: I think that GPT-3 probably has the relevant situational awareness in that its world model understands gradient descent. With the current approach to LLMs it seems this just comes pre-loaded into the models. I am not saying GPT-3 is dangerous due to this information, I am saying that the factual information relevant to situational awareness gets baked into pre-training at some point by default, and this could very well be before the model has an understanding of its base objective.
      • David writes “For a model to have this property, it would need goal horizons that extended at least beyond the next gradient update.” I’m not sure I buy this, but let’s assume it for now. We care about how likely all 3 of these required-for-deception characteristics are conjunctively. If we’re saying that the 3rd requires the 2nd then we can just condition on the 2nd: Given that the model gets long-term planning before an understanding of the base objective, how likely is dangerous situational awareness (before understanding of base objective)? Seems pretty likely now. For the question “how likely are we to get deception-relevant situational awareness before base objective?” I’m probably like 90% conditioning on long-term goals and still pretty high without conditioning. Yet again I am confused by what understanding the base objective means here.

FWIW I often vote on posts at the top without scrolling because I listened to the post via the Nonlinear podcast library or read it on a platform that wasn't logged in. Not all that important of a consideration, but worth being aware of. 

Load more