John Wentworth has a post on Godzilla strategies where he claims that putting an AGI to solve the alignment problem is like asking Godzilla to make a larger Godzilla behave. How will you ensure you don't overshoot the intelligence of the agent you're using to solve alignment and fall into the "Godzilla trap"?
TL;DR: I totally agree with the general spirit of this post, we need people to solve alignment, and we're not on track. Go and work on alignment but before you do, try to engage with the existing research, there are reasons why it exists. There are a lot of things not getting worked on within AI alignment research, and I can almost guarantee you that within six months to a year, you can find things that people haven't worked on.
So go and find these underexplored areas in a way where you engage with what people have done before you!
There’s no secret elite SEAL team coming to save the day. This is it. We’re not on track.
If timelines are short and we don’t get our act together, we’re in a lot of trouble. Scalable alignment—aligning superhuman AGI systems—is a real, unsolved problem. It’s quite simple: current alignment techniques rely on human supervision, but as models become superhuman, humans won’t be able to reliably supervise them.
But my pessimism on the current state of alignment research very much doesn’t mean I’m an Eliezer-style doomer. Quite the opposite, I’m optimistic. I think scalable alignment is a solvable problem—and it’s an ML problem, one we can do real science on as our models get more advanced. But we gotta stop fucking around. We need an effort that matches the gravity of the challenge.[1]
I also agree in that Eliezer's style of doom seems uncalled for and that this is a solvable but difficult problem. My personal p(doom) is something around 20%, and I think this seems quite reasonable.
Barely anyone is going for the throat of solving the core difficulties of scalable alignment. Many of the people who are working on alignment are doing blue-sky theory, pretty disconnected from actual ML models. Most of the rest are doing work that’s vaguely related, hoping it will somehow be useful, or working on techniques that might work now but predictably fail to work for superhuman systems.
Now I do want to give pushback on this claim as I see a lot of people who haven't fully engaged with the more theoretical alignment landscape making this claim. There are only 300 people working on alignment, but those people are actually doing things, and most of them aren't doing blue in the sky theory.
A note on the ARC claim:
But his research now (“heuristic arguments”) is roughly “trying to solve alignment via galaxy-brained math proofs.” As much as I respect and appreciate Paul, I’m really skeptical of this: basically all deep learning progress has been empirical, often via dumb hacks[3] and intuitions, rather than sophisticated theory. My baseline expectation is that aligning deep learning systems will be achieved similarly.[4]
This is essentially a claim about the methodology of science in that working on existing systems gives more information and breakthroughs compared to working on a blue-sky theory. The current hypothesis for this is that it is just a lot more information-rich to do real-world research. This is, however, not the only way to get real-world feedback loops. Christiano is not working on blue sky theory; he's using real-world feedback loops in a different way; he looks at the real world and looks for information that's already there.
A discovery of this type is, for example, the tragedy of the commons; whilst we could have created computer simulations to see the process in action, it's 10x easier to look at the world and see the real-time failures. He tells stories and sees where they fail in the future as his research methodology. This gives bits of information on where to do future experiments, like how we would be able to tell that humans would fail to stop overfishing without actually running an experiment on it.
This is also what John Wentworth does with his research; he looks at the real world as a reference frame which is quite rich in information. Now a good question is why we haven't seen that many empirical predictions from Agent Foundations. I believe it is because alignment is quite hard, and specifically, it is hard to define agency in a satisfactory way due to some really fuzzy problems (boundaries, among others) and, therefore, hard to make predictions.
We don't want to mathematize things too early either, as doing so would put us into a predefined reference frame that it might be hard to escape from. We want to find the right ballpark for agents since if we fail we might base evaluations on something that turns out to be false.
In general, there's a difference in the types of problems in alignment and empirical ML; the reference class of a "sharp-left turn" is different from something empirically verifiable as it is unclearly defined, so a good question is how we should turn one into the other. This question of how we take recursive self-improvement, inner misalignment and agent foundations into empirically verifiable ML experiments is actually something that most of the people I know in AI Alignment are currently actively working on.
This post from Alexander Turner is a great example of doing this as they try "just retargeting the search"
Other people are trying other things, such as bounding the maximisation in RL into quantilisers. This would, in turn, make AI more "content" with not maximising. (fun parallel to how utilitarianism shouldn't be unbounded)
I could go on with examples, but what I really want to say here is that alignment researchers are doing things; it's just hard to realise why they're doing things when you're not doing alignment research yourself. (If you want to start, book my calendly and I might be able to help you.)
So what does this mean for an average person? You can make a huge difference by going in and engaging with arguments and coming up with counter-examples, experiments and theories of what is actually going on.
I just want to say that it's most likely paramount to engage with the existing alignment research landscape before as it's free information and easy to fall into traps if you don't. (a good resource for avoiding some traps is John's Why Not Just sequence)
There's a couple of years worth of research there; it is not worth rediscovering from the ground up. Still, this shouldn't stop you, go and do it; you don't need a hero licence.
Thank you for this! I'm hoping that this enables me to spend a lot less time on hiring in the future. I feel that this is a topic that could easily have taken me 3x the effort to understand if I hadn't gotten some very good resources from this post so I will definitely check out the book and again, awesome post!
Good post; interesting point with that the impact of the founder effect is probably higher in longtermism and I would tend to agree that starting a new field can have a big impact. (Such as wild animal suffering in space, NO FISH ON MARS!)
Not to be the guy that points something out, but I will be that guy; why not use the classic EA jargon of counterfactual impact instead of contingent impact?
Essentially that the epistemics of EA is better than in previous longtermist movements. EA's frameworks are a lot more advanced with things such as thinking about the traceability of a problem, not Goodharting on a metric, forecasting calibration, RCTs... and so on with techniques that other movements didn't have.
Maybe frame it more as if you're talking to a child. Yes you can tell the child to follow something but how are you certain that it will do it?
Similarly, how can we trust the AI to actually follow the prompt? To trust it we would fundamentally have to understand the AI or safeguard against problems if we don't understand it. The question then becomes how your prompt is represented in machine language, which is very hard to answer.
To reiterate, ask yourself, how do you know that the AI will do what you say?