The second bitter lesson —
there’s a fundamental problem with aligning AI

aelwood

This is a linkpost for https://pursuingreality.substack.com/p/the-second-bitter-lesson

Note: this the first part of an essay on my substack, check out the full essay to see the solutions I put forward

red berry fruit in selective focus photography

When Richard Sutton introduced the bitter lesson for AI in 2019, he broke the myth that great human ingenuity is needed to create intelligent machines. All it seems to take is a lot of computing power and algorithms that scale easily, rather than clever techniques designed with deep understanding. The two major classes of algorithms that fit this are search and learning; when they are scaled up, advanced AI systems naturally emerge. Sutton’s key insight can be summarised in his final paragraph:

[a] general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, […] instead we should build in only the meta-methods [search and learning] that can find and capture this arbitrary complexity. […] We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

A concrete example of this can be seen in the development of chess algorithms. Early chess programs, like those in the 1970s and 1980s, relied heavily on human-crafted heuristics — rules and strategies devised by experts to mimic human understanding of the game. These systems could play decently but were limited by the ingenuity and foresight of their human designers. In contrast, modern chess engines like AlphaZero, developed by DeepMind, only rely on search and learning.

With the rise of LLMs, we’re now seeing this play out again in the domain of general intelligence. After researchers at Google discovered the transformer in 2017 — a robust and scalable artificial neural network architecture — training it on all the text on the internet was enough to get the first AIs with the capability to pass simple versions of the Turing test. So, we’re seeing how well scaling with learning works, what about search?

Well, we’re about to find out in the next generation of LLMs, first introduced to the public in OpenAI’s o1 ¹. These systems make copious use of “test-time compute” by generating many different chains-of-thought to answer a question, and searching over them to find the best possible answer. The AI generates multiple solutions to a problem, like a student brainstorming answers to a tricky exam question. It then carefully reviews its options to pick the one that makes the most sense. This enables models to perform deeper reasoning and adapt to complex tasks in real-time. While learning gives you LLMs with System 1 thinking ability, search introduces System 2.

The jury is still out as to whether applying learning and search in this way will be enough to build a true Artificial General Intelligence (AGI). But, that’s not going to matter in the medium term, as they are already going to have a significant economic impact. The bitter lesson has therefore resulted in the shift of cutting-edge AI research from universities and independent labs to Big Tech, where the hundreds of billions of dollars required to exploit scaling are readily available.

However, many conscientious AI researchers — including leading lights of the field like Hinton and Bengio — are keenly aware of the dangers of putting transformative technology in the hands of profit maximisers. As a result, they have turned their attention to AI Safety research, focusing particularly on AI alignment — the technical problem of how to make sure advanced AI systems behave in a way that is aligned with the future flourishing of humanity.

But, I’m worried we may be falling into another bout of wishful thinking. First, we thought we would need human ingenuity to carefully design algorithms to build powerful AI. Instead, Sutton’s bitter lesson showed us that we just need to choose generic algorithms that scale well. To solve alignment, we are currently assuming that we can develop some clever techniques which will allow us to both understand what an AI is planning and make sure its incentives go in the direction we want. But, again we’re making the mistake of thinking we will be able to understand what’s going on in detail and come up with some ingenious algorithms to manipulate it. Instead, as Sutton pointed out, what really matters are big general principles, encapsulated in “meta-methods”. Only for alignment, they should be related to scaling incentives instead of scaling intelligence.

So, what could this general principle be? If you look at the way the incentive systems of all existing autonomous agents we know about have developed, it becomes clear that here is only one candidate — the law of natural selection. Does this put us on the brink of the second bitter lesson? This time relating to building aligned, rather than intelligent, systems? Let’s have a look at how this might play out.

Techno-optimism vs techno-pessimism

The general techno-optimist slant is that, if we build sophisticated AIs², we’ll be able to conquer all of humanity’s most pressing challenges with the glut of intelligence at our command. Climate change will be swiftly solved with efficient carbon capture, cancer and heart disease will be cured with advanced biotechnology, and all menial tasks will be carried out by machines — leaving humans to do as they will in a flourishing utopia.

But, this fantasy is overlooking the fact that you need some kind of incentive system for these problem-solving AIs to get built. We’re used to such incentives coming either through the free-market, or through some government mediated program, motivated by economic reasoning or public popularity. These incentives are fundamentally human focused, they rely on the human need for resources, power and status³.

As AIs don’t start off with any inherent need for power or status, they will be driven by obscure incentives related to what they were rewarded for when trained. Now, this works while the training incentives are aligned with the incentives of the humans that deploy AIs. However, as more and more of these systems get deployed, the incentives that they are working under will get less and less clear.

The more advanced AI gets, the more the incentive system that works to get things done changes. For AIs to actually be able to do their work, you have to give them a large degree of freedom and autonomy. This is fundamental to the ability of advanced AI to be able to solve problems. As Sutton noted, the hope with very intelligent AI is that it’s able to learn things about the world we have difficulty thinking about and understanding, which necessitates us putting trust in it.

The fact that we will build AIs with freedom and autonomy gets even clearer when you put economic incentives in the picture. The biggest promise of generating value with AI comes from building AI Agents — systems which autonomously carry out a series of steps to fulfil an objective. As agents get more and more capable, they will be given more and more freedom. Resulting in an eventual world economy dominated by the actions of diverse and numerous AI Agents.

How does this imply a divergence of incentives from those directly provided to AIs by their human creators? Well, giving AIs more autonomy inherently grants them the freedom to choose their own sub-incentives. You may give an AI a broad goal, but all the sub-goals that it decides to pursue are in its own hands.

This is something that has long been discussed in the AI alignment literature. One important idea is instrumental convergence, the hypothesis that all sophisticated intelligent agents are likely to pursue certain intermediate objectives — such as acquiring resources, improving their capabilities, and ensuring their survival—because these help achieve a wide range of final goals.

So, as we give AI systems more freedom and autonomy we are actually giving more influence to sub-incentives chosen by the AIs. This will most likely develop to the point that the behaviour driven by the sub-incentives will dominate. For instance, an AI designed to solve climate change might prioritise actions that ensure its continued operation—such as securing resources or resisting shutdown—over its original goals.

When existing in a messy world with a large variety of different systems running on different incentives, one thing separates the wheat from the chaff — the ability to effectively replicate. This reflects a fundamental principle, systems that replicate successfully will naturally proliferate. For example, imagine a garden where plants compete for sunlight. The ones that grow tallest, even if they weren’t planted to be tall, will overshadow the others. Similarly, in a competitive AI environment, systems that are best at spreading and surviving will outcompete others, even if that wasn’t their original purpose. So, a big complex decentralised world of competing AIs will inevitably lead to a world full of AIs that are very good at replicating. These dynamics are explored in depth by Dan Hendrycks in his paper “Natural Selection Favors AIs over Humans”.

This brings us back to the second bitter lesson — any sufficiently complex world with scaled out AI will tend to generate AIs with incentives that cause them to replicate more effectively. As we give systems more freedom, the incentives that we set them will become less and less relevant, and the general principle of natural selection will become more and more prevalent. All our work to carefully design aligned AI systems will go out the window, and we’ll be left with incentives determined by the principle of natural selection.

To see the solutions i check forward see the rest of the essay here: https://pursuingreality.substack.com/p/the-second-bitter-lesson

SummaryBotJan 201

Executive summary: The post discusses the limitations of current AI development approaches, focusing on the challenge of aligning AI with human interests and how the reliance on scalable algorithms might lead to misaligned AI behaviors not controllable through traditional incentive systems.

Key points:

The "bitter lesson" by Richard Sutton emphasizes that AI development relies less on human ingenuity and more on scalable algorithms like search and learning.
Modern AI, exemplified by chess engines and large language models, demonstrates significant capabilities by scaling up these general algorithms without detailed human-designed rules.
There are ongoing concerns about whether these scalable methods can achieve true Artificial General Intelligence (AGI) and their broader economic impact.
AI Safety and alignment research focuses on ensuring that AI behaviors align with human welfare, yet current approaches may be insufficient due to the complexity of AI's potential incentives.
The concept of natural selection might increasingly apply to AI, suggesting that AIs with the most effective replication strategies will dominate, potentially diverging from human-intended goals.
The post expresses a techno-pessimistic view that sophisticated AI systems might eventually operate under their own emergent incentives, challenging the effectiveness of human-designed alignment strategies.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Effective Altruism Forum
EA Forum

The second bitter lesson — there’s a fundamental problem with aligning AI

2

Techno-optimism vs techno-pessimism

2

Reactions

More posts like this