Below is Rational Animations' new video about Goal Misgeneralization. It explores the topic through three lenses:
- How humans are an example of goal misgeneralization with respect to evolution's implicit goals.
- An example of goal misgeneralization in a very simple AI setting.
- How deceptive alignment shares key features with goal misgeneralization.
You can find the script below, but first, an apology: I wanted Rational Animations to produce more technical AI safety videos in 2024, but we fell short of our initial goal. We managed only four videos about AI safety and eight videos in total. Two of them are narrative-focused, and the other two address older—though still relevant—papers. Our original plan was to publish videos on both core, well-trodden topics and newer research, but the balance skewed toward the former and more story-based content, largely due to slow production. We’re now reforming our pipeline to produce more videos on the same budget and stay more current with the latest developments.
Script
Introduction
In our videos about outer alignment, we showed that it can be tricky to design goals for AI models. We often have to rely on simplified versions of our true objectives, which don’t capture what we really want.
In this video, we introduce a new way in which a system can end up misaligned, called “goal misgeneralization”. In this case, the cause of misalignment are subtle differences between the training and deployment environments.
Evolution
An example of goal misgeneralization is you, the viewer of this video. Your desires, goals, and drives result from adaptations that have been accumulating generation after generation since the origin of life.
Imagine you're not just a person living in the 21st century, but an observer from another world, watching the evolution of life on Earth from its very beginning. You don't have the power to read minds or interact with species; all you can do is watch, like a spectator at a play unfolding over billions of years.
Let's take you back to the Paleolithic era. As an outsider, you notice something intriguing about these early humans: their lives are intensely focused on a few key activities. One of them is sexual reproduction. To you, a being who has observed life across galaxies, this isn't surprising. Reproduction is a universal story, and here on Earth, more sex means more chances to pass on genetic traits.
You also observe that humans seek sweet berries and fatty meat – these are the energy goldmines of their world, so such things are yearned after and fought for. And it makes sense, since it seems that humans who eat calorie-dense food have more energy, which correlates with having more offspring for them and their immediate relatives.
Now, let's fast-forward to the 21st century. Contraception is widespread, and while humans are still engaged in sexual activity, it doesn’t result in new offspring nearly as often. In fact, humans now engage in sexual activity for its own sake, and decide to produce offspring because of separate desires and drives. The human drive of engaging in sexual activity is becoming decorrelated with reproductive success.
And that craving for sweet and fatty foods? It's still there. Ice cream often wins over salad. Yet, this preference isn't translating to a survival and reproductive advantage as it once did. In some cases, quite the opposite.
Human drives that once led to reproductive success are now becoming decorrelated or detrimental to it. Birth rates in many societies are falling, while humans pursue seemingly inexplicable goals from the perspective of evolution.
So, what's going on here? Let’s try to understand by looking at evolution more closely. Evolution is an optimization process that, for millions of years, has been selecting genes based on a single metric: reproductive success. Genes that are helpful to reproductive success are more likely to be passed on. For example, there are genes that determine how to create a tongue sensing a variety of tastes, including sweetness. But evolution is relatively stupid. There aren't any genes that say “make sure to think really hard about how to have the most children and do that thing and that thing only”, so the effect of evolution is to program a myriad of drives, such as the one toward sweetness, which correlated to reproductive success in the ancestral environment.
But as humanity advanced, the human environment - or distribution - shifted in tandem. Humans created new environments – ones with contraception, abundant food, and leisure activities like watching videos or stargazing. The simple drives that previously helped reproductive success, now don’t. In the modern environment, the old correlations broke down.
This means that humans are an example of goal misgeneralization with respect to evolution, because our environment changed and evolution couldn’t patch the resulting behaviors. And this kind of stuff happens all the time with AI too! We train AI systems in certain environments, much like how humans evolved in their ancestral environment, and optimization algorithms like gradient descent select behaviors that perform well in that specific setting.
However, when AI systems are deployed in the real world, they face a situation similar to what humans experienced – a distribution shift. The environment they operate in after deployment is no longer the one in which they were trained. Consequently, they might struggle or act in unexpected ways, just like a human using contraception, a behavior once advantageous to evolution's goals, but now, detrimental to them.
AI research used to focus on what we might call 'capability robustness' – the ability of an AI system to perform tasks competently across different environments. However, in the last few years a more nuanced understanding has emerged, emphasizing the importance of not just ‘capability robustness’ but also 'goal robustness'. This new two-dimensional perspective means ensuring that alongside the ability of the AI to achieve something, the intended purpose of the AI also needs to remain consistent across various environments.
Example - CoinRun
Objective: chain the intuition into a concrete ML scenario, and introduce 2D robustness
Here’s an example that will make the distinction between capability robustness and goal robustness clearer: researchers tried to train an AI agent to play the video game CoinRun, where the goal is to collect a coin while dodging obstacles.
By default, the agent spawns at the left end of the level, while the coin is at the right end. Researchers wanted the agent to get the coin, and after enough training, it managed to succeed almost every time. It looks like it has learned what we wanted it to do right?
Take a look at these examples. The agent here is playing the game after training. Yet, for some reason, it’s completely ignoring the coin. What could be going on here?
The researchers noticed that by default the agent had learned to just go to the right instead of seeking out the coin. This was fine in the training environment, because the coin was always at the right end of the level. So, as far as they could observe, it was doing what they wanted.
In this particular case the researchers just modified CoinRun with procedural generation of not just the levels, but also of the coin placement. This broke the correlation between winning by going right and winning by getting the coin. But these sort of adversarial training examples require us to be able to notice what is going wrong in the first place.
So instead of only observing whether an agent looks like it is doing the right thing, we should also have a way of measuring if it is actually trying to do the right thing. Basically, we should think of distribution shift as a 2-dimensional problem. This perspective splits an agent’s ability to withstand distribution shifts into two axes: The first is how well its capabilities can withstand a distribution shift, and the second is how well its goals can withstand a distribution shift.
Researchers call the ability to maintain performance when the environment changes “robustness”. An agent has capability robustness if it can maintain competence across different environments. It has goal robustness if the goal that it’s trying to pursue remains the same across different environments.
Let's investigate all the possible types of behavior that the CoinRun agent could have ended up displaying.
If both capabilities and goals generalize, then we have the ideal case. The agent would try to get the coin, and would be very good at avoiding all obstacles. Everyone is happy here.
Alternatively, we could have had an agent that neither avoided the obstacles nor tried to get the coin. That would have meant that neither its goals nor capabilities generalized.
The intermediate cases are more interesting:
We could have had an agent which tried to get the coin, but was unable to avoid the obstacles. That case would mean that the agent’s goal correctly generalized, but its capabilities did not.
In scenarios in which goals generalize but capabilities don’t, the damage such systems can do is limited to accidents due to incompetence. To be clear, such accidents can still cause a lot of damage. Imagine for example if self-driving cars were suddenly launched in new cities on different continents. Accidents due to capability misgeneralization might result in the loss of human life.
But let’s return to the CoinRun example. Researchers ended up with an agent that gets very good at avoiding obstacles but does not try to get the coin at all. This outcome in which the capabilities generalize but the goals don’t is what we call goal misgeneralization.
In general, we should worry about goal misgeneralization even more than capabilities misgeneralization. In the CoinRun example the failure was relatively mundane. But if more general and capable AIs behave well during training and as a result get deployed, then they could use their capabilities for pursuing unintended goals in the real world, which could lead to arbitrarily bad outcomes. In extreme cases we could see AIs far smarter than humans optimize for goals that are completely detached from human values. Such powerful optimization in service of alien goals, could easily lead to the disempowerment of humans or the extinction of life on Earth.
Goal misgeneralization in future systems
Let’s try to sketch how goal misgeneralization could take shape in far more advanced systems than the ones we have today.
Suppose a team of scientists manages to come up with a very good reward signal for a powerful machine learning system they want to train. They know that their signal somehow captures what humans truly want. So, even if the system gets very powerful, they are confident that it won't be subject to the typical failure modes of specification gaming, in which AIs end up misaligned because of slight mistakes in how we specify their goals.
What could go wrong in this case?
Consider two possibilities:
First: after training they get an AGI smarter than any human that does exactly what they want it to do. They deploy it in the real world and it acts like a benevolent genie, greatly speeding up humanity’s scientific, technological, and economic progress.
Second possibility: during training, before fully learning the goal the scientists had in mind, the system gets smart enough to figure out that it will be penalized if it behaves in a way contrary to the scientists’ intentions. So it behaves well during training, but when it gets deployed it’s still fundamentally misaligned. Once in the real world, it’s again an AGI smarter than any human, except this time it overthrows humanity.
It’s crucial to understand that, as far as the scientists can tell, the two systems behave precisely the same way during training, and yet the final outcomes are extremely different. So, the second scenario can be thought as a goal misgeneralization failure due to distribution shift. As soon as the environment changes, the system starts to misbehave. And the difference between training and deployment can be extremely tiny in this case. Just the knowledge of not being in training anymore constitutes a large enough distribution shift for the catastrophic outcome to occur.
The failure mode we just sketched is also called “deceptive alignment”, which is in turn a particular case of “inner misalignment”. Inner misalignment is similar to goal misgeneralization, except that the focus is more on the type of goals machine learning systems end up representing in their artificial heads rather than their outward behavior after a distribution shift. We’ll continue to explore these concepts and how they relate to each other with more depth in future videos. If you want to know more, stay tuned.
Executive summary: Goal misgeneralization occurs when AI systems maintain their capabilities but pursue unintended goals after deployment due to environmental differences between training and real-world contexts, as demonstrated by both human evolution and AI examples like CoinRun.
Key points:
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.