The main concern of AI alignment centers on reward-maximizing agents trying to take their reward to infinity. This always has terrible consequences. If the United States Postal Service made an AI to deliver packages, and designed it to get a reward every time a package was delivered, that AI would be incentivized to find a way to deliver as many packages as possible, for the minimum possible descriptions of “deliver” and “packages”, by any means necessary. This would probably lead to the destruction of all humans and soon all life on earth.
This is more than just an idle speculation, it happens all the time, even with very simple forms of reward-maximizing agents, as seen in the list of specification gaming examples in AI. Any agent that has a motivation system that focuses on maximizing its reward will be interested in going to infinity in a way precisely defined by its reward function.
We see two closely related paths out of this crisis.
First, biological intelligences exist and don’t seem to have this problem. There are limited examples of animals running afoul of specification gaming, but this is the exception, not the rule. Humans, dogs, deer, mice, squid, etc. do not empirically seem to spend every second of downtime maniacally working to gather more resources and power.
Even given our unique human ability to plan far ahead, we often seem to use our free time to watch TV and memorize digits of pi. This suggests that biological intelligences have some other kind of motivational system, and are not reward maximizers.
Second, any system that’s an alternative to reward maximization shouldn’t show this problem. The only reason to bring a value to infinity is if your motivation system wants to make it as big as possible, or if you’re using a closely-related design like a positive feedback loop with no constraints.
An artificial intelligence that has a motivation system based on a categorically different principle shouldn’t have this particular alignment problem, though it may have others.
With your support we will develop alternative motivation algorithms based on these two arguments. Specifically, this funding would go towards the following main outcomes:
- Examine and review existing literature on animal and human motivation to draw inspiration, including work from the 1950s – 1970s which has received little recent attention (e.g. Rozin, 1968; Rozin & Kalat, 1971).
- Identify motivation principles different from reward maximization / positive feedback loops and develop them into algorithms.
- Model interactive tools for alternative algorithms and compare their behaviors to both traditional maximizing algorithms and to the behavior of biological intelligences.
All findings will be made public in a report released in 2025.
Whylome, Inc. is a 501(c)(3) nonprofit organization, so donations to support its mission are tax-deductible. Donations support us in continuing our work at the current size of our organization. Marginal donations past this point would allow us to hire consultants and assistants that let us build models and read more papers faster. To donate or for more information, contact us at root@whylome.org, or visit our website:
Donate Now