My lab's small AI safety agenda

Jobst Heitzig (vodle.it)

My lab has started devoting some resources to AI safety work. As a transparency measure and to reach out, I here describe our approach.

Overall Approach

I select small theoretical and practical work packages that...

seem manageable in view of our very limited resources,
match our mixed background in applied machine learning, game theory, agent-based modeling, complex networks science, dynamical systems theory, social choice theory, mechanism design, environmental economics, behavioral social science, pure mathematics, and applied statistics, and
appear under-explored or neglected but promising or even necessary, according to our subjective assessment based on our reading of the literature and exchanges with individuals from applied machine learning, computer linguistics, AI ethics researchers, and most importantly, AI alignment researchers (you?).

Initial Reasoning

I believe that the following are likely to hold:

We don't want the world to develop into a very low-welfare state.
Powerful AI agents that optimize for an objective not almost perfectly aligned with welfare can produce very low-welfare states.
Powerful AI agents will emerge soon enough.
It is impossible to specify sufficiently well what "welfare" means (welfare theorists have tried for centuries and still disagree, common people disagree even more).

My puzzling conclusion from this is:

We can't make sure that powerful AI agents optimize for an objective that is almost perfectly aligned with welfare.
Hence we must try to prevent that any powerful AI agent optimizes for any objective whatsoever.

Those of you who are Asimov fans like me might like the following...

Six Laws of Non-Optimizing

Never attempt to optimize* your behavior with regards to any metric.
Constrained by 1, don't cause suffering or do other harm.
Constrained by 1-2, prevent other agents from violating 1 or 2
Constrained by 1-3, do what the stakeholders in your behavior would collectively decide you should do.
Constrained by 1-4, cooperate with other agents.
Constrained by 1-5, protect and improve yourself.

Rather than trying to formalize this or even define the terms precisely, I just use them to roughly guide my work.

*When saying "optimize" I mean it in the strict mathematical sense: aiming to find an exact or approximate, local or global maximum or minimum of some given function. When I mean mere improvements w.r.t. some metric, I just say "improve" rather than "optimize".

Agenda

We currently slowly pursue two parallel approaches, the first related to laws 1,3,5 from above, the other related to law 4.

Non-Optimizing Agents

Explore several novel variants of "satisficing" policies and related learning algorithms for POMDPs, produce corresponding non-optimizing versions of classical to state-of-the art tabular and ANN-based RL algorithms, and test and evaluate them in benchmark and safety-relevant environments from the literature, plus in tailormade environments for testing particular hypotheses. This might or might not be seen as a contribution to Agent Foundations research. (Currently underway)
Test them in near-term relevant application areas such as autonomous vehicles, via state-of-the-art complex simulation environments. (Planned with partner from autonomous vehicles research)
Using our game-theoretical and agent-based modeling expertise, study them in multi-agent environments both theoretically and numerically.
Design evolutionarily stable non-optimizing strategies for non-optimizing agents that cooperate with others to punish violations of law 1 in paradigmatic evolutionary games.
Use our expertise in adaptive complex networks and dynamical systems theory to study dynamical properties of mixed populations of optimizing and non-optimizing agents: attractors, basins of attraction, their stability and resilience, critical states, bifurcations and tipping behavior, etc.

Collective Choice Aspects

Analyse known existing schemes for Reinforcement Learning from Human Feedback (RLHF) from a Social Choice Theory perspective to study their implicit preference aggregation mechanism and its effects on inclusiveness, fairness, and diversity of agent behavior.
Reinforcement Learning from Collective Human Feedback (RLCHF): Plug in suitable collective choice mechanisms from Social Choice Theory into existing RLHF schemes to make agents obey law 4. (Currently underway)
Design collective AI governance mechanisms that focus on inclusion, fairness, and diversity.
Eventually merge the latter with the hypothetical approach to long-term high-stakes decision making described in this post.

Call for collaboration and exchange

Given almost non-existent funding, we currently rely on voluntary work by a few interns and students writing their theses, so I would be extremely grateful for additional collaborators and people who are willing to discuss our approach.

Thanks

I profited a lot from a few conversations with, amongst others, Yonatan Cale, Scott Garrabrant, Bob Jacobs, Jan Hendrik Kirchner, Vanessa Kosoy, David Manheim, Marcus Ogren (in alphabetical order). This is not meant to claim their endorsement of anything I wrote here, of course.

46 Reactions

More posts like this

Comments17

Sorted by

New & upvoted

Click to highlight new comments since: Today at 1:52 AM

Yonatan CaleJun 18 202311

Hey Jobst!

Regarding non-optimizing agents,

TL;DR: These videos from Robert Miles changed my mind about this, personally

(I think we talked about that but I'm not sure?)

A bit longer:

Robert (+ @edoarad ) convinced my that an agent that isn't optimizing anything isn't a coherent concept. Specifically, an agent that has a few things true about it, like "it won't trade things in a circle so that it will end up losing something and gaining nothing" will have a goal that can be described with a utility function.

If you agree with this, then I think it's less relevant to say that the agent "isn't maximizing anything" and more coherent to talk about "what is the utility function being maximized"

Informally:

If I am a paperclip maximizer, but every 100 seconds I pause for 1 second (and so, I am not "maximizing" paperclips), would this count as a non-optimizer, for you?

Also maybe obvious:

"5. We can't just build a very weak system": Even if you succeed building a non-optimizer, it still needs to be pretty freaking powerful. So using a technique that just makes the AI very weak wouldn't solve the problem as I see it. (though I'm not sure if that's at all what you're aiming at, as I don't know the algorithms you talked about)

Ah,

And I encourage you to apply for funding if you haven't yet. For example here. Or if you can't get funding, I'd encourage you to try talking to a grantmaker who might have higher quality feedback than me. I'm mostly saying things based on 2 youtube videos and a conversation

titotalJun 19 20238

Specifically, an agent that has a few things true about it, like "it won't trade things in a circle so that it will end up losing something and gaining nothing" will have a goal that can be described with a utility function.

Something is wrong here, because I fit the description of an "AGI", and yet I do not have a utility function. Within that theorem something is being smuggled in that is not necessary for general intelligence.

michelJun 19 20236

Agree. Something that clarified my thinking on this (still feel pretty confused!) is Katja Grace's counterarguments to basic AI x-risk case. In particular the section on "Different calls to ‘goal-directedness’ don’t necessarily mean the same concept" and discussions about "pseduo-agents" clarified how there are other ways for agents to take actions than purely optimizing a utility functions (which humans don't do).

Yonatan CaleJun 19 20235

I mainly want to say I agree, this seems fishy to me too.

An answer I heard from an agent foundation's researcher if I remember correctly (I complained about almost the exact same thing) : Humans do have a utility function, but they're not perfectly approximating it.

I'd add: Specifically, humans have a "feature" of (sometimes) being willing to lose all their money (in expectation) in a casino, and other such things. I don't think this is such a good safety feature (and also, if I had access to my own code, I'd edit that stuff away). But still this seems unsolved to me and maybe worth discussing more. (maybe MIRI people would just solve it in 5 seconds but not me)

titotalJun 19 20236

It is interesting to think about the seeming contradiction here. Looking at the von neuman theorem you linked earlier, the specific theorem is about a rational agent choosing between several different options, and saying that if their preferences follow the axioms (no dutch-booking etc), you can build a utility function to describe those preferences.

First of all, humans are not rational, and can be dutch-booked. But even if they were much more rational in their decision making, I don't think the average person would suddenly switch into "tile the universe to fulfill a mathematical equation" mode (with the possible exception of some people in EA).

Perhaps the problem is that the utility function describing an entities preferences doesn't need to be constant. Perhaps today I choose to buy pepsi over coke because it's cheaper, but next week I see a good ad for coke and decide to pay the extra money for the good associations it brings. I don't think the theorem says anything about that, it seems like the utility just describes my current preferences, and says nothing about how my preferences change over time.

Jobst Heitzig (vodle.it)Jun 20 20231

I agree.

Except for one detail: Humans who hold preferences that don't comply to the axioms cannot necessarily be "dutch-booked" for real. That would require them not only to hold certain preferences but also to always act on those preferences like an automaton, see this nice summary discussion: https://plato.stanford.edu/entries/dutch-book/

Jobst Heitzig (vodle.it)Jun 20 20231

"Humans do have a utility function"? I would say that depends on what one means by "have".

Does it mean that the value of a humans' life can in principle be measured, only that measure might not be known to the human? Then I would not be convinced – what would the evidence for this claim be?

Or does it mean that humans are imperfect maximizers of some imperfectly encoded state-action-valuation function that is somehow internally stored in their brains and might have been inherited and/or learned? Then I would also not be conviced as long as one cannot point to evidence that such an evaluation function is actually encoded somewhere in the brain.

Or does it simply mean that the observable behavior of a human can be interpreted as (imperfecty) maximizing some utility function? This would be the classical "as if" argument that economists use to defend their modeling humans as rational agents despite all evidence from psychology.

Jobst Heitzig (vodle.it)Jun 18 20234

Hey Yonatan,

first, excuse my spelling your name incorrectly originally, I fixed it now.

Thank you for your encouragement with funding. As it happens, we did apply for funding from several sources and are waiting for their response.

Regarding Rob Miles' videos on satisficing:

One potential misunderstanding relates to the question of with what probability the agent is required to reach a certain goal. If I understand him correctly, he assumes satisficing needs to imply maximizing the probability that some constraint is met, which would still constitute a form of optimization (namely of the probability). This is why our approach is different: In a Markov Decision Process, the client would for example specify a feasibility interval for the expected value of the return (= long-term discounted sum of rewards according to some reward function that we explicitly do not assume to be a proper measure of utility), and the learning algorithm would seek a policy that makes the expected return fall anywhere into this interval.

The question of whether an agent somehow necessarily must optimize something is a little philosophical in my view. Of course, given an agent's behavior, one can always find some function that is maximal for the given behavior. This is a mathematical triviality. But this is not the problem we need to address here. The problem we need to address is that the behavior of the agent might get chosen by the agent or its learning algorithm by maximizing some objective function.

It is all about a paradigm shift: In my view, AI systems should be made to achieve reasonable goals that are well-specified w.r.t. one or more proxy metrics, not to maximize whatever metric. What would be the reasonable goal for your modified paperclip maximizer?

Regarding "weakness":

Non-maximizing does not imply weak, let alone "very weak". I'm not suggesting to build a very weak system at all. In fact, maximizing an imperfect proxy metric will tend to give low score on the real utility. Or, to turn this around: The maximum of the actual utility function is most achieved by a policy that does not maximize the proxy metric. We will study this in example environments and report results later this year.

Yonatan CaleJun 19 20232

long-term discounted sum of rewards according to some reward function that we explicitly do not assume to be a proper measure of utility

Isn't this equivalent to building an agent (agent-2) that DID have that as their utility function?

Ah, you wrote:

The problem we need to address is that the behavior of the agent might get chosen by the agent or its learning algorithm by maximizing some objective function.

I don't understand this and it seems core to what you're saying. Could you maybe say it in other words?

Jobst Heitzig (vodle.it)Jun 20 20232

When I said "actual utility" I meant that which we cannot properly formalize (human welfare and other values) and hence not teach (or otherwise "give" to) the agent, so no, the agent does not "have" (or otherwise know) this as their utility function in any relevant way.

In my use of the term "maximization", it refers to an act, process, or activity (as indicated by the ending "-ation") that actively seeks to find the maximum of some given function. First there is the function to be maximized, then comes the maximization, and finally one knows the maximum and where the maximum is (argmax).

On the other hand, one might object the following: if we are given a deterministic program P that takes input x and returns output y=P(x), we can of course always construct a mathematical function f that takes a pair (x,y) and returns some number r=f(x,y) so that it turns out that for each possible y we have P(x)=argmax f(x,y). A trivial choice for such a function is f(x,y)=1 if y=P(x) and f(x,y)=0 otherwise. Notice, however, that here the program P is given first, and then we construct a specific function f for this equivalence to hold.

In other words, any deterministic program P is functionally equivalent to another program P' that takes some input x, maximizes some function f(x,y), and returns the location y of that maximum. But being functionally equivalent to a maximizer is not the same as being a maximizer.

In the learning agent context: If I give you a learned policy pi that takes a state s and returns an action a=pi(s) (or a distribution of actions), then you might well be able to construct a reward function g that takes a state-action pair (s,a) and returns a reward (or expected reward) r=g(s,a) so that when I then calculate the corresponding optimal state-action-quality-function Q* of this reward function, it turns out that for all states s, we have pi(s)=argmax Q*(s,a). This means that the policy pi is the same policy as the one that a learning process would have produced that searches for the policy that maximizes the long-term discounted sum of rewards according to reward function g. But it does not mean that the policy pi was actually determined by such a possible optimization procedure: the learning process that produced pi can very well be of a completely different kind than an optimization procedure.

titotalJun 19 20234

Hey! Can you elaborate a bit more on what you mean by "never optimise" here? It seems like the definition you have is broad enough to render an AI useless :

When saying "optimize" I mean it in the strict mathematical sense: aiming to find an exact or approximate, local or global maximum or minimum of some function When I mean mere improvements w.r.t. some metric, I just say "improve" rather than "optimize".

It seems like this definition would apply to anything that uses math to make decisions. If I ask the AI to find me the cheapest flight it can from london to new york tomorrow, will it refuse to answer?

Also, I don't understand the distinction with "improvement" here. If I try to "improve" the estimate of the cheapest flight, isn't that the same think as trying to "optimise" to find the approximate local minimum of cost?

Jobst Heitzig (vodle.it)Jun 20 20231

This is difficult to say. I have a relatively clear intuition what I mean by optimization and what I mean by optimizing behavior. In your example, merely asking for the cheapest flight might be safe as long as you don't automatically then book that flight without spending a moment to think about whether taking that one-propeller machine without any safety belts that you have to pilot yourself is actually a good idea just because it turned out to be the cheapest. I mostly care about agents that have more agency than just printing text to your screen.

I believe what some people call "AI heaven" can be reached with AI agents that don't book the cheapest flights but that book you a flight that costs no more than you specify, take no longer than you specify, and have at least those safety equipment and other facilities that you specify. In other words: satisficing! Another example: Not find me a job that earns me as much income as possible, but find me a job that earns me at least as much income to satisfy all my basic needs and let's me have as much fun from leisure activities as I can squeeze into my lifetime. And so on...

Regarding "improvement": Replacing a state s by a state s' that scores higher on some metric r, so that r(s') > r(s), is an "improvement w.r.t. r", not an optimization for r. An optimization would require replacing s by that s' for which there is no other s'' with r(s'') > r(s'), or some approximate version of this.

One might think that a sequence of improvements must necessarily constitute an optimization, so that my distinction is unimportant. But this is not correct: While any sequence of improvements r(s1) < r(s2) must make r(sn) converge to some value r° (at least if r is bounded), this limit value r° will in general be considerably lower than the maximal value r* = max r(s). unless the procedure that selects the improvements is especially designed to find that maximum, in other words, is an optimization algorithm. Note that optimization is a hard problem in most real-world cases, much harder than just finding some sequence of improvements.

titotalJun 20 20235

With regards to your improvements definition, isn't "continuously improving until you reach a limit with is not necessarily the global limit" just a different way of describing local optimization? It sounds like you're just describing a hill climber.

I do agree with building a satisficer, as this describes more accurately what the user actually wants! I want a cheap flight, but I wouldn't be willing to wait 3 days for the program to find the cheapest possible flight that saved me 5 bucks. But on the other hand, if I told it to find me flights under 500 bucks, and it served me up a flight for 499 bucks even though there was another equally good option for 400 bucks, I'd be pretty annoyed.

It seems like some amount of local optimisation is necessary for an AI to be useful.

Jobst Heitzig (vodle.it)Jun 20 20232

That depends what you mean by "continuously improving until you reach a limit which is not necessarily the global limit".

I guess by "continuously" you probably do not mean "in continuous time" but rather "repeatedly in discrete time steps"? So you imagine a sequence r(s1) < r(s2) < ... ? Well, that could converge to anything larger than each of the r(sn). E.g., if r(sn) = 1 - 1/n, it will converge to 1. (It will of course never "reach" 1 since it will always below 1.) This is completely independent of what the local or global maxima of r are. They can obviously be way larger. For example, if the function is r(s) = s and the sequence is sn = 1 - 1/n, then r(sn) converges to 1 but the maximum of r is infinity. So, as I said before, unless your sequence of improvements is part of an attempt to find a maximum (that is, part of an optimization process), there is no reason to expect that it will converge to some maximum.

Btw., this also shows that if you have two competing satisficers whose only goal is to outperform the other and who therefore repeatedly improve their reward to be larger than the other agents' current reward, this does not imply that their rewards will converge to some maximum reward. They can easily be programmed to avoid this by just outperforming the other by an amount of 2**(-n) in the n-th step, so that their rewards converge to the initial reward plus one, rather than to whatever maximum reward might be possible.

titotalJun 20 20232

Ah, well explained, thank you. Yes, I agree now that you can theoretically improve to a limit without having that limit being a local maxima. Although I'm unsure if the procedure could end up being equivalent in practice to a local maximisation with a modified goal function (say one that penalises going above "reward + 1" with exponential cost). Maybe something to think about when going forward.

Thanks for answering the questions, best of luck with the endeavour!

Will PetilloJul 11 20231

If your goal is to prevent an agent from being incentivized to pursue narrow objectives in an unbounded fashion (e.g. "paperclip maximizer"), you can do this within the existing paradigm of reward functions by ensuring that the set of rewards simultaneously includes:

1) Contradictory goals, and
2) Diminishing returns

Either one of these on their own is insufficient. With contradictory goals alone, the agent can maximize reward by calculating which of its competing goals is more valuable and disregarding everything else. With diminishing returns alone, the agent can always get a little more reward by pursuing the goal further. But when both are in place, diminishing returns provides automatic, self-adjusting calibration to bring contradictory goals into some point of equilibrium. The end result looks like satisficing, but dodges all of the philosophical questions as to whether "satisficing" is a stable (or even meaningful) concept as discussed in the other comments.

Obviously there are deep challenges with the above, namely:

(1) Both properties must be present across all dimensions of the agent's utility function. Further, there must not be any hidden "win-win" solutions that bring competing goals into alignment so as to eliminate the need for equilibrium.

(2) The point of equilibrium must be human-compatible.

(3) 1 & 2 must remain true as the agent moves further from its training environment, as well as if it changes, such as by self-improvement.

(4) Calibrating equilibria requires the ability to reliably instill goals into an AI in the first place, currently lacking since ML only provides the indirect lever of reinforcement.

But most of these reflect general challenges within any approach to alignment.

Jobst Heitzig (vodle.it)Jul 12 20232

Dear Will,

thanks for these thoughtful comments. I'm not sure I understand some aspects of what you say correctly, but let me try to make sense of this in the example of Zhuang et al., http://arxiv.org/abs/2102.03896. If the utility function is defined only in terms of a proper subset of the attributes, it will exploit the seemingly irrelevant remaining attributes in the optimization, whether or not some of the attributes it uses represent conflicting goals. Even when conflicting goals are "present across all dimensions of the agent's utility function", that utility function might simple ignore relevant side-effects, e.g. because the designers and teachers have not anticipated them at all.

Their example in Fig. 2 shows this nicely. In contrast, with a satisficing goal of achieving only, say, 6 in Fig. 2, the agent will not exploit the unrepresented features as much and actual utility will be much larger.