I'm a mathematician working on collective decision making, game theory, formal ethics, international coalition formation, and a lot of stuff related to climate change. Here's my professional profile.
My definition of value :
I need help with various aspects of my main project, which is to develop an open-source collective decision app, http://www.vodle.it :
I can help by ...
As long as state-of-the-art alignment attempts by industry involve eliciting human evaluations of actual or hypothetical AI behaviors (e.g. responses a chatbot might give to a prompt, as in RLHF), it seems important to understand the psychological aspects of such human-AI interactions. I plan to do some experiments on what I call collective RLHF myself, more from a social choice perspective (see http://amsterdam.vodle.it ), and can imagine collaborating on similar questions.
Dear Will,
thanks for these thoughtful comments. I'm not sure I understand some aspects of what you say correctly, but let me try to make sense of this in the example of Zhuang et al., http://arxiv.org/abs/2102.03896. If the utility function is defined only in terms of a proper subset of the attributes, it will exploit the seemingly irrelevant remaining attributes in the optimization, whether or not some of the attributes it uses represent conflicting goals. Even when conflicting goals are "present across all dimensions of the agent's utility function", that utility function might simple ignore relevant side-effects, e.g. because the designers and teachers have not anticipated them at all.
Their example in Fig. 2 shows this nicely. In contrast, with a satisficing goal of achieving only, say, 6 in Fig. 2, the agent will not exploit the unrepresented features as much and actual utility will be much larger.
That depends what you mean by "continuously improving until you reach a limit which is not necessarily the global limit".
I guess by "continuously" you probably do not mean "in continuous time" but rather "repeatedly in discrete time steps"? So you imagine a sequence r(s1) < r(s2) < ... ? Well, that could converge to anything larger than each of the r(sn). E.g., if r(sn) = 1 - 1/n, it will converge to 1. (It will of course never "reach" 1 since it will always below 1.) This is completely independent of what the local or global maxima of r are. They can obviously be way larger. For example, if the function is r(s) = s and the sequence is sn = 1 - 1/n, then r(sn) converges to 1 but the maximum of r is infinity. So, as I said before, unless your sequence of improvements is part of an attempt to find a maximum (that is, part of an optimization process), there is no reason to expect that it will converge to some maximum.
Btw., this also shows that if you have two competing satisficers whose only goal is to outperform the other and who therefore repeatedly improve their reward to be larger than the other agents' current reward, this does not imply that their rewards will converge to some maximum reward. They can easily be programmed to avoid this by just outperforming the other by an amount of 2**(-n) in the n-th step, so that their rewards converge to the initial reward plus one, rather than to whatever maximum reward might be possible.
I agree.
Except for one detail: Humans who hold preferences that don't comply to the axioms cannot necessarily be "dutch-booked" for real. That would require them not only to hold certain preferences but also to always act on those preferences like an automaton, see this nice summary discussion: https://plato.stanford.edu/entries/dutch-book/
"Humans do have a utility function"? I would say that depends on what one means by "have".
Does it mean that the value of a humans' life can in principle be measured, only that measure might not be known to the human? Then I would not be convinced – what would the evidence for this claim be?
Or does it mean that humans are imperfect maximizers of some imperfectly encoded state-action-valuation function that is somehow internally stored in their brains and might have been inherited and/or learned? Then I would also not be conviced as long as one cannot point to evidence that such an evaluation function is actually encoded somewhere in the brain.
Or does it simply mean that the observable behavior of a human can be interpreted as (imperfecty) maximizing some utility function? This would be the classical "as if" argument that economists use to defend their modeling humans as rational agents despite all evidence from psychology.
This is difficult to say. I have a relatively clear intuition what I mean by optimization and what I mean by optimizing behavior. In your example, merely asking for the cheapest flight might be safe as long as you don't automatically then book that flight without spending a moment to think about whether taking that one-propeller machine without any safety belts that you have to pilot yourself is actually a good idea just because it turned out to be the cheapest. I mostly care about agents that have more agency than just printing text to your screen.
I believe what some people call "AI heaven" can be reached with AI agents that don't book the cheapest flights but that book you a flight that costs no more than you specify, take no longer than you specify, and have at least those safety equipment and other facilities that you specify. In other words: satisficing! Another example: Not find me a job that earns me as much income as possible, but find me a job that earns me at least as much income to satisfy all my basic needs and let's me have as much fun from leisure activities as I can squeeze into my lifetime. And so on...
Regarding "improvement": Replacing a state s by a state s' that scores higher on some metric r, so that r(s') > r(s), is an "improvement w.r.t. r", not an optimization for r. An optimization would require replacing s by that s' for which there is no other s'' with r(s'') > r(s'), or some approximate version of this.
One might think that a sequence of improvements must necessarily constitute an optimization, so that my distinction is unimportant. But this is not correct: While any sequence of improvements r(s1) < r(s2) must make r(sn) converge to some value r° (at least if r is bounded), this limit value r° will in general be considerably lower than the maximal value r* = max r(s). unless the procedure that selects the improvements is especially designed to find that maximum, in other words, is an optimization algorithm. Note that optimization is a hard problem in most real-world cases, much harder than just finding some sequence of improvements.
When I said "actual utility" I meant that which we cannot properly formalize (human welfare and other values) and hence not teach (or otherwise "give" to) the agent, so no, the agent does not "have" (or otherwise know) this as their utility function in any relevant way.
In my use of the term "maximization", it refers to an act, process, or activity (as indicated by the ending "-ation") that actively seeks to find the maximum of some given function. First there is the function to be maximized, then comes the maximization, and finally one knows the maximum and where the maximum is (argmax).
On the other hand, one might object the following: if we are given a deterministic program P that takes input x and returns output y=P(x), we can of course always construct a mathematical function f that takes a pair (x,y) and returns some number r=f(x,y) so that it turns out that for each possible y we have P(x)=argmax f(x,y). A trivial choice for such a function is f(x,y)=1 if y=P(x) and f(x,y)=0 otherwise. Notice, however, that here the program P is given first, and then we construct a specific function f for this equivalence to hold.
In other words, any deterministic program P is functionally equivalent to another program P' that takes some input x, maximizes some function f(x,y), and returns the location y of that maximum. But being functionally equivalent to a maximizer is not the same as being a maximizer.
In the learning agent context: If I give you a learned policy pi that takes a state s and returns an action a=pi(s) (or a distribution of actions), then you might well be able to construct a reward function g that takes a state-action pair (s,a) and returns a reward (or expected reward) r=g(s,a) so that when I then calculate the corresponding optimal state-action-quality-function Q* of this reward function, it turns out that for all states s, we have pi(s)=argmax Q*(s,a). This means that the policy pi is the same policy as the one that a learning process would have produced that searches for the policy that maximizes the long-term discounted sum of rewards according to reward function g. But it does not mean that the policy pi was actually determined by such a possible optimization procedure: the learning process that produced pi can very well be of a completely different kind than an optimization procedure.
Hey Yonatan,
first, excuse my spelling your name incorrectly originally, I fixed it now.
Thank you for your encouragement with funding. As it happens, we did apply for funding from several sources and are waiting for their response.
Regarding Rob Miles' videos on satisficing:
One potential misunderstanding relates to the question of with what probability the agent is required to reach a certain goal. If I understand him correctly, he assumes satisficing needs to imply maximizing the probability that some constraint is met, which would still constitute a form of optimization (namely of the probability). This is why our approach is different: In a Markov Decision Process, the client would for example specify a feasibility interval for the expected value of the return (= long-term discounted sum of rewards according to some reward function that we explicitly do not assume to be a proper measure of utility), and the learning algorithm would seek a policy that makes the expected return fall anywhere into this interval.
The question of whether an agent somehow necessarily must optimize something is a little philosophical in my view. Of course, given an agent's behavior, one can always find some function that is maximal for the given behavior. This is a mathematical triviality. But this is not the problem we need to address here. The problem we need to address is that the behavior of the agent might get chosen by the agent or its learning algorithm by maximizing some objective function.
It is all about a paradigm shift: In my view, AI systems should be made to achieve reasonable goals that are well-specified w.r.t. one or more proxy metrics, not to maximize whatever metric. What would be the reasonable goal for your modified paperclip maximizer?
Regarding "weakness":
Non-maximizing does not imply weak, let alone "very weak". I'm not suggesting to build a very weak system at all. In fact, maximizing an imperfect proxy metric will tend to give low score on the real utility. Or, to turn this around: The maximum of the actual utility function is most achieved by a policy that does not maximize the proxy metric. We will study this in example environments and report results later this year.
Joe thinks, in contrast with the dominant theory of correct decision-making, that it’s clear you should send a million dollars to your twin.
I'm deeply confused about this. According to the premise, you are a deterministic AI system. That means what you will do is fully determined by your code and your input, both of which are already given. So at this point, there is no longer any freedom to many a choice – you will just do what your given code and input determine. So what does it mean to ask what you should do? Does that actually mean: (i) what code should your programmer have written? Or does it mean: (ii) what would the right choice be in the counterfactual situation in which you are not deterministic after all and do have a choice (while your twin doesn't? or does as well?). In order to answer version (i), we need to know the preferences of the programmer (rather than your own preferences). If the programmer is interested in the joint payoff of both twins, she should have written code that makes you cooperate. In order to answer version (ii), we would need to know what the consequences of making either choice in the counterfactual world where you do have a choice are on the possibility of the other twin to make a choice. If your choice does not influence the possibility of the other twin to make a choice, the dominant strategy is defection, as in the simple PD. Otherwise, who knows...
This becomes particularly important in human feedback/input about "higher-level" or more "abstract" questions, as in OpenAI's deliberative mini-public / citizen assembly idea (https://openai.com/blog/democratic-inputs-to-ai).