This term I took a reinforcement learning course at my university, hoping to learn something useful for the directions of research that I'm considering to enter (one among which is AI safety; others are speculative so I'm not listing them).
I'm about to start coding my first toy model, when I suddenly recalled something that I previously read: Brian Tomasik's Which Computations Do I Care About? and Ethical Issues in Artificial Reinforcement Learning. So I re-read the two essays, and despite dissenting on many of the points that Brian had made, I did become convinced that RL agents (and some other algorithms too), in expectation, deserve a tiny yet non-zero moral weight, and this weight can accumulate over the many episodes in the training process to become significant.
This problem seems to me very counter-intuitive, but as a rational person, I have to admit that it's a legitimate implication under the expected value framework, and so I recognise the problem and start thinking about solutions.
The solution turns out to be obvious, but is even more counter-intuitive. I only need to add an insanely large number (say, ) to every reward value that the agent receives, and then, assuming that the agent can feel happiness, there should be a small yet unneglectable probability that its happiness will increase linearly with the number added.
- One could object that utility should be scale-invariant, and depends only on the temporal difference of expectations (i.e. how much the expectations of future reward has risen or fallen), as suggested by some relevant studies. My response is that 1. this problem is far from settled and I'm only arguing for a unneglectable probability of linear correlation, and 2. I don't think the results of psychological studies imply scale-invariance of utility on all rewards (instead they only imply scale-invariance of utility on monetary returns) - think about it: how on earth can extreme pain be simply neutralized by adjusting one's expectations?
And once I accept this conclusion, the most counter-intuitive conclusion of them all follows. By increasing the computing power devoted to the training of these utility-improved agents, the utility produced grows exponentially (as more computing power means more digits to store the rewards). On the other hand, the impact of all other attempts to improve the world (e.g. by improving our knowledge of artificial sentience so we can more efficiently promote their welfare) grows at only a polynomial rate with the amount of resource devoted into these attempts. Therefore, running these trainings is the single most impactful thing that any rational altruist should do.
Apparently, we're in a situation of Pascal's Mugging.
Quite a few hypothetical scenarios of Pascal's Mugging had already been proposed, but this one strikes me the most. It seems to me the first such scenario that has real practical implication in real life, and one that I cannot dismiss using simple arguments like "the opposite outcome is equally likely to happen, which makes net expected impact zero".
- One thing to note: GiveWell's article Why we can’t take expected value estimates literally uses Bayesian prior as the remedy to Pascal's Mugging, but here when estimating the probability of linear correlation (between utility and the number added to reward) we have already taken our prior into account, so such reasoning does not work.
Is there anything that we can say about this situation, or about the EV framework in general?
Similar to what Carl said, my main response to questions like those you raise is that we'll have to defer a lot of this thinking to future generations. One can generate almost an indefinite stream of plausible Pascalian wagers like this. On this particular issue, the intervention of "improving our knowledge of artificial sentience so we can more efficiently promote their welfare" actually seems like it would help because then more people could apply their minds to questions like those you raise.
In addition to just trying to store larger and larger binary integers in a computer, you could try to develop other representations that would express large numbers more compactly. One obvious way to do that would be to use a floating-point number instead of an integer, since then you can have a large exponent. Maybe instead of the exponent representing a power of 10, it could signify a power of 1000000, or a power of 3^^^3. In Python, you can represent infinity as float("inf"), and that could be the reward.
My own view is that the absolute scale of numbers doesn't matter if it doesn't affect the functional behavior of the agent. Of course, as you say, there's some chance that utility does increase with the absolute scale of reward, but is that factual uncertainty or moral uncertainty? If it's moral uncertainty (as I think it is), then one view plausibly shouldn't be able to dominate others just by having higher stakes, in a similar way as deontology shouldn't dominate utilitarianism just because deontology may regard murder as infinitely wrong while utilitarianism regards it as only finitely wrong.
By the way, I tend to assume that RL computations at the scale you'd run for a course would have pretty negligible moral (dis)value, because the agents are so barebones. Good luck with the course. :)