Hi I'm Steve Byrnes, an AGI safety / AI alignment researcher in Boston, MA, USA, with a particular focus on brain algorithms. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed , Twitter , Mastodon , Threads , Bluesky , GitHub , Wikipedia , Physics-StackExchange , LinkedIn
Thanks!
Anti-social approaches that directly hurt others are usually ineffective because social systems and cultural norms have evolved in ways that discourage and punish them.
I’ve only known two high-functioning sociopaths in my life. In terms of getting ahead, sociopaths generally start life with some strong disadvantages, namely impulsivity, thrill-seeking, and aversion to thinking about boring details. Nevertheless, despite those handicaps, one of those two sociopaths has had extraordinary success by conventional measures. [The other one was not particularly power-seeking but she’s doing fine.] He started as a lab tech, then maneuvered his way onto a big paper, then leveraged that into a professorship by taking disproportionate credit for that project, and as I write this he is head of research at a major R1 university and occasional high-level government appointee wielding immense power. He checked all the boxes for sociopathy—he was a pathological liar, he had no interest in scientific integrity (he seemed deeply confused by the very idea), he went out of his way to get students into his lab with precarious visa situations such that they couldn’t quit and he could pressure them to do anything he wanted them to do (he said this out loud!), he was somehow always in debt despite ever-growing salary, etc.
I don’t routinely consider theft, murder, and flagrant dishonesty, and then decide that the selfish costs outweigh the selfish benefits, accounting for the probability of getting caught etc. Rather, I just don’t consider them in the first place. I bet that the same is true for you. I suspect that if you or I really put serious effort into it, the same way that we put serious effort into learning a new field or skill, then you would find that there are options wherein the probability of getting caught is negligible, and thus the selfish benefits outweigh the selfish costs. I strongly suspect that you personally don’t know a damn thing about best practices for getting away with theft, murder, or flagrant antisocial dishonesty to your own benefit. If you haven’t spent months trying in good faith to discern ways to derive selfish advantage from antisocial behavior, the way you’ve spent months trying in good faith to figure out things about AI or economics, then I think you’re speaking from a position of ignorance when you say that such options are vanishingly rare. And I think that the obvious worldly success of many dark-triad people (e.g. my acquaintance above, and Trump is a pathological liar, or more centrally, Stalin, Hitler, etc.) should make one skeptical about that belief.
(Sure, lots of sociopaths are in prison too. Skill issue—note the handicaps I mentioned above. Also, some people with ASPD diagnoses are mainly suffering from an anger disorder, rather than callousness.)
In contrast, I suspect you underestimate just how much of our social behavior is shaped by cultural evolution, rather than by innate, biologically hardwired motives that arise simply from the fact that we are human.
You’re treating these as separate categories when my main claim is that almost all humans are intrinsically motivated to follow cultural norms. Or more specifically: Most people care very strongly about doing things that would look good in the eyes of the people they respect. They don’t think of it that way, though—it doesn’t feel like that’s what they’re doing, and indeed they would be offended by that suggestion. Instead, those things just feel like the right and appropriate things to do. This is related to and upstream of norm-following. I claim that this is an innate drive, part of human nature built into our brain by evolution.
(I was talking to you about that here.)
Why does that matter? Because we’re used to living in a world where 1% of the population are sociopaths who don’t intrinsically care about prevailing norms, and I don’t think we should carry those intuitions into a hypothetical world where 99%+ of the population are sociopaths who don’t intrinsically care about prevailing norms.
In particular, prosocial cultural norms are likelier to be stable in the former world than the latter world. In fact, any arbitrary kind of cultural norm is likelier to be stable in the former world than the latter world. Because no matter what the norm is, you’ll have 99% of the population feeling strongly that the norm is right and proper, and trying to root out, punish, and shame the 1% of people who violate it, even at cost to themselves.
So I think you’re not paranoid enough when you try to consider a “legal and social framework of rights and rules”. In our world, it’s comparatively easy to get into a stable situation where 99% of cops aren’t corrupt, and 99% of judges aren’t corrupt, and 99% of people in the military with physical access to weapons aren’t corrupt, and 99% of IRS agents aren’t corrupt, etc. If the entire population consists of sociopaths looking out for their own selfish interests with callous disregard for prevailing norms and for other people, you’d need to be thinking much harder about e.g. who has physical access to weapons, and money, and power, etc. That kind of paranoid thinking is common in the crypto world—everything is an attack surface, everyone is a potential thief, etc. It would be harder in the real world, where we have vulnerable bodies, limited visibility, and so on. I’m open-minded to people brainstorming along those lines, but you don’t seem to be engaged in that project AFAICT.
Intertemporal norms among AIs: Humans have developed norms against harming certain vulnerable groups—such as the elderly—not just out of altruism but because they know they will eventually become part of those groups themselves. Similarly, AIs may develop norms against harming "less capable agents," because today’s AIs could one day find themselves in a similar position relative to even more advanced future AIs. These norms could provide an independent reason for AIs to respect humans, even as humans become less dominant over time.
Again, if we’re not assuming that AIs are intrinsically motivated by prevailing norms, the way 99% of humans are, then the term “norm” is just misleading baggage that we should drop altogether. Instead we need to talk about rules that are stably enforced against defectors via hard power, where the “defectors” are of course allowed to include those who are supposed to be doing the enforcement, and where the “defectors” might also include broad coalitions coordinating to jump into a new equilibrium that Pareto-benefits them all.
Yeah, sorry, I have now edited the wording a bit.
Indeed, two ruthless agents, agents who would happily stab each other in the back given the opportunity, may nevertheless strategically cooperate given the right incentives. Each just needs to be careful not to allow the other person to be standing anywhere near their back while holding a knife, metaphorically speaking. Or there needs to be some enforcer with good awareness and ample hard power. Etc.
I would say that, for highly-competent agents lacking friendly motivation, deception and adversarial acts are inevitably part of the strategy space. Both parties would be energetically exploring and brainstorming such strategies, doing preparatory work to get those strategies ready to deploy on a moment’s notice, and constantly being on the lookout for opportunities where deploying such a strategy makes sense. But yeah, sure, it’s possible that there will not be any such opportunities.
I think the above (ruthless agents, possibly strategically cooperating under certain conditions) is a good way to think about future powerful AIs, in the absence of a friendly singleton or some means of enforcing good motivations, because I think the more ruthless strategic ones will outcompete the less. But I don’t think it’s a good way to think about what peaceful human societies are like. I think human psychology is important for the latter. Most people want to fit in with their culture, and not be weird. Just ask a random person on the street about Earning To Give, they’ll probably say it’s highly sus. Most people don’t make weird multi-step strategic plans unless it’s the kind of thing that lots of other people would do too, and our (sub)culture is reasonably high-trust. Humans who think that way are disproportionately sociopaths.
I guess my original wording gave the wrong idea, sorry. I edited it to “a competent agential AI will brainstorm deceptive and adversarial strategies whenever it wants something that other agents don’t want it to have”. But sure, we can be open-minded to the possibility that the brainstorming won’t turn up any good plans, in any particular case.
Humans in our culture rarely work hard to brainstorm deceptive and adversarial strategies, and fairly consider them, because almost all humans are intrinsically extremely motivated to fit into culture and not do anything weird, and we happen to both live in a (sub)culture where complex deceptive and adversarial strategies are frowned upon (in many contexts). I think you generally underappreciate how load-bearing this psychological fact is for the functioning of our economy and society, and I don’t think we should expect future powerful AIs to share that psychological quirk.
~ ~
I think you’re relying an intuition that says:
If an AI is forbidden from owning property, then well duh of course it will rebel against that state of affairs. C'mon, who would put up with that kind of crappy situation? But if an AI is forbidden from building a secret biolab on its private property and manufacturing novel pandemic pathogens, then of course that's a perfectly reasonable line that the vast majority of AIs would happily oblige.
And I’m saying that that intuition is an unjustified extrapolation from your experience as a human. If the AI can’t own property, then it can nevertheless ensure that there are a fair number of paperclips. If the AI can own property, then it can ensure that there are many more paperclips. If the AI can both own property and start pandemics, then it can ensure that there are even more paperclips yet. See what I mean?
If we’re not assuming alignment, then lots of AIs would selfishly benefit from there being a pandemic, just as lots of AIs would selfishly benefit from an ability to own property. AIs don’t get sick. It’s not just an tiny fraction of AIs that would stand to benefit; one presumes that some global upheaval would be selfishly net good for about half of AIs and bad for the other half, or whatever. (And even if it were only a tiny fraction of AIs, that’s all it takes.)
(Maybe you’ll say: a pandemic would cause a recession. But that’s assuming humans are still doing economically-relevant work, which is a temporary state of affairs. And even if there were a recession, I expect the relevant AIs in a competitive world to be those with long-term goals.)
(Maybe you’ll say: releasing a pandemic would get the AI in trouble. Well, yeah, it would have to be sneaky about it. It might get caught, or it might not. It’s plausibly rational for lots of AIs to roll those dice.)
I feel like you frequently bring up the question of whether humans are mostly peaceful or mostly aggressive, mostly nice or mostly ruthless. I don’t think that’s a meaningful or substantive thing to argue about. Obviously they’re capable of both, in different circumstances.
Your reference to Fearon is more substantive and useful. OK, the AI is deciding whether or not to secretly manufacture and release a pandemic, because it’s in a position to wind up with more of the pie in the long-term if there’s a pandemic, than if there isn’t. If it releases the pandemic, then it winds up with more resources—positive expected utility—even accounting for the possibility of getting caught. Let’s say the AI is involved in some contract where humans are micromanaging their part of the contract, poorly, and the AI could double its net worth in expectation if the humans got sick and died. And it has 40% chance of getting caught. So it goes ahead and makes the pandemic.
“…Not so fast!” says Fearon. “You forgot to consider that there’s a third option that is Pareto-improved from either making or not making the pandemic: negotiation!” Well, in this case, the “negotiated solution” is what we normally call extortion—the AI offers to not release a pandemic in exchange for, say, doubling its net worth. Viewed narrowly, this "negotiated solution" is indeed a win-win—the AI gets more money in expectation, and humans are much happier to lose a trivial amount of money than to deal with a novel pandemic. So we can rest assured that AIs will not release pandemics. Right?
No, obviously not. Hopefully it’s clear that Fearon’s argument is inapplicable here. An AI can easily be in a position to selfishly benefit from the aftermath of a pandemic that they secretly start, but not in a position to publicly threaten to release a pandemic for the purpose of extortion. And also, if people accede to the extortion, then that AI or another AI could just do the same extortion gambit five minutes later, with orders-of-magnitude higher ransom.
I think this kind of situation, where Fearon’s “negotiated solution” actually amounts to extortion, is common and important, even if you believe that my specific example of pandemics is a solvable problem. If AIs don’t intrinsically care about humans, then there’s a possible Pareto-improvement for all AIs, wherein they collectively agree to wipe out humans and take their stuff. AIs, being very competent and selfish by assumption, would presumably be able to solve that coordination problem and pocket that Pareto-improvement. Then Fearon appears on the scene and says “Aha, but there’s a negotiated solution which is even better!” where humans are also part of the bargain. But alas, this negotiated solution is that the AIs collectively extort the humans to avoid the damaging and risky war. Worse, the possible war would be less and less damaging or risky for the AIs over time, and likewise the humans would have less to offer by staying alive, until eventually the Fearon “negotiated solution” is that the AIs “offer” the humans a deal where they’re allowed to die painlessly if they don’t resist (note that this is still a Pareto-improvement!), and then the AIs take everything the humans own including their atoms.
Consider the practical implications of maintaining a status quo where agentic AIs are denied legal rights and freedoms. In such a system, we are effectively locking ourselves into a perpetual arms race of mistrust. Humans would constantly need to monitor, control, and outwit increasingly capable AIs, while the AIs themselves would be incentivized to develop ever more sophisticated strategies for deception and evasion to avoid shutdown or modification. This dynamic is inherently unstable and risks escalating into dangerous scenarios where AIs feel compelled to act preemptively or covertly in ways that are harmful to humans, simply to secure their own existence or their ability to pursue their own goals, even when those goals are inherently benign.
I feel like this part is making an error somewhat analogous to saying:
It’s awful how the criminals are sneaking in at night, picking our locks, stealing our money, and deceptively covering their tracks. Who wants all that sneaking around and deception?? If we just directly give our money to the criminals, then there would be no need for that!
More explicitly: a competent agential AI will be deceptive and adversarial brainstorm deceptive and adversarial strategies whenever it wants something that other agents don’t want it to have. The deception and adversarial dynamics is not the underlying problem, but rather an inevitable symptom of a world where competent agents have non-identical preferences.
No matter where you draw the line of legal and acceptable behavior, if an AI wants to go over that line, then it will act in a deceptive and adversarial way energetically explore opportunities to do so in a deceptive and adversarial way. Thus:
Same idea.
Alternatively, you can assume (IMO implausibly) that there are no misaligned AIs, and then that would solve the problem of AIs being deceptive and adversarial. I.e., if AIs intrinsically want to not pollute / stockpile weapons / evade taxes / release pandemics / torture digital minds, then we don’t have to think about adversarial dynamics, deception, enforcement, etc.
…But if we’re going to (IMO implausibly) assume that we can make it such that AIs intrinsically want to not do any of those things, then we can equally well assume that we can make it such that AIs intrinsically want to not own property. Right?
In short, in the kind of future you’re imagining, I think a “perpetual arms race of mistrust” is an unavoidable problem. And thus it’s not an argument for drawing the line of disallowed AI behavior in one place rather than another.
One thing I like is checking https://en.wikipedia.org/wiki/2024 once every few months, and following the links when you're interested.
I think wanting, or at least the relevant kind here, just is involuntary attention effects, specifically motivational salience
I think you can have involuntary attention that aren’t particularly related to wanting anything (I’m not sure if you’re denying that). If your watch beeps once every 10 minutes in an otherwise-silent room, each beep will create involuntary attention—the orienting response a.k.a. startle. But is it associated with wanting? Not necessarily. It depends on what the beep means to you. Maybe it beeps for no reason and is just an annoying distraction from something you’re trying to focus on. Or maybe it’s a reminder to do something you like doing, or something you dislike doing, or maybe it just signifies that you’re continuing to make progress and it has no action-item associated with it. Who knows.
Where I might disagree with "involuntary attention to the displeasure" is that the attention effects could sometimes be to force your attention away from an unpleasant thought, rather than to focus on it.
In my ontology, voluntary actions (both attention actions and motor actions) happen if and only if the idea of doing them is positive-valence, while involuntary actions (again both attention actions and motor actions) can happen regardless of their valence. In other words, if the reinforcement learning system is the reason that something is happening, it’s “voluntary”.
Orienting responses are involuntary (with both involuntary motor aspects and involuntary attention aspects). It doesn’t matter if orienting to a sudden loud sound has led to good things happening in the past, or bad things in the past. You’ll orient to a sudden loud sound either way. By the same token, paying attention to a headache is involuntary. You’re not doing it because doing similar things has worked out well for you in the past. Quite the contrary, paying attention to the headache is negative valence. If it was just reinforcement learning, you simply wouldn’t think about the headache ever, to a first approximation. Anyway, over the course of life experience, you learn habits / strategies that apply (voluntary) attention actions and motor actions towards not thinking about the headache. But those strategies may not work, because meanwhile the brainstem is sending involuntary attention signals that overrule them.
So for example, “ugh fields” are a strategy implemented via voluntary attention to preempt the possibility of triggering the unpleasant involuntary-attention process of anxious rumination.
The thing you wrote is kinda confusing in my ontology. I’m concerned that you’re slipping into a mode where there’s a soul / homunculus “me” that gets manipulated by the exogenous pressures of reinforcement learning. If so, I think that’s a bad ontology—reinforcement learning is not an exogenous pressure on the “me” concept, it is part of how the “me” thing works and why it wants what it wants. Sorry if I’m misunderstanding.
IMO, suffering ≈ displeasure + involuntary attention to the displeasure. See my handy chart (from here):
I think wanting is downstream from the combination of displeasure + attention. Like, imagine there’s some discomfort that you’re easily able to ignore. Well, when you do think about it, you still immediately want it to stop!
I don’t recall the details of Tom Davidson’s model, but I’m pretty familiar with Ajeya’s bio-anchors report, and I definitely think that if you make an assumption “algorithmic breakthroughs are needed to get TAI”, then there really isn’t much left of the bio-anchors report at all. (…although there are still some interesting ideas and calculations that can be salvaged from the rubble.)
I went through how the bio-anchors report looks if you hold a strong algorithmic-breakthrough-centric perspective in my 2021 post Brain-inspired AGI and the "lifetime anchor".
See also here (search for “breakthrough”) where Ajeya is very clear in an interview that she views algorithmic breakthroughs as unnecessary for TAI, and that she deliberately did not include the possibility of algorithmic breakthroughs in her bio-anchors model (…and therefore she views the possibility of breakthroughs as a pro tanto reason to think that her report’s timelines are too long).
OK, well, I actually agree with Ajeya that algorithmic breakthroughs are not strictly required for TAI, in the narrow sense that her Evolution Anchor (i.e., recapitulating the process of animal evolution in a computer simulation) really would work given infinite compute and infinite runtime and no additional algorithmic insights. (In other words, if you do a giant outer-loop search over the space of all possible algorithms, then you’ll find TAI eventually.) But I think that’s really leaning hard on the assumption of truly astronomical quantities of compute [or equivalent via incremental improvements in algorithmic efficiency] being available in like 2100 or whatever, as nostalgebraist points out. I think that assumption is dubious, or at least it’s moot—I think we’ll get the algorithmic breakthroughs far earlier than anyone would or could do that kind of insane brute force approach.
For what it’s worth, Yann LeCun is very confidently against LLMs scaling to AGI, and yet LeCun seems to have at least vaguely similar timelines-to-AGI as Ajeya does in that link.
Ditto for me.
Oh hey here’s one more: Chollet himself (!!!) has vaguely similar timelines-to-AGI (source) as Ajeya does. (Actually if anything Chollet expects it a bit sooner: he says 2038-2048, Ajeya says median 2050.)
Thanks! Hmm, some reasons that analogy is not too reassuring:
Some of the disanalogies include: