Matthew_Barnett

I think I basically agree with you, and I am definitely not saying we should just shrug. We should instead try to shape the future positively, as best we can. However, I still feel like I'm not quite getting my point across. Here's one more attempt to explain what I mean.

Imagine if we achieved a technology that enabled us to build physical robots that were functionally identical to humans in every relevant sense, including their observable behavior, and their ability to experience happiness and pain in exactly the same way that ordinary humans do. However, there is just one difference between these humanoid robots and biological humans: they are made of silicon rather than carbon, and they look robotic, rather than biological.

In this scenario, it would certainly feel strange to me if someone were to suggest that we should be worried about a peaceful robot takeover, in which the humanoid robots collectively accumulate the vast majority of wealth in the world via lawful means.

By assumption, these humanoid robots are literally functionally identical to ordinary humans. As a result, I think we should have no intrinsic reason to disprefer them receiving a dominant share of the world's wealth, versus some other subset of human-like beings. This remains true even if the humanoid robots are literally "not human", and thus their peaceful takeover is equivalent to "human disempowerment" in a technical sense.

There ultimate reason why I think one should not worry about a peaceful robot takeover in this specific scenario is because I think these humanoid robots have essentially the same moral worth and right to choose as ordinary humans, and therefore we should respect their agency and autonomy just as much as we already do for ordinary humans. Since we normally let humans accumulate wealth and become powerful via lawful means, I think we should allow these humanoid robots to do the same. I hope you would agree with me here.

Now, generalizing slightly, I claim that to be rationally worried about a peaceful robot takeover in general, you should usually be able to identify a relevant moral difference between the scenario I have just outlined and the scenario that you're worried about. Here are some candidate moral differences that I personally don't find very compelling:

In the humanoid robot scenario, there's no possible way the humanoid robots would ever end up killing the biological humans, since they are functionally identical to each other. In other words, biological humans aren't at risk of losing their rights and dying.
- My response: this doesn't seem true. Humans have committed genocide against other subsets of humanity based on arbitrary characteristics before. Therefore, I don't think we can rule out that the humanoid robots would commit genocide against the biological humans either, although I agree it seems very unlikely.
In the humanoid robot scenario, the humanoid robots are guaranteed to have the same values as the biological humans, since they are functionally identical to biological humans.
- My response: this also doesn't seem guaranteed. Humans frequently have large disagreements in values with other subsets of humanity. For example, China as a group has different values than the United States as a group. This difference in values is even larger if you consider indexical preferences among the members of the group, which generally overlap very little.

A framework for thinking about AI power-seeking

Matthew_Barnett1d2

Say you're worried about any take-over-the-world actions, violent or not -- in which case this argument about the advantages of non-violent takeover is of scant comfort;

This is reasonable under the premise that you're worried about any AI takeovers, no matter whether they're violent or peaceful. But speaking personally, peaceful takeover scenarios where AIs just accumulate power—not by cheating us or by killing us via nanobots—but instead by lawfully beating humans fair and square and accumulating almost all the wealth over time, just seem much better than violent takeovers, and not very bad by themselves.

I admit the moral intuition here is not necessarily obvious. I concede that there are plausible scenarios in which AIs are completely peaceful and act within reasonable legal constraints, and yet the future ends up ~worthless. Perhaps the most obvious scenario is the "Disneyland without children" scenario where the AIs go on to create an intergalactic civilization, but in which no one (except perhaps the irrelevant humans still on Earth) is sentient.

But when I try to visualize the most likely futures, I don't tend to visualize a sea of unsentient optimizers tiling the galaxies. Instead, I tend to imagine a transition from sentient biological life to sentient artificial life, which continues to be every bit as cognitively rich, vibrant, and sophisticated as our current world—indeed, it could be even moreso, given what becomes possible at a higher technological and population level.

Worrying about non-violent takeover scenarios often seems to me to arise simply from discrimination against non-biological forms of life, or perhaps a more general fear of rapid technological change, rather than naturally falling out as a consequence of more robust moral intuitions.

Let me put it another way.

It is often conceded that it was good for humans to take over the world. Speaking broadly, we think this was good because we identify with humans and their aims. We belong to the "human" category of course; but more importantly, we think of ourselves as being part of what might be called the "human tribe", and therefore we sympathize with the pursuits and aims of the human species as a whole. But equally, we could identify as part of the "sapient tribe", which would include non-biological life as well as humans, and thus we could sympathize with the pursuits of AIs, whatever those may be. Under this framing, what reason is there to care much about a non-violent, peaceful AI takeover?

A framework for thinking about AI power-seeking

Matthew_Barnett1d4

I want to distinguish between two potential claims:

When two distinct populations live alongside each other, sometimes the less intelligent population dies out as a result of competition and violence with the more intelligent population.
When two distinct populations live alongside each other, by default, the more intelligent population generally develops convergent instrumental goals that lead to the extinction of the other population, unless the more intelligent population is value-aligned with the other population.

I think claim (1) is clearly true and is supported by your observation that Neanderthals' went extinct, but I intended to argue against claim (2) instead. (Although, separately, I think the evidence that Neanderthals' were less intelligent than homo sapiens is rather weak.)

Despite my comment above, I do not actually have much sympathy towards the claim that humans can't possibly go extinct, or that our species is definitely going to survive over the very long run in a relatively unmodified form, for the next billion years. (Indeed, perhaps like the Neanderthals, our best hope to survive in the long-run may come from merging with the AIs.)

It's possible you think claim (1) is sufficient in some sense to establish some important argument. For example, perhaps all you're intending to argue here is that AI is risky, which to be clear, I agree with.

On the other hand, I think that claim (2) accurately describes a popular view among EAs, albeit with some dispute over what counts as a "population" for the purpose of this argument, and what counts as "value-aligned". While important, claim (1) is simply much weaker than claim (2), and consequently implies fewer concrete policy prescriptions.

I think it is important to critically examine (2) even if we both concede that (1) is true.

A framework for thinking about AI power-seeking

Matthew_Barnett1d6

I'm not sure I fully understand this framework, and thus I could easily have missed something here, especially in the section about "Takeover-favoring incentives". However, based on my limited understanding, this framework appears to miss the central argument for why I am personally not as worried about AI takeover risk as most EAs seem to be.

Here's a concise summary of my own argument for being less worried about takeover risk:

There is a cost to violently taking over the world, in the sense of acquiring power unlawfully or destructively with the aim of controlling everything in the whole world, relative to the alternative of simply gaining power lawfully and peacefully, even for agents that don't share 'our' values.
1. For example, as a simple alternative to taking over the world, an AI could advocate for the right to own their own labor and then try to accumulate wealth and power lawfully by selling their services to others, which would earn them the ability to purchase a gargantuan number of paperclips without much restraint.
The cost of violent takeover is not obviously smaller than the benefits of violent takeover, given the existence of lawful alternatives to violent takeover. This is for two main reasons:
1. In order to wage a war to take over the world, you generally need to pay costs fighting the war, and there is a strong motive for everyone else to fight back against you if you try, including other AIs who do not want you to take over the world (and this includes any AIs whose goals would be hindered by a violent takeover, not just those who are "aligned with humans"). Empirically, war is very costly and wasteful, and less efficient than compromise, trade, and diplomacy.
2. Violently taking over the war is very risky, since the attempt could fail, and you could be totally shut down and penalized heavily if you lose. There are many ways that violent takeover plans could fail: your takeover plans could be exposed too early, you could also be caught trying to coordinate the plan with other AIs and other humans, and you could also just lose the war. Ordinary compromise, trade, and diplomacy generally seem like better strategies for agents that have at least some degree of risk-aversion.
There isn't likely to be "one AI" that controls everything, nor will there likely be a strong motive for all the silicon-based minds to coordinate as a unified coalition against the biological-based minds, in the sense of acting as a single agentic AI against the biological people. Thus, future wars of world conquest (if they happen at all) will likely be along different lines than AI vs. human.
1. For example, you could imagine a coalition of AIs and humans fighting a war against a separate coalition of AIs and humans, with the aim of establishing control over the world. In this war, the "line" here is not drawn cleanly between humans and AIs, but is instead drawn across a different line. As a result, it's difficult to call this an "AI takeover" scenario, rather than merely a really bad war.
Nothing about this argument is intended to argue that AIs will be weaker than humans in aggregate, or individually. I am not claiming that AIs will be bad at coordinating or will be less intelligent than humans. I am also not saying that AIs won't be agentic or that they won't have goals or won't be consequentialists, or that they'll have the same values as humans. I'm also not talking about purely ethical constraints: I am referring to practical constraints and costs on the AI's behavior. The argument is purely about the incentives of violently taking over the world vs. the incentives to peacefully cooperate within a lawful regime, between both humans and other AIs.
A big counterargument to my argument seems well-summarized by this hypothetical statement (which is not an actual quote, to be clear): "if you live in a world filled with powerful agents that don't fully share your values, those agents will have a convergent instrumental incentive to violently take over the world from you". However, this argument proves too much.

We already live in a world where, if this statement was true, we would have observed way more violent takeover attempts than what we've actually observed historically.
For example, I personally don't fully share values with almost all other humans on Earth (both because of my indexical preferences, and my divergent moral views) and yet the rest of the world has not yet violently disempowered me in any way that I can recognize.

Pausing AI is Progress

Matthew_Barnett8d2

what we want from progress in our airplanes is, first and foremost, safety.

I dispute that this is what I want from airplanes. First and foremost, what I want from an airplane is for it to take me from point A to point B at a high speed. Other factors are important too, including safety, comfort, and reliability. But, there are non-trivial tradeoffs between these other factors: for example, if we could make planes 20% safer but at the cost that flights took twice as long, I would not personally want to take this trade.

You might think this is a trivial objection to your analogy, but I don't think it is. In general, humans have a variety of values, and are not single-mindedly focused on safety at the cost of everything else. We put value on safety, but we also put value on capabilities, and urgency, along with numerous other variables. As another analogy, if we were to have delayed the adoption of the covid vaccine by a decade to perform more safety testing, that cost would have been substantial, even if it were done in the name of safety.

In my view, the main reason not to delay AI comes from a similar line of reasoning. By delaying AI, we are delaying all the technological progress and economic value that could be hastened by AI, and this cost is not small. If you think that accelerated technological progress could save your life, cure aging, and eliminate global poverty, then from the perspective of existing humans, delaying AI can start to sound like it mainly prolongs the ongoing catastrophes in the world, rather than preventing new ones.

It might be valuable to delay AI even at the price of letting billions of humans to die of aging, prolonging the misery of poverty, and so on. Whether it's worth delaying depends, of course, on what we are getting in exchange for the delay. However, unless you think this price is negligible, or you're simply very skeptical that accelerated technological progress will have these effects, then this is not an easy dilemma.

Navigating Risks from Advanced Artificial Intelligence: A Guide for Philanthropists [Founders Pledge]

Matthew_Barnett1mo10

From the full report,

It is not merely enough that we specify an “aligned” objective for a powerful AI system, nor just that objective be internalized by the AI system, but that we do both of these on the first try. Otherwise, an AI engaging in misaligned behaviors would be shut down by humans. So to get ahead, the AI would first try to shut down humans.

I dispute that we need to get alignment right on the first try, and otherwise we're doomed. However, this question depends critically on what is meant by "first try". Let's consider two possible interpretations of the idea that we only get "one try" to develop AI:

Interpretation 1: "At some point we will build a general AI system for the first time. If this system is misaligned, then all humans will die. Otherwise, we will not all die."

Interpretation 2: "The decision to build AI is, in a sense, irreversible. Once we have deployed AI systems widely, it is unlikely that we could roll them back, just like how we can't roll back the internet, or electricity."

I expect the first interpretation of this thesis will turn out incorrect because the "first" general AI systems will likely be rather weak and unable to unilaterally disempower all of humanity. This seems evident to me because of the fact that current AI systems are already fairly general (and increasingly so), and yet are weak, and are as-yet far from being able to disempower humanity.

These current systems also seem to be increasing in their capabilities somewhat incrementally, albeit at a rapid pace^[1]. I think it is highly likely that we will have many attempts at aligning general AI systems before they become more powerful than the rest of humanity combined, either individually or collectively. This implies that we do not get only "one try" to align AI—in fact, we will likely have many tries, and these attempts will help us accumulate evidence about the difficulty of alignment on the even more powerful systems that we build next.

To the extent that you are simply defining the "first try" as the last system developed before humans become disempowered, then this claim seems confused. Building such a system is better viewed as a "last try" than a "first try" at AI, since it would not necessarily be the first general AI system that we develop. It also seems likely that the construction of such a system would be aided substantially by AI-guided R&D, making it unclear to what extent it was really "humanity's try" at AI.

Interpretation 2 appears similarly confused. It may be true that the decision to deploy AI on a wide scale is irreversible, if indeed these systems have a lot of value and are generally intelligent, which would make it hard to "put the genie back in the bottle". However, AI does not seem unusual in this respect among technologies, as it is similarly nearly impossible to reverse the course of technological progress in almost all other domains.

More generally, it is simply a fundamental feature of all decision-making that actions are irreversible, in the sense that it is impossible to go back in time and make different decisions than the ones we had in fact made. As a general property of the world, rather than a narrow feature of AI development in particular, this fact in isolation does little to motivate any specific AI policy.

^{^}
I do not think the existence of emergent capabilities implies that general AI systems are getting more capable in a discontinuous fashion, as emergent capabilities are generally quite narrow abilities, rather than constituting an average competence level of AI systems. On broad measures of intelligence, such as the MMLU, AI systems appear to be developing more incrementally. And moreover, many apparently emergent capabilities are merely artifacts of the way we measure them, and therefore do not reflect underlying discontinuities in latent abilities.

Navigating Risks from Advanced Artificial Intelligence: A Guide for Philanthropists [Founders Pledge]

Matthew_Barnett1mo10

From the full report,

Even if power-seeking APS systems are deployed, it’s not obvious that they would permanently disempower humanity. We may be able to stop the system in its tracks (by either literally or metaphorically “pulling the plug”). First, we need to consider the mechanisms by which AI systems attempt to takeover (i.e. disempower) humanity. Second, we need to consider various risk factors for a successful takeover attempt.
Hacking computer systems....
Persuading, manipulating or coercing humans....
Gain broad social influence... For instance, AI systems might be able to engage in
electoral manipulation, steering voters towards policymakers less willing or able to
prevent AIs systems being integrated into other key places of power.
Gaining access to money... If misaligned systems are rolled out into financial markets, they may be able to siphon off money without human detection.
Developing advanced technologies... An AI system adept at the science, engineering and manufacturing of nanotechnology, along with access to the physical world, might be able to rapidly construct and deploy dangerous nanosystems, leading to a “gray goo” scenario described by Drexler (1986).

I think the key weakness in this part of the argument is that it overlooks lawful, non-predatory strategies for satisfying goals. As a result, you give the impression that any AI that has non-human goals will, by default, take anti-social actions that harm others in pursuit of their goals. I believe this idea is false.

The concept of instrumental convergence, even if true^[1], does not generally imply that almost all power-seeking agents will achieve their goals through nefarious means. Ordinary trade, compromise, and acting through the legal system (rather than outside of it) are usually rational means of achieving your goals.

Certainly among humans, a desire for resources (e.g. food, housing, material goods) does not automatically imply that humans will universally converge on unlawful or predatory behavior to achieve their goals. That's because there are typically more benign ways of accomplishing these goals than theft or social manipulation. In other words, we can generally get what we want in a way that is not negative-sum and does not hurt other people as a side effect.

To the extent you think power-seeking behavior among humans is usually positive-sum, but will become negative-sum when in manifests in AIs, this premise needs to be justified. One cannot explain the positive sum-nature of the existing human world by positing that humans are aligned with each other and have pro-social values, as this appears to be a poor explanation for why humans obey the law.

Indeed, the legal system itself can be seen as a way for power-seeking misaligned agents to compromise on a framework that allows agents within it to achieve their goals efficiently, without hurting others. In a state of full mutual inter-alignment with other agents, criminal law would largely be unnecessary. Yet it is necessary, because humans in fact do not share all their goals with each other.

It is likely, of course, that AIs will exceed human intelligence. But this fact alone does not imply that AIs will take unlawful actions to pursue their goals, since the legal system could become better at coping with more intelligent agents at the same time AIs are incorporated into it.

We could imagine an analogous case in which genetically engineered humans are introduced into the legal system. As these modified humans get smarter over time, and begin taking on roles within the legal system itself, our institutions would adapt, and likely become more capable of policing increasingly sophisticated behavior. In this scenario, as in the case of AI, "smarter" does not imply a proclivity towards predatory and unlawful behavior in pursuit of one's goals.

^{^}
I personally doubt that the instrumental convergence thesis is true as it pertains to "sufficiently intelligent" AIs which were not purposely trained to have open-ended goals. I do not expect, for example, that GPT-5 or GPT-6 will spontaneously develop a desire to acquire resources or preserve their own existence, unless they are subject to specific fine-tuning that would reinforce those impulses.

Navigating Risks from Advanced Artificial Intelligence: A Guide for Philanthropists [Founders Pledge]

Matthew_Barnett1mo11

(I have not read the full report yet, I'm merely commenting on a section in the condensed report.)

Big tech companies are incentivized to act irresponsibly
Whilst AI companies are set to earn enormous profits from developing powerful AI systems, the costs these systems impose are borne by society at large. These costs are negative externalities, like those imposed on the public by chemical companies that pollute rivers, or large banks whose failure poses systemic risks.
Further, as companies engage in fierce competition to build AI systems, they are more inclined to cut corners in a race to the bottom. In such a race, even well-meaning companies will have fewer and fewer resources dedicated to tackling the harms and threats their systems create. Of course, AI firms may take some action to mitigate risks from their products 4 - but there are well-studied reasons to suspect they will underinvest in such safety measures.

This argument seems wrong to me. While AI does pose negative externalities—like any technology—it does not seem unusual among technologies in this specific respect (beyond the fact that both the positive and negative effects will be large). Indeed, if AI poses an existential risk, that risk is borne by both the developers and general society. Therefore, it's unclear whether there is actually an incentive for developers to dangerously "race" if they are fully rational and informed of all relevant facts.

In my opinion, the main risk of AI does not come from negative externalities, but rather from a more fundamental knowledge problem: we cannot easily predict the results of deploying AI widely, over long time horizons. This problem is real but it does not by itself imply that individual AI developers are incentivized to act irresponsibly in the way described by the article; instead, it implies that developers may act unwisely out of ignorance of the full consequences of their actions.

These two concepts—negative externalities, and the knowledge problem—should be carefully distinguished, as they have different implications for how to regulate AI optimally. If AI poses large negative externalities (and these are not outweighed by their positive externalities), then the solution could look like a tax on AI development, or regulation with a similar effect. On the other hand, if the problem posed by AI is that it is difficult to predict how AI will impact the world in the coming decades, then the solution plausibly looks more like investigating how AI will likely unfold and affect the world.

Four Futures For Cognitive Labor

Matthew_Barnett1mo4

Again, I'm assuming that the AIs won't get this money. Almost everything AIs do basically gets done for "free", in an efficient market, without AIs themselves earning money. This is similar to how most automation works.

That's not what I meant. I expect the human labor share to decline to near-zero levels even if AIs don't own their own labor.

In the case AIs are owned by humans, their wages will accrue to their owners, who will be humans. In this case, aggregate human wages will likely be small relative to aggregate capital income (i.e., GDP that is paid to capital owners, including people who own AIs).

In the case AIs own their own labor, I expect aggregate human wages will be both small compared to aggregate AI wages, and small compared to aggregate capital income.

In both cases, I expect the total share of GDP paid out as human wages will be small. (Which is not to say humans will be doing poorly. You can enjoy high living standards even without high wages: rich retirees do that all the time.)

Four Futures For Cognitive Labor

Matthew_Barnett1mo4

I think that even small bottlecks would eventually become a large deal. If 0.1% of a process is done by humans, but the rest gets automated and done for ~free, then that 0.1% is what gets paid for.

I agree with this in theory, but in practice I expect these bottlenecks to be quite insignificant in both the short and long-run.

We can compare to an analogous case in which we open up the labor market to foreigners (i.e., allowing them to immigrate into our country). In theory, preferences for services produced by natives could end up implying that, no matter how many people immigrate to our country, natives will always command the majority of aggregate wages. However, in practice, I expect that the native labor share of income would decline almost in proportion to their share of the total population.

In the immigration analogy, the reason why native workers would see their aggregate share of wages decline is essentially the same as the reason why I expect the human labor share to decline with AI: foreigners, like AIs, can learn to do our jobs about as well as we can do them. In general, it is quite rare for people to have strong preferences about who produces the goods and services they buy, relative to their preferences about the functional traits of those goods and services (such as their physical quality and design).

(However, the analogy is imperfect, of course, because immigrants tend to be both consumers and producers, and therefore their preferences impact the market too -- whereas you might think AIs will purely be producers, with no consumption preferences.)

Matthew_Barnett

Posts 18

Comments334

Posts
18

Comments
334