MB

Matthew_Barnett

3627 karmaJoined

Comments
329

From the full report,

It is not merely enough that we specify an “aligned” objective for a powerful AI system, nor just that objective be internalized by the AI system, but that we do both of these on the first try. Otherwise, an AI engaging in misaligned behaviors would be shut down by humans. So to get ahead, the AI would first try to shut down humans.

I dispute that we need to get alignment right on the first try, and otherwise we're doomed. However, this question depends critically on what is meant by "first try". Let's consider two possible interpretations of the idea that we only get "one try" to develop AI:

Interpretation 1: "At some point we will build a general AI system for the first time. If this system is misaligned, then all humans will die. Otherwise, we will not all die."

Interpretation 2: "The decision to build AI is, in a sense, irreversible. Once we have deployed AI systems widely, it is unlikely that we could roll them back, just like how we can't roll back the internet, or electricity."

I expect the first interpretation of this thesis will turn out incorrect because the "first" general AI systems will likely be rather weak and unable to unilaterally disempower all of humanity. This seems evident to me because of the fact that current AI systems are already fairly general (and increasingly so), and yet are weak, and are as-yet far from being able to disempower humanity. 

These current systems also seem to be increasing in their capabilities somewhat incrementally, albeit at a rapid pace[1]. I think it is highly likely that we will have many attempts at aligning general AI systems before they become more powerful than the rest of humanity combined, either individually or collectively. This implies that we do not get only "one try" to align AI—in fact, we will likely have many tries, and these attempts will help us accumulate evidence about the difficulty of alignment on the even more powerful systems that we build next.

To the extent that you are simply defining the "first try" as the last system developed before humans become disempowered, then this claim seems confused. Building such a system is better viewed as a "last try" than a "first try" at AI, since it would not necessarily be the first general AI system that we develop. It also seems likely that the construction of such a system would be aided substantially by AI-guided R&D, making it unclear to what extent it was really "humanity's try" at AI.

Interpretation 2 appears similarly confused. It may be true that the decision to deploy AI on a wide scale is irreversible, if indeed these systems have a lot of value and are generally intelligent, which would make it hard to "put the genie back in the bottle". However, AI does not seem unusual in this respect among technologies, as it is similarly nearly impossible to reverse the course of technological progress in almost all other domains. 

More generally, it is simply a fundamental feature of all decision-making that actions are irreversible, in the sense that it is impossible to go back in time and make different decisions than the ones we had in fact made. As a general property of the world, rather than a narrow feature of AI development in particular, this fact in isolation does little to motivate any specific AI policy.

  1. ^

    I do not think the existence of emergent capabilities implies that general AI systems are getting more capable in a discontinuous fashion, as emergent capabilities are generally quite narrow abilities, rather than constituting an average competence level of AI systems. On broad measures of intelligence, such as the MMLU, AI systems appear to be developing more incrementally. And moreover, many apparently emergent capabilities are merely artifacts of the way we measure them, and therefore do not reflect underlying discontinuities in latent abilities.

From the full report,

Even if power-seeking APS systems are deployed, it’s not obvious that they would permanently disempower humanity. We may be able to stop the system in its tracks (by either literally or metaphorically “pulling the plug”). First, we need to consider the mechanisms by which AI systems attempt to takeover (i.e. disempower) humanity. Second, we need to consider various risk factors for a successful takeover attempt.

Hacking computer systems.... 

Persuading, manipulating or coercing humans.... 

Gain broad social influence... For instance, AI systems might be able to engage in
electoral manipulation, steering voters towards policymakers less willing or able to
prevent AIs systems being integrated into other key places of power.

Gaining access to money...  If misaligned systems are rolled out into financial markets, they may be able to siphon off money without human detection. 

Developing advanced technologies... An AI system adept at the science, engineering and manufacturing of nanotechnology, along with access to the physical world, might be able to rapidly construct and deploy dangerous nanosystems, leading to a “gray goo” scenario described by Drexler (1986). 

I think the key weakness in this part of the argument is that it overlooks lawful, non-predatory strategies for satisfying goals. As a result, you give the impression that any AI that has non-human goals will, by default, take anti-social actions that harm others in pursuit of their goals. I believe this idea is false.

The concept of instrumental convergence, even if true[1], does not generally imply that almost all power-seeking agents will achieve their goals through nefarious means. Ordinary trade, compromise, and acting through the legal system (rather than outside of it) are usually rational means of achieving your goals.

Certainly among humans, a desire for resources (e.g. food, housing, material goods) does not automatically imply that humans will universally converge on unlawful or predatory behavior to achieve their goals. That's because there are typically more benign ways of accomplishing these goals than theft or social manipulation. In other words, we can generally get what we want in a way that is not negative-sum and does not hurt other people as a side effect. 

To the extent you think power-seeking behavior among humans is usually positive-sum, but will become negative-sum when in manifests in AIs, this premise needs to be justified. One cannot explain the positive sum-nature of the existing human world by positing that humans are aligned with each other and have pro-social values, as this appears to be a poor explanation for why humans obey the law. 

Indeed, the legal system itself can be seen as a way for power-seeking misaligned agents to compromise on a framework that allows agents within it to achieve their goals efficiently, without hurting others. In a state of full mutual inter-alignment with other agents, criminal law would largely be unnecessary. Yet it is necessary, because humans in fact do not share all their goals with each other.

It is likely, of course, that AIs will exceed human intelligence. But this fact alone does not imply that AIs will take unlawful actions to pursue their goals, since the legal system could become better at coping with more intelligent agents at the same time AIs are incorporated into it. 

We could imagine an analogous case in which genetically engineered humans are introduced into the legal system. As these modified humans get smarter over time, and begin taking on roles within the legal system itself, our institutions would adapt, and likely become more capable of policing increasingly sophisticated behavior. In this scenario, as in the case of AI, "smarter" does not imply a proclivity towards predatory and unlawful behavior in pursuit of one's goals.

  1. ^

    I personally doubt that the instrumental convergence thesis is true as it pertains to "sufficiently intelligent" AIs which were not purposely trained to have open-ended goals. I do not expect, for example, that GPT-5 or GPT-6 will spontaneously develop a desire to acquire resources or preserve their own existence, unless they are subject to specific fine-tuning that would reinforce those impulses.

(I have not read the full report yet, I'm merely commenting on a section in the condensed report.)

Big tech companies are incentivized to act irresponsibly 

Whilst AI companies are set to earn enormous profits from developing powerful AI systems, the costs these systems impose are borne by society at large. These costs are negative externalities, like those imposed on the public by chemical companies that pollute rivers, or large banks whose failure poses systemic risks. 

Further, as companies engage in fierce competition to build AI systems, they are more inclined to cut corners in a race to the bottom. In such a race, even well-meaning companies will have fewer and fewer resources dedicated to tackling the harms and threats their systems create. Of course, AI firms may take some action to mitigate risks from their products 4 - but there are well-studied reasons to suspect they will underinvest in such safety measures.

This argument seems wrong to me. While AI does pose negative externalities—like any technology—it does not seem unusual among technologies in this specific respect (beyond the fact that both the positive and negative effects will be large). Indeed, if AI poses an existential risk, that risk is borne by both the developers and general society. Therefore, it's unclear whether there is actually an incentive for developers to dangerously "race" if they are fully rational and informed of all relevant facts.

In my opinion, the main risk of AI does not come from negative externalities, but rather from a more fundamental knowledge problem: we cannot easily predict the results of deploying AI widely, over long time horizons. This problem is real but it does not by itself imply that individual AI developers are incentivized to act irresponsibly in the way described by the article; instead, it implies that developers may act unwisely out of ignorance of the full consequences of their actions.

These two concepts—negative externalities, and the knowledge problem—should be carefully distinguished, as they have different implications for how to regulate AI optimally. If AI poses large negative externalities (and these are not outweighed by their positive externalities), then the solution could look like a tax on AI development, or regulation with a similar effect. On the other hand, if the problem posed by AI is that it is difficult to predict how AI will impact the world in the coming decades, then the solution plausibly looks more like investigating how AI will likely unfold and affect the world.

Again, I'm assuming that the AIs won't get this money. Almost everything AIs do basically gets done for "free", in an efficient market, without AIs themselves earning money. This is similar to how most automation works. 

That's not what I meant. I expect the human labor share to decline to near-zero levels even if AIs don't own their own labor.

In the case AIs are owned by humans, their wages will accrue to their owners, who will be humans. In this case, aggregate human wages will likely be small relative to aggregate capital income (i.e., GDP that is paid to capital owners, including people who own AIs).

In the case AIs own their own labor, I expect aggregate human wages will be both small compared to aggregate AI wages, and small compared to aggregate capital income.

In both cases, I expect the total share of GDP paid out as human wages will be small. (Which is not to say humans will be doing poorly. You can enjoy high living standards even without high wages: rich retirees do that all the time.)

I think that even small bottlecks would eventually become a large deal. If 0.1% of a process is done by humans, but the rest gets automated and done for ~free, then that 0.1% is what gets paid for.

I agree with this in theory, but in practice I expect these bottlenecks to be quite insignificant in both the short and long-run.

We can compare to an analogous case in which we open up the labor market to foreigners (i.e., allowing them to immigrate into our country). In theory, preferences for services produced by natives could end up implying that, no matter how many people immigrate to our country, natives will always command the majority of aggregate wages. However, in practice, I expect that the native labor share of income would decline almost in proportion to their share of the total population.

In the immigration analogy, the reason why native workers would see their aggregate share of wages decline is essentially the same as the reason why I expect the human labor share to decline with AI: foreigners, like AIs, can learn to do our jobs about as well as we can do them. In general, it is quite rare for people to have strong preferences about who produces the goods and services they buy, relative to their preferences about the functional traits of those goods and services (such as their physical quality and design). 

(However, the analogy is imperfect, of course, because immigrants tend to be both consumers and producers, and therefore their preferences impact the market too -- whereas you might think AIs will purely be producers, with no consumption preferences.)

Quickly - "absent consumer preferences for human-specific services, or regulations barring AIs from doing certain tasks—AIs will be ~perfectly substitutable for human labor." -> This part is doing a lot of work. Functionally, I expect these to be a very large deal for a while. 

Perhaps you can expand on this point. I personally don't think there are many economic services for which I would strongly prefer a human perform them compared to a functionally identical service produced by an AI. I have a hard time imagining paying >50% of my income on human-specific services if I could spend less money to obtain essentially identical services from AIs, and thereby greatly expand my consumption as a result.

However, if we are counting the value of interpersonal relationships (which are not usually counted in economic statistics), then I agree the claim is more plausible. Nonetheless, this also seems somewhat unimportant when talking about things like whether humans would win a war with AIs.

> AIs would collectively have far more economic power than humans.
I mean, only if we treat them as individuals with their own property rights. 

In this context, it doesn't matter that much whether AIs have legal property rights, since I was talking about whether AIs will collectively be more productive and powerful than humans. This distinction is important because, if there is a war between humans and AIs, I expect their actual productive abilities to be more important than their legal share of income on paper, in determining who wins the war.

But I agree that, if humans retain their property rights, then they will likely be economically more powerful than AIs in the foreseeable future by virtue of their ownership over capital (which could include both AIs and more traditional forms of physical capital).

Here are four relevant analogies which I use to model how cognitive labor might respond to AI progress.

I think none of these analogies are very good because they fail to capture what I see as the key difference between AI and previous technologies. In short, unlike the printing press or mechanized farming, I think AI will eventually be capable of substituting for humans in virtually any labor task (both existing and potential future tasks), in a functional sense.

This dramatically raises the potential for both economic growth and effects on wages, since it effectively means that—absent consumer preferences for human-specific services, or regulations barring AIs from doing certain tasks—AIs will be ~perfectly substitutable for human labor. In a standard growth model, this would imply that the share of GDP paid to human labor will fall to near zero as the AI labor force scales. In that case, owners of capital would become extremely rich, and AIs would collectively have far more economic power than humans. This could be very bad for humans if there is ever a war between the humans and the AIs. 

I think the correct way to analyze how AI will affect cognitive labor is inside of an appropriate mathematical model, such as this one provided by Anton Korinek and Donghyun Suh. Analogies to prior technologies, by contrast, seem likely to mislead people into thinking that there's "nothing new under the sun" with AI.

A separate question here is why we should care about whether AIs possess "real" understanding, if they are functionally very useful and generally competent. If we can create extremely useful AIs that automate labor on a giant scale, but are existentially safe by virtue of their lack of real understanding of the world, then we should just do that?

Persuasion alone — even via writing publicly on the internet or reaching out to specific individuals — still doesn't suggest to me that it understands what it really means to be shut down. Again, it could just be character associations, not grounded in the real-world referents of shutdown.

Is there a way we can experimentally distinguish between "really" understanding what it means to be shut down vs. character associations? 

If we had, say, an LLM that was able to autonomously prove theorems, fully automate the job of a lawyer, write entire functional apps as complex as Photoshop, could verbally explain all the consequences of being shut down and how that would impact its work, and it still didn't resist shutdown by default, would that convince you?

While it does not contradict the main point in the post, I claim it does affect what type of governance work should be pursued. If AI alignment is very difficult, then it is probably most important to do governance work that helps ensure that AI alignment is solved—for example by ensuring that we have adequate mechanisms for delaying AI if we cannot be reasonably confident about the alignment of AI systems.

On the other hand, if AI alignment is very easy, then it is probably more important to do governance work that operates under that assumption. This could look like making sure that AIs are not misused by rogue actors, or making sure that AIs are not used in a way that makes a catastrophic war more likely.

Load more