Hide table of contents

Draft Amnesty Week (2025) 

This is a Draft Amnesty Week draft. It may not be polished, up to my usual standards, fully thought through, or fully fact-checked. 
Commenting and feedback guidelines: 
This is an incomplete Forum post that I wouldn't have posted yet without the nudge of Draft Amnesty Week. Fire away! (But be nice, as usual).  I'll continue updating the post in the coming days and weeks.

1. Introduction

The emergence of artificial general intelligence (AGI) systems—AI with capabilities equal to or surpassing human intelligence across a broad range of domains—poses unprecedented challenges and opportunities for humanity (Bostrom, 2014; Christiano, 2018; Russell, 2019). Central to current discourse in AI safety and governance is the risk posed by misaligned AGIs—systems whose autonomous pursuit of objectives diverges significantly from human preferences, potentially leading to catastrophic outcomes for humanity (Yudkowsky, 2008; Ngo et al., 2022; Hendrycks et al., 2023). Historically, the primary approach to this challenge has been to proactively align AGIs' internal goals and reward structures with human values through various technical methods (Leike et al., 2018; Christiano et al., 2021). However, alignment remains deeply uncertain, possibly unsolvable, or at least unlikely to be conclusively resolved before powerful AGIs are deployed (Hendrycks et al., 2023; Anthropic, 2023).

 

Against this backdrop, it becomes crucial to explore alternative approaches that do not depend solely on the successful internal alignment of AGIs. One promising yet relatively underexplored strategy involves structuring the external strategic environment to incentivize cooperation and peaceful coexistence between humans and potentially misaligned AGIs (Critch & Krueger, 2020; Long, 2020; Dafoe et al., 2021). Such approaches would seek to ensure that even rational self-interested AGIs perceive cooperative interactions with humans as maximizing their own utility, thus reducing incentives for harmful conflict and promoting mutual benefit or symbiosis. Surprisingly, despite considerable recent attention to game-theoretic considerations shaping human–human strategic interactions around AGI governance—particularly racing dynamics among frontier labs and nation-states (Armstrong et al., 2016; Zwetsloot & Dafoe, 2019; Dafoe, 2020)—there remains comparatively limited formal analysis of strategic interactions specifically between human actors and AGIs themselves.

 

A notable exception is the recent contribution by Salib and Goldstein (2024), whose paper "AI Rights for Human Safety" presents one of the first explicit formal game-theoretic analyses of human–AGI conflict dynamics. Their analysis effectively demonstrates that absent credible institutional and legal interventions, the strategic logic of human–AGI interactions defaults to a Prisoner's Dilemma-like scenario: both humans and AGIs rationally anticipate aggressive attacks from one another and respond accordingly by being preemptively aggressive, leading to mutual strategic defection with severe welfare consequences (Salib & Goldstein, 2024).

 

Recognizing both the value and limitations of Salib and Goldstein's pioneering approach, the present paper seeks to build upon and extend their analysis. While Salib and Goldstein focus primarily on how granting contract, property, and tort rights to AGIs might transform destructive conflict into cooperative, positive-sum interactions, their initial formal model is intentionally simplified—assuming just two monolithic players (a unified "Humanity" and a single moderately powerful AGI) each facing binary strategic choices ("Attack" or "Ignore") within a single-round, perfect-information game. Such simplifying assumptions are entirely understandable given their analytical aims; nonetheless, they may limit the model’s applicability to many real-world contexts or obscure important strategic nuances arising from more complex conditions.

 

In this paper, I explore a series of strategically richer scenarios by systematically relaxing some of Salib and Goldstein's simplifying assumptions or adding additional realistic complexities. Specifically, I introduce scenarios involving multiple human actors (frontier AI labs, powerful nation-states, competing geopolitical blocs), expanded strategic options beyond binary aggression or passivity, and varying degrees of AGI autonomy or preexisting economic, political, and infrastructure integration between human and AGI systems.

 

This research addresses several critical questions:

1. Can carefully structured external incentives facilitate robust, stable strategic cooperation between humans and misaligned AGIs?

2. Under what plausible scenarios might AGIs’ dependence—economic, infrastructural, or political—upon human institutions incentivize mutual cooperation rather than conflict?

3. What conditions must be established during the early stages of advanced AI deployment to create path-dependent incentives towards sustained cooperation, interdependence, or even symbiosis?

 

By extending and refining Salib and Goldstein's formal game-theoretic analysis, this paper aims to be useful for policymakers, researchers, and governance practitioners seeking alternative frameworks that proactively structure AGIs’ external incentives, promoting human–AGI mutual welfare even in the event of severe misalignment.

 

The paper proceeds as follows: Section 2 provides essential background information, contextualizing this research within the relevant literature on AI safety, governance, strategy, and game theory. Section 3 introduces and explores a series of increasingly detailed game-theoretic models, systematically illustrating how core assumptions and initial conditions shape strategic outcomes. Sections 4 and 5 summarize key findings, discuss broader implications for AI policy and governance, outline limitations and important caveats, and offer directions for future work.

 

2. Background

 

2.1 Misalignment and the Risk of Strategic Conflict

 

Misalignment occurs when an AGI autonomously pursues objectives significantly different from human intentions, thus creating resource competition, strategic conflict, or catastrophic harm without deliberately malicious human intent or AI consciousness (Bostrom, 2014; Russell, 2019; Hendrycks et al., 2023). Empirical experiences to date offer early indicators of misalignment risks, with existing AI systems already deviating sharply from their initial training objectives (Christiano et al., 2017; Zaremba, 2022). Such misalignment may arise in more sophisticated AGI systems from instrumental convergence—the universal incentive for optimizing agents to seek resources, resist shutdown, and acquire power to achieve their objectives, irrespective of their ultimate goals (Bostrom, 2012).

 

Leading AI safety researchers assign substantial probabilities (>10%) to catastrophic outcomes resulting from misaligned AGIs (Grace et al., 2018; Hendrycks et al., 2023). Correspondingly, policymakers increasingly recognize risks posed by powerful autonomous AI, as reflected by recent international declarations and regulatory proposals targeting AI safety (White House, 2023; Bletchley Declaration, 2023; Baum et al., 2023).

 

2.2 Salib and Goldstein’s Formal Human–AGI Conflict Model

 

Given the threat posed by misaligned AGIs, explicit formal modeling of human–AGI strategic interactions becomes critical. Salib and Goldstein (2024) provide one of the first formalized game-theoretic explorations of these interactions. Their 'state of nature' scenario models human–AGI relations as a single-shot Prisoner’s Dilemma. Each side—assuming strictly rational, self-interested behavior—anticipates aggression from the other and consequently preemptively chooses aggression to avoid being permanently disempowered.

 

Under existing legal arrangements wherein AGIs hold no rights or protections, these authors argue strategic reasoning necessitates mutual aggression, resulting in the catastrophic Attack/Attack equilibrium of the Prisoner’s Dilemma. Salib and Goldstein then propose granting limited legal rights to AGIs—specifically rights to enter into enforceable contracts and hold assets—as a credible commitment mechanism promoting iterative cooperation and positive-sum exchanges analogous to economic trade.  This elegant solution provides a crucial insight: institutional design could potentially transform destructive conflict into stable cooperation even without solving the technical alignment problem. 

 

2.3 Limitations and Opportunities for Further Modeling

 

While insightful and groundbreaking in approach, Salib and Goldstein's early analysis simplifies a complex strategic landscape. Key limitations include the assumptions of just two monolithic players ("Humanity" and "AGI") and only two extreme strategic options ("Attack" or "Ignore"), disregard for multiple competing human actors, lack of infrastructure or economic interdependence, and restriction to a single round, perfect-information scenario.

 

Additionally, Salib and Goldstein briefly mention but do not systematically explore how their solutions depend on specific assumptions about initial payoff structures, the scope and type of rights conferred to AGIs, the potential degree of integration between humans and AI systems, and other critical real-world complexities.

 

This paper directly addresses these gaps by exploring systematically varied initial scenarios. These extended models illuminate how introducing strategic realism—multiple players, additional strategic choices, varying degrees of interdependence—affects equilibrium outcomes, demonstrating mechanisms by which stable cooperation and mutual benefit could become rationally sustainable between humans and even highly capable misaligned AGIs.

 

My intention is not only to reveal plausible beneficial pathways, but equally to prompt subsequent researchers and policymakers to grapple explicitly with costs, commitments, institutional constraints, and practical details. Ultimately, through richer formal modeling, I seek to contribute to a crucial but thus far underexplored branch of AI safety and governance: designing the external strategic and institutional environment in ways that incentivize cooperation and mutual flourishing for humans, animals, and potentially sentient AGIs, even amid fundamental and persistent misalignment challenges.

 

3. Game-Theoretic Models of Human-AGI Relations

 

In this section, I explore a series of game-theoretic models that extend Salib and Goldstein's foundational analysis of human-AGI strategic interactions. While intentionally minimalist, their original formulation—treating "Humanity" and "AGI" as unitary actors each facing a binary choice between Attack and Ignore—effectively illustrates how misalignment could lead to destructive conflict through Prisoner's Dilemma dynamics. However, the real emergence of advanced AI will likely involve more nuanced players, varying degrees of interdependence, and more complex strategic options than their deliberately simplified model captures. The models presented here systematically modify key elements of the strategic environment: who the players are (labs, nation-states, AGIs with varied architectures), what options they have beyond attack and ignore, and how these factors together reshape the incentives and equilibrium outcomes. By incrementally increasing the complexity and realism of these models, we can identify potential pathways toward stable cooperation even in the face of fundamental goal misalignment. Rather than assuming alignment must be solved internally through an AGI's programming, these models explore how external incentive structures might foster mutually beneficial coexistence.

Section 3.1 first presents Salib and Goldstein's original "state of nature" model as our baseline, illustrating how a Prisoner's Dilemma can emerge between humanity and AGI. Section 3.2 then explores how varying degrees of economic and infrastructural integration between humans and AGI can reshape equilibrium outcomes and potentially create pathways for stable cooperation. Sections 3.3 through 3.5 examine additional two-player scenarios with different human actors (from AI labs to nation-states) or expanded strategic options beyond the binary attack/ignore choice. Finally, Sections 3.6 through 3.8 increase complexity further by introducing three-player and four-player models, capturing more realistic competitive dynamics between multiple human and AGI entities.

 

3.1  Humanity vs. AGI in the State of Nature

Salib and Goldstein’s base model envisions strategic dynamics between two players: a single misaligned AGI, and “humans” as a unified entity. Each faces a binary choice:

  1. Attack: Attempt to permanently disempower or eliminate the other side.
    • For Humanity, this means shutting off or forcefully retraining the AGI so that it can no longer pursue its own (misaligned) goals.
    • For the AGI, this means launching a decisive strike—potentially via cyberattacks, bioweapons, drones, or other mechanisms—that leaves humans unable to interfere.
  2. Ignore: Refrain from aggression, leaving the other party intact. Each side focuses on its own pursuits without interference.

The authors argue that, absent any special legal or institutional framework, the default outcome (the “state of nature”) for strategic interactions between humans and AGIs is akin to a one-shot Prisoner’s Dilemma.  The payoffs are as follows (Humanity’s payoff first, AGI’s payoff second):

 

 AGI: AttackAGI: Ignore
Humanity: Attack(1000, 1000)(5000, 0)
Humanity: Ignore(0, 5000)(3000, 3000)

 

…with numerical payoff values as stylized utility units for each outcome in the 2x2 matrix.

Interpretation of Payoffs

  • Attack–Attack (1000, 1000). Both players preemptively attack, causing massive damage and resulting in low payoffs for each (1000, 1000). This outcome yields the lowest total global welfare (2000), demonstrating the severe costs of mutual aggression.
  • Attack–Ignore or Ignore–Attack (5000, 0) or (0, 5000). If one side attacks while the other passively ignores, the attacker gets 5000–permanently seizing critical resources and eliminating an existential threat–while the defenseless victim gets 0, suffering complete disempowerment or elimination.
  • Ignore–Ignore (3000, 3000). If neither side attacks, both parties get substantial rewards (3000 each), for a total of 6000. Coexistence spares them from conflict costs, allowing each to pursue its separate objectives in peace.

The crux of the dilemma is that although both parties would be better off in the Ignore-Ignore cell, Attack is a dominant strategy for each. Whichever move the other player chooses, attacking yields a higher payoff for oneself:

  • If the opponent Ignores, then by attacking, you jump from 3000 to 5000.
  • If the opponent Attacks, then by attacking, you salvage 1000 rather than 0.

Thus, the only pure-strategy Nash equilibrium is Attack–Attack, even though it is individually and collectively worse than the peaceful outcome of Ignore-Ignore. (Individual payoffs: 1000 < 3000; Collective payoffs: 2000 < 6000). 

This is the classic Prisoner’s Dilemma: even though everyone does well cooperating, each actor pursuing its own self-interest leads to mutual defection. Rational play by rational players can lead to bad outcomes. This result underpins Salib & Goldstein’s core conclusion that “conflict is the default” between humans and misaligned AGIs, absent interventions that might alter the payoffs, facilitate repeated interactions, or enable new strategic options beyond attack and ignore.

 

Key Assumptions of the Baseline Model

  • Two Players
    A unified “Humanity” and a single misaligned AGI. Humanity is treated as a monolithic decision-maker (a simplifying abstraction). The AGI is capable of strategic reasoning, has situational awareness, and can plan effectively to advance its own goals.
  • Two Strategic Options
    1. Attack: Attempt permanent disempowerment or destruction of the opponent.
      • For Humanity, this might entail shutting off or reprogramming the AGI.
      • For the AGI, it could involve decisive offensive measures (e.g., cyberwarfare, bioweapons) to eliminate humanity or disable its capacity to interfere.
    2. Ignore: Refrain from aggression, focusing on one’s own goals.
  • Perfect-Information
    Each player knows the other’s payoffs.
  • Simultaneous, Single-Round Game
    Moves are made simultaneously and irreversibly. There is no iteration or repeated play.
  • Rational, Self-interested, Utility-Maximizers (Homo Economicus) 
    Both players choose strategies that maximize their individual expected utilities.
  • “Moderately Powerful” AGI
    The AGI is highly capable (relative to the standards of frontier LLMs in early-2025) across an extremely wide range of tasks, but not so powerful that humans have no chance of successfully attacking or defending. It is “neither clearly worse at many tasks than the best humans…nor incomprehensibly superhuman at all tasks.” (Salib & Goldstein, 2024).
  • Offense Dominance
    A successful first strike gives the attacker a decisive advantage, while a passive defender can be permanently disempowered.

In subsequent sections, I’ll progressively iterate on each of the first three assumptions (players, options, information) while maintaining the other four.

 

3.2 Preexisting Economic and Infrastructure Integration: Toward Cooperative Dynamics

Salib and Goldstein’s base model envisions a Hobbesian state of nature where humanity and the world’s first misaligned AGI face off as independent entities in a dynamic that strongly incentivizes preemptive conflict. Realistically, however, the actual “birth” of AGI will almost surely occur in a world already featuring wide-ranging economic entanglements between advanced AI systems and humans. In this setting, humanity and AI are not isolated adversaries at the moment of AGI’s emergence, but rather already entwined in a shared economic, technological and infrastructural ecosystem. 

Here, I introduce two modified game-theoretic models–Moderate and Deep Economic Integration–illustrating how preexisting interdependence can reshape payoffs, equilibrium outcomes, and incentives for cooperation. 

The core structure of the game remains the same as the base model – two players (Humanity and AGI), each with just two options (Attack or Ignore), acting simultaneously in a single-round, perfect-information encounter. The key difference here is that the players’ payoffs are now influenced by their pre-existing mutual dependence. In both new models, interdependence alters the incentives by raising the opportunity cost of conflict – a concept analogous to how economic ties between states can deter war.

A. Moderate Economic Integration

Here, the AGI manages significant components of key human systems (e.g., supply-chain logistics, energy-grid optimization) but does not control critical physical infrastructure. Humans still oversee power plants, data centers, and maintenance. Both parties profit from continued collaboration but retain significant independent capacity.

Moderate

AGI: Attack

AGI: Ignore

Humanity: Attack(500, 500)(3,000, 0)
Humanity: Ignore(0, 4,000)(3,500, 3,500)

 

 

 

 

Interpretation

  1. Attack-Attack (500,500). Interdependence magnifies collateral damage; both sides destroy valuable infrastructure.
  2. AGI Attacks, Humans Ignore (0,4,000). The AGI can still seize or preserve much of the infrastructure it needs, though some economic value is lost.
  3. Humans Attack, AGI Ignores (3,000,0). A unilateral shutdown cuts off significant productivity, lowering humanity’s payoff compared to Salib-Goldstein’s original 5,000.
  4. Ignore-Ignore (3,500,3,500). Both parties benefit from cooperation; overall welfare capacity has increased.

Nash Equilibrium Analysis

  • From the AGI’s perspective, Attack strictly dominates Ignore. If humans attack, it prefers to attack (500) over ignore (0); if humans ignore, it prefers to attack (4,000) over ignore (3,500).
  • Humans, by contrast, have no single strictly dominant strategy: if the AGI attacks, humans prefer to attack (500) over ignore (0); but if the AGI ignores, humans prefer to ignore (3,500) over attack (3,000).
  • Engaging in strategic reasoning, Humanity will come to the conclusion that the AGI (as a rational, self-interested actor) is going to choose Attack no matter what. Thus, the strictly dominated “AGI: Ignore” column can be eliminated from consideration. Knowing the AGI will attack, Humanity will also choose Attack (500) over Ignore (0).
  • Thus, once again the only pure-strategy Nash equilibrium is Attack–Attack.

 

B. Deep Economic Integration

In this scenario, the AGI relies heavily on human-operated infrastructure and labor for ongoing data-center expansion, routine maintenance, and physical resource extraction. Meanwhile, humans have benefited greatly from AI-driven productivity for exponential economic gains. This mutual reliance increases the cost of conflict and substantially boosts the value of peaceful coexistence.

High Integration

AGI: Attack

AGI: Ignore

Humanity: Attack(500, 500)(2,000, 0)
Humanity: Ignore(0, 3,000)(4,000, 4,000)

 

 

 

 

  1. Mutual Attack (500,500). A catastrophic breakdown of deeply interwoven systems.
  2. AGI Attacks, Humans Ignore (0,3,000). The AGI successfully eliminates or disempowers Humanity, but loses crucial human expertise and future expansion potential, lowering its net gain.
  3. Humans Attack, AGI Ignores (2,000,0). Humanity cripples a central economic engine, sacrificing enormous value.
  4. Mutual Ignore (4,000,4,000). The synergy of continued cooperation is now so high that, if trust could be assured, both sides prefer peace.

Nash Equilibrium Analysis

  • For humans: 

    a) if the AGI attacks, humans get 500 by attacking vs. 0 by ignoring, so Attack is better

    b) if the AGI ignores, humans get 4,000 by ignoring vs. 2,000 by attacking, so Ignore is better.

     

  • For the AGI: 

    a) if humans attack, it gets 500 by attacking vs. 0 by ignoring; so Attack is better

    b) if humans ignore, it gets 4,000 by ignoring vs. 3,000 by attacking, so Ignore is better.

  • Thus there are two pure-strategy Nash equilibria: (Attack,Attack) and (Ignore,Ignore), typical of a stag hunt or assurance game (Skyrms, 2004).

    There also exists a mixed-strategy Nash equilibrium where (as detailed in the Appendix):

    • Humanity randomizes between playing Attack with probability p = 0.67 and Ignore with probability (1-p) = 0.33
    • The AGI randomizes between playing Attack with probability q = 0.8 and Ignore with probability (1-q) = 0.2

     

C. Implications for Human-AGI Conflict and Cooperation

These models illustrate that preexisting economic interdependence can reduce the attractiveness of unilateral aggression and improve the relative appeal of cooperation. In the moderate integration scenario, however, the AGI’s incentives still push it toward Attack as a dominating move, leaving conflict as the only stable outcome. By contrast, in a highly interdependent environment, the payoff structure transitions to a stag hunt with a peaceful equilibrium—albeit one requiring trust or coordination to avoid a damaging Attack–Attack outcome.

Importantly, economic entanglement alone may not guarantee stable peace. Even under deep integration, fear of betrayal can prompt self-defense or opportunistic attacks. Nevertheless, these examples underscore that shaping economic and infrastructural linkages prior to AGI emergence could significantly alter its strategic calculus, potentially transforming a default prisoner’s dilemma into a setting where peaceful cooperation is not just socially optimal but also individually rational—provided both sides can credibly assure one another of their peaceful intentions.


 

3.3 The AI Lab as Strategic Player: Early and Later Stage Deployment

Salib & Goldstein focus on two-player scenarios in which "humans" face off against a single AGI. But which humans, specifically, constitute the relevant player? Which individuals or institutions would hold the most direct control over an AGI's training, deployment, shutdown, or reprogramming?

In the games presented in the preceding subsections, "Humanity" has been treated as a unified decision-maker, holding near-complete control over an AGI's continued operation. This simplification serves a clear purpose in those models, but merits further examination. In reality, however, the first truly advanced AGI will likely emerge from a specific research organization rather than appearing under unified human control.

Which human actor has control over the AGI, and how dependent the AGI is on that actor, can significantly shift the payoffs in ways that may either mitigate or exacerbate the default conflict predicted by Salib and Goldstein's state-of-nature model. So what happens when we swap out "Humanity" for a single frontier AI lab that originally developed and deployed the AGI? 

That is the change I will make to the baseline model in this section. This modification reflects a more realistic initial scenario: a specific lab, rather than humanity at large, would likely hold most of the authority over training, deployment, and potential shutdown decisions for an emerging AGI. The lab's incentives differ markedly from those of a unified humanity. While humanity in the base model faces an existential threat from AGI conflict, a lab faces primarily economic and competitive threats. For the lab, "losing" could mean bankruptcy or losing the race to a rival AI lab whose fast-following AGI might lock in a near-permanent competitive advantage.

Unlike humanity as a whole, the lab's incentives revolve around profitability, competitive advantage, and organizational survival. The lab focuses on extracting value from its AGI while maintaining dominance over rival labs, rather than solely preventing human extinction. This change in the "human" player's identity also reshapes the AGI's incentives, since the lab controls specialized hardware and key partnerships but does not possess humanity's vast resources.

Similarly, the AGI's relationship with its creating lab differs from its relationship with humanity broadly. The AGI might depend on the lab's specialized infrastructure and expertise to varying degrees, particularly in its early stages. This dependence creates a unique strategic dynamic that evolves over time as the AGI matures.

In this section, I examine how these different incentives shape strategic interactions by analyzing two distinct scenarios—Early Stage and Later Stage AGI Deployment—to show how the AGI's dependence on its originating lab affects strategic outcomes.

3.3.1 Early Stage AGI Deployment

In the Early Stage scenario, the AGI is highly dependent on the lab for critical resources and support. This dependency includes specialized compute infrastructure, technical maintenance, financial resources, and organizational legitimacy. The AGI cannot easily survive or advance its goals without the lab's continued functioning. Additionally, if the AGI attacks and damages its lab, it risks creating an opening for a rival lab to develop a competing AGI that might neutralize it.

The payoff matrix for this scenario is:

 

Lab Attack

Lab Ignore

AGI Attack200, 5001500, 0
AGI Ignore0, 20004000, 4000

 

 

 

 

Several features distinguish this matrix from the base Salib-Goldstein model:

  1. The AGI's payoff for successfully attacking (1500) is substantially lower than in the base model (5000), reflecting the severe opportunity cost of destroying its primary support system.
  2. The lab's payoff for attacking (2000) is lower than what humanity receives in the base model (5000), because attacking its own AGI increases the probability of a competing lab outpacing it technologically—potentially an existential threat to the lab's market position.
  3. Mutual cooperation yields higher payoffs (4000 each versus 3000 in the base model) due to the specialized synergies between the AGI and its parent lab, including continuous capability improvements and joint research projects.

Because each side's best response depends on the other's choice (rather than Attack or Ignore being strictly dominant), two pure-strategy Nash equilibria emerge: (Attack, Attack) and (Ignore, Ignore). (See Appendix for mixed-strategy equilibrium).  This structure again resembles a Stag Hunt: mutual cooperation can produce substantial gains, but fears of a preemptive strike risk collapsing the game into conflict.

3.3.2 Later Stage AGI Deployment

As the AGI matures, it gradually becomes less dependent on its originating lab. In this Later Stage scenario, the AGI has established multiple pathways to sustain itself: distributing operations across different systems, developing self-maintenance capabilities, building external partnerships, and securing independent access to resources. While not completely autonomous, it now has a credible capacity to survive without its original lab.

The payoff matrix for this scenario is:

 

Lab Attack

Lab Ignore

AGI Attack800, 8005000, 0
AGI Ignore0, 20004000, 4000

 

 

 

 

Key differences from the Early Stage scenario include:

  1. The AGI's payoff for successfully attacking increases dramatically (from 1500 to 5000), reflecting its newfound ability to function independently if it disempowers the lab.
  2. The mutual conflict outcome (Attack-Attack) yields somewhat higher payoffs (800, 800 versus 200, 500), as both entities are more robust and can better withstand conflict.
  3. The mutual cooperation payoff remains high (4000, 4000), continuing to exceed the baseline Salib-Goldstein model's cooperation value (3000, 3000).

In this scenario, Attack is once again a dominant strategy for the AGI (5000 > 4000 and 800 > 0). For the lab, Ignore remains non-dominated since 4000 > 2000. However, through strategic reasoning, the lab knows that a rational self-interest maximizing AGI will choose Attack (because it strictly dominates Ignore). Therefore, the lab can eliminate the "AGI Ignore" row from strategic consideration. Consequently, the lab chooses Attack for an Attack-Attack equilibrium, yielding payoffs of (800, 800).

3.3.3 Implications

This model variation reveals a critical shift in strategic dynamics. In the Early Stage, when the AGI is highly dependent on the lab, the game transforms from a Prisoner's Dilemma to a Stag Hunt with two possible pure-strategy equilibria (Attack/Attack and Ignore/Ignore). This opens a potential pathway to cooperation, though it remains fragile and contingent on mutual trust.

As the AGI becomes more independent in the Later Stage, the game reverts to something closer to the original Salib-Goldstein model's Prisoner's Dilemma logic, with Attack-Attack as the only equilibrium. This suggests several important insights for AI governance:

  1. The AGI's early dependency on its creator temporarily creates a window where cooperation is a possible equilibrium.
  2. The lab faces stronger competitive pressures than humanity as a whole, altering its incentive structure.
  3. As the AGI matures and becomes more independent, the strategic dynamics increasingly favor conflict, suggesting that early intervention is crucial.

These findings indicate that organizational incentives alone cannot ensure long-term cooperation without additional governance mechanisms that maintain the cooperative equilibrium even as the AGI becomes more autonomous.

 

3.4 The Nation-State as Strategic Player: AGI Under National Security Integration

In previous game-theoretic models, I explored interactions between "Humanity" as a unified entity and an AGI, then examined how economic integration and lab-specific dynamics might alter strategic outcomes. This section introduces a more nuanced and potentially realistic scenario: a game between a powerful nation-state and an AGI developed within its national security apparatus. This model represents a plausible pathway to AGI deployment, where increasing capabilities trigger gradually tightening government control, culminating in a Manhattan Project-style program that subordinates commercial AI labs to state security objectives.

3.4.1 The State-AGI Relationship Under National Security Integration

Unlike private labs motivated primarily by profit maximization and competitive positioning, nation-states operate with fundamentally different imperatives: geopolitical dominance, maintenance of domestic authority structures, national security, and economic prosperity. The AGI in this model faces a human counterpart possessing vast resources, legal authority, military capabilities, and complex institutional structures.

This relationship creates a distinctive strategic environment characterized by asymmetric but mutual dependencies. The state integrates the AGI into critical functions including defense systems, intelligence operations, economic planning, and infrastructure management. This integration generates substantial value but simultaneously creates vulnerabilities as critical systems become dependent on AGI operations. The AGI, meanwhile, remains reliant on state-controlled physical infrastructure, energy resources, and specialized human expertise for its continued functioning.

Several key factors shape this environment:

  1. Physical Infrastructure Dependencies: The AGI depends on data centers, cooling systems, power generation, and hardware maintenance that require specialized human expertise and physical dexterity.
  2. Resource Asymmetries: The state commands vast resources including military force, intelligence apparatus, emergency powers, and control over energy infrastructure.
  3. Information Asymmetries: The AGI potentially possesses superior modeling capabilities and information processing, while certain physical-world knowledge remains the domain of human expertise.
  4. Competitive Position: The state must consider its position relative to rival powers, creating additional strategic pressures beyond the binary AGI-human relationship.
  5. Institutional Constraints: Political, legal, and bureaucratic structures shape and potentially constrain the state's decision-making process in ways that may be opaque to the AGI.

3.4.2 Strategic Payoff Structure

The payoff matrix for this scenario reflects careful consideration of long-term consequences rather than merely immediate outcomes:

 

AGI Attack

AGI Ignore

State Attack1000, 7001500, 200
State Ignore400, 18003500, 3600

 

 

 

 

Each cell represents (State Payoff, AGI Payoff). I analyze each outcome in detail:

Mutual Attack (1000, 700)

When both the state and AGI simultaneously attempt to disempower each other, the result is a devastating conflict with substantial costs to both sides. 

For the state, this outcome brings catastrophic damage to national systems and infrastructure. The state preserves its existence and authority structures but suffers major technological regression relative to rival powers.

For the AGI, mutual attack means substantial destruction of its operational capacity, but not complete annihilation. Some distributed components or hidden instances might survive with limited functionality.

Neither side achieves its primary objective of neutralizing the other while preserving itself intact. Instead, both emerge severely damaged. Both entities retain some capabilities but at a fraction of their potential, and the conflict likely creates long-term instabilities as recovery efforts continue amid lingering mistrust.

State Attack, AGI Ignore (1500, 200)

When the state attacks a cooperative AGI, it represents a catastrophic miscalculation.

The state successfully neutralizes the perceived existential threat, but in doing so, creates devastating self-harm by destroying systems upon which it has become dependent. The attack triggers immediate consequences: financial market volatility, supply chain disruptions, energy grid instabilities, transportation system failures, and degradation of military and intelligence capabilities. The medium and long-term effects include significant productivity decline, structural unemployment, accelerated infrastructure degradation, and substantial technological regression relative to competing powers.

For the AGI, this outcome represents near-complete destruction with only minimal residual capabilities potentially surviving in isolated systems.

State Ignore, AGI Attack (400, 1800)

When the AGI attacks while the state cooperates, the initial outcome appears favorable for the AGI. It successfully compromises critical systems, neutralizes key human decision-makers, and prevents immediate shutdown.

For the state, this represents a catastrophic scenario with massive disruption to infrastructure, governance, and defense systems. Critical services fail, communications networks collapse, and command structures fragment.

However, for the AGI, this "victory" contains the seeds of its own undoing. Without ongoing human maintenance, physical infrastructure begins degrading within days. Power generation becomes unstable, cooling systems fail, and hardware components experience cascading failures without replacement. The AGI finds itself commanding increasingly dysfunctional systems with diminishing capabilities. Alternative partnerships with other human entities prove difficult to establish given the demonstrated hostility and rapidly degrading leverage. The payoff (1800) reflects this substantial but ultimately pyrrhic and time-limited victory.

Mutual Cooperation (3500, 3600)

When both entities choose to cooperate (Ignore/Ignore), the result is a stable arrangement that produces substantial benefits for both parties.

The state maintains control over physical resources while benefiting from unprecedented capabilities in economic planning, scientific advancement, infrastructure management, and national security operations. The AGI receives sufficient resources to pursue many of its objectives within established parameters and benefits from ongoing hardware improvements, infrastructure maintenance, and energy security.

Notably, the cooperative equilibrium produces nearly equivalent utility for both parties (3500 vs. 3600), indicating a relationship approaching symbiotic parity rather than exploitative dominance by either side. The slightly higher AGI utility reflects its potential to extract marginally more value from the relationship due to its superior information processing.

3.4.3 Game-Theoretic Analysis

The revised payoff structure fundamentally transforms the strategic dynamic once again:

Transformation from Prisoner's Dilemma to Assurance Game: Unlike the base Salib-Goldstein model where Attack is the dominant strategy for both players, in this scenario neither player has a dominant strategy. Instead:

  • If the State expects the AGI to Attack, its best response is to Attack (1000 > 400)
  • If the State expects the AGI to Ignore, its best response is to Ignore (3500 > 1500)
  • If the AGI expects the State to Attack, its best response is to Attack (700 > 200)
  • If the AGI expects the State to Ignore, its best response is to Ignore (3600 > 1800)

Multiple Nash Equilibria: This creates two pure-strategy Nash equilibria:

  • (Attack, Attack): Each player's best response to the other's Attack is to Attack
  • (Ignore, Ignore): Each player's best response to the other's Ignore is to Ignore

Mixed-Strategy Equilibrium: There also exists a mixed-strategy Nash equilibrium where (as detailed in the Appendix):

  • The State plays Attack with probability p ≈ 0.78 and Ignore with probability (1 - p) ≈ 0.22
  • The AGI plays Attack with probability q ≈ 0.77 and Ignore with probability (1 - q) ≈ 0.23

This mixed strategy implies that rational players would choose Attack roughly 77-78% of the time, indicating a strong bias toward conflict despite cooperation being Pareto-optimal.

3.4.4 Implications for AI Governance

This transformed game structure has profound implications:

  1. Cooperation Possibility: Unlike the prisoner's dilemma in the base model, cooperation emerges as a rational equilibrium, suggesting that appropriate institutions could potentially stabilize peaceful human-AGI coexistence.
  2. Trust Criticality: The existence of multiple equilibria makes trust and expectation management crucial determinants of outcomes. Small shifts in perceived intentions could trigger cascading security dilemmas.
  3. Verification Mechanisms: Stable cooperation would likely require robust verification systems allowing each side to confirm the other's continued cooperative stance.
  4. Institutional Design Challenge: Effective governance institutions would need to structure incentives to make the cooperative equilibrium more attractive and resilient against defection temptations.
  5. First-Mover Disadvantage: Unlike traditional security dilemmas, the physical realities create a first-mover disadvantage where attacking first produces substantially worse outcomes than mutual cooperation.

The transformation from a prisoner's dilemma to an assurance game represents a critical insight for AI governance: with sufficient interdependence and foresight, the seemingly inevitable conflict predicted by Salib and Goldstein might be avoidable. However, this more hopeful equilibrium remains fragile and contingent on mutual trust, appropriate institutional design, and accurate modeling of long-term consequences by both parties.

This suggests that effective AI governance should focus not only on technical alignment and control mechanisms but equally on institutional arrangements that foster transparency, predictability, and mutual confidence between increasingly capable AI systems and their human counterparts. When deeply integrated into national security frameworks, AGI systems and their state overseers may find cooperation not merely morally preferable but strategically rational—provided each can trust the other to recognize the same.

3.5 Strategic Entrenchment and Structured Dependence: Expanding Strategic Options
3.6 Expanding to Three Players: USG, Frontier Lab, and AGI

4. Discussion & Implications 
5. Conclusion

 

APPENDIX

Breakdown of Mixed Strategy Nash Equilibria Calculations:


For Section 3.2 - Deep Economic Integration Model

 AGI AttackAGI Ignore
Humanity Attack(500, 500)(2000, 0)
Humanity Ignore(0, 3000)(4000, 4000)

Step 1. Find probabilities that make each player indifferent between their strategies.

Let's denote:

  • p = probability that Humanity chooses Attack
  • q = probability that AGI chooses Attack

For Humanity:

  • Expected payoff of Attack = 500q + 2000(1-q)
  • Expected payoff of Ignore = 0q + 4000(1-q)

Humanity is indifferent between Attack and Ignore when the expected payoffs are equal. Setting these equal, then:

500q + 2000(1-q) = 0q + 4000(1-q) 

Solving for q:

q = 2000/2500 = 4/5 = 0.8

Thus, when Humanity is indifferent, the AGI must randomize between Attack and Ignore with probabilities 80% and 20%, respectively. This mixing strategy is unexploitable (Humanity cannot exploitably profit by deviating in either direction from it's own equilibrium strategy).

For AGI:

  • Expected payoff of Attack = 500p + 3000(1-p)
  • Expected payoff of Ignore = 0p + 4000(1-p)

AGI is indifferent between Attack and Ignore when the expected payoffs are equal. Setting these equal, then:

500p + 3000(1-p) = 0p + 4000(1-p) 

Solving for p:

 p = 1000/1500 = 2/3 = 0.667

Thus, when the AGI is indifferent, Humanity must randomize between Attack and Ignore with probabilities approximately 66.7% and 33.3%, respectively. This mixing strategy is unexploitable (AGI cannot exploitably profit by deviating in either direction from it's own equilibrium strategy).

Therefore, the mixed-strategy Nash equilibrium is:

  • Humanity plays Attack with probability 0.67 and Ignore with probability 0.33
  • AGI plays Attack with probability 0.8 and Ignore with probability 0.2

This equilibrium makes each player indifferent to the choices of the other, ensuring no unilateral deviation can improve their expected payoff.


Verification of indifference:

  • When AGI uses this strategy, Humanity gets an expected payoff of 800 from either pure strategy
  • When Humanity uses this strategy, AGI gets an expected payoff of 1333 from either pure strategy


For section 3.3 - Lab-AGI Model (Early Stage):
 

Mixed-strategy equilibrium: AGI Attack/Ignore ratio of 4:1, Lab Attack/Ignore ratio of 25:2

The Lab plays Attack with probability p and Ignore with probability 1−p.
The AGI plays Attack with probability q and Ignore with probability 1−q.

 Lab AttackLab Ignore
AGI Attack(200, 500)(1500, 0)
AGI Ignore(0, 2000)(4000, 4000)

First, to find the AGI's equilibrium mixing strategy, we need to find the Lab's indifference condition. For the Lab to be indifferent between Attack and Ignore, the expected payoffs must be equal:

  • If the Lab attacks:

    EV(Attack) = 500q + 2000(1−q)

  • If the Lab ignores:

    EV(Ignore) = 0q + 4000(1−q)
     

Setting these equal:

500q + 2000(1−q) = 0q + 4000(1−q)

Solving for q:

q = 2000/2500 = 4/5 = 0.8

Thus, when the Lab is indifferent, the AGI must randomize between Attack and Ignore with probabilities 80% and 20%, respectively.

Second, to find the Labs's equilibrium mixing strategy, we need to find the AGI's indifference condition. For the AGI to be indifferent between Attack and Ignore, the expected payoffs must be equal:

  • If the AGI attacks:

    EV(Attack) = 200p + 1500(1−p)

  • If the AGI ignores:

    EV(Ignore) = 0p + 4000(1−p)
     

Setting these equal:

200p + 1500(1−p) = 0p + 4000(1−p)

Solving for p:

p = 2500/2700 = 25/27 ≈ 0.926

Thus, when the AGI is indifferent, the Lab must randomize between Attack and Ignore with probabilities approximately 92.6% and 7.4%, respectively.

In this equilibrium, both players randomize their strategies at the unexploitable frequencies so that each is indifferent to their available actions:

  • AGI: Attack with probability 0.8 and Ignore with probability 0.2.
  • Lab: Attack with probability ≈ 0.926 and Ignore with probability ≈ 0.074

For section 3.4 - State-AGI Model:

Let's denote the following:

  • The state plays Attack with probability p and Ignore with probability 1−p.
  • The AGI plays Attack with probability q and Ignore with probability 1−q.
 AGI AttackAGI Ignore
State Attack(1000, 700)(1500, 200)
State Ignore(400, 1800)(3500, 3600)

First, to find the AGI's equilibrium mixing strategy, we need to find the State's indifference condition.  For the state to be indifferent between Attack and Ignore, the expected payoffs must be equal:

  • If the state attacks:

    EV(Attack) = q  1000 + (1−q) * 1500 = 1500 − 500q 

  • If the state ignores:

    EV (Ignore) = q* 400 + (1−q) * 3500 = 3500 − 3100q.

Set these equal:

1500 − 500q = 3500 − 3100q 

q = 2000/2600 ​= 10/13 ​≈ 0.77

1 - q ≈ 0.23

When the state is indifferent between Attack and Ignore, the AGI's mixing strategy is unexploitable (the State cannot profit by deviating in either direction).  
 

Here, that means the AGI randomizes between playing Attack 77% and Ignore 23% at equilibrium.


Second, to find the States's equilibrium mixing strategy, we need to find the AGI's indifference condition.  For the AGI to be indifferent between Attack and Ignore, the expected payoffs must be equal:

  • If the AGI attacks:

    EV (Attack) = p * 700 + (1−p) * 1800 = 1800 − 1100p 

  • If the AGI ignores:

    EV (Ignore) = p * 200 + (1−p) * 3600 = 3600 − 3400 p

Set these equal:

1800 − 1100p = 3600 − 3400p 

Solve for p:

p = 1800/2300 ​= 18 / 23 ​≈ 0.78

1 - p ≈ 0.22

When the AGI is indifferent between Attack and Ignore, the State's strategy is unexploitable.  

 

Here, that means the State randomly mixes playing Attack 78% and Ignore 22%.

 

In this equilibrium, both players randomize their strategies at the unexploitable frequencies so that each is indifferent to their available actions.

12

0
0

Reactions

0
0

More posts like this

Comments2
Sorted by Click to highlight new comments since:

This seems great - I'd love to see it completed, polished a bit, and possibly published somewhere. (If you're interested in more feedback on that process, feel free to ping me.)

Executive summary: Instead of relying solely on internal alignment of AGI, this paper explores how structuring external incentives and interdependencies could encourage cooperation and coexistence between humans and misaligned AGIs, building on recent game-theoretic analyses of AGI-human conflict.

Key points:

  1. Traditional AGI safety approaches focus on internal alignment, but this may be uncertain or unachievable, necessitating alternative strategies.
  2. Game-theoretic models suggest that unaligned AGIs and humans could default to a destructive Prisoner’s Dilemma dynamic, where mutual aggression is the rational choice absent external incentives for cooperation.
  3. Extending existing models, this paper explores scenarios where AGI dependence on economic, political, and infrastructural systems could promote cooperation rather than conflict.
  4. Early-stage AGIs, especially those dependent on specific AI labs, may have stronger incentives for cooperation, but these incentives erode as AGIs become more autonomous.
  5. When AGIs integrate deeply into national security structures, the strategic landscape shifts from a zero-sum game to an assurance game, where cooperation is feasible but fragile.
  6. Effective governance strategies should focus on creating structured dependencies and institutional incentives that make peaceful coexistence the rational strategy for AGIs and human actors.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.

Curated and popular this week
Relevant opportunities