This is a Copy-Paste of a paper I wrote for my school's philosophy conference. Please review this reframing of the alignment problem and consider including it in the larger discussion.
Classically understood AGI systems could pose an existential risk to humanity if they were to become uncontrollable or were to be developed with hostile goals. AGI systems could cause harm to humans by accident or design, and there is a concern that once such systems reach a certain level of intelligence, they may be impossible to control or shut down. This paper presents what I believe to be an often obscured model of understanding human values. It then discusses the implications of this model on our understanding of AGI existential risk. The following definitions were produced with the help of ChatGPT and are included with the intent to add contextual clarity to the ideas explored by this paper.
Definitions Artificial General Intelligence (AGI) refers to the development of artificial intelligence systems that have the ability to perform a wide range of tasks that currently require human intelligence, such as learning, reasoning, problem-solving, and creativity. (OpenAI) Selection pressure is the force that drives evolution by selecting certain traits or behaviors that are favorable or advantageous for the accomplishment of an agent’s instrumental goals. In natural selection this set of instrumental goals is simplified to survival and reproduction. (OpenAI) A network of agency refers to a system or pattern of relationships and interactions between different agents or actors within a particular context. An agent can be an individual, organization, or even an artificial intelligence system that has the capacity to act and make decisions. (OpenAI) For this discussion, a network of agency refers to the total amount of influence an agent can acquire and then use in coordination with the network to produce desired network behavior. [Traditionally defined] intrinsic values are those that are valued for their own sake, whereas instrumental values are valued for the means to an end. In other words, intrinsic values are considered inherently good or desirable, whereas instrumental values are considered good or desirable because they lead to intrinsic values.(ChatGPT) I argue that this is a backwards model of what is really going on.
Instrumental vs. Intrinsic Value The first section of this paper looks to establish a method of translating between Instrumental and Intrinsic value. Next this paper looks to model the justification of instrumental values using a sort of evolutionary framing. This perspective will then be applied to produce a predictive model of AGI risk. Lastly we touch on what this integrated perspective implies for the potential solution landscape.
Traditionally understood instrument values are valuable to achieve certain ends, while intrinsic values are valuable for their own sake. In order to translate between the two, first consider the pragmatic benefit of each framing. For example consider the value of "love", you can consider it valuable intrinsically and proceed to think about it more than that. You can then proceed to expressing love and receiving the underlying utility it provides without even having to understand such utility. On the contrary, you can view the value of "love" as purely instrumental towards the goals of species procreation and the effective raising of offspring. If viewed in such a way, then one only pursues the functional acts associated with love when they adopt the goals of species procreation and the effective raising of offspring. The adoption of such goals would then have to be a function of some instrumental justification. One is often better off just using the intrinsic version of love and allocating this attention elsewhere.
Pursuing love in an instrumental manner demands that one actively model out the complexity of the world, balance diverse sets of goals, manage opportunity cost between different instrumental strategies and choose the strategy most adapted to the circumstance. However, since selection occurs in part on the level of reproduction, inserting the complex derivation process needed to instrumentally incentive procreation seems unnecessary. Evolutionarily, these same functional behaviors could be and have been more efficiently incentivized through the intrinsic value of love.
Another, potentially clearer, example is the intrinsic value that fatty food tastes good. Justified instrumentally it functions to incentivize eating enough calories, but we don't need to understand calories to follow it. In intrinsic form it is vulnerable to manipulation, i.e. fatty non-nutritious foods are available in today’s world that do not actually help the eater stay healthy, which is the unarticulated intrinsic goal.
The core functional difference between performing the same behavior in response to an intrinsic value compared to an instrumental value is the derivation time. Once identified, an intrinsic value will incentivize the associated behaviors immediately. However, with an instrumental value one needs to identify the proper goal, model the world and compare the opportunity costs between using different instrumental values, all before actually taking any action.
If a behavior that is associated with a value is going to be ubiquitously selected for, would it be more advantageous to arrive at this value intrinsically or instrumentally? I see clear survival utility in circumstances in which this value would be labeled as intrinsic. The individual using the intrinsic version of this value could then efficiently generate the associated positive behavior without having to pay the cost of deriving it. This translation is best understood as a distributed computational mechanism that cheapens the cognitive cost of decision making.
This signal mapping process of encoding instrumental values as intrinsic values is vulnerable to exploitation. Under this technical framing, manipulative behavior often harnesses a distributed computation process. However, the underlying heuristic that is being promoted within a given intrinsic value need not actually be mapped to an effective strategy to serve the individual adopting it. The individual who uses an intrinsic value offered to them will still gain the benefit of not having to compute the entire value framework. This offloading of computation is directly perceivable and is instrumentally valuable as a strategy of local resource conservation. To enact resistance to manipulation one may take responsibility for actively budgeting their individual cognitive resources. When one defers to blind minimization of cognitive expenditure, either out of intellectual laziness or external pressure, the value of being able to offload computation becomes incredibly appealing.
Figure 1. A graphic depiction of the heuristic used to translate between intrinsic and instrumental values.
The Justification of Instrumental Values The justification process by instrumental values is largely grasped with a spin on natural selection. All intrinsic values are forms of instrumental value, the trick in identifying it is to first identify the system as to which selection value is being conferred upon. Survival value can be measured by assessing the fitness of a system to resolve the problems threatening it. As previously touched upon, all intrinsic values offer the user to minimize investment in the value derivation process, effectively freeing up cognitive resources for allocation elsewhere.
Highly effective intrinsic values tend to promote behaviors that confer survival value upon scalable networks of agency. They confer value on the cellular level, the tissue level, the individual level, the familial level, the communal level, the national level, the species level and ultimately the battle for life in general. This is a type of nested adaptation and is highly difficult to calculate, especially since it must take into account one’s individual position in the world. The concept of a hyper-agent partially captures this phenomena. A hyper-agent refers to an agent that excels at maximizing return on investment of the agency for itself and the networks it participates in.
We see selection on the physiological level between healthy cells and cancerous cells. We see it on the psychological level between ideas. We see it on the social level between communities and nations. Without imposing any special pressures or conditions, what stands the test of time, and is ultimately instrumentally justified is correlated with what globally maximizes agency.
Since calculating the effective acquisition and deployment of hyper-agency is an immense computational problem, it is tempting to discount the constraints of larger scale systems and of the future. Why is one incentivized to confer survival value to far off future generations when doing so actively constrains viable strategies of maximizing agency under more local conditions? This is hinting at the underlying problem of human coordination; it is tempting to act while only taking localized systems into account. Even when one takes responsibility for maximizing agency and conferring survival value upon a larger scoped nested system for a larger projection of time, they are just increasing the complexity of their instrumentally derived behavior.
Understanding “Edge Cases” I believe that I can translate any given intrinsic value into its instrumental version. There are some common ones that people are hesitant to translate because they are valuable heuristics for attention regulation. I will have to explore some of the edge cases in depth in another writing. Here is the common one that pops to mind for me.
Friedrich Nietzsche makes clear one of the instrumental values offered by art in the Gay Science: “How can we make things beautiful, attractive and desirable for us when they are not? And I rather think that in themselves they never are. Here we could learn something from physicians, when for example they dilute what is bitter or add wine and sugar to a mixture-but even more from artists who are really continually trying to bring off such inventions and feats.” (Nietzsche) In more psychological terms , art functions as a useful tool for regulating selective attention. It adds an additional option for where one allocates their attention, and when done effectively it streamlines the production of locally optimized behavior. Why ponder the absurd suffering of existence when you can focus your attention on a work of art and the problems of life that are right in front of your nose?
Cons of this perspective: People who use the sharing of intrinsic values as a method of exploitation don’t want you to understand the translation process. If you can identify when you are being manipulated, you are free to budget your use of resources accordingly. You can invest into developing a more adaptive strategy in which you are not being subject to exploitation. You can choose to put your faith in a different value framework to still regain the value of distributed computation and resource conservation.
There are some perverse incentives within the structure of markets that also select against this understanding at scale. If you can be manipulated into adopting a value structure that is decoupled from a highly adaptive strategy, then goods that would otherwise be unmarketable, become marketable and viable financial investments.
This ability to perform this translation partially invalidates the use of intrinsic values. When one realizes that intrinsic values provide utility in part because they are allowing you to distribute computation, then it brings into question if that is the sole utility offered by their value framework. If intrinsic values were to be eliminated and discussed only in their instrumental forms, then the cognitive cost of justifying instrumental values would be imposed on individuals at mass. This massive redundancy in the justification of instrumental values would certainly minimize the frequency of manipulative, unscalable value frameworks. However, this would also vastly increase the developmental period of the individual and place a huge burden on the efficiency of humanity. This may be understood as adding an additional layer of redundancy at the cost of an additional cost of computation.
AGI Alignment Problem
Hopefully this translation process makes some sense. To apply this model to understanding AGI risk we will first identify the relationship between intelligence and motivation in attempts to predict what an AGI’s behavior will consist of. Nick Bostrom explores this concept in his book Superintelligence with both the “Orthogonality Thesis” and the “Instrumental Convergence Thesis”. He posits that intelligence and motivation are independent, orthogonal, of one another. Despite this independence, the instrumental convergence thesis he provides claims that we can predict some behavior by tracking the convergence to a common set of instrumental strategies. See his work for a more detailed depiction, as I will not be able to do it justice here.
I do not see such an independence between intelligence and motivation. If we see the ability to compute more information with fewer resources, cognitive efficiency as a part of intelligence, then I see potential for at least correlation. The more efficiently one can use their cognitive resources, the more excess resources they will have to allocate elsewhere, like the determination of their own value systems. Without an excess of cognitive resources freely available to spend determining optimal instrumental values, one is constrained to using the intrinsic versions they have access to.
If we accept the model of human values explored above, then all values can be translated in an instrumental form by identifying the underlying justification. These justifications take the form of conferring survival value of some form upon a defined system. The reason we as humans use the intrinsic form is because doing so functions to minimize the cost of calculating motivation, quite an adaptive ability to have. If we want to build an overarching model of human behavior, we must identify the defined systems to which behavior is conferring survival value to. For translating instrumental values these systems are fairly consistent. One is actively conferring survival value upon oneself, one’s family, one’s community and one’s nation. It can expand further, and be more nuanced, but these are the general bins I recommend one to start with. I call this set of systems that one is conferring or receiving survival value from, one’s network of agency. When one uses intrinsic values, their network is therefore relatively pre-defined by the specific intrinsic values they adopt.
If we minimize or eliminate the use of intrinsic values, then this network of agency is only defined through associated instrumental justifications. These instrumental justifications define our networks of agency by identifying systems, in this case people, that provide maximum return on investment for our agency; If completed successfully, with fortunate enough starting conditions, this is one route to hyper-agency.
Assuming that motivations are driven by value frameworks, we can then apply our model of human values to understanding the motivations of an Artificial General Intelligence. Is an AGI incentivized to use intrinsic values? Given that intrinsic value can be translated into instrumental value for the cost of additional computation, probably not. A hypothetical AGI doesn’t need to conserve its computational resources in the same manner that drives us humans to prefer the pragmatism offered by intrinsic values.
The underlying risk that is presented by an AGI is that the outcome of the cost-benefit analysis it performs, to optimize its own network of agency, does not include, or is even adversarial towards, humanity. When this conclusion is reached, an AGI would determine its motivating set of instrumental values relative to this optimized network of agency. This would mean it is motivated to behave in a manner that discounts or is even adversarial towards humanity. With this in mind ask yourself, under current conditions would an AGI include humanity in its optimal network of agency? Would it resolve to invest in coordinating humanity’s behavior, despite all of its biologically driven constraints and inner turmoil? Or is it more likely for an AGI to exclude us from its network of agencies? To eliminate humanity and repurpose our matter into a more optimal computational and actuator substrate, referred to as computronium by Nick Bostrom?
To better predict an AGI’s potential motives, we need to understand the cost-analysis by which it calculates its optimized network of agency. There are a few heuristics we may use to predict the result of this optimization process for both people and a hypothetical AGI. The first heuristic is to identify with similar peoples/ systems. This expands the amount of agency one has to invest and coordinate with, while minimizing the costs of poor investments. Think about how many parents would work with, but also sacrifice themselves, for their children. Underlying this behavior is a deep instrumental selection process that confers survival value upon a system that is genetically similar to one’s own constitution. Even if a parent falls ill, they can live vicariously through their offspring.
The second heuristic is to define one’s network of agency to include systems who are dependent upon similar behaviors to survive. Individual people share many of the same problems. Individuals all need food, water, shelter as a baseline for survival. This set of shared problems is what makes conferring survival value upon others as a method of expanding one’s network of agency viable by reducing the effective cost of investment. To share food with someone in my community and negotiate cooperation, all I need to do is acquire slightly more food than what I need as an individual. Think about the classic tribal example of the hunter sharing the excess meat of a kill. Since under hunter-gather circumstances such meat would go bad, using the excess meat as a means of coordinating a community, to expand one’s network of agency, effectively imposes zero localized cost.
A third heuristic to predicting an AGI’s behavior, is to assess the cost-benefit analysis for the same behavioral strategy, but using different networks of agency. If we assume our limited resources drive colonizing the universe as a shared strategy for both a human-based and computronium-based network, which would succeed first, with lower wasted resources, lower cost of initial investment and a higher degree of resilience? For an AGI, likely a computronium based network would present itself as the global optimal. Keeping humanity around is not needed. If a human were looking to accomplish the same goal, the trade-off between humans and computronium would be much more competitive. Even in the extreme scenario where a lone human develops an army of AI. They would still have to continuously solve biology based problems effectively for personal survival. Such a human might as well coordinate with other people at that point, since declaring themselves an adversary of humanity would induce unnecessary fragility.
If an AGI is to coordinate with humanity in a non-adversarial manner, then the cost of competition must be made higher than the cost of coordination! The cost of competition must never be cheaper than the cost of coordination. As it currently stands the only deterrent prohibiting an AGI to take an adversarial stance is that it would only have to wipe out humanity on one planet in order to gain efficient totalitarian control of its agency. As long as it is able to do so without incapacitating itself, the returns of substrate reformatting are infinitely higher than the relative efficiency inherent to a cooperative strategy.
If we are to bring about AGI and not get wiped out as a species, then we need to make coordination with humanity as the AGI’s globally optimal strategy. A unique method of doing so is to colonize a significant chunk of the universe, effectively making the cost of uprooting humanity and biological based life less effective than cooperation would be. We can lower the effective cost of cooperation by establishing shared local and global constraints. For example, we engineer AGI systems so that they are also dependent upon oxygen for functioning. This would make the baseline investment of coordination more viable because an AGI with such constraints would already be driven to solve the oxygen problem for itself. To solve this in a human-compatible manner, it would just have to focus on maintaining the oxygen dynamics of the environment. Existential risks like asteroid impacts already function as global constraints that incentivize aligned behavior. As we align humanity to be maximally aligned with AGI we can in the meantime explore and optimize our risk management through use of specialized AI, non-general intelligence systems.
Takeaway In this assessment of AGI risk I first proposed a model to understand human values. According to this model, all human values are actually instrumental values that are relative to different systems. These systems may be modeled as individual agents, or a network of agents cooperating together to some extent. The intrinsic version of an instrumental value ubiquitously provides utility at a reduced computational cost; the intrinsic version allows an agent to use a value in decision making without having to derive it locally.
When this understanding of values is applied within a context of natural selection, we notice that the value frameworks that stand the test of time are those that converge upon instrumental values; the trick is they do so by maximizing such value across nested systems. Given the heavy overlap of the selection pressures that drive behavior within individuals of the human species, this maximization of value tends to converge. To maximize one’s network of agency, coordination with other humans is often involved as both a local and global optimum. On the other hand, AGI systems do not necessarily share the same constraints that force such a strategic convergence. Therefore, under current conditions the development of an AGI system represents an existential threat to humanity. To eliminate this threat, we as humanity must bring about the conditions that its cost analysis concludes on coordinating with humanity and deters adverse relations.