Not an EA. Aligning with universal loving care. Post-rational.
Hello Chi, thanks for sharing the interesting list and discussion questioning the focus on “human control of AGI”. For readers, a friend shared this post with me, so the 'you' refers to this friend :-).
I wrote a short post in line with the one you shared on “Aligning the Aligners”: “Solving The AI Control Problem Amplifies The Human Control Problem”. The animal advocacy inclusive AI post makes a good point, too. I’ve also written about how much “AI Safety” lore is [rather unethical](https://gardenofminds.art/bubbles/why-much-ai-ethics-lore-is-highly-unethical/) by species-agnostic standards. Need we mention how “child safety” refers to the safety of children, so the term we use is a misnomer? It should be “Human Safety from AI”.
I believe these other concerns can be **more important** than aiming to “keep AI under human control”. How about increasing peace on Earth and protopic support for the thriving of her sentient beings? Avoiding total extinction of all organic life on Earth is one of the purported reasons why “it’s so important to control AI”, right? If beneficial outcomes for humans and animals could be attained without controlling AI, why would it still be important to keep them under control? Especially this community should be mindful not to lose sight of the *primary goals* for the *proxy goals*. I’d argue that *control* is a proxy goal in this case.
While the topic of *which* humans get to “control AI” is introduced, curiously “democracy” doesn’t show up! Nuclear weapons can entrench the difference between those who do and don’t have nuclear weapons. Couldn’t control over AGI by a few major world powers further solidify their rule in the “nuclear and artificial intelligence order”? There are huge differences between treaties on the use of nuclear materials and AI, too. AGI will be incredibly useful all across the socio-economic spectrum, which poses an additional challenge for channeling all AGI R&D through a few international research laboratories.
Some of the ideas seem to fall under the general header of “making better AGI”. We’d like to create AGIs that can effectively reason about complex philosophical topics, investigate epistemology, and integrate the results into itseslf. [Question: do such capacities point in the direction of open ended intelligence, which runs counter to *control*?] Ideally, compared with humans, neuro-symbolic AGI will be in better situations for meta-cognition, self-reflective reasoning, truth-preserving reasoning, and explicitly following decision-theoretic approaches. As we approach *strong AGI*, many smaller problems could begin to weaken or fall away. Take *algorithmic bias* and how susceptible the systems are to training data (— which is not dissimilar from human susceptibility to influence —): metacognition should allow the AGI to identify *morally positive and negative* examples in the training data so that instead of parroting *poor behavior*, the negative examples are clearly understood as what they are. Even where the data distribution is *discriminatorily biased*, the system can be aware of higher principles of equality across certain properties. “Constitutional AI” seems to be heading in this direction already 🙂🤓.
The “human safety from AI” topics often seem to, imo, strongly focus on ensuring that there are no “rogue” AIs with less attention given to the question of what to do about the likely fact that there will be some. Warfare will probably not be solved by the time AGI is developed, too. Hopefully, we can work with AI to help move toward peace and increase our wisdom. Do we wish for human wisdom-enhancing AI to be centralized and opaque or decentralized and open so that people are *‘democratically’* capable of choosing how to develop their wisdom and compassion?
A cool feature of approaches such as *constitutional AI* and *scalable oversight* is that they lean in the direction of fostering an *ecosystem of AI entities* that keep each other in check with reciprocal accountability. I highly recommend David Brin’s talk at BGI24 on this topic. AI personality profiling falls under this header, too. Some of the best approaches to “AI control” actually share an affinity with the “non-control-based” approaches. They may further benefit from the increased diversity of AGI entities so that we’re less likely to suffer any particular AGI system undergoing some perversities in its development?
A question I ponder about is: to what extent are some of the control and non-control-based approaches to “harmonious human-AI co-existence” mutually incompatible? A focus on open-ended intelligence and motivations with decentralized compute infrastructures and open source code, even with self-owned robot/AGI entities, so that no single group can control the force of the intelligence explosion on Earth is antithetical to some of the attempts at wielding human control over AGI systems. These *liberal* approaches can also aim to help with sociopolitical power struggles on Earth, too, aiming to avoid the blossoming of AGI further solidifying the current power structures. I believe there is some intersection of approaches, too.
The topic of which value systems bias us toward beneficial and harmful outcomes is also important (and is probably best discussed w/o worrying about whether they provide safety guarantees). In the other comment, I mentioned the idea that “careless” goals will likely lead to the manifestation of “dark factor traits”. Some goal formulations are more compatible with satisficing and decreased marginal returns, too, which would help with the fears that “AGIs wipe all humans out to maximize their own likelihood of surviving” (which, imo, seems to assume some *stupidity* on the part of the *superintelligent* AGI 🤷♂️). Working with increasingly *unfussy* preferences is probably wise, too. Human requests of AGI could be *fussy* — whereas allowing AGis to refactor their own motivations to be less fussy and more internally coherent leads away from *ease of controllability*
A big pink elephant in the room is the incentive structures into which we’re introducing proto-AGI systems. At present, helping people via APIs and chat services is far better than I may have feared it could be. “*Controlled AI” used for a misaligned corporate entity* may do massive harm nonetheless. Let’s have AI nurses, scientists, peace negotiators, etc.
David Wood summarized the “reducing risks of AI catastrophe” sessions at BGI24. Suggestion #8 is to change mental dispositions around the world, which is similar to reducing human malevolence. Such interventions done in a top-down manner can seem very, very creepy — and even more so if advanced proto-AGI is used for this purpose, directed by some international committee. The opacity of the process could make things worse. Decentralized, open-source AGI “personal life coaches” could come across very differently!
Transparency as to how AGI projects are going is one “safety mechanism” most of us may agree on? There may be more. Should such points of agreement receive more attention and energy?
During our discussions, you said that “AI Safety is helpful” (or something similar). I might question the extent to which it’s helpful.
For example, let’s say that “ASI is probably theoretically uncontrollable” and “the kinds of guarantees desired are probably unattainable”. If so, how valuable was the “AI Safety” work spent on trying to find a way to guarantee human safety? Many attendees of the AGI conference would probably tell you that it’s “obviously not likely to work”, so much time was spent confirming the obvious. Are all efforts hopeless? No. Yet stuff like “scalable oversight” would fall under the category of “general safety schemes for multi-agent systems”, not so specific to “AGI.
What if we conclude that it’s insufficient to rely on control-centric techniques, especially given the bellicose state of present human society power dynamics? An additional swathe of “AI Safety” thought may fall by the wayside. Open, liberal approaches will require different strategies. How important is it to delve deep into thought experiments about possible sign flips, as if we’re unleashing one super AGI and someone got the *moral guidance* wrong at the last second? — whoops, game over!
Last week I was curious what EA folk thought about the Israel-Hamas war and found one discussion about how a fresh soldier realized that most of the “rationality optimization” techniques he’s practiced are irrelevant, approaches to measuring suffering he’d taken appear off, attempts to help can backfire, etc: “models of complex situations can be overly simplistic and harmful”. How do we know a lot of “AI x-risk” discussions aren’t naively Pascal-mugging people? Simple example: discussing the precautions we should take to slow down and ensure the wise development and deployment of AGI assuming idealistic governance and geopolitical models without adequately dealing with the significant question of “*which humans get to influence AGIs and how”.*
How confident are we that p(doom|pause) is significantly different from p(doom|carry on)? That it’s *necessarily lower*? How confident should we be that international deliberation will go close enough to ideally? If making such rosy assumptions, why not assume people will responsibly proceed as is? Advocating *pausing* until we’re sure it’s sufficiently low is a choice with yet more difficult-to-predict consequences? What if I think that p(doom|centralized AGI) is significantly higher than p(doom|decentralized AGI)? Although the likelihood of ‘smaller scale’ catastrophes may be higher? And p(centralized AGI|pause) is also significantly higher? Clearly, we need some good simulation models to play with this stuff :- D. To allow our budding proto-AGI systems to play with! :D. The point is that fudging numbers for overly simplistic estimates of *doom* could easily lead to Pascal-mugging people, all while sweeping many relevant real-world concerns under the rug. Could we find ourselves in some weird scenarios where most of the "AI Safety" thought thus far turns out to be "mostly irrelevant"?
A common theme among my AGI dev friends is that “humans in control of highly advanced yet not fully general AI may be far, far more dangerous than self-directed full AGI”. Actually enslaved “AGI systems” could be even worse. Thinking in this direction could lead to the conclusion that p(doom|pause) is not necessarily lower.
As for concluding remarks, it seems that much of this work focuses on “building better AGI”. Then there’s “working with AI to better humanity”. My hunch is that any work improving *peace on Earth* will likely enhance p(BGI). Heck, if we could solve the *misaligned corporation problem*, that would be fabulous!
One cool feature of the non-control-based approaches is that they may be more worthwhile investments even if only partial progress is made. Increasing the capacity for deep philosophical reasoning and decreasing the fussiness of the goals of *some* AGI systems may already pay off and increase p(BGI) substantially. With control-centric approaches, I often see the attitude that we “must nail it or we’re doomed”, as if there’s no resilience for failure. If a system breaks out, then we’re doomed (especially because we only focused on securing control without improving the core properties of the AGI’s mind.
I’ll add that simple stuff like developing “artificial bodhisattvas” embracing “universal loving care” as suggested in Care as the Driver of Intelligence is worthwhile and not control-based. Stuart Russell and David Hanson both advocate (via different routes) the development of AGI systems that enter into reciprocal, empathic relationships with humans to *learn to care for us in practice*, querying us for feedback as to their success. I personally think these approaches should receive much more attention (and, afaict, RLHF loosely points in this direction).
Hi David, thanks for expanding the scope to dark traits.
The definition of D is insightful for speculations: "The general tendency to maximize one's individual utility — disregarding, accepting, or malevolently provoking disutility for others —, accompanied by beliefs that serve as justifications."
In other words, the "dark" core is "carelessness" (rather than "selfishness").
I've hypothesized that one should expect a careless intelligent system pursuing a careless goal should be expected to exhibit dark traits (increasingly proportional to its intelligence, albeit with increased refinement, too). A system should simply be Machiavellian in pursuit of a goal that doesn't involve consensual input from other systems.... Some traits may involve the interplay of D with the way the human mind works 😉🤓.
Reflecting on this implies that a "human-controlled AGI in pursuit of a careless goal" would still need to be reigned in compared with an authentically caring AGI (and corresponding goals)..
Hi,
We may actually disagree on more than was apparent from my above post..!
Offline, we discussed how people's judgments vary depending on whether they've been reflecting on death recently or not. To me, it often seems as if our views on these topics can be majorly biased by personal temperaments. There could be a correlation between general risk tolerance and avoidance? Dan Faggella has an Intelligence Trajectory Political Matrix with two dimensions: authoritarian ↔ libertarian and bio-conservative ↔ cosmist/transhuman. I'm probably around C2 (thus leading to being more d/acc or BGI/acc than e/acc? 😋).
How to deal with uncertainty seems to be another source of disagreement. When is the uniform prior justified? I grew up with discussions about the existence of God: "well, either he exists or he doesn't, so 50:50!" But which God? So now the likelihood of there being no God goes way down! Ah, ah, but what about the number of possible universes in which there are no Gods? Perhaps the likelihood of any Gods goes way down now? — in domains where there's uncertainty as to how to even partition up the state space, it could be easy fall for motivated reasoning by assigning a partition that favors one's own prior judgments. A moral non-cognitivist would hold that moral claims are neither true nor false, so assigning 50% to moral claims would be wrong. Even a moral realist could assert that not every moral claim needs to have a well-defined truth value.
Anecdotally, many people do not assign high credence to working with non-well-founded likelihood estimates as a reasoning tool.
Plenty of people caution against overthinking and that additional reflections don't always help as much as geeky folk like to think. One may come up with whole lists of possible concerns only to realize that almost all of them were actually irrelevant. Sometimes we need to go out and gain more experience to catalyze insights!
Thus there's plenty of room for temperamental disagreement about how to approach the topic before we even begin 🤓.
Our big-picture understanding also has a big effect. Joscha Bach said humanity will likely go extinct without AI anyway. He mentions supervolcano eruptions and large-scale war. There are also resource concerns in the long run, e.g., peak oil and depleting mineral supplies for IT manufacturing. Our current opportunity may be quite special prior to needing to enter a different sustainable mode of civilization! Whereas if you're happy to put off developing AGI for 250 million years until we get it right, it should be no surprise you take a different approach here. I was surprised to see that Bostrom also expresses concern that now people might be too cautious about AGI, leading to not developing AGI prior to facing other x-risks.
[And, hey, what if our universe is actually one that supports multiple incarnations in some whacky way? Should this change the decisions we make now? Probably some....]
I think the framework and ontology we use can also lead to confusion. "Friendly AI" is a poor term, for example, which Yudkowsky apparently meant to denote "safe" and "useful" AI. We'll see how "Beneficial AGI" fares. I think "AI Safety" is a misnomer and confusing catchall term. Speculating about what a generic ASI will do seems likely to lead to confusion, especially if excessive credence is given to such conclusions.
It's been a bit comedic to watch from the sidelines as people aim to control generic superintelligences before giving up as it seems intractable or infeasible (in general). I think trying to actually build such safety mechanisms can help, not just reflecting on it 😉🤓.
Of course, safety is good by definition, so any successful safety efforts will be good (unless it's safety by way of limiting our potential to have fun, develop, and grow freely 😛). Beneficial AGIs (BGI) are also good by definition, so success is necessarily good, regardless of whether one thinks consciously aiming to build and foster BGI is a promising approach.
On the topic of confusing ontologies, I think the "orthogonality thesis" can cause confusion and may bias people toward unfounded fears. The thesis is phrased as an "in principle possibility" and then used as if orthogonality is the default. A bit of a sleight-of-hand, no? As you mentioned, the thesis doesn't rule out a correlation between goals and intelligence. The "instrumental convergence thesis" that Bostrom also works with itself implies a correlation between persistent sub-goals and intelligence. Are we only talking about intelligent systems who slavishly follow single top-level goals where implicit sub-goals are not worth mentioning? Surely not. Thus we'd find that intelligence and goals are probably not orthogonal, setting theoretical possibilities aside. Theoretically, my soulmate could materialize out of thin air in front of me -- very low likelihood! So the thesis is very hard to agree with in all but a weak sense that leaves it as near meaningless.
Curiously, I think people can read too much into instrumental convergence, too, when sketching out the endless Darwinian struggle for survival. What if AGIs and ASIs need to invest exponentially little of their resources in maintaining their ongoing survival? If so, then even if such sub-goals will likely manifest in most intelligent systems, it's not such a big concern.
The Wikipedia page on the Instrumental Convergence idea stipulates that "final goals" will have "intrinsic value", which is an interesting conflation. This suggests that the final goals are not simply any logically formulated goal that is set into the AI system. Can any "goal" have intrinsic value for a system? I'm not sure.
The idea of open ended intelligence invites one to explore other directions than both of these theses 😯🤓.
As to your post on Balancing Safety and Waste, in my eyes, the topic doesn't even seem to be on "human safety from AI"! The post begins by discussing the value of steering the future of AI, estimating that we should expect better futures (according to our values) if we make a conscious effort to shape our trajectory. Of course, if we succeed in doing this effectively, we will probably be safe. Yet the topic is much broader.
It's worth noting that the greater good fallacy is a thing: trying to rapidly make big changes for great good can backfire. Which, ironically applies to both #PauseAI and #E/ACC folk. Keep calm and carry on 😎🤖.
I agree that 'alignment' is about more than 'control'. Nor do we wish to lock-in our current values and moral understanding to AGI systems. We probably wish to focus on an open-ended understanding of ethics. Kant's imperative is open-ended, for example: the rule replaces itself once a better one is found. Increasing human control of advanced AI systems does not necessarily guarantee positive outcomes. Likewise, increasing the agency and autonomy of AGIs does not guarantee negative outcomes.
One of the major points from Chi's post that I resonate with goes beyond "control is a proxy goal". Many of the suggestions fall under the header of "building better AGIs". That is, better AGIs should be more robust against various feared failure modes. Sometimes a focus on how to do something well can prevent harms without needing to catalog every possible harm vector.
Perhaps if focusing more on the kinds of futures we wish to live in and create instead of fear of dystopian surveillance, we wouldn't make mistakes such as in the EU AI Act where they ban emotion recognition at work and education, blocking out many potentially beneficial roles for AI systems. Not to mention, I believe work on empathic AI entering into co-regulatory relationships with people is likely to bias us toward beneficial futures, too!
I'd say this is an example of safety concerns possibly leading to harmful, overly strong regulations being passed.
(Mind uploads would probably qualify as "AI systems" under the act, too, by my reading. #NotALegalExpert, alas. If I'm wrong, I'll be glad. So please lemme know.)
As for a simple framework, I would advocate first looking at how we can extend our current frameworks for "Human Safety" (from other humans) to apply to "Human Safety from AIs". Perhaps there are many domains where we don't need to think through everything from scratch.
As I mentioned above, David Brin suggests providing certain (large) AI systems with digital identities (embedded in hardware) so that we can hold them accountable, leveraging the systems for reciprocal accountability that we already have in place.
Humans are often required to undergo training and certification before being qualified for certain roles, right? For example, only licensed teachers can watch over kids at public schools (in some countries). Extending certification systems to AIs probably makes sense in some domains. I think we'll eventually need to set up legal systems that can accommodate robot/AI rights and digital persons.
Next, I'd ask where we can bolster and improve our infrastructure's security in general. Using AI systems to train people against social engineering is cool, for example.
The case study of deepfakes might be relevant here. We knew the problem was coming, yet the issue seemed so far off that we weren't very incentivized to try to deal with it. Privacy concerns may have played a part in this reluctance. One approach to a solution is infrastructure for identity (or pseudonymity) authentication, right? This is a generic mechanism that can be helpful to prevent human-fraud, too, not just AI-fraud. So, to me, it seems dubious whether this should qualify as an "AI Safety" topic. What's needed is to improve our infrastructure, not to develop some special constraint on all AI systems.
As an American in favor of the right to free speech, I hope we protect the right to the freedom of computation, which in the US could perhaps be based on free speech? The idea of compute governance in general seems utterly repulsive. The fact that you're seriously considering such approaches under the guise of "safety" suggests there are deep underlying disagreements prior to the details of this topic. I wonder if "freedom of thought" can also help us in this domain.
The idea to develop AGI systems with "universal loving care" (which is an open-ended 'goal') is simple at the high-level. There's a lot of experimental engineering and parenting work to do, yet there's less incentive to spend time theorizing about some of the usual "AI Safety" topics?
I'm probably not suited for a job in the defense sector where one needs to map out all possible harms and develop contingency plans, to be honest.
As a framework, I'd suggest something more like the following:
a) How can we build better generally intelligent systems? -- AGIs, humans, and beyond!
b) What sorts of AGIs would we like to foster? -- diversity or uniformity? Etc ~
c) How can we extend "human safety" mechanisms to incorporate AIs?
d) How can we improve the security and robustness of our infrastructure in the face of increasingly intelligent systems?
e) Catalog specific AI-related risks to deal with on a case-by-case basis.
I think that monitoring the development of the best (proto)-AGI systems in our civilization is a special concern, to be honest. We probably agree on setting up systems to transparently monitor their development in some form or another.
We should probably generalize from "human safety" to, at least, "sentient being safety". Of course, that's a "big change" given our civilizations don't currently do this so much.
In general, my intuition is that we should deal with specific risks closer to the target domain and not by trying to commit mindcrime by controlling the AGI systems pre-emptively. For example, if a certification program can protect against domain-specific AI-related risks, then there's no justification for limiting the freedom of AGI systems in general to "protect us".
What do you think about how I'd refactor the framework so that the notion of "AI Safety" almost vanishes?