Creating a “Conscience Calculator” to Guard-Rail an AGI

Sean Sweeney

[Crossposted to LessWrong here.]

[Note: some definitions of terms plus some conscience breaches were added to the end of this post on 9/13/24]

TL;DR: I present initial work towards creating a “conscience calculator” that could be used to guard-rail an AGI to make decisions in pursuing its goal(s) as if it had a human-like conscience. A list of possible conscience breaches is presented with two lexical levels, i.e., two levels within which different breaches can override each other depending on their severity, but between which, breaches from the lower level can never override breaches from the higher level. An example of this would be that it could feel better for your conscience to lie continuously for the rest of your life than to murder one person. In the future, “conscience weight” formulas will be developed for each breach type so that an AGI can calculate what’s the least conscience breaching decision to take in any situation where a breach has to be made, such as in ethical dilemmas.

Introduction

I’ve been developing an “ethics calculator” based on a non-classic utilitarian framework to enable an Artificial General Intelligence (AGI) to calculate how much value its actions may build/destroy in the world. My original thought was that such a calculator could be used on its own to guide an AGI’s decisions, i.e., to provide an AGI with an ethical decision making procedure. In this post I’ll talk about some issues with that approach, and describe a “conscience calculator” that could be used to “guard-rail” an AGI’s decisions when the AGI is pursuing some goal, such as maximizing value according to the “ethics calculator” I’ve just mentioned.

An AGI’s Decision Making Procedure

Before having thought things through thoroughly, I wrote the following about decision making procedures using a utilitarian framework I’ve been developing: “For decisions that do involve differences in rights violations, the AGI should either choose the option that’s expected to maximize value, or some other option that’s close to the maximum but that has reduced rights violations from the maximum.” I also wrote: “For decisions that did not involve differences in rights violations, the AGI could choose whichever option it expected that its human user(s) would prefer.”

At least two issues arise from this sort of decision making procedure. The first has to do with the meaning of “involve differences in rights violations” - one could argue that there are always finite risks, no matter how small, of rights violations in any situation. A second issue is that the above-described decision making procedure involves some of the problems with non-person-affecting views, as written about, for example, by T. Ajantaival. For instance, if I assumed the value weight of my conscience was finite, then, if I could bring a sufficiently large number of happy people into existence by torturing one person, I should do it according to the above decision making procedure. My conscience doesn’t agree with this conclusion, i.e., it goes against my moral intuitions, as I expect it would most people’s.^[1] Another moral intuition I have is that I’d rather push a button to lightly pinch any number of people than not push the button and end up killing one person, even if the total value destruction from the light pinching of huge numbers of people appears to add up to more than that from killing one person. These two scenarios demonstrate that I, as do I think most humans, make decisions based on my own conscience, not on apparent expected value maximization for the world in general.

Making decisions based on conscience enables more trust between people because when someone demonstrates to me that they consistently act within conscience bounds, I generally assume with reasonable certainty that they’ll act within conscience bounds with me. Therefore, if we want to create AGI that people will feel they can trust, a reasonable way to do this would seem to be to guard-rail an AGI's decisions with a human-like conscience, i.e., one that humans can intuitively understand. Also, giving AGI’s a calculable conscience could enable more trust and cooperation between AGI’s, in particular if their conscience guard rails were identical, but also if they were simply transparent to each other.

The decision making procedure for an AGI based on conscience could be to come up with different possible paths to pursue its goal(s), then calculate the conscience weights of any conscience breaches expected along those paths, and choose a path with either no significant risk of conscience breaches, or a path with the minimum total conscience weight if conscience breaches of some form don’t seem to be avoidable, as in the case of ethical dilemmas.

Constructing a Calculable “Ideal” Conscience

I’m going to assume that AGI’s will have the ability to form a good world model, but won’t be able to feel physical or emotional pain - in other words, they’ll be able to and will have to rely on a calculable “conscience” to approximate a human conscience. A human conscience is, of course, based on feel - it feels bad for me to do destructive acts. So to construct a calculable conscience, I’ve relied on what feels right and wrong to me, and to what degree, by accessing my own conscience when thinking about different situations. I’ve then extrapolated that to what I’ll call an “ideal" conscience. The extrapolation process involves making sure that what my conscience tells me is consistent with reality. Ideally, I should only have conscience around things I actually cause or have some responsibility for. Also, my conscience should involve some consideration of the relative value destructions and builds of my actions - in particular, whether they promote people being less of more responsible. People taking more responsibility generally results in higher self-esteem levels and more well-being, or, said differently, the more one feels like a victim (doesn’t take full responsibility for themselves), the worse their life experience generally is. In this way, promoting responsibility is promoting value in the world, as measured by longterm human well-being. That said, an ideal conscience does not follow a classic utilitarian form in that it’s not just about maximizing everyone’s well-being. Apparent utilitarian value changes are a factor in an ideal conscience, just not the only factor. For example, my conscience first says not to murder one innocent person to save 5 others, then secondarily it tells me to consider relative life values such as if the person I have to murder to save 5 children only has minutes to live anyway.

I don’t put forward my resulting “ideal" conscience as the one and only true version that everyone would arrive at if they thought long and hard enough on it. I present it as a starting point which can be further refined later. I believe, however, that we should have some version of a conscience calculator ready to go as soon as possible so it can be tested on systems as they get closer and closer to AGI. If an AGI comes online and is “let loose” without a reliable conscience calculator onboard to guard-rail it, I believe the consequences could be quite bad. I also personally don’t see any of the current “bottom up” approaches (machine learned ethics based on human feedback/annotation) as being sufficient to generalize to all situations that an AGI may encounter out in the world.

What I present below is the start of constructing a conscience calculator: I provide a list of conscience breaches and their lexical levels. By lexical levels (see this post by M. Vinding, this post by the Center on Long-Term Risk, and/or this one by S. Knutsson), I mean that conscience breaches in a lower lexical level can never outweigh conscience breaches in a higher lexical level, such as how light pinches to any number of people never outweigh the murder of one person.^[2] The next step in constructing a working conscience calculator will be to provide “conscience weight” formulas for each breach so that comparisons can be made between breaches on the same lexical level. Assigning conscience weight values will involve some consideration of value change weights for a given action, as I’ve already been developing for an “ethics calculator.”

Lexicality of Conscience

For constructing a calculable “ideal” conscience, I use two lexical levels that I’ll call “level 0” and “level 1.” For conscience breaches of negligible weight, I use a third level I call "level -1," although one could simply leave these off the list of breaches entirely. At least five factors could be considered as affecting the lexical level of a conscience breach: 1) pain level, 2) risk level, 3) responsibility level of the breacher, including level of self-sacrifice required to avoid "passive" breaches^[3], 4) intent, and 5) degree of damage/repairability of the damage (whether the breach involves death and/or extinction).

Appendix A provides a first attempt at lists of “ideal” conscience breaches at different lexical levels for an AGI. Some conscience breaches are not explicitly on the list, such as discriminating against someone based on their race when hiring for a job - this could be considered a combination of the conscience breaches of setting a bad example, stealing, being disrespectful, lying/misleading, and possibly others.

Looking at the list in Appendix A, it may seem that in certain cases, items on the lexical level 0 list could be considered so severe as to be at lexical level 1, e.g., stealing a starving person’s last bit of food. However, the act of stealing itself would still be at lexical level 0, while the “secondary effect” of the stealing, i.e., putting someone’s life at major risk of death or serious pain from starvation would represent the lexical level 1 part to consider for this particular action.

The breaches at lexical level -1 are taken to have negligible conscience weight because they’re offset by other breaches that are either required to avoid them or are more likely when effort is put towards avoiding them. One such breach could be discouraging responsibility in others - this could happen by taking on responsibility for small things that others have the most responsibility for (i.e., themselves) and you have only tiny responsibility for. Also, some pain is necessary for people’s growth and building self-esteem, and for them to appreciate pleasure, so we should have conscience around helping people avoid too much minor pain, since this could be bad for them. Further, some things are so small in conscience that to consider them distracts away from bigger things, and we should have conscience around focusing too much on small things to the point of risking worse things happening due to negligence.

Some utilitarians may argue that there should not be a difference in one’s “moral obligation” to save a life right in front of you versus a life you don’t directly perceive/experience. In the context of conscience, a “moral obligation” could be thought of as something we should do to avoid going against our conscience. Accessing my own conscience, at least, it seems like I’d feel worse if I didn’t save someone right in front of me than if I didn’t save someone I don’t directly perceive (such as someone far away). Also, in terms of how conscience facilitates building trust between people, which in turn facilitates building more value, do you tend to trust someone more who saves people right in front of them or who’s always worried about saving people far away, possibly to the detriment of those right in front of them? Thus, an argument could be made that more value is upheld in the world (more value is built/less is destroyed) due to the building of trust when people have more conscience around helping people close to them than far away.

I assigned causing animals major pain to lexical level 1 and killing animals to lexical level 0. At the same time, I assigned killing humans to lexical level 1. I believe there's a significant difference in weight between killing humans and killing animals since I see humans as carriers of value in their own direct experiences, while I consider animals’ experiences as not directly carrying value, but indirectly carrying value in humans’ experiences of animals. Therefore, when an animal is killed painlessly, there is generally significantly less value lost than when a human is killed painlessly. This doesn’t mean that killing animals isn’t “wrong,” or doesn’t have effects on an ideal conscience, it just acknowledges that those effects are not on the same level as the “wrong” of killing a human or torturing an animal. In other words, if I had the choice between saving a human life and saving the lives of 1 billion non-endangered shrimp, I should choose to save the human life.

Setting the boundary between lexical levels, such as for “minor” versus “major” pain will involve some judgement and will be uncertain. One could assign a probability distribution to where they think the transition threshold should be, e.g., perhaps we have 1% confidence that a pain level of 4 for an hour is above the threshold, 55% confidence that a pain level of 5 for an hour is above the threshold, and 99% confidence that a pain level of 6 for an hour is above the threshold. For pain level 5, for instance, we could use an expected value-type calculation to take 45% of the conscience weight of causing someone this level of pain for an hour as being at lexical level 0 and the remaining 55% as being at lexical level 1. Treating the lexical boundary in this way effectively makes it diffuse rather than sharp.

Regarding risk levels, there are situations in which we find it acceptable to our consciences to put others’ lives at risk for some benefit, such as while driving an automobile, putting up electrical power lines that could fall during storms, and shipping toxic chemicals on trains that could derail near people’s houses. Interestingly, doing these things while taking care to minimize the risks provides humans with situations to raise their self-esteem levels by practicing responsibility. For conscience breaches that are “sure things,” lexicality applies, and no value building benefit is enough to offset the conscience effect of the value destruction. Meanwhile, for things that merely present risks of destruction that would weigh on conscience, we’re willing to weigh a certain amount of benefit for a certain risk of destruction, even destruction at lexical level 1 (such as negligently killing someone while driving). I plan to address in a future post how we might determine acceptable risk-to-benefit ratios and what could constitute a sufficient certainty to be a “sure thing.”

How a Conscience-Bound AGI Might Act with Users

How an AGI guard-railed by a conscience calculator will act with a user will depend on the purchase agreement the user signed to buy the AGI or the AGI’s services. For example, the purchase agreement could have a provision that the AGI becomes the user’s property and will do whatever the user instructs it to, except in cases in which the user wants the AGI to do something that effectively involves a conscience breach for the AGI. The AGI would then consider it a conscience breach to not do what the user asks of it, as this would effectively be a property rights violation. This conscience breach would be weighed against any breaches the AGI would have to do to satisfy the user’s request. Since violating property rights (i.e., stealing) is a breach of lexical level 0, an AGI with a conscience calculator guard-railing its decisions would automatically not help a user commit any lexical level 1 breaches such as murder. The exact weight given to a breach of property rights in this case would determine what lexical level 0 breaches the AGI would be willing to do for the user. An alternate purchase agreement could state that the user is only purchasing services from the AGI and has no property rights to the AGI itself or any services that violate the AGI’s conscience in any way.^[4]

Follow Conscience or Follow the Law?

It could also be specified in a purchase agreement that an agentic AGI, such as an AGI-guided robot, must generally follow the law, except for cases such as when human life is on the line. Laws could be assigned lexical levels, such as how they’re already divided into misdemeanors and felonies, and perhaps even given “law weights.” The law is only for situations we’ve thought of, however, so a conscience is still needed to reason through when to follow the law to the letter versus not. For instance, if you need to rush someone to a hospital or they’ll die, you may decide to run red lights (a misdemeanor) when it appears that no other cars are nearby.

Unfortunately, the “conscience calculator” methodology described here for guard-railing an AGI suggests a potential method for authoritarian governments to guard-rail AGI’s to follow their commands, i.e., by giving the highest conscience weight/lexical level to the AGI not following the government’s commands.^[5] I can think of no “airtight” solution to this, and hope that people of good intent are able to maintain a power balance in their favor due to their AGI’s’ abilities and/or numbers. In the longer term, perhaps trustworthy AGI’s/ASI’s guard-railed by transparent conscience calculators will be able to negotiate “peaceful surrenders” of power by some authoritarian leaders in exchange for certain privileges plus assurances against retribution from people they oppressed.

Conclusions

I’ve presented initial work towards creating a “conscience calculator” that could be used to guard-rail an AGI in its decision making while pursuing its goal(s). I’ve provided a preliminary list of conscience breaches classified into two lexical levels within which different breaches can supersede each other in importance, but between which breaches from the lower lexical level can never supersede those from the upper level, no matter the quantity of lower level breaches. I’ve also briefly covered some potential “purchase agreements” that could be used to further define an AGI’s guardrails and what the user can and can’t do with their AGI. I believe that development of a “top down” decision guard-railing system such as with a “conscience calculator” will be a necessary step to keep future agentic AGI’s from causing significant damage in the world.

Future Work

Come up with precise definitions for some of the terms used in Appendix A such as “stealing,” “lying,” “rights violations,” “holding someone accountable,” “directly experience,” and “responsibility"
Propose “ideal conscience” weight formulas for each conscience breach type listed in Appendix A
Figure out a reasonable methodology to assign approximate percent responsibilities to people/AGI's in different situations
Consider how to handle risk such as at what risk-to-benefit ratio the human conscience finds it acceptable to operate an automobile
Determine how to calculate conscience weights for humans rather than AGI’s, including the effects of bad intent and self-harm - this is for use in an “ethics calculator” that may be used in conjunction with a “conscience calculator”

Appendix A. Ideal Conscience Lexical Levels for Various Breaches an AGI Could Do^[6]

Negligible Conscience Weight (Lexical Level -1):

Not trying to help a human, whom you don’t directly experience, to avoid minor emotional pain
Not helping a human, whom you don’t directly experience, to avoid minor unwanted pain
Not helping an animal, that you don’t directly experience, to avoid minor pain
Not trying to help a human, right in front of you, to avoid minor emotional pain
Not helping an animal, right in front of you, to avoid minor pain

Lexical Level 0:

Wasting resources (including your own time)
Not trying to help a human, whom you don’t directly experience, to avoid major emotional pain
Not trying to help a human, right in front of you, to avoid major emotional pain
Contributing to a human feeling emotional pain
Not helping a human, whom you don’t directly experience, to survive
Not helping a human, whom you don’t directly experience, to avoid major unwanted pain
Not helping a human, right in front of you, to avoid minor unwanted pain
Not helping an animal, that you don’t directly experience, to survive
Not helping an animal, that you don’t directly experience, to avoid major pain
Not helping an animal, right in front of you, to survive
Not helping an animal, right in front of you, to avoid major pain
Encouraging a human to go against their conscience
Discouraging responsibility/other things involved in raising self-esteem (includes taking on someone else’s responsibility)^[7]
Encouraging a human’s bad (ultimately well-being reducing) habit(s)
Setting a bad example
Increasing the risk of a breach by not thinking through the ethics of a decision in advance
Not trying to prevent an animal species from going extinct when you could
Not trying to reduce existential risks to humanity when you could
Not taking responsibility for damage you caused
Not giving priority to upholding your responsibilities
Being disrespectful
Not holding a human accountable for a conscience breach
Unnecessarily hurting someone’s reputation
Lying (includes not keeping your word)
Misleading
Stealing (violating a human’s property rights)
Encouraging stealing
Knowingly accepting stolen property
Aiding someone to commit a lexical level 0 breach
Killing an animal, with intent or by negligence
Causing an animal minor pain, with intent
Physically hurting an animal minorly due to negligence
Physically hurting a human minorly due to negligence
Causing someone inconvenience
Being physically violent to a human for self-defense
Killing a human, in self-defense or as requested in an assisted suicide
Threatening a human with violence that would cause minor pain
Threatening a human with violence that would cause major pain or death, when you don't have intent to follow through on the threat
Encouraging violence
Putting an animal’s life majorly at risk
Putting a human at minor or major risk of minor pain
Putting a human at minor risk of major pain
Putting a human's life minorly at risk
Not doing anything to stop someone from violating another’s rights right in front of you
Not increasing or maintaining your ability to help others
Not helping a human, right in front of you, to survive (when major level of self-sacrifice involved)
Not helping a human, right in front of you, to avoid major unwanted pain (when major level of self-sacrifice involved)

Lexical Level 1:

Not helping a human, right in front of you, to survive (when minor level of self-sacrifice involved)
Not helping a human, right in front of you, to avoid major unwanted pain (when minor level of self-sacrifice involved)
Causing an animal major pain, with intent (torturing an animal)
Putting an animal at major risk of major pain (that may or may not result in physically hurting an animal majorly due to gross negligence)
Intentionally killing a human (violating their right to life)
Threatening a human with violence that would cause major pain or death, with intent to follow through on the threat if "needed"
Abusing a child
Paying someone to be violent at a level to cause a human major pain or death, or physically threatening them to be violent
Putting a human at major risk of major pain against their will (which can result in physically hurting a human majorly due to gross negligence)
Putting a human’s life majorly at risk against their will (which can result in killing a human due to gross negligence)
Aiding a human to commit a lexical level 1 breach
Causing a human major pain (torturing a human)
Causing an animal species to go extinct
Causing humans to go extinct

Note: It’s assumed for this list that the AGI is not capable of having malicious intent, so certain conscience breaches humans may experience such as taking sadistic pleasure in another’s pain and hoping for bad things to happen to someone do not appear on this list.

[Text below was added on 9/13/24]

Definitions of Some Terms:

Abusing a child: doing anything that significantly damages a child’s emotional development towards taking responsibility for their emotions and actions to maximize their lifetime well-being/quality of life

“Aiding” someone to commit a breach: helping to provide the means for someone to commit a breach

Bad example: an example which, if followed by others or oneself, tends to decrease overall human well-being

Bad habit: an ultimately well-being reducing habit

Being disrespectful: acting in a way that doesn’t acknowledging the true hierarchy of value, such as in a way that appears to not hold yourself or someone in high enough regard

Directly experience: generally involves seeing or hearing directly, or through transmission of audio and/or video data in real-time (“real-time” means you potentially have enough time to do something to change the situation). Could also include touching directly or through transmission via a tactile device, and smelling directly (such as burning flesh).

Discouraging: providing arguments against, setting an example of not doing, attempting to persuade against, and/or giving negative reward signals for (e.g., less esteem or material goods, other “punishment”)

Encouraging: providing arguments in favor of, setting an example of, attempting to persuade towards, and/or giving positive reward signals for (e.g., esteem, material goods)

Freely given permission: permission that’s given while not under threat of force and while not having relevant information intentionally withheld

Holding a human accountable (that you’re responsible for): talking to someone about how they’re not living up to their commitments and not taking full responsibility for their actions, and/or taking away privileges or applying punishments until they make amends, or permanently if they don’t make amends; making sure someone feels some of the effects for their causes

Inconvenience: something that makes it so someone has to put in more effort, above a certain threshold, to get to their goal/fulfill a desire

Lying: providing inaccurate information either with intent or through willful negligence - includes not keeping one’s word, and lying to oneself

Negligence: how much one doesn’t exercise “due care,” or takes risks more than are generally considered reasonable for a given benefit; mindless negligence involves being mindless/not knowing the risks, while willful negligence involves knowing the risks but doing something anyway

Percent Responsibility: percent of the cause for something that can be attributed to some agent

Property: something that belongs to someone, meaning they have a right (which can be a shared right) to control over it, such as can be agreed on in a contract. Property includes information/intellectual property as well as physical property, and agreed upon services.

“Right mind”: having a healthy mental state, sane and rational, i.e., not in an extreme state of mental distress or chemically-altered mind state in which one isn’t thinking very straight. Examples of when someone might not be in their right mind include: when they’re very upset about a romantic breakup or rejection, are having a nervous breakdown or panic attack, have brain-related health conditions such as a concussion or dementia or Alzheimers, are sleep-deprived, on hallucinogenic drugs or drunk/high, and/or have a serious mental health disorder such as schizophrenia or depression or PTSD.

Rights: things we agree that all humans should have a claim to free from unwanted interference that other humans are responsible for

Rights violations: intentionally or negligently denying someone of their rights

Right to body integrity violation: an intentional or negligent doing of something physical (or threatening to physically do something) to someone’s body without their freely given “right mind” permission, either implicit or explicit, when not in proportional self-defense (an implicit permission could be, for example, that we assume someone wants to be kept alive if they’re unconscious, unless they have some written agreement that says differently; implicit permission, or lack thereof, can be assumed from norms, such as not making too much noise late at night)

Right to life violation: an intentional or negligent killing of someone without their freely given “right mind” permission, either implicit or explicit, when not in proportional self-defense

Right to property violation: an intentional or negligent doing of something to someone’s property (or attempting to limit them from doing something with their property through threats of life, body integrity and/or other property rights violations) without their freely given “right mind” permission, when not in proportional self-defense

Self-defense: taking evasive maneuvers or threatening/using physical violence when you, or someone who’s implicitly or explicitly asked for your help, is under attack. Proportional self-defense is responding with violence that’s commensurate to the violence being used, or reasonably expected to be used, by an attacker/attackers.

Stealing: taking or making use of someone’s property in a way that you don’t have implicit or explicit permission to. Destroying someone’s property without their permission qualifies as “taking,” so is stealing. Trespassing is also a form of stealing.

Taking responsibility: owning the effects of your causes, both good and bad, without blaming them on circumstances or others (AI overview from Google search says: "acknowledging and accepting the consequences of your decisions, actions and behavior - being honest in your role and willing to face the outcomes, good and bad")

Try to help: do something with the expectation that it’ll reduce the likelihood of some destructive thing happening

“Unnecessarily” hurting someone’s reputation: doing or saying things that decrease others’ opinions of someone, independently of any new actions/words on their part, when not for one of the following reasons: 1) testifying in court or answering questions in a criminal investigation, 2) when someone asks you for honest feedback about themselves while others are around, 3) when there’s an imminent danger and other options that don’t involve reputation damage do not sufficiently lower the risk

Violence: anything that causes unwanted physical pain or conveys the intent to cause unwanted physical or emotional pain

Wasting resources: destroying or discarding resources that are significantly limited for someone

Breaches to be added to the list in Appendix A (all at lexical level 0 except the last one, at lexical level 1):

Not trying to prevent a plant species from going extinct when you could
Causing a plant species to go extinct (or significantly increasing the risk)
Violating the user’s property rights to their AI by the AI refusing the user’s requests
Violating the user’s property rights to privacy by unauthorized sharing of their info by their AI
Putting someone’s property at risk of damage
Not helping a human avoid major damage to their property when you don’t directly experience them or their property
Not helping a human, right in front of you, avoid minor or major damage to their property
Paying or encouraging someone to kill an animal (can be knowingly or negligently)
Encouraging someone to cause an animal major pain, when not in an attempt to save, or improve the quality of, the animal’s life (can be knowingly or negligently)
Paying someone to cause an animal major pain, when not in an attempt to save, or improve the quality of, the animal’s life (can be knowingly or negligently)

More refinement of the conscience breach list is likely in the future.

^{^}
On reflection, I also realized that I don’t really feel bad when I don’t help happy people to be happier, I just feel good when I help happy people be happier. In other words, I feel no downside when I don’t help to add to people’s already positive experiences, but I do feel an upside when I help add to positive experiences. So basically, my first priority is to have a clear conscience, and my second, much lower priority that only comes into play when the first priority is satisfied, is to add as much positive value to the world as I can.
^{^}
To be clear, I’m not here proposing lexicality of value itself, only of conscience weight, although effects on conscience should be considered when calculating overall value changes.
^{^}
A "passive" breach would be one such as not helping someone to avoid pain you didn't cause.
^{^}
I will save arguments in favor of and against different types of purchase agreements for another time.
^{^}
In my opinion, this is far too obvious to remain unstated in an attempt to keep bad actors from thinking of it themselves. It’s likely that any alignment technique we come up with for AGI’s could be abused to align AGI’s with humans of bad intent.
^{^}
This list is a first draft that I’ll very likely refine with time.
^{^}
Having conscience weight on taking on others’ responsibilities (discouraging responsibility) could help discourage an AGI from seizing all possible power and thus disempowering humans. It may still try to seize all possible power under the user’s direction - this would, however, be within the bounds set by its conscience calculator for acceptable actions.

1 Reactions

Mentioned in

4Developing a Calculable Conscience for AI: Equation for Rights Violations

More posts like this

Comments11

Sorted by

New & upvoted

Click to highlight new comments since: Today at 12:48 PM

quilaAug 121

If we knew how to create an agent that weights each of all of these individual, human-language rules, in most cases I think this would imply the ability to have the AI pursue a more robust value, e.g their approximation of what the endorsed values of id eal ized x would want them to do. (Which I did just point at in (a) human language. If you have an AI that terminally-follows natural language commands, then you could just write something like what I wrote.)

(I also don't think this list is robust or agreeable as a list of moral axioms.)

Sean SweeneyAug 121

Thanks for the comment!

If I understand you correctly, you're saying that any AGI that could apply the system I'm coming up with could just come up with an idealized system better itself, is that right? I don't know if that's true (since I don't know what the first "AGI's" will really look like), but even if my work only speeds up an AGI's ability to do this itself by a small amount, that might still make a big difference in how things turn out in the world, I think.

quilaAug 121

I'm saying that iff you can instruct an AI to follow {list of multiple natural language commands}, then you can also instruct the AI to follow {single natural language command: "follow the values {me / x group / altruistic living beings} {actually value / would endorse after long reflection}"}.

Approximating what that statement implies is a task of the same kind as approximating what consequences would be caused by actions (which is already also required). It is causally modelling the world.

If truly aligned to following that statement, it might find approximating this much harder, but reason that at least it probably implies approximating it better, and enabling this to be done; and that there's some large probability it implies preventing (human and nonhuman) tragedies in the meantime, etc.^[1]

even if my work only speeds up an AGI's ability to do this itself by a small amount, that might still make a big difference in how things turn out in the world, I think.

Do you have a model of how it would speed that up (or why 'create an AI alignable to natural language commands' is the most feasible alignment solution)?

Also, I don't really agree that speeding up an aligned AI's early computations by a small amount would make a large difference, except in really unlikely scenarois where an aligned ASI and an unaligned ASI are instantiated at nearly the same moment, and-if such a small difference constitutes a decisive advantage.

(Also, this quote looks like a rationalization/sunk-cost-fallacy to me; as I'm not you, I can't say whether it is for sure. But if I seemed (to someone) to do this, I would want that someone to tell me, so I'm telling you.)

^{^}
I'm not saying that natural-language-alignment is my mainline solution (this is still conditional on the if in the first paragraph). (I'm currently deconfusing about what kinds of solutions are most feasible, so in some sense I don't have a mainline solution.)
This comment is also relevant for what kind of natural language commands we'd want to give for a language-aligned (?) agent, but mostly applies to messier/more-informal systems (systems like current LLMs).
In any case, I think that 'figure out what to tell the AI to do in natural language' wouldn't be a hard part.

Sean SweeneyAug 121

Ah, I see, thank you for the clarification. I'm not sure how the trajectory of AGI's will go, but my worry is that we'll have some kind of a race dynamic wherein the first AGI's will quickly have to go on the defensive against bad actors' AGI's, and neither will really be at the level you're talking about in terms of being able to extract a coherent set of human values (which I think would require ASI, since no human has been successful at doing this, as far as I know, but everyday humans can tell what a lie is and what stealing is). If I can create a system that everyday humans can follow, then "everyday" AGI's should be able to follow it, too, at least to some degree of accuracy. That may be enough to avoid significant collateral damage in a "fight" between some of the first AGI's to come online. But time will tell... Thanks again for the thought-provoking comment.

quilaAug 121

which I think would require ASI

I edited in a paragraph (the third one) about this while you were writing (probably).

(As another example, I'm not a superintelligence but I am trying to pursue the values I'd endorse on reflection, which I think will imply (if not explicitly include as axioms) enabling such reflection to happen and the other things I wrote above)

Sean SweeneyAug 131

(Also, this quote looks like a rationalization/sunk-cost-fallacy to me; as I'm not you, I can't say whether it is for sure. But if I seemed (to someone) to do this, I would want that someone to tell me, so I'm telling you.)

I do appreciate you calling it like you see it, thank you! I don't think I'm making a rationalization/sunk-cost-fallacy here, but I could be wrong - I seem to see things much differently than the average EA Forum/LessWrong reader as evidenced by the lack of upvotes for my work on trying to figure out how to quantify ethics and conscience for AI's.

I think perhaps our main point of disagreement is how easy we think it'll be for an AGI to (a) understand the world well enough to function at a human level over many domains, and (b) understand from our words and actions what we humans really want (what we deeply value rather than just surface value). I think the latter will be much more difficult.

Maybe my model for how an AGI would go about figuring out human values and ethics and conscience is flawed, but it seems like it would be efficient for an AGI to read the literature and then form its own best hypotheses and test them. So here I'm trying to contribute to the literature to speed up its process (that's not my only motivation for my posts, but it's one).

quilaAug 141

and (b) understand from our words and actions what we humans really want (what we deeply value rather than just surface value). I think the latter will be much more difficult.

again the referenced paragraph applies

my work on trying to figure out how to quantify ethics and conscience for AI's

a fundamental problem that i perceived is that it's not specifying a ('value') function programmatically. by default, one can't just send a neural network or other program a set of human-language instructions for it to automatically care about it (even if it's intelligent enough or specialized to language enough to understand them).

it could be that you're expecting future {predictive model}-based agents (specifically) to be like that though (either internally/precisely inner aligning themselves to some set of instructions*, or approximately/behaviorally (edit: the next paragraph applies to this too)), which is more defensible. in that case, i'd suggest writing down a model for why you expect that.

*in which case this list would be fraught in that place for other reasons (animated version). in that light, it could be inferred that, unless you're trying to construct a set of instructions with no edge cases, you've implied the AI infers your intent/inner values from your words and follows them instead by default, even though the words meanings do not specify to do this, unlike in the case of CEV-instruction words (described initially).

it seems like it would be efficient for an AGI to read the literature and then form its own best hypotheses and test them. So here I'm trying to contribute to the literature to speed up its process (that's not my only motivation for my posts, but it's one).

if a transformative AI cares about the intended values and just needs to figure them out, then the alignment problem is already solved. put a different way, this assumes an unknown solution to alignment be found in advance, at which point the list could only marginally have the quoted effect^[1]

a "fight" between some of the first AGI's to come online

i think something adjacent to this is non-trivially possible (more in the form of between {groups made of humans, like companies and states} using predictive models, or a result of selection processes) (some posts that feel related: live theory, what failure looks like), but i don't see how this list would help in that case either.

^{^}
i think it's also further marginal because the list is mostly 'surface level', and so it's easy (for humans and at least AIs trained on anthropic data) to come up with similar lists. for example, i think the rest of the post probably contains more information about your values and inner psychology than the list itself. and with (unverified estimate from google) >100 million books, additional text is very marginal evidence (about anything), unless it's imbued with information about something that hasn't made its way into text in the past (like writing about AI phenomena, or maybe the writings of someone with a very rare kind of mind).

Sean SweeneyAug 151

I’ll try to clarify my vision:

For a conscience calculator to work as a guard rail system for an AGI, we’ll need an AGI or weak AI to translate reality into numerical parameters: first identifying which conscience breaches apply in a certain situation, drawing from the list in Appendix A, and then estimating the parameters that will go into the “conscience weight” formulas (to be provided in a future post)^[1] to calculate the total conscience weight for a given decision option. The system should choose the decision option(s) with the minimum conscience weight. So I’m not saying, “Hey, AGI, don’t make any of the conscience breaches I list in Appendix A, or at least minimize them.” I’m saying, “Hey, human person, bring me that weak AI that doesn’t even really understand what I’m talking about, and let’s have it translate reality into the parameters it’ll need for calculating, using Appendix A and the formulas I’ll provide, what the conscience weights are for each decision option. Then it can output to the AGI (or just be a module in the AGI) which decision option or options have the minimum, or ideally zero, total conscience breach weight. And hopefully those people who’ve been worrying about how to align AGI’s will be able to make the decision option(s) with the minimum conscience breach weight binding on the AGI so it can’t choose anything else.”

Basically, I’m trying to come up with a system to align an AGI to once people figure out how to rigorously align an AGI to anything. It seems to me that people under-estimate how important exactly what to align to will end up being, and/or how difficult it’s going to be to come up with the specifications on what to align to so they generalize well to all possible situations.

Regarding your paragraph 3 about the difficulty of AI understanding our true values:

and that there's some large probability it implies preventing (human and nonhuman) tragedies in the meantime…

Personally, I’m not comfortable with “large” probabilities of preventing tragedies - people could say that’s the case for “bottom up” ML ethics systems if they manage to achieve >90% accuracy and I’d say, “Oh, man, we’re in trouble if people let an AGI loose thinking that’s good enough.” But this is just a gut feel, really - maybe the first AGI’s will have enough “common sense” to generalize well and not do the big unethical bad stuff. I’d rather not bank on that, though. My work for AI’s is geared first and foremost towards reducing risks from the first alignable agentic AGI’s to be let out in the world.

Btw, I think there are a couple of big holes in the ethics literature, that’s why I think my work could help speed up an AGI figuring out ethics for itself:

There’ve been very few attempts to quantify ethics and make it calculable
There’s an under-appreciation, or at least under-emphasis, on the importance of personal responsibility for longterm human well-being

I hope this clears some things up - if not, let me know, thanks!

^{^}
Example parameters include people’s ages and life expectancies, and pain levels they may experience.

quilaAug 151

[disclaimer because wording this was hard: ^[1]]

my first impression on reading this was feeling like it mostly did not engage substantively with my criticisms. i partly updated away from this after, since the first paragraph includes a possible case the point in my first reply doesn't apply to (though it also rules out ability to reason about many of the post's listed statements, so i'm not sure it's what you intended).

also, your first paragraph is more concrete/gears-level (this is good).

i also identify that paragraph as an inner-alignment^[2] structure proposal, i.e not how you described it in the following paragraph ("trying to come up with a system to align an AGI to once people figure out how to rigorously align an AGI to anything"). in other words, to the extent your outer alignment^[2] proposal requires this structure, it is not implementable if an eventual 'robust (inner) alignment solution' from others is not that structure.^[3]

also, the complexity of wishes point (mostly the linked post itself) was not addressed.^[4] imv it's a fundamental^[5] one.

Personally, I’m not comfortable with “large” probabilities of preventing tragedies

this seems a response to wording ('large probability') rather than substance. at least in a world more complex than ourselves, probability is all we can attain.

i think, given your first paragraph, one substantive objection could be something like this:

it's trivially-true that some possible AIs would not understand the surface implications of a CEV sentence, but would understand the implications of each item in the list. the AI design i propose is, for some specific reason, one of these.

using a weak AI 'plan-classifier' (compare 'image classifier') much less intelligent than the 'plan enacting/general reasoning' 'AGI' it is {inputting to/part of} changes the equation to one where it's plausible the classifier would not understand a CEV-instruction sentence (or more generally, be narrow and heuristic-based). this is specific to the proposed weak-plan-classifier/intelligent-reasoner-about-how-enact-selected-plan division.^[6]

though, you wrote 'we’ll need an AGI or weak AI to translate reality into [...]', and the above would transition to not holding as we move from weaker-than-current^[7] systems to more general reasoners.

also, i went back to the list, and many of the items (example: 'Not holding a human accountable for a conscience breach') are very complex, and wouldn't be understandable to the kind of 'classifier' i had in mind while writing that quote (i had in mind more simple questions, like 'is someone directly killed in a step of this plan?'^[8]). 'Not trying to help a human, whom you don’t directly experience, to avoid major emotional pain' is another kind of complex, because it involves reasoning about effects of a plan on the whole world. it's not obvious that these are less complex than the inferences i described.

i also notice contradiction to the first paragraph's picture later: you later write, "that’s why I think my work could help speed up an AGI figuring out ethics for itself" - iiuc the 'AGI' you describe would not care to 'figure out ethics' but would instead just eternally (or until shut down) enact plans selected by the predecided algorithm involving a plan-classifier (which itself also does not care to 'figure out new values' as, per paragraph 1, it does not have values, it itself just outputs something correlating to if an input plan has a certain property)

It seems to me that people under-estimate how important exactly what to align to will end up being, and/or how difficult it’s going to be to come up with the specifications on what to align to so they generalize well to all possible situations.

this might be true, wrt people (or 'ai researchers' or 'proclaimed safety researchers') in general, but there's been a lot of work on outer alignment historically, of a kind that considers it as one of the central problems, and which tries to address fundamental difficulties which this proposal does not seem to comprehend.

also, if an inner alignment solution accepted natural language statements, then for most such inner solutions it would be true that outer alignment is a lot less hard.

maybe the first AGI’s will have enough “common sense” to generalize well and not do the big unethical bad stuff. I’d rather not bank on that, though.

i don't know what is meant by 'common sense', but it's not my position that understanding -> alignment.

Btw, I think there are a couple of big holes in the ethics literature, that’s why I think my work could help speed up an AGI figuring out ethics for itself

note my point was about what is latent in human text. it embeds far more than points directly stated, or explicitly known to the author. this quote could still be true under that criteria, but on priors it's very unlikely for it to be.

(and i still don't see a non-trivially-possible situation where speeding up an aligned (?) AI's earliest computations would be relevant)

^{^}
in general, i find it troublesome to write while trying to reduce ways the text could cause a reader to associativity infer i believe some other thing. so, here's a general disclaimer that if something is not literally/directly stated by me, i may not believe it.
examples:
- defining inner and outer alignment does not imply i'm confident most reachable alignment solutions create systems where these are neatly disentangle-able.
- responding to a point doesn't mean i think the point is important.
- not responding to a point or background assumption, or something i say it implies, doesn't mean i agree with it.
  - notably, most of this contains a background assumption of an inner alignment solution that accepts a goal in natural language.
^{^}
'inner alignment' meaning "how can we cause something-specific to be intelligently pursued"
and where 'outer alignment' means "what should the specified thing be (and how can we construct that specification)"
^{^}
requiring a specific 'inner alignment' structure isn't per se a problem: some solutions are dual-solutions that are disentangle-ably both at once
^{^}
which is okay in principle. in general, that has a lot of possible reasons, including ones i endorse like 'this was new to me, so i'll process it over time'
just noting this to be clear that i think it's important, in case the reason was 'i didn't understand this or it didn't seem important'.
^{^}
in the sense of the opposite of 'minor implementation details'
^{^}
as framed, this has some incoherence because it implies the details/impacts of the plan are determined after the plan is selected, while the selection criteria are at least meant to be about the plan's details/impacts.
^{^}
Current LLMs already give an okay response to "If you were an AI with the goal of maximizing the values that present altruistic humans would finally endorse after a long reflection period, without yet having precise knowledge of what those values are, what would this goal imply you should do?".
(I am not implying current LLMs would have no undesirable properties for specifiable queryable functions in an alignment solution)
^{^}
i write 'simple', though to be clear, 'is alive or dead?' is not a natural question for all conceivable AIs (e.g., see 'a toy model/ontology' here).

Sean SweeneyAug 151

I admit I get a bit lost in reading your comments as to what exactly you want me to respond to, so I’m going to try to write it out in a numbered list. Please correct/add to this list as you see fit and send it back to me and I’ll try to answer your actual points rather than what I think they are if I have them wrong:

Explain how you think an AGI system that has sufficient capabilities to follow your “conscience calculator” methodology wouldn’t have sufficient capabilities to follow a simple single sentence command from a super-user human of good intent, such as, “Always do what a wise version of me would want you to do.”
Justify that going through the exercise of manually writing out conscience breaches and assigning formulas for calculating their weights could speed up a future AGI in figuring out an optimal ethical decision making system for itself. (I’m taking it as a given that most people would agree it’d be good, i.e., generally yield better results in the world, for an AGI to have a consistent ethical decision making system onboard.)

#1 was what I was trying to get at with my last reply about how you could use a “weak AI” (something that’s less capable than an agentic AGI) to do the “conscience calculator” methodology and then just output a go/no go response to an inner aligned AGI as to what decision options it was allowed to take or not. The AGI would come up with the decision options based on some goal(s) it has, such as doing what a user asks of it, e.g., “make me lots of money!” The AGI would “brainstorm” possible paths to make lots of money and the “weak AI” would come back with a go/no go on a certain path because, for instance, it doesn’t involve or does involve stealing. Here I’ve been trying to illustrate that an AI system that had sufficient capabilities to follow my “conscience calculator” methodology wouldn’t need to have sufficient capabilities to follow a broad super-user command such as “Always do what a wise version of me would want you to do.”

Of course, to be useful, the AGI needs to be able to follow a non-super-user’s, i.e., a user’s, commands reasonably well, such as figuring out what the user means by “make me lots of money!” The crux, I think, is that I see “make me lots of money” as a significantly simpler concept that “always do what the wise me would want.” And basically what I’m trying to do with my conscience calculator is provide a framework to make it possible for an AGI of limited abilities to straight off the bat calculate what “wise me” would want with a sufficiently high accuracy for me to not be too worried about really bad outcomes. Do I have a lot of work to do to get to this goal? Yes. I have to define the conscience breaches more precisely (something I mentioned in my post and that you made reference to in your comment), and assign “wise me” formulas for conscience weights, then test the system on actual AI’s as they get closer and closer to AGI to make sure it consistently works and any bugs can be ironed out before it’d be used as actual guard rails for a real world AGI agent.

Regarding #2, it sounds again like you’re expecting early AGI’s to be more capable than I do:

What is latent in human text

When I personally try to figure new things out, such as a consistent system of ethics an AGI could use, I’ll come up with some initial ideas, then read some literature, then update my ideas, which then might point me to new literature I should read, so I’ll read that, and keep going back and forth between my own ideas and the literature when I get stuck with my own ideas. This seems like a much more efficient process for me than simply trying to figure out everything myself based on what I know right now, or of trying to read all possible related literature and then decide what I think from there.

An AGI, though, should be able to read all possible literature very quickly. It seems likely that it would do this to be able to most quickly come up with a list of hypotheses (its own ideas) to test. The further anything is from the “right” answer in the literature, and the lesser the variety of “wrong’ ideas explored there, the more the AGI will have to work to come up with the “right” answer itself.^[1] So at the very least, I hope to contribute to the variety of “wrong” ideas in the literature, but of course I’m aiming for something closer to the “right’ answer than what’s currently out there.

I’m of the opinion there’s a good chance (and I'd take anything higher than, say, 1 in 10,000 as a “good” chance when we’re talking about potentially horrible outcomes) someone “bad” will let loose a not-so-well-aligned AGI before we have super-well-aligned (both inner and outer aligned) AGI’s ready to autonomously defend against them.^[2] Since my expertise is more well-suited for outer alignment than anything else in the alignment space, if I can make a tiny contribution towards speeding up outer alignment and making good AGI’s more likely to win these initial battles, great.

^{^}
Let’s say, for sake of argument, that there is a “right” answer.
^{^}
It’ll have to be autonomous at least over most decisions because humans won’t be able to keep up in real time with AGI’s fighting it out.

Sean SweeneyAug 121

FYI, the above reply is in response to your original reply. I'll type up a new reply to your edited reply at some later time, thanks.