R

RobertM

Software Engineer @ Lightcone Infrastructure
661 karmaJoined Working (6-15 years)

Bio

LessWrong dev & admin as of July 5th, 2022.

Posts
1

Sorted by New

Comments
42

Topic contributions
5

We don't know how to do that.  It's something that falls out of its training, but we currently don't know how to even predict what goal any particular training setup will result in, let alone aim for a specific one.

Answer by RobertM2
0
0

The goal you specify in the prompt is not the goal that the AI is acting on when it responds.  Consider: if someone tells you, "Your goal is now [x]", does that change your (terminal) goals?  No, because those don't come from other people telling you things (or other environmental inputs)[1].

Understanding a goal that's been put into writing, and having that goal, are two very different things.

  1. ^

    This is a bit of an exaggeration, because humans don't generally have very coherent goals, and will "discover" new goals or refine existing ones as they learn new things.  But I think it's basically correct to say that there's no straightforward relationship between telling a human to have a goal, and them having it, especially for adults (i.e. a trained model).

I think that's strongly contra Eliezer's model, which is shaped something like "succeeding at solving the alignment problem eliminates most sources of existential risk, because aligned AGI will in fact be competent to solve for them in a robust way".  This does obviously imply something about the ability of random humans to spin up unmonitored nanofactories push a bad yaml file.  Maybe there'll be some much more clever solution(s) for various possible problems?  /shrug

I do agree with this, in principle:

A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’.

...though I don't think it buys us more than a couple points; I think people dramatically underestimate how high the ceiling is for humans and think that a reasonably smart human familiar with the right ideas would stand a decent chance at executing a takeover if placed into the position of an AI (assuming speedup of cognition, + whatever actuators current systems typically possess).

However, I think this is wrong:

LLMs distill human cognition

LLMs have whatever capabilities they have because those are the capabilities discovered by gradient descent which, given their architecture, improved their performance on the test task (next token prediction).  This task is extremely unlike the tasks represented in the environment where human evolution occurred, and the kind of cognitive machinery which would make a system effective at next token prediction seems very different from whatever it is that humans do.  (Humans are capable of next token prediction, but notably we are much worse at it than even GPT-3.)

Separately, the cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values (and/or the cognitive machinery that causes humans to develop values after birth), so if it turned out that LLMs did, somehow, share the bulk of their cognitive algorithms with humans, that would be a slight positive update for me, but not an overwhelming one, since I wouldn't expect an LLM to want anything remotely relevant to humans.  (Most of the things that humans want are lossy proxies for things that improved IGF in the ancestral environment, many of which generalized extremely poorly out of distribution.  What are the lossy proxies for minimizing prediction loss that a sufficiently-intelligent LLM would end up with?  I don't know, but I don't see why they'd have anything to do with the very specific things that humans value.)

None of those obviously mean the same thing ("runaway AI" might sort of gesture at it, but it's still pretty ambiguous).  Intelligence explosion is the thing it's pointing at, though I think there are still a bunch of conflated connotations that don't necessarily make sense as a single package.

I think "hard takeoff" is better if you're talking about the high-level "thing that might happen", and "recursive self improvement" is much clearer if you're talking about the usually-implied mechanism by which you expect hard takeoff.

I think people should take a step back and take a bird's-eye view of the situation:

  • The author persistently conflates multiple communities: "tech, EA (Effective Altruists), rationalists, cybersecurity/hackers, crypto/blockchain, Burning Man camps, secret parties, and coliving houses".  In the Bay Area, "tech" is literally a double-digit percentage of the population.
  • The first archived snapshot of the website of the author's consultancy ("working with survivors, communities, institutions, and workplaces to prevent and heal from sexual harassment and sexual assault") was recorded in August 2022.
  • According to the CEA Community Health team: "The author emailed the Community Health team about 7 months ago, when she shared some information about interpersonal harm; someone else previously forwarded us some anonymous information that she may have compiled. Before about 7 months ago, we hadn’t been in contact with her."
    • This would have been late July 2022.
  • From the same comment by the CEA Community Health team: "We have emailed the author to tell her we will not be contracting her services."
    • Implied: the author attempted to sell her professional services to CEA.
  • The author, in the linked piece: "To be clear, I’m not advocating bans of the accused or accusers - I am advocating for communities to do more, for thorough investigations by trained/experienced professionals, and for accountability if an accusation is found credible. Untrained mediators and community representatives/liaisons who are only brought on for their popularity and/or nepotistic ties to the community, without thought to expertise, experience, or qualifications, such as the one in the story linked above (though there are others), often end up causing the survivors greater trauma." (Emphasis mine.)
  • The author: "In February 2023, I calculated that I personally knew of/dealt with thirty different incidents in which there was a non-trivial chance the Centre for Effective Altruism or another organization(s) within the EA ecosystem could potentially be legally liable in a civil suit for sexual assault, or defamation/libel/slander for their role/action (note: I haven’t added the stories I’ve received post-February to this tally, nor do I know if counting incidents an accurate measure (eg, accused versus accusers) also I’ve gotten several stories since that time; nor is this legal advice and to get a more accurate assessment, I’d want to present the info to a legal team specializing in these matters). Each could cost hundreds of thousands and years to defend, even if they aren’t found liable. Of course, without discovery, investigation, and without consulting legal counsel, this is a guess/speculative, and I can’t say whether they’d be liable or rise to the level of a civil suit - not with certainty without formal legal advice and full investigations."  (Emphasis in original.)
  • The author: "In response to my speculation, the community health team denied they knew of my work prior to August 2022, and that it was not connected to EA. Three white community health team members have strongly insinuated that I’ve lied and treated me – an Asian-American – in much the gaslighting, silencing way that survivors reporting rape fear being treated. Many of the women who have publicly spoken up about sexual misconduct in EA are of Asian descent. As I stated in the previous paragraph, I haven’t yet consulted with lawyers, but I personally believe this is defamatory. Additionally, the Centre and Effective Ventures Foundation are in headquartered in a jurisdiction that is much more harsh on defamation than the one I’m in."  (Emphasis in original.)
  • The author: "Unlike most of these mediators and liaisons, I have training/formal education, mentorship, and years of specific experience. If/When I choose to consult with lawyers about the events described in the paragraph above, there might be a settlement if my speculations of liability are correct (or just to silence me on the sexual misconduct and rapes I do know of). If (again, speculative) that doesn’t happen and we continue into a discovery process, I’m curious as to what could be uncovered." (Emphasis in original.)

 

I don't doubt that the author cares about preventing sexual assault, and mitigating the harms that come from it.  They do also seem to care about something that requires dropping dark hints of potential legal remedies they might pursue, with scary-sounding numbers and mentions of venue-shopping attached to them.

Answer by RobertM12
1
1

Relevant, I think,  is Gwern's later writing on Tool AIs:

There are similar general issues with Tool AIs as with Oracle AIs:

  • a human checking each result is no guarantee of safety; even Homer nods. A extremely dangerous or subtly dangerous answer might slip through; Stuart Armstrong notes that the summary may simply not mention the important (to humans) downside to a suggestion, or frame it in the most attractive light possible. The more a Tool AI is used, or trusted by users, the less checking will be done of its answers before the user mindlessly implements it.
  • an intelligent, never mind superintelligent Tool AI, will have built-in search processes and planners which may be quite intelligent themselves, and in ‘planning how to plan’, discover dangerous instrumental drives and the sub-planning process execute them.2 (This struck me as mostly theoretical until I saw how well GPT-3 could roleplay & imitate agents purely by offline self-supervised prediction on large text databases—imitation learning is (batch) reinforcement learning too! See Decision Transformer for an explicit use of this.)
  • developing a Tool AI in the first place might require another AI, which itself is dangerous

Personally, I think the distinction is basically irrelevant in terms of safety concerns, mostly for reasons outlined by the second bullet-point above.  The danger is in the fact that "useful answers" you might get out of a Tool AI are those answers which let you steer the future to hit narrow targets (approximately described as "apply optimization power" by Eliezer & such).

If you manage to construct a training regime for something that we'd call a Tool AI, which nevertheless gives us something smart enough that it does better than humans in terms of creating plans which affect reality in specific ways[1], then it approximately doesn't matter whether or not we give it actuators to act in the world[2].  It has to be aiming at something; whether or not that something is friendly to human interests won't depend on what we name we give the AI.

I'm not sure how to evaluate the predictions themselves.  I continue to think that the distinction is basically confused and doesn't carve reality at the relevant joints, and I think progress to date supports this view.

  1. ^

    Which I claim is a reasonable non-technical summary of OpenAI's plan.

  2. ^

    Though note that even if whatever lab develops it doesn't do so, the internet has helpfully demonstrated that the people will do it themselves, and quickly, too.

I think you are somewhat missing the point.  The point of a treaty with an enforcement mechanism which includes bombing data centers is not to engage in implicit nuclear blackmail, which would indeed be dumb (from a game theory perspective).  It is to actually stop AI training runs.  You are not issuing a "threat" which you will escalate into greater and greater forms of blackmail if the first one is acceded to; the point is not to extract resources in non-cooperative ways.  It is to ensure that the state of the world is one where there is no data center capable of performing AI training runs of a certain size.

The question of whether this would be correctly understood by the relevant actors is important but separate.  I agree that in the world we currently live in, it doesn't seem likely.  But if you in fact lived in a world which had successfully passed a multilateral treaty like this, it seems much more possible that people in the relevant positions had updated far enough to understand that whatever was happening was at least not the typical realpolitik.

2. If the world takes AI risk seriously, do we need  threats? 

Obviously if you live in a world where you've passed such a treaty, the first step in response to a potential violation is not going to be "bombs away!", and nothing Eliezer wrote suggests otherwise.  But the fact that you have these options available ultimately bottoms out in the fact that your BATNA is still to bomb the data center.

3. Don't do morally wrong things 

I think conducting cutting edge AI capabilities research is pretty immoral, and in this counterfactual world that is a much more normalized position, even if consensus is that chances of x-risk absent a very strong plan for alignment is something like 10%.  You can construct the least convenient possible world such that some poor country has decided, for perfectly innocent reasons, to build data centers that will predictably get bombed, but unless you think the probability mass on something like that happening is noticeable, I don't think it should be a meaningful factor in your reasoning.  Like, we do not let people involuntarily subject others to russian roulette, which is similar to the epistemic state of the world where 10% x-risk is a consensus position, and our response to someone actively preparing to go play roulette while declaring their intentions to do so in order to get some unrelated real benefit out of it would be to stop them.

4. Nuclear exchanges could be part of a rogue AI plan

I mean, no, in this world you're already dead, and also nuclear exchange would in fact cost AI quite a lot so I expect many fewer nuclear wars in worlds where we've accidentally created an unaligned ASI.

He proposes instituting an international treaty, which seems to be aiming for the reference class of existing treaties around the proliferation of nuclear and biological weapons.  He is not proposing that the United States issue unilateral threats of nuclear first strikes.

This advocates for risking nuclear war for the sake of preventing mere "AI training runs".  I find it highly unlikely that this risk-reward payoff is logical at a 10% x-risk estimate. 

All else equal, this depends on what increase in risk of nuclear war you're trading off against what decrease in x-risk from AI.  We may have "increased" risk of nuclear war by providing aid to Ukraine in its war against Russia, but if it was indeed an increase it was probably small and worth the trade-off[1] against our other goals (such as disincentivizing the beginning of wars which might lead to nuclear escalation in the first place).  I think approximately the only unusual part of Eliezer's argument is the fact that he doesn't beat around the bush in spelling out the implications.

  1. ^

    Asserted for the sake of argument; I haven't actually demonstrated that this is true but my point is more that there are many situations where we behave as if it is obviously a worthwhile trade-off to marginally increase the risk of nuclear war.

Load more