R

Remmelt

551 karmaJoined

Bio

Program Coordinator of AI Safety Camp.

Sequences
3

Bias in Evaluating AGI X-Risks
Developments toward Uncontrollable AI
Why Not Try Build Safe AGI?

Comments
140

Topic contributions
3

We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.

Worth reading:

No control method exists to safely contain the global feedback effects of self-sufficient learning machinery. What if this control problem turns out to be an unsolvable problem?

https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable

See below a text I wrote 9 months ago (with light edits) regarding the limits of error correction in practice. It was one of 10+ attempts to summarise Forrest Landry's arguments, which accumulated in this forum post 🙂

If you want to talk more, also happy to have a call
I realise I was quite direct in my comments. I don't want that to come across as rude. I really appreciate your good-faith effort here to engage with the substance of the post. We are all busy with our own projects, so the time you spent here is something I'm grateful for!

I want to make sure we maintain integrity in our argumentation, given what's at stake. If you are open to going through the reasoning step-by-step, I'd love to do that. Also understand that you've got other things going on.


~ ~ ~

4. Inequality of Monitoring

Takes more code (multiple units) to monitor local environmental effects of any single code unit.

We cannot determine the vast majority of microscopic side-effects that code variants induce and could get selected for in interaction with the surrounding environment. 

Nor could AGI, because of a macroscopic-to-microscopic mismatch: it takes a collection of many pieces of code, say of neural network circuits, to ‘kinda’ determine the innumerable microscopic effects that one circuit running on hardware has in interaction with all surrounding (as topologically connected) and underlying (as at lower layers of abstraction) virtualized and physical circuitry.

In turn, each circuit in that collection will induce microscopic side-effects when operated –    so how do you track all those effects? With even more and bigger collections of circuits?       It is logically inconsistent to claim that it is possible for internals to detect and correct (and/or predict and prevent) all side-effects caused by internals during computation. 

Even if able to generally model and exploit regularities of causation across macroscopic space, it is physically impossible for AGI to track all side-effects emanating from their hardware components at run-time, for all variations introduced in the hardware-embedded code (over >10² layers of abstraction; starting lower than the transistor-bit layer), contingent with all possible (frequent and infrequent) degrees of inputs and with all possible transformations/changes induced by all possible outputs, via all possibly existing channels from and to the broader environment.

Note emphasis above on interactions between code substrate and the rest of the environment, at the microscopic level all the way to at the macroscopic level. 
To quote Eliezer Yudkowsky: "The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI's output to determine whether the consequences will be good.

Q: What about scaling up capability so an AGI can track more side-effects simultaneously? 

Scaling capability of any (superficially aligned) AI make them worse-equipped at tracking all interactions between/with internals. The number of possible interactions (hypothetically, if they were countable) between AI components and the broader environment would scale at minimum exponentially with a percentage-wise scaling of AI the components.

Scaling interpretability schemes is counterproductive too in that it leads researchers to miscalibrate even more on what general capabilities and degrees of freedom of interaction (eg. closed-loop, open-ended, autonomous) they can safely allow the interpreted ML architectures to scale to. If, for example, you would scale up interpretation to detect and correct out any misaligned mesa-optimiser, the mesa-optimisers you are leaving to grow in influence are those successfully escaping detection (effectively deceiving researchers into miscalibrated beliefs). Same goes for other locally selected-for optimisers, which we will get to later.

 

5. Combinatorial Complexity of Machine Learning

Increasingly ambiguous to define & detect novel errors to correct at higher abstraction layers.

Mechanistic interpretability emphasizes first inspecting neural network circuits, then piecing the local details of how those circuits work into a bigger picture of how the model functions. Based on this macroscopic understanding of functionality, you would then detect and correct out local malfunctions and misalignments (before these errors overcome forward pass redundancies).

This is a similar exercise to inspecting how binary bits stored on eg. a server’s harddrive are logically processed – to piece together how the architecture stack functions and malfunctions: 

  1. Occasionally, a local bit flips (eg. induced by outside electromagnetic interference). 
    So you make redundant copies of the binary code to compare and correct against.
  2. At the packet layer, you find distortions in packets transmitted over wires to topologically adjacent hardware. You append CRC checksums to correct those errors.
  3. At the application layer, you find that a Trojan horse transmitted from adjacent hardware caused an application to malfunction. You add in virus detection signatures.
  4. At the layer of neural networks, trained through an application running on the server, you fear that more complex Trojan horses could infiltrate this layer too. 

​​Each time we scale the architecture to a new layer of abstraction, we discover that we need to solve a completely new set of problems. Over time, variance is introduced into code computed at this layer, and some of this variance may lead to malfunctions. That a mechanism corrects errors at a lower layer says little about how to detect and correct errors at the new higher layer.

The higher up the architecture stack you design for, the more ambiguous and multi-faceted in meaning is  the ‘error’ to correct out. Mechanisms you design for a neural network must detect over a combinatorial explosion of possible causal connections to correct out all defined errors.  

Now, there is no mechanism to make a deep neural network receiving Internet data robust against any but a tiny fraction of possible Trojan backdoor attacks for which countermeasures exist, countermeasures that are easily circumvented. Here, an adversary introduces variation into the input data that a model is training on, such that in later deployment a chosen (series of) inputs will trigger the model to behave out of line. The model is set to act like a perfect Manchurian Candidate. For the model’s developers, under usual interactions, the model plainly serves some overall purpose. In parallel, the model is storing latent causal structure that under particular interactions with the outside environment causes outputs and cascading effects.

For the adversary, the model also serves another, conflicting purpose. In effect, they select for internal variance misaligned with the original developers’ purpose. Worse, if ‘random’ initialisation of internals was directed by the adversary, they could code in persistent misalignment that is undetectable – making it computationally infeasible for an interpreter with full access to internals to find a single input for which the model’s outputs differ from the outputs of an untampered copy (for details, see section 2.3 of the paper).



 

6. Delays in Measuring Changes

Time delay in measuring (and correcting) the extent of any misalignment detectable in the monitored (or simulated) changes caused by an internal code unit.

Any decision-theoretical premise of coherent unitary agency of future AGI fails to account for locally selected-for effects. It asserts a priori that any AGI would consistently and continuously operate as, and thus can be distilled and represented monolithically as, a coherent agentic unit. More precisely, that ‘AGI’ can be soundly represented as (a) single unit(s) that make(s) decisions over time based on (orderings of) preferences that are consistent with each other. 

Not so in practice: if AGI internals induce local effects that loop out of the control of higher-level regulation (in reference to eg. decided metric extents or numbers of end states), then lower-level behavior is inconsistent with higher-level decision-making. But even internals detect & correct (or predict & prevent) internal inconsistencies, can only do so discontinuously.

There is a time delay to inspecting, say, a new circuit or module. Take for example, a scheme that measures the extent to which the variant, as computed in interaction with connected embedded code, causes external effects that diverge (are misaligned) from reference values. But to measure, code’s detected/predicted effects are first transformed to be commensurable (measured by a common unit standard against the reference) over dimensions learned to be semantically descriptive of mis/alignment (as eg. generalized from past feedback expressed by humans within environmental contexts that they and/or the AI were operating in at the time).

Then, over the number of clock cycles this scheme took to detect the internal misalignment (let alone correct the misalignment, without breaking the system in another way), any nonlinear effects of the computed code variant may have looped outside the bounds of what the scheme can detect and correct for. The scheme’s mechanistic operations cannot predetermine what happens during the gap between inspecting code and measuring+correcting for misalignment. 

Q: But AGI is a fast global optimiser, so it must be fast enough at detecting or predicting the effects of local variants before any misaligned effects break outside AGI, right?

Any superficially human-aligned AGI running on digital hardware would take some huge multiple of a clock cycle to inspect (and if misaligned, correct) each of all pieces of internal code that could be running over a next period. 

Q: Could AGI not pause computation of that internal code in the meantime?

AGI pausing computation of all internal code while inspecting/interpreting internals is not an option, because the inspector/interpreter is effectively topologically connected within the internals. Instead, AGI could continue computing non-varied code, including the interpreter, but how can one ascertain that no changes have actually been made to ‘non-varied’ hardware memory locations since the last inspection point through any physical pathway (e.g quantum jumps between nanometer-spaced transistor gaps)? By inspecting the code: here basically requiring a massive parallel-processed redundancy-based error checker that is perfectly accurate and stays so over millennia+ (also error checking the error checker and so on). For the compute-paused portion of varied code, the catch is that the interpreter would not be able to simulate the full range of the new variants’ effects without first computing them in interaction with connected code, as in deployment (variants interact with the interpreter in any case). Finally, AGI must upgrade their interpreter to be somewhat able to interpret new layers of variants integrated into internals, which requires creating new variations on the interpreter itself. 

^— All of this is to say ‘Indeed, AGI inspecting all internal code that could be running in a next period does take some huge multiple of a clock cycle, and that code needs to actually be running for inspection to be anything remotely close to sound.’

‘Built-in alignment’ does not work either, since this notion of ‘built-in’ fails to account for the malfunctioning or misalignment of variants that are introduced and newly connected up within the code pool over time. 

 

7. Computationally-Irreducible Causal Trajectories

Nonlinear feedback cycles can amplify a tiny local change into a large global divergence in the final conditions. 

Even if any effect starts microscopic in scope and small in its magnitude, we cannot a priori rule out that it cascades into larger macroscopic effects. In case that tiny ‘side-effect’ feeds into a chaotic system, found across eg. biological lifeforms and Internet networks, the minor change caused in the initial conditions can get recursively amplified into causing much larger changes (vs. non-amplified case) in the final conditions.

Any implicitly captured structure causing (repeated) microscopic effects does not have to have captured macroscopic regularities (ie. a natural abstraction) of the environment to run amok. Resulting effects just have to stumble into a locally-reachable positive feedback loop. 

It is dangerous to assume otherwise, ie. to assume that:

  • selected-for microscopic effects fizzle out and get lost within the noise-floor over time.
  • reliable mechanistic interpretation involves piecing together elegant causal regularities, natural abstractions or content invariances captured by neural circuits.

 


 

if you mean a feedback loop involving actions into the world and then observations going back to the AI,

Yes, I mean this basically.

i insist that in one-shot alignment, this is not a thing at least for the initial AI, and it has enough leeway to make sure that its single-action, likely itself an AI, will be extremely robust.

I can insist that a number can be divided by zero as the first step of my reasoning process. That does not make my reasoning process sound.

Nor should anyone here rely on you insisting that something is true as the basis of why machinery that could lead to the deaths of all current living species on this planet could be aligned after all – to be ‘extremely robust’ in all its effects on the planet.

The burden of proof is on you.

a one-shot aligned AI (let's call it AI₀) can, before its action, design a really robust AI₁ which will definitely keep itself aligned, be equipped with enough error-codes to ensure that its instances will get corrupted approximately 0 times until heat death

You are attributing a magical quality to error correction code, across levels of abstraction of system operation, that is not available to you nor to any AGI.

I see this more often with AIS researchers with pure mathematics or physics backgrounds (note: I did not check yours).

There is a gap in practical understanding of what implementing error correction code in practice necessarily involves.

The first time a physicist insisted that all of this could be solved with “super good error correction code”, Forrest wrote this (just linked that into the doc as well): https://mflb.com/ai_alignment_1/agi_error_correction_psr.html

I will also paste below my more concrete explanation for prosaic AGI:

it would make sure that the conditions on earth and throughout the universe are not up to selection effects, but up to its deliberate decisions. the whole point of aligned powerful agents is that they steer things towards desirable outcomes rather than relying on selection effects.

This is presuming a premise that AGI can do something that I tried to clarify in this post a (superintelligent) AGI could actually not do. I cannot really argue with your reasoning except to point back at the post explaining why is not a sound premise to base one’s reasoning off.

Alignment of effects in the outside world requires control feedback loops.

Any formal alignment scheme implemented in practice will need to contend with that functionally complex machinery (AGI) will be interacting with an even more complex outside world – a space of (in effect, uncountable) interactions that unfortunately cannot be completely measured by and then just continue to be modelled by the finite set of signal-processing AGI hardware components themselves. There is a fundamental inequality here with real practical consequences. The AGI will have to run some kind of detection and correction loop(s) so its internal modelling and simulations are less likely to diverge from outside processes, at least over the short term.

The question I’d suggest looking into is whether any explicit reasoning process that happens across the connected AGI components can actually ensure (top-down) that the iterative (chaotic) feedback of physical side-effects caused by interactions with those components are still aimed at ‘desirable outcomes’ or at least away from ‘non-desirable outcomes’.

Glad to read your thoughts, Ben.

You’re right about this:

  • Even if long-term AGI safety was possible, then you still have to deal with limits on modelling and consistently acting on preferences expressed by humans from their (perceived) context. https://twitter.com/RemmeltE/status/1620762170819764229

  • And not consistently represent the preferences of malevolent, parasitic or short-term human actors who want to misuse/co-opt the system through any attack vectors they can find.

  • And deal with that the preferences of a lot of the possible future humans and of non-human living beings will not get automatically represented in a system that AI corporations by default have built to represent current living humans only (preferably, those who pay).

A humble response to layers on layers of fundamental limits on the possibility of aligning AGI, even in principle, is to ask how we got so stuck on this project in the first place.

Nice, thanks for sharing.

The host, Jim Rutt, is actually the former chairman of the Sante Fe institute, so he gets complexity theory (which is core to the argument, but not deeply understood in terms of implications in the alignment community, so I tried conveying those in other ways in this post).

The interview questions jump around a lot, which makes it harder to follow.

Forrest’s answers on Rice Theorem also need more explanation: https://mflb.com/ai_alignment_1/si_safety_qanda_out.html#p6

Many AI researchers have signed pledges not to develop lethal autonomous weapons (LAWs), such as ‘slaughterbots’.

Despite that, the US military has been investing billions in automating network-centric warfare over the last years.

Check out #KillCloud on Twitter: https://twitter.com/search?q=%23KillCloud&src=typed_query

amicus briefs from AI alignment, development, or governance organizations, arguing that AI developers should face liability for errors in or misuse of their products.

Sounds like a robustly useful thing to do to create awareness of the product liability issues of buggy spaghetti code.

Actually, there are many plaintiffs I’m in touch with (especially those representing visual artists, writers, and data workers) who need funds to pay for legal advice and to start class-action lawsuits (given having to pay court fees if a case is unsuccessful).

A friend in AI Governance just shared this post with me.

I was blunt in my response, which I will share below:

~ ~ ~

Two cruxes for this post:

  1. Is aligning AGI to be long-term safe even slightly possible – practically given default AI scaled training and deployment trends and complexity of the problem (see Yudkowsky’s list of AGI lethalities) or theoretically given strict controllability limits (Yampolskiy) and uncontrollable substrate-needs convergence (Landry).

If clearly, pre-aligning AGI to not cause a mass extinction is not even slightly possible, then IMO splitting hairs about “access to good data that might help with alignment” is counterproductive.

  1. Is a “richer technological world” worth the extent to which corporations are going to automate away our ability to make our own choices (starting with our own data), the increasing destabilisation of society, and the toxic environmental effects of automating technological growth?

These are essentially rhetorical questions, but covers the points I would ask someone who proposes desisting from collaborating with other groups who notice related harms and risks of corporations scaling AI.

To be honest, the reasoning in this post seems rather motivated without examination of underlying premises.

These sentences particularly:

“A world that restricts compute will end up with different AGI than a world that restricts data. While some constraints are out of our control — such as the difficulty of finding certain algorithms — other constraints aren't. Therefore, it's critical that we craft these constraints carefully, to ensure the trajectory of AI development goes well. Passing subpar regulations now — the type of regulations not explicitly designed to provide favorable differential technological progress — might lock us into bad regime.”

It assumes AGI is inevitable, and therefore we should be picky about how we constrain developments towards AGI.

It also implicitly assumes that continued corporate scaling of AI counts as positive “progress” – at least for the kind of world they imagine would result and want to live in.

The tone also comes across as uncharitable. As if they are talking down at others they have not spent time trying to listen carefully to, take the perspective of, and paraphrase back their reasoning to (at least nothing is written about/from those attempts in the post).

Frankly, we cannot be held back by motivated techno-utopian arguments from taking collective action against exponentially increasing harms and risks (in extents of the scale and local impacts). We need to work with other groups to make traction.

~ ~ ~

Load more