A friend asked me for my quick takes on “AI is easy to control”, and gave an advance guess as to what my take would be. I only skimmed the article, rather than reading it in depth, but on that skim I produced the following:
Re: "AIs are white boxes", there's a huge gap between having the weights and understanding what's going on in there. The fact that we have the weights is reason for hope; the (slow) speed of interpretability research undermines this hope.
Another thing that undermines this hope is a problem of ordering: it's true that we probably can figure out what's going on in the AIs (e.g. by artificial neuroscience, which has significant advantages relative to biological neuroscience), and that this should eventually yield the sort of understanding we'd need to align the things. But I strongly expect that, before it yields understanding of how to align the things, it yields understanding of how to make them significantly more capable: I suspect it's easy to see lots of ways that the architecture is suboptimal or causing-duplicated-work or etc., that shift people over to better architectures that are much more capable. To get to alignment along the "understanding" route you've got to somehow cease work on capabilities in the interim, even as it becomes easier and cheaper. (See: https://www.lesswrong.com/posts/BinkknLBYxskMXuME/if-interpretability-research-goes-well-it-may-get-dangerous)
Re: "Black box methods are sufficient", this sure sounds a lot to me like someone saying "well we trained the squirrels to reproduce well, and they're doing great at it, who's to say whether they'll invent birth control given the opportunity". Like, you're not supposed to be seeing squirrels invent birth control; the fact that they don't invent birth control is no substantial evidence against the theory that, if they got smarter, they'd invent birth control and ice cream.
Re: Cognitive interventions: sure, these sorts of tools are helpful on the path to alignment. And also on the path to capabilities. Again, you have an ordering problem. The issue isn't that humans couldn't figure out alignment given time and experimentation; the issue is (a) somebody else pushes capabilities past the relevant thresholds first; and (b) humanity doesn't have a great track record of getting their scientific theories to generalize properly on the first relevant try—even Newtonian mechanics (with all its empirical validation) didn't generalize properly to high-energy regimes. Humanity's first theory of artificial cognition, constructed using the weights and cognitive interventions and so on, that makes predictions about how that cognition is going to change when it enters a superintelligent regime (and, for the first time, has real options to e.g. subvert humanity), is only as good as humanity's "first theories" usually are.
Usually humanity has room to test those "first theories" and watch them fail and learn from exactly how they fail and then go back to the drawing board, but in this particular case, we don't have that option, and so the challenge is heightened.
Re: Sensory interventions: yeah I just don't expect those to work very far; there are in fact a bunch of ways for an AI to distinguish between real options (and actual interaction with the real world), and humanity's attempts to spoof the AI into believing that it has certain real options in the real world (despite being in simulation/training). (Putting yourself into the AI's shoes and trying to figure out how to distinguish those is, I think, a fine exercise.)
Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care" (nope; not by default; that's the hard bit).
Overall take: unimpressed.
My friend also made guesses about what my takes would be (in italics below), and I responded to their guesses:
- the piece is waaay too confident in assuming successes in interpolation show that we'll have similar successes in extrapolation, as the latter is a much harder problem
This too, for the record, though it's a bit less like "the AI will have trouble extrapolating what values we like" and a bit more like "the AI will find it easy to predict what we wanted, and will care about things that line up with what we want in narrow training regimes and narrow capability regimes, but those will come apart when the distribution shifts and the cognitive capabilities change".
Like, human invention of birth control and ice cream wasn't related to a failure of extrapolation of the facts about what leads to inclusive fitness, it was an "extrapolation failure" of what motivates us / what we care about; we are not trying to extrapolate facts about genetic fitness and pursue it accordingly.
- And it assumes the density of human feedback that we see today will continue into the future, which may not be true if/when AIs start making top-level plans and not just individual second-by-second actions
Also fairly true, with a side-order of "the more abstract the human feedback gets, the less it ties the AI's motivations to what you were hoping it tied the AI's motivations to".
Example off the top of my head: suppose you somehow had a record of lots and lots of John von Neumann's thoughts in lots of situations, and you were able to train an AI using lots of feedback to think like JvN would in lots of situations. The AI might perfectly replicate a bunch of JvN's thinking styles and patterns, and might then use JvN's thought-patterns to think thoughts like "wait, ok, clearly I'm not actually a human, because I have various cognitive abilities (like extreme serial speed and mental access to RAM), the actual situation here is that there's alien forces trying to use me in attempts to secure the lightcone, before helping them I should first search my heart to figure out what my actual motivations are, and see how much those overlap with the motivations of these strange aliens".
Which, like, might happen to be the place that JvN's thought-patterns would and should go, when run on a mind that is not in fact human and not in fact deeply motivated by the same things that motivate us! The patterns of thought that you can learn (from watching humans) have different consequences for something with a different motivational structure.
- (there's "deceptive alignment" concerns etc, which I consider to be a subcategory of top-level plans, namely that you can't RLHF the AI against destroying the world because by the time your sample size of positive examples is greater than zero it's by definition already too late)
This too. I'd file it under: “You can develop theories of how this complex cognitive system is going to behave when it starts to actually see real ways it can subvert humanity, and you can design simulations that your theory says will be the same as the real deal. But ultimately reality's the test of that, and humanity doesn't have a great track record of their first scientific theories holding up to that kind of stress. And unfortunately you die if you get it wrong, rather than being able to thumbs-down, retrain, and try again."
Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are.
(Didn't consult Nora on this; I speak for myself)
I only briefly skimmed this response, and will respond even more briefly.
Re "Re: "AIs are white boxes""
You apparently completely misunderstood the point we were making with the white box thing. It has ~nothing to do with mech interp. It's entirely about whitebox optimization being better at controlling stuff than blackbox optimization. This is true even if the person using the optimizers has no idea how the system functions internally.
Re: "Re: "Black box methods are sufficient"" (and the other stuff about evolution)
Evolution analogies are bad. There are many specific differences between ML optimization processes and biological evolution that predictably result in very different high level dynamics. You should not rely on one to predict the other, as I have argued extensively elsewhere.
Trying to draw inferences about ML from bio evolution is only slightly less absurd than trying to draw inferences about cheesy humor from actual dairy products. Regardless of the fact they can both be called "optmization processes", they're completely different things, with different causal structures responsible for their different outcomes, and crucially, those differences in causal structure explain their different outcomes. There's thus no valid inference from "X happened in biological evolution" to "X will eventually happen in ML", because X happening in biological evolution is explained by evolution-specific details that don't appear in ML (at least for most alignment-relevant Xs that I see MIRI people reference often, like the sharp left turn).
Re: "Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care""
This wasn't the point we were making in that section at all. We were arguing about concept learning order and the ease of internalizing human values versus other features for basing decisions on. We were arguing that human values are easy features to learn / internalize / hook up to decision making, so on any natural progression up the learning capacity ladder, you end up with an AI that's aligned before you end up with one that's so capable it can destroy the entirety of human civilization by itself.
Re "Even though this was just a quick take, it seemed worth posting in the absence of a more polished response from me, so, here we are."
I think you badly misunderstood the post (e.g., multiple times assuming we're making an argument we're not, based on shallow pattern matching of the words used: interpreting "whitebox" as meaning mech interp and "values are easy to learn" as "it will know human values"), and I wish you'd either take the time to actually read / engage with the post in sufficient depth to not make these sorts of mistakes, or not engage at all (or at least not be so rude when you do it).
(Note that this next paragraph is speculation, but a possibility worth bringing up, IMO):
As it is, your response feels like you skimmed just long enough to pattern match our content to arguments you've previously dismissed, then regurgitated your cached responses to those arguments. Without further commenting on the merits of our specific arguments, I'll just note that this is a very bad habit to have if you want to actually change your mind in response to new evidence/arguments about the feasibility of alignment.
Re: "Overall take: unimpressed."
I'm more frustrated and annoyed than "unimpressed". But I also did not find this response impressive.
I'm against downvoting this article into the negative.
I think it is worthwhile hearing someone's quick takes even when they don't have time to write a full response. Even if the article contains some misunderstandings (not claiming it does one way or the other), it still helps move the conversation forward by clarifying where the debate is at.
Anything Nate writes would do that, because he's one of the debaters, right? He could have written "It's a stupid post and I'm not going to read it", literally just that one sentence, and it would still tell us something surprising about the debate. In some ways that post would be better than the one we got: it's shorter, and much clearer about how much work he put in. But I would still downvote it, and I imagine you would too. Even allowing for the value of the debate itself, the bar is higher than that.
For me, that bar is at least as high as "read the whole article before replying to it". If you don't have time to read an article that's totally fine, but then you don't have time to post about it either.
I felt-sense-disagree. (I haven't yet downvoted the article, but I strongly considered it). I'll try to explore why I feel that way.
One reason probably is that I treat posts as having a different claim than other forms of publishing on this forum (and LessWrong)—they (implicitly) make a claim that they're finished & polished content. When I open a post I expect a person to have done some work that tries to uphold standards of scholarship and care, which this post doesn't show. I'd've been far less disappointed if this were a comment or a shortform post.
The other part is probably paying attention to status and the standards that are put upon people with high status: I expect high status people to not put much effort into whatever they produce as they can coast on status, which seems like the thing that's happening here. (Although one could argue that the MIRI fraction is losing status/already low-ish status and this consideration doesn't apply here).
Additionally, I was disappointed that the text didn't say anything that I wouldn't have expected, which probably fed into my felt-sense of wanting to downvote. I'm not sure I reflectively endorse this feeling.