Hide table of contents

I've got a sabbatical coming up for 2024, and, as a psych professor concerned about AI X risk, I'd like to spend the year doing some empirical psychology research that helps inform AI safety and alignment. 

What are some key questions on these topics that you'd like to see addressed by new behavioral sciences research, e.g. using surveys, experiments, interviews, literature reviews, etc?




New Answer
New Comment

7 Answers sorted by

Just brainstorming here, I have zero experience with actual psychology research:

- It might be interesting to try and identify some psychological traits that lead people to becoming EAs / becoming alignment researchers, in order to aid future recruitment/education/community-building efforts.

- This is a medium-term concern rather than about alignment itself, but I would be interested to get a clearer picture on how "botpocalypse" concerns will play out.  (See this ACX article for more detail, as well as the relevant recurring section of theZvi's AI newsletter -- basically, how worried should I be that a world full of convincing AI chatbots will totally destroy our culture's ability to have important collective conversations, have trust that we're talking to real humans, etc?)  Some psych experiments might be informative here, eg hosting a political debate on a controversial topic among 10 humans in a chatroom, versus among 5 humans and 5 AI instances in a chatroom.  Can the AIs successfully derail the conversation?  Or maybe people easily detect the AIs, but this nevertheless results in a lower-trust space and less productive conversation even between fellow humans?  etc

- A related concern about medium-term AI is that people might find them much more persuasive or empathetic once AIs are capable of conversing via realistic zoom-call-esque CGI videos, rather than just chatting back and forth in text.  It might be interesting to see how easily people are able to be swayed by an argument for "AI rights", depending on whether it is presented in text, or as an interactive chat, or as AI-synthesized audio, or as AI-deepfaked video.

- Research organizations like "Ought" have run some rudimentary AI alignment experiments with human subjects in place of the AI, like seeing how well it works for a human non-expert to judge complex technical debates between a deceptive expert vs an honest expert.  Perhaps there are other psych-style experiments that could be set up to test various alignment concepts?  ie, imagine a situation where an expert computer programmer is being tasked by a novice programmer to write some code that accomplishes simple task X, but the expert programmer is also trying to sneak in "dangerous" functionality Y.  You could experiment with equipping the novice programmer various different abilities (consult with another expert, run snippets of the code but not the whole thing, create unit-tests designed to detect dangerous functionality, etc) and see what helps detect deception.

Then there are a bunch of things that are basically more "focus testing" than psych experiments:
- Figuring out which AI safety messages are most persuasive
- Polling people about their attitudes towards AI regulation
- Doing some research about different attitudes towards doom -- how have people reacted in similar situations, eg, activists and researchers who get depressed about climate change, or people living during the Cuban missile crisis who thought nuclear war was probably imminent, or etc?

Jackson - great ideas; thanks so much for your thoughtful and creative suggestions here!

I'd love to see research into what I called "human safety problems" (or sometimes "human-AI safety"), fleshing out the idea more or giving some empirical evidence as to how much of a problem it really is. Here's a short description of the idea from AI design as opportunity and obligation to address human safety problems:

Many AI safety problems are likely to have counterparts in humans. AI designers and safety researchers shouldn’t start by assuming that humans are safe (and then try to inductively prove that increasingly powerful AI systems are safe when developed/trained by and added to a team of humans) or try to solve AI safety problems without considering whether their designs or safety approaches exacerbate human safety problems relative to other designs / safety approaches. At the same time, the development of AI may be a huge opportunity to address human safety problems, for example by transferring power from probably unsafe humans to de novo AIs that are designed from the ground up to be safe, or by assisting humans’ built-in safety mechanisms (such as moral and philosophical reflection).

I go into a bit more detail in Two Neglected Problems in Human-AI Safety.

Wei Dai - thanks for these helpful links! Will have a look. :)

There seems to be a nascent field in academia of using psychology tools/methods to understand LLMs, e.g. https://www.pnas.org/doi/10.1073/pnas.2218523120; it might be interesting to think about the intersection of this with alignment e.g. what experiments to perform, etc.

Maybe more on the neuroscience side, I'd be very excited to see (more) people think about how to build a neuroconnectionist research programme for alignment (I've also briefly mentioned this in the linkpost).

Another relevant article on "machine psychology" https://arxiv.org/abs/2303.13988 (interestingly, it's by a co-author of Peter Singer's first AI paper)

something relevant to the moratorium / protests would be useful 😏

As long as state-of-the-art alignment attempts by industry involve eliciting human evaluations of actual or hypothetical AI behaviors (e.g. responses a chatbot might give to a prompt, as in RLHF), it seems important to understand the psychological aspects of such human-AI interactions. I plan to do some experiments on what I call collective RLHF myself, more from a social choice perspective (see http://amsterdam.vodle.it ), and can imagine collaborating on similar questions.

Jobst - yes, I think ew need a lot more psych research on how to elicit the human values that AI systems are trying to align with. Especially given that some of our most important values either can't be articulated very well, or are too 'obvious' and 'common-sensical' to be discussed much, or are embodied in our physical phenotypes rather than articulated in our brains.

Jobst Heitzig (vodle.it)
This becomes particularly important in human feedback/input about "higher-level" or more "abstract" questions, as in OpenAI's deliberative mini-public / citizen assembly idea (https://openai.com/blog/democratic-inputs-to-ai).

I have some interest in cluster B personality disorders, on the theory that something(s) in human brains makes people tend to be nice to their friends and family, and whatever that thing is, it would be nice to understand it better because maybe we can put something like it into future AIs, assuming those future AIs have a sufficiently similar high-level architecture to the human brain, which I think is plausible.

And whatever that thing is, it evidently isn’t working in the normal way in cluster B personality disorder people, so maybe better understanding the brain mechanisms behind cluster B personality disorders would get a “foot in the door” in understanding that thing.

Sigh. This comment won’t be very helpful. Here’s where I’m coming from. I have particular beliefs about how social instincts need to work (short version), beliefs which I think we mostly don’t share—so an “explanation” that would satisfy you would probably bounce off me and vice-versa. (I’m happy to work on reconciling if you think it’s a good use of your time.) If it helps, I continue to be pretty happy about the ASPD theory I suggested here, with the caveat that I now think that it’s only an explanation of a subset of ASPD cases. I’m pretty confused on borderline, and I’m at a total loss on narcissism. There’s obviously loads of literature on borderline & narcissism, and I can’t tell you concretely any new studies or analysis that I wish existed but don’t yet. But anyway, if you’re aware of gaps in the literature on cluster B stuff, I’m generally happy for them to be filled. And I think there’s a particular shortage of “grand theorizing” on what’s going on mechanistically in narcissism (or at least, I’ve been unable to find any in my brief search). (In general, I find that “grand theorizing” is almost always helpfully thought-provoking, even if it’s often wrong.)

Steven - well, I think the Cluster B personality disorders (including antisocial, borderline, histrionic, and narcissistic disorders) are probably quite important to understand in AI alignment. 

Antisocial personality disorder (which is closely related to the more classical notion of 'psychopathy') seems likely to characterize a lot of 'bad actors' who might (mis)use AI for trolling, crime, homicide, terrorism, etc. And, it provides a model for what we don't want AGIs to behave like.

Curated and popular this week
Relevant opportunities