Basically all ideas/insights/research about AI is potentially exfohazardous. At least, it's pretty hard to know when some ideas/insights/research will actually make things better; especially in a world where building an aligned superintelligence (let's call this work "alignment") is quite harder than building any superintelligence (let's call this work "capabilities"), and there's a lot more people trying to do the latter than the former, and they have a lot more material resources.
Ideas about AI, let alone insights about AI, let alone research results about AI, should be kept to private communication between trusted alignment researchers. On lesswrong, we should focus on teaching people the rationality skills which could help them figure out insights that help them build any superintelligence, but are more likely to first give them insights that help them realize that that is a bad idea.
For example, OpenAI has demonstrated that they're just gonna cheerfully head towards doom. If you give OpenAI, say, interpretability insights, they'll just use them to work towards doom faster; what you need is to either give OpenAI enough rationality to slow down (even just a bit), or at least not give them anything. To be clear, I don't think people working at OpenAI know that they're working towards doom; a much more likely hypothesis is that they've memed themselves into not thinking very hard about the consequences of their work, and to erroneously feel vaguely optimistic about those due to cognitive biases such as wishful thinking.
It's very rare that any research purely helps alignment, because any alignment design is a fragile target that is just a few changes away from unaligned. There is no alignment plan which fails harmlessly if you fuck up implementing it, and people tend to fuck things up unless they try really hard not to (and often even if they do), and people don't tend to try really hard not to. This applies doubly so to work that aims to make AI understandable or helpful, rather than aligned — a helpful AI will help anyone, and the world has more people trying to build any superintelligence (let's call those "capabilities researchers") than people trying to build aligned superintelligence (let's call those "alignment researchers").
Worse yet: if focusing on alignment is correlated with higher rationality and thus with better ability for one to figure out what they need to solve their problems, then alignment researchers are more likely to already have the ideas/insights/research they need than capabilities researchers, and thus publishing ideas/insights/research about AI is more likely to differentially help capabilities researchers. Note that this is another relative statement; I'm not saying "alignment researchers have everything they need", I'm saying "in general you should expect them to need less outside ideas/insights/research on AI than capabilities researchers".
Alignment is a differential problem. We don't need alignment researchers to succeed as fast as possible; what we really need is for alignment researchers to succeed before capabilities researchers. Don't ask yourself "does this help alignment?", ask yourself "does this help alignment more than capabilities?".
- "But superintelligence is so far away!" — even if this was true (it isn't) then it wouldn't particularly matter. There is nothing that makes differentially helping capabilities "fine if superintelligence is sufficiently far away". Differentially helping capabilities is just generally bad.
- "But I'm only bringing up something that's already out there!" — something "already being out there" isn't really a binary thing. Bringing attention to a concept that's "already out there" is an exfohazard if it's worse for people to think about that concept more often. In many contexts, the concept of AI is an exfohazard, because in some contexts it's better for the world if people to think a bit less often about AI, even though they're already familiar with the concept. The same applies often for situations where people say "this ship has sailed": often, it is the case that the ship has, in fact, less-than-maximally sailed, and every bit of sailing-it-a-bit-less helps. If a ship has 85% sailed, let's not bring that up to 87% sailed. No, not everyone is already maximally-efficiently allocating their focus to the concepts that would help them the most — in fact, barely anyone is, and controlling what already-out-there concepts people pay attention to is an important part of exfohazard policy.
- "But my ideas/insights/research is not likely to impact much!" — that's not particularly how it works? It needs to somehow be differenially helpful to alignment, which I think is almost never the case. There is nothing that makes differentially helping capabilities "fine if you're only differentially helping them a little bit". Differentially helping capabilities is just generally bad. Overall p(doom) depends, among other things, on many small impacts, so you can see this as "doing your part". But really, you should just go "is this the right direction?" and if the answer is not then "by how much is this the wrong direction" doesn't matter a whole lot. I, for example, will continue to take actions that direct capabilities researchers' attention away from concept that'd help them with capabilities, even if my impact is very small; not just because it'd have a bunch of impact if everyone in my reference class did this, but because my very small impact is still in the right direction.
- "But I'm explicitely labelling these ideas as scary and bad!" — and? Pointing at which things are powerful and thus they're scary still points people at the things which are powerful.
- "So where do I privately share such research?" — good question! There is currently no infrastructure for this. I suggest keeping your ideas/insights/research to yourself. If you think that's difficult for you to do, then I suggest not thinking about AI, and doing something else with your time, like getting into factorio 2 or something.
- "If nobody publishes anything, how will alignment get solved?" — sure, it's harder for alignment researchers to succeed if they don't communicate publicly with one another — but it's not impossible. That's what dignity is about. And "this is bad so I'll do the alternative" isn't really a plan: the alternative might be worse. My whole point is that, as bad as that situation would be, it'd be better than the status quo where people just casually post eg interpretability and other prosaic research on lesswrong or in papers. This is the case because alignment takes more sequential work-time, because it's harder.
Any small increment towards alignment-succeeding-before-capabilities helps. Even if there's a bunch of damage everyday from people posting prosaic research on lesswrong and in papers, you can help by not making it worse. This isn't even a prisoner's dilemma; not publishing ideas/insights/research about AI gets you lower p(doom) — and thus, also, more LDT-value-handshake utility in worlds where we do survive.
So favor posting things that help people be more rational and make better judgments (such as not working on capabilities). Favor posting things that help capabilities researchers realize that they are that; that their alignment plans won't pan out and they're really just doing capabilities research.
Or at least just shut up and posting nothing, rather than posting ideas/insights/research about AI.