This other Ryan Greenblatt is my old account[1]. Here is my LW account.
Account lost to the mists of time and expired university email addresses.
I think the AI Notkilleveryoneism Memes ⏸️ (@AISafetyMemes) twitter account reasonably often says things that feel at least close to crying wolf. (E.g., in response to our recent paper "Alignment Faking in Large Langauge Models", they posted a tweet which implied that we caught the model trying to escape in the wild. I tried to correct possible misunderstandings here.)
I wish they would stop doing this.
They are on the fringe IMO and often get called out for this.
The Long Term Future Fund (LTFF) also looks pretty good IMO, especially if you're less optimistic about policy.
I don't think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.
I think we should want models to be quite deontological about corrigibility.
This isn't responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.
(And, I don't think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)
This is an old thread, but I'd like to confirm that a high fraction of my motivation for being vegan[1] is signaling to others and myself. (So, n=1 for this claim.) (A reasonable fraction of my motivation is more deontological.)
I eat fish rarely as I was convinced that the case for this improving productivity is sufficiently strong.
I suppose the complement to the naive thing I said before is "80k needs a compelling reason to recruit people to EA, and needs EA to be compelling to the people to recruit to it as well; by doing an excellent job at some object-level work, you can grow the value of 80k recruiting, both by making it easier to do and by making the outcome a more valuable outcome. Perhaps this might be even better for recruiting than doing recruiting."
I think there are a bunch of meta effects from working in an object level job:
I think people wouldn't normally consider it Pascalian to enter a postive total returns lottery with a 1 / 20,000 (50 / million) chance of winning?
And people don't consider it to be Pascalian to vote, to fight in a war, or to advocate for difficult to pass policy that might reduce the chance of nuclear war?
Maybe you have a different-than-typical perspective on what it means for something to be Pascalian?
A large reason to focus on opaque components of larger systems is that difficult-to-handle and existentially risky misalignment concerns are most likely to occur within opaque components rather than emerge from human built software.
I don't see any plausible x-risk threat models that emerge directly from AI software written by humans? (I can see some threat models due to AIs building other AIs by hand such that the resulting system is extremely opaque and might takeover.)
In the comment you say "LLMs", but I'd note that a substantial fraction of this research probably generalizes fine to arbitrary DNNs trained with something like SGD. More generally, various approaches that work for DNNs trained with SGD plausibly generalize to other machine learning approaches.