Ryan Greenblatt

A large reason to focus on opaque components of larger systems is that difficult-to-handle and existentially risky misalignment concerns are most likely to occur within opaque components rather than emerge from human built software.

I don't see any plausible x-risk threat models that emerge directly from AI software written by humans? (I can see some threat models due to AIs building other AIs by hand such that the resulting system is extremely opaque and might takeover.)

In the comment you say "LLMs", but I'd note that a substantial fraction of this research probably generalizes fine to arbitrary DNNs trained with something like SGD. More generally, various approaches that work for DNNs trained with SGD plausibly generalize to other machine learning approaches.

Are AI safetyists crying wolf?

Ryan Greenblatt3d4

Here is that tweet.

Are AI safetyists crying wolf?

Ryan Greenblatt8d73

I think the AI Notkilleveryoneism Memes ⏸️ (@AISafetyMemes) twitter account reasonably often says things that feel at least close to crying wolf. (E.g., in response to our recent paper "Alignment Faking in Large Langauge Models", they posted a tweet which implied that we caught the model trying to escape in the wild. I tried to correct possible misunderstandings here.)

I wish they would stop doing this.

They are on the fringe IMO and often get called out for this.

It looks like there are some good funding opportunities in AI safety right now

Ryan Greenblatt25d15

The Long Term Future Fund (LTFF) also looks pretty good IMO, especially if you're less optimistic about policy.

Alignment Faking in Large Language Models

Ryan Greenblatt1mo5

I don't think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.

I think we should want models to be quite deontological about corrigibility.

This isn't responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.

(And, I don't think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)

Yanni Kyriacos's Quick takes

Ryan Greenblatt1mo10

This is an old thread, but I'd like to confirm that a high fraction of my motivation for being vegan^[1] is signaling to others and myself. (So, n=1 for this claim.) (A reasonable fraction of my motivation is more deontological.)

^{^}
I eat fish rarely as I was convinced that the case for this improving productivity is sufficiently strong.

Ben Millwood's Quick takes

Ryan Greenblatt1mo6

I suppose the complement to the naive thing I said before is "80k needs a compelling reason to recruit people to EA, and needs EA to be compelling to the people to recruit to it as well; by doing an excellent job at some object-level work, you can grow the value of 80k recruiting, both by making it easier to do and by making the outcome a more valuable outcome. Perhaps this might be even better for recruiting than doing recruiting."

I think there are a bunch of meta effects from working in an object level job:

The object level work makes people more likely to enter the field as you note. (Though this doesn't just route through 80k and goes through a bunch of mechanisms.)
You'll probably have some conversations with people considering entering the field from a slightly more credible position at least if the object level stuff goes well.
Part of the work will likely involve fleshing stuff out so people with less context can more easily join/contribute. (True for most / many jobs.)

JWS's Quick takes

Ryan Greenblatt3mo9

I think people wouldn't normally consider it Pascalian to enter a postive total returns lottery with a 1 / 20,000 (50 / million) chance of winning?

And people don't consider it to be Pascalian to vote, to fight in a war, or to advocate for difficult to pass policy that might reduce the chance of nuclear war?

Maybe you have a different-than-typical perspective on what it means for something to be Pascalian?

That Alien Message - The Animation

Ryan Greenblatt4mo4

I agree that it is a poor analogy for AI risk. However, I do think it is a semi-reasonable intuition pump for why AIs that are very superhuman would be an existential problem if misaligned (and without other serious countermeasures).

JWS's Quick takes

Ryan Greenblatt6mo4

I think that the political activation of Silicon Valley is the sort of thing which could reshape american politics, and that twitter is a leading indicator.

I don't disagree with this statement, but also think the original comment is reading into twitter way too much.

Ryan Greenblatt

Bio

Posts 4

Comments191

Topic contributions2

Posts
4

Comments
191

Topic contributions
2