I've heard OpenAI employees talk about the relatively high amount of compute superalignment has (complaining superalignment has too much and they, employees outside superalignment, don't have enough). In conversations with superalignment people, I noticed they talk about it as a real strategic asset ("make sure we're ready to use our compute on automated AI R&D for safety") rather than just an example of safety washing. This was something Ilya pushed for back when he was there.
A brief overview of the contents, page by page.
1: most important century and hinge of history
2: wisdom needs to keep up with technological power or else self-destruction / the world is fragile / cuban missile crisis
3: unilateralist's curse
4: bio x-risk
5: malicious actors intentionally building power-seeking AIs / anti-human accelerationism is common in tech
6: persuasive AIs and eroded epistemics
7: value lock-in and entrenched totalitarianism
8: story about bioterrorism
9: practical malicious use suggestions
10: LAWs as an on-ramp to AI x-risk
11: automated cyberwarfare -> global destablization
12: flash war, AIs in control of nuclear command and control
13: security dilemma means AI conflict can bring us to brink of extinction
14: story about flash war
15: erosion of safety due to corporate AI race
16: automation of AI research; autnomous/ascended economy; enfeeblement
17: AI development reinterpreted as evolutionary process
18: AI development is not aligned with human values but with competitive and evolutionary pressures
19: gorilla argument, AIs could easily outclass humans in so many ways
20: story about an autonomous economy
21: practical AI race suggestions
22: examples of catastrophic accidents in various industries
23: potential AI catastrophes from accidents, Normal Accidents
24: emergent AI capabilities, unknown unknowns
25: safety culture (with nuclear weapons development examples), security mindset
26: sociotechnical systems, safety vs. capabilities
27: safetywashing, defense in depth
28: story about weak safety culture
29: practical suggestions for organizational safety
30: more practical suggestions for organizational safety
31: bing and microsoft tay demonstrate how AIs can be surprisingly unhinged/difficult to steer
32: proxy gaming/reward hacking
33: goal drift
34: spurious cues can cause AIs to pursue wrong goals/intrinsification
35: power-seeking (tool use, self-preservation)
36: power-seeking continued (AIs with different goals could be uniquely adversarial)
37: deception examples
38: treacherous turns and self-awareness
39: practical suggestions for AI control
40: how AI x-risk relates to other risks
41: conclusion
I don't think Redwood's project had identical goals, and would strongly disagree with someone saying it's duplicative.
I agree it is not duplicative. It's been a while, but if I recall correctly the main difference seemed to be that they chose a task with gave them a extra nine of reliability (started with an initially easier task) and pursued it more thoroughly.
think I'm comparably skeptical of all of the evidence on offer for claims of the form "doing research on X leads to differential progress on Y,"
I think if we find that improvement of X leads to improvement on Y, then that's some evidence, but it doesn't establish that it's differential. If we find that improvement on X also leads to progress on thing Z that is highly indicative of general capabilities, then that's evidence against. If we find that it mainly affects Y but not other things Z, then that's reasonable evidence it's differential. For example, so far, transparency hasn't affected general capabilities, so I read that as evidence of differential technological progress. As another example, I think trojan defense research differentially improves our understanding our trojans; I don't see it making models better at coding or gaining new general instrumental skills.
I think commonsense is too unreliable of a guide when thinking about deep learning; deep learning findings are phenomena are often unintelligible even in hindsight (I still don't understand why some of my research papers' methods work). That's why I'd prefer empirical evidence. Empirical research claiming to differentially improve safety should demonstrate a differential safety improvement empirically.
The failure of Redwood's adversarial training project is unfortunately wholly unsurprising given almost a decade of similarly failed attempts at defenses to adversarial examples from hundreds or even thousands of ML researchers. For example, the RobustBench benchmark shows the best known robust accuracy on ImageNet is still below 50% for attacks with a barely perceptible perturbation.
The better reference class is adversarially mined examples for text models. Meta and other researchers were working on a similar projects before Redwood started doing that line of research. https://github.com/facebookresearch/anli is an example. (Reader: evaluate your model's consistency for what counts as alignment research--does this mean non-x-risk-pilled Meta researchers do some alignment research, if we believe RR project constituted exciting alignment research too?)
Separately, I haven't seen empirical demonstrations that pursuing this line of research can have limited capabilities externalities or result in differential technological progress. Robustifying models against some kinds of automatic adversarial attacks (1,2) does seem to be separable from improving general capabilities though, and I think it'd be good to have more work on that.
We recommend this article by an MIT CS professor which is partly about how creating a sustainable work culture can actually increase productivity.
This researcher's work attitude is only part of a spectrum. Many researchers find great returns working 80+ hours a week. Some labs differentiate themselves by having usual hours, but many successful labs have their members work a lot, and that works out well. For example, Dawn Song's students work a ton, and some other Berkeley grad students in other labs are intimidated by her lab's hours, but that's OK because her graduate students find that environment suitable. It'd be nice if this post was more specific about how much of the work culture discontent is about hours vs other issues.
Let me be clear: I find the Bay Area EA Community on AI risk intellectually dissatisfying and have ever since I started my PhD in Berkeley. Contribution/complaint ratio is off, ego/skill ratio is off, tendency to armchair analyze deep learning systems instead of having experiments drive decisions was historically off, intellectual diversity/monoculture/overly deferential patterns are really off.
I am not a "strong axiological longtermist" and weigh normative factors such as special obligations and, especially, desert.
The Bay Area EA Community was the only game in town on AI risk for a long time. I do hope AI safety outgrows EA.