I can see a worldview in which prioritizing raising awareness is more valuable, but I don't see the case for believing "that we have concrete proposals". Or at least, I haven't seen any; could you link them, or explain what you mean by a concrete proposal?
My guess is that you're underestimating how concrete a proposal needs to be before you can actually muster political will behind it. For example, you don't just need "let's force labs to pass evals", you actually need to have solid descriptions of the evals you want them to pass.
I also think that recent events have been strong evidence in favor of my position: we got a huge amount of political will "for free" from AI capabilities advances, and the best we could do with it was to push a deeply flawed "let's all just pause for 6 months" proposal.
Hi Richard! I truly appreciate your enlightening post! It struck me as highly informative, and I believe others will feel the same. In order to reach out to our Spanish-speaking individuals, I have proactively translated it into Spanish, as I'm sure they will also see the immense value in this information.
I'll piggyback on this (excellent) post to mention that we're working on some of the governance questions mentioned here at Rethink Priorities.
For example: @Onni Aarne is working on hardware-enabled mechanisms for compute governance (this touches on a bunch of stuff that comes up in Yonadav's paper, like tamper-evident logging), and I am working on China's access to ML compute post October export controls. @MichaelA is supervising those projects.
We're definitely happy to hear from others who are working on these (or related) things, are considering considering working on these things, or are simply interested in these things! (You can reach any of us at <firstname>@rethinkpriorities.org.)
We expect to open a hiring round for another compute governance person soon™.
Our other projects are summarized in this two-pager, and some are also relevant to problems listed in this post.
Here I proposed a systematic framework for classifying AI safety work. This is a matrix, where one dimension is the system level:
A monolithic AI system, e.g., a conversational LLM
AGI lab (= the system that designs, manufactures, operates, and evolves monolithic AI systems and systems of AIs)
A cyborg, human + AI(s)
A system of AIs with emergent qualities (e.g., https://numer.ai/, but in the future, we may see more systems like this, operating on a larger scope, up to fully automatic AI economy; or a swarm of CoEms automating science)
A human+AI group, community, or society (scale-free consideration, supports arbitrary fractal nestedness): collective intelligence, e.g., The Collective Intelligence Project
Design time: research into how the corresponding system should be designed (engineered, organised): considering its functional ("capability", quality of decisions) properties, adversarial robustness (= misuse safety, memetic virus security), and security. AGI labs: org design and charter.
Manufacturing and deployment time: research into how to create the desired designs of systems successfully and safely:
AI training and monitoring of training runs.
Offline alignment of AIs during (or after) training.
AI strategy (= research into how to transition into the desirable civilisational state = design).
Designing upskilling and educational programs for people to become cyborgs is also here (= designing efficient procedures for manufacturing cyborgs out of people and AIs).
Operations time: ongoing (online) alignment of systems on all levels to each other, ongoing monitoring, inspection, anomaly detection, and governance.
Evolutionary time: research into how the (evolutionary lineages of) systems at the given level evolve long-term:
How the human psyche evolves when it is in a cyborg
How humans will evolve over generations as cyborgs
How AI safety labs evolve into AGI capability labs :/
How groups, communities, and society evolve.
Designing feedback systems that don't let systems "drift" into undesired state over evolutionary time.
Considering system property: property of flexibility of values (i.e., the property opposite of value lock-in, Riedel (2021)).
IMO, it (sometimes) makes sense to think about this separately from alignment per se. Systems could be perfectly aligned with each other but drift into undesirable states and not even notice this if they don't have proper feedback loops and procedures for reflection.
There would be 6*4 = 24 slots in this matrix, and almost all of them have something interesting to research and design, and none of them is "too early" to consider.
Richard's directions within the framework
Scalable oversight: (monolithic) AI system * manufacturing time
Alignment theory: Richard phrases it vaguely, but referencing primarily MIRI-style work reveals that he means primarily "(monolithic) AI system * design, manufacturing, and operations time".
Evaluations, unrestricted adversarial training: (monolithic) AI system * manufacturing, operations time
Threat modeling: system of AIs (rarely), human + AI group, whole civilisation * deployment time, operations time, evolutionary time
Governance research, policy research: human + AI group, whole civilisation * mostly design and operations time.
Takeaways
To me, it seems almost certain that many current governance institutions and democratic systems will not survive the AI transition of civilisation. Bengio recently hinted at the same conclusion.
Richard mostly classifies this as "governance research", which has a connotation that this is a sort of "literary" work and not science, with which I disagree. There is a ton of cross-disciplinary hard science to be done about group intelligence and civilisational intelligence design: game theory, control theory, resilience theory, linguistics, political economy (rebuild as hard science, of course, on the basis of resource theory, bounded rationality, economic game theory, etc.), cooperative reinforcement learning, etc.
I feel that the design of group intelligence and civilisational intelligence is an under-appreciated area by the AI safety community. Some people do this (Eric Drexler, davidad, the cip.org team, ai.objectives.institute, the Digital Gaia team, and the SingularityNET team, although the latter are less concerned about alignment), but I feel that far more work is needed in this area.
There is also a place for "literary", strategic research, but I think it should mostly concern deployment time of group and civilisational intelligence designs, i.e., the questions of transition from the current governance systems to the next-generation, computation and AI-assisted systems.
Thank you so much for your insightful and detailed list of ideas for AGI safety careers, Richard! I really appreciate your excellent post.
I would propose explicitly grouping some of your ideas and additional ones under a third category: “identifying and raising public awareness of AGI’s dangers.” In fact, I think this category may plausibly contain some of the most impactful ideas for reducing catastrophic and existential risks, given that alignment seems potentially difficult to achieve in a reasonable period of time (if ever) and the implementation of governance ideas is bottlenecked by public support.
I don't actually think the implementation of governance ideas is mainly bottlenecked by public support; I think it's bottlenecked by good concrete proposals. And to the extent that it is bottlenecked by public support, that will change by default as more powerful AI systems are released.
Richard, I hope you turn out to be correct that public support for AI governance ideas will become less of a bottleneck as more powerful AI systems are released!
But I think it is plausible that we should not leave this to chance. Several of the governance ideas you have listed as promising (e.g., global GPU tracking, data center monitoring) are probably infeasible at the moment, to say the least. It is plausible that these ideas will only become globally implementable once a critical mass of people around the world become highly aware of and concerned about AGI dangers.
This means that timing may be an issue. Will the most detrimental of the AGI dangers manifest before meaningful preventative measures are implemented globally? It is plausible that before the necessary critical mass of public support builds up, a catastrophic or even existential outcome may already have occurred. It would then be too late.
The plausibility of this scenario is why I agree with Akash that identifying and raising public awareness of AGI’s dangers is an underrated approach.
If such a model is a strong success it may shift my credences from, say, 25% to 75% in a given proposition. But that’s only a factor of 3 difference, whereas one plan for how to solve governance could be one or two orders of magnitude more effective than another.
Do you have any thoughts on the value of models to determine the effectiveness of plans to solve governance?
I really appreciate specific career advice from people working in relevant jobs and the ideas and considerations outlined here, and am curating the post. (I'm also really interested in the discussion happening here.)
Personal highlights (note that I'm interested in hearing disagreement with these points!):
The emphasis on fast feedback loops, especially for people who are newer to a field (see also the bit about becoming an expert in something for governance)
"the best option for mentorship may be outside of alignment—but PhDs are long enough, and timelines short enough, that you should make sure that your mentor would be excited about supervising some kind of alignment-relevant research."
This bit (I'd be interested in hearing disagreement, if there is much, though!):
"You’ll need to get hands-on. The best ML and alignment research engages heavily with neural networks (with only a few exceptions). Even if you’re more theoretically-minded, you should plan to be interacting with models regularly, and gain the relevant coding skills. In particular, I see a lot of junior researchers who want to do “conceptual research”. But you should assume that such research is useless until it cashes out in writing code or proving theorems, and that you’ll need to do the cashing out yourself (with threat modeling being the main exception, since it forces a different type of concreteness). ..."
"You can get started quickly. People coming from fields like physics and mathematics often don’t realize how much shallower deep learning is as a field, and so think they need to spend a long time understanding the theoretical foundations first. You don’t..." [read the rest above]
The specific directions and research topics listed! (With links and commentary!)
On governance:
"The main advice I give people who want to enter this field: pick one relevant topic and try to become an expert on it."
"In general I think people overrate “analysis” and underrate “proposals”.
"You’ll need to get hands-on. The best ML and alignment research engages heavily with neural networks (with only a few exceptions). Even if you’re more theoretically-minded, you should plan to be interacting with models regularly, and gain the relevant coding skills. In particular, I see a lot of junior researchers who want to do “conceptual research”. But you should assume that such research is useless until it cashes out in writing code or proving theorems, and that you’ll need to do the cashing out yourself (with threat modeling being the main exception, since it forces a different type of concreteness). ..."
Yeah, I agree on priors & some arguments about feedback loops, although note that I don't really have relevant experience. But I remember hearing someone try to defend something like the opposite claim to me in some group setting where I wasn't able to ask the follow-up questions I wanted to ask — so now I don't remember what their main arguments were and don't know if I should change my opinion.
I expect a bunch of more rationalist-type people disagree with this claim, FWIW. But I also think that they heavily overestimate the value of the types of conceptual research I'm talking about here.
I don't know of any writing that directly contradicts these claims. I think https://www.lesswrong.com/s/v55BhXbpJuaExkpcD/p/3pinFH3jerMzAvmza indirectly contradicts these claims as it broadly criticizes most empirical approaches and is more open to conceptual approaches.
Loved reading this post, as a person considering working in AI Safety, this is a great resource and answers many questions, including some I hadn't thought of answering. Thanks so much for writing this!
One question: I am curious to hear anyone's perspective on the following "conflict":
Point 1: "There is a specific skill of getting things done inside large organizations that most EAs lack (due to lack of corporate experience, plus lack of people-orientedness), but which is particularly useful when pushing for lab governance proposals. If you have it, lab governance work may be a good fit for you."
Point 2: "You need to get hands on" and, related: "Coding skill is a much more important prerequisite, though."
There may be exceptions, but I would guess (partly based on my own experience) that the kind of people who have a lot of experience getting things done in large organisations typically do not spend much time coding ML models.
And yet, as I say, I believe both of these are necessary. If I want to influence a major AI / ML company, I will lack credibility in their eyes if I have no experience working with and in large organisations. But I will also lack credibility if I don't have an in-depth understanding of the models and an ability to discuss them specifically rather than just abstractly.
Specific question: What might the typical learning curve be for the second aspect, to get to the point where I could get hands on with models? My starting point would be having studied FORTRAN in college (!! - yes, that long ago!) and only having one online course of Python. There may be others with different starting points.
I suppose. ultimately, it still seems likely that it would be quicker even for a total novice to coding to reach some level of meaningful competence than for someone with no experience of organisations to become expert in how decisions are made and plans are approved or rejected, and how to influence this.
Also are there good online courses anyone would recommend?
One question: I am curious to hear anyone's perspective on the following "conflict":
The former is more important for influencing labs, the latter is more important for doing alignment research.
And yet, as I say, I believe both of these are necessary.
FWIW when I talk about the "specific skill", I'm not talking about having legible experience doing this, I'm talking about actually just being able to do it. In general I think it's less important to optimize for having credibility, and more important to optimize for the skills needed. Same for ML skill—less important for gaining credibility, more important for actually just figuring out what the best plans are.
Also are there good online courses anyone would recommend?
I can see a worldview in which prioritizing raising awareness is more valuable, but I don't see the case for believing "that we have concrete proposals". Or at least, I haven't seen any; could you link them, or explain what you mean by a concrete proposal?
My guess is that you're underestimating how concrete a proposal needs to be before you can actually muster political will behind it. For example, you don't just need "let's force labs to pass evals", you actually need to have solid descriptions of the evals you want them to pass.
I also think that recent events have been strong evidence in favor of my position: we got a huge amount of political will "for free" from AI capabilities advances, and the best we could do with it was to push a deeply flawed "let's all just pause for 6 months" proposal.
Hi Richard! I truly appreciate your enlightening post! It struck me as highly informative, and I believe others will feel the same. In order to reach out to our Spanish-speaking individuals, I have proactively translated it into Spanish, as I'm sure they will also see the immense value in this information.
Thanks again!
Thanks! I'll update it to include the link.
I'll piggyback on this (excellent) post to mention that we're working on some of the governance questions mentioned here at Rethink Priorities.
Note: this comment is cross-posted on LessWrong.
Classification of AI safety work
Here I proposed a systematic framework for classifying AI safety work. This is a matrix, where one dimension is the system level:
Another dimension is the "time" of consideration:
There would be 6*4 = 24 slots in this matrix, and almost all of them have something interesting to research and design, and none of them is "too early" to consider.
Richard's directions within the framework
Scalable oversight: (monolithic) AI system * manufacturing time
Mechanistic interpretability: (monolithic) AI system * manufacturing time, also design time (e.g., in the context of the research agenda of weaving together theories of cognition and cognitive development, ML, deep learning, and interpretability through the abstraction-grounding stack, interpretability plays the role of empirical/experimental science work)
Alignment theory: Richard phrases it vaguely, but referencing primarily MIRI-style work reveals that he means primarily "(monolithic) AI system * design, manufacturing, and operations time".
Evaluations, unrestricted adversarial training: (monolithic) AI system * manufacturing, operations time
Threat modeling: system of AIs (rarely), human + AI group, whole civilisation * deployment time, operations time, evolutionary time
Governance research, policy research: human + AI group, whole civilisation * mostly design and operations time.
Takeaways
To me, it seems almost certain that many current governance institutions and democratic systems will not survive the AI transition of civilisation. Bengio recently hinted at the same conclusion.
Human+AI group design (scale-free: small group, org, society) and the civilisational intelligence design must be modernised.
Richard mostly classifies this as "governance research", which has a connotation that this is a sort of "literary" work and not science, with which I disagree. There is a ton of cross-disciplinary hard science to be done about group intelligence and civilisational intelligence design: game theory, control theory, resilience theory, linguistics, political economy (rebuild as hard science, of course, on the basis of resource theory, bounded rationality, economic game theory, etc.), cooperative reinforcement learning, etc.
I feel that the design of group intelligence and civilisational intelligence is an under-appreciated area by the AI safety community. Some people do this (Eric Drexler, davidad, the cip.org team, ai.objectives.institute, the Digital Gaia team, and the SingularityNET team, although the latter are less concerned about alignment), but I feel that far more work is needed in this area.
There is also a place for "literary", strategic research, but I think it should mostly concern deployment time of group and civilisational intelligence designs, i.e., the questions of transition from the current governance systems to the next-generation, computation and AI-assisted systems.
Also, operations and evolutionary time concerns of everything (AI systems, systems of AIs, human+AI groups, civilisation) seem to be under-appreciated and under-researched: alignment is not a "problem to solve", but an ongoing, manufacturing-time and operations-time process.
Thank you so much for your insightful and detailed list of ideas for AGI safety careers, Richard! I really appreciate your excellent post.
I would propose explicitly grouping some of your ideas and additional ones under a third category: “identifying and raising public awareness of AGI’s dangers.” In fact, I think this category may plausibly contain some of the most impactful ideas for reducing catastrophic and existential risks, given that alignment seems potentially difficult to achieve in a reasonable period of time (if ever) and the implementation of governance ideas is bottlenecked by public support.
For a similar argument that I found particularly compelling, please check out Greg Colbourn’s recent post: https://forum.effectivealtruism.org/posts/8YXFaM9yHbhiJTPqp/agi-rising-why-we-are-in-a-new-era-of-acute-risk-and
I don't actually think the implementation of governance ideas is mainly bottlenecked by public support; I think it's bottlenecked by good concrete proposals. And to the extent that it is bottlenecked by public support, that will change by default as more powerful AI systems are released.
Richard, I hope you turn out to be correct that public support for AI governance ideas will become less of a bottleneck as more powerful AI systems are released!
But I think it is plausible that we should not leave this to chance. Several of the governance ideas you have listed as promising (e.g., global GPU tracking, data center monitoring) are probably infeasible at the moment, to say the least. It is plausible that these ideas will only become globally implementable once a critical mass of people around the world become highly aware of and concerned about AGI dangers.
This means that timing may be an issue. Will the most detrimental of the AGI dangers manifest before meaningful preventative measures are implemented globally? It is plausible that before the necessary critical mass of public support builds up, a catastrophic or even existential outcome may already have occurred. It would then be too late.
The plausibility of this scenario is why I agree with Akash that identifying and raising public awareness of AGI’s dangers is an underrated approach.
Thanks for sharing!
Do you have any thoughts on the value of models to determine the effectiveness of plans to solve governance?
I really appreciate specific career advice from people working in relevant jobs and the ideas and considerations outlined here, and am curating the post. (I'm also really interested in the discussion happening here.)
Personal highlights (note that I'm interested in hearing disagreement with these points!):
This seems strongly true to me
Yeah, I agree on priors & some arguments about feedback loops, although note that I don't really have relevant experience. But I remember hearing someone try to defend something like the opposite claim to me in some group setting where I wasn't able to ask the follow-up questions I wanted to ask — so now I don't remember what their main arguments were and don't know if I should change my opinion.
I expect a bunch of more rationalist-type people disagree with this claim, FWIW. But I also think that they heavily overestimate the value of the types of conceptual research I'm talking about here.
CC https://www.lesswrong.com/posts/fqryrxnvpSr5w2dDJ/touch-reality-as-soon-as-possible-when-doing-machine that expands on "hands-on" experience in alignment.
I don't know of any writing that directly contradicts these claims. I think https://www.lesswrong.com/s/v55BhXbpJuaExkpcD/p/3pinFH3jerMzAvmza indirectly contradicts these claims as it broadly criticizes most empirical approaches and is more open to conceptual approaches.
Loved reading this post, as a person considering working in AI Safety, this is a great resource and answers many questions, including some I hadn't thought of answering. Thanks so much for writing this!
One question: I am curious to hear anyone's perspective on the following "conflict":
Point 1: "There is a specific skill of getting things done inside large organizations that most EAs lack (due to lack of corporate experience, plus lack of people-orientedness), but which is particularly useful when pushing for lab governance proposals. If you have it, lab governance work may be a good fit for you."
Point 2: "You need to get hands on" and, related: "Coding skill is a much more important prerequisite, though."
There may be exceptions, but I would guess (partly based on my own experience) that the kind of people who have a lot of experience getting things done in large organisations typically do not spend much time coding ML models.
And yet, as I say, I believe both of these are necessary. If I want to influence a major AI / ML company, I will lack credibility in their eyes if I have no experience working with and in large organisations. But I will also lack credibility if I don't have an in-depth understanding of the models and an ability to discuss them specifically rather than just abstractly.
Specific question: What might the typical learning curve be for the second aspect, to get to the point where I could get hands on with models? My starting point would be having studied FORTRAN in college (!! - yes, that long ago!) and only having one online course of Python. There may be others with different starting points.
I suppose. ultimately, it still seems likely that it would be quicker even for a total novice to coding to reach some level of meaningful competence than for someone with no experience of organisations to become expert in how decisions are made and plans are approved or rejected, and how to influence this.
Also are there good online courses anyone would recommend?
The former is more important for influencing labs, the latter is more important for doing alignment research.
FWIW when I talk about the "specific skill", I'm not talking about having legible experience doing this, I'm talking about actually just being able to do it. In general I think it's less important to optimize for having credibility, and more important to optimize for the skills needed. Same for ML skill—less important for gaining credibility, more important for actually just figuring out what the best plans are.
See the resources listed here.
Thanks Richard, This is clear now.
And thank you (and others) for sharing the resources link - this indeed looks like a fantastic resource.
Denis