Redwood Research is a longtermist organization working on AI alignment based in Berkeley, California. We're going to do an AMA this week; we'll answer questions mostly on Wednesday and Thursday this week (6th and 7th of October). I expect to answer a bunch of questions myself; Nate Thomas and Bill Zito and perhaps other people will also be answering questions.
Here's an edited excerpt from this doc that describes our basic setup, plan, and goals.
Redwood Research is a longtermist research lab focusing on applied AI alignment. We’re led by Nate Thomas (CEO), Buck Shlegeris (CTO), and Bill Zito (COO/software engineer); our board is Nate, Paul Christiano and Holden Karnofsky. We currently have ten people on staff.
Our goal is to grow into a lab that does lots of alignment work that we think is particularly valuable and wouldn’t have happened elsewhere.
Our current approach to alignment research:
- We’re generally focused on prosaic alignment approaches.
- We expect to mostly produce value by doing applied alignment research. I think of applied alignment research as research that takes ideas for how to align systems, such as amplification or transparency, and then tries to figure out how to make them work out in practice. I expect that this kind of practical research will be a big part of making alignment succeed. See this post for a bit more about how I think about the distinction between theoretical and applied alignment work.
- We are interested in thinking about our research from an explicit perspective of wanting to align superhuman systems.
- When choosing between projects, we’ll be thinking about questions like “to what extent is this class of techniques fundamentally limited? Is this class of techniques likely to be a useful tool to have in our toolkit when we’re trying to align highly capable systems, or is it a dead end?”
- I expect us to be quite interested in doing research of the form “fix alignment problems in current models” because it seems generally healthy to engage with concrete problems, but we’ll want to carefully think through exactly which problems along these lines are worth working on and which techniques we want to improve by solving them.
We're hiring for research, engineering, and an office operations manager.
You can see our website here. Other things we've written that might be interesting:
- A description of our current project
- Some docs/posts that describe aspects of how I'm thinking about the alignment problem at the moment: The theory-practice gap. The alignment problem in different capability regimes.
We're up for answering questions about anything people are interested in.
How do we know the AMA answers are coming from real Redwood staff and not cleverly trained text models?
GPT-3 suggests: "We will post the AMA with a disclaimer that the answers are coming from Redwood staff. We will also be sure to include a link to our website in the body of the AMA, with contact information if someone wants to verify with us that an individual is staff."
That's quite a good answer
But wait, how do we know that was really written by an algorithm? ^^
"Click here to prove you are a robot"
"What needs to happen in order for the field of x-risk-motivated AI alignment research to employ a thousand ML researchers and engineers"?
(I’ll use this comment to also discuss some aspects of some other questions that have been asked.)
I think there are currently something like three categories of bottlenecks on alignment research:
Regarding 1 (“tractable projects / theoretical understanding”): Maybe in the next few years we will come to have clearer and more concrete schemes for aligning superhuman AI, and this might make it easier to scope engineering-requiring research projects that implement or test parts of those plans. ARC, Paul Christiano’s research organization, is one group that is working towards this.
Regarding 2 (“institutional structures”), I think of there being 5 major categories of institutions that could house AI alignment researchers:
Redwood Research is currently focused on 2. One of the hypotheses behind Redwood’s current organizational structure is “it’s important for organizations to focus closely on alignment research if they want to produce a lot of high-quality alignment research” (see, for example, common startup advice such as “The most important thing for startups to do is to focus” (Paul Graham)). My guess is that it’s generally tricky to stay focused on the problems that are most likely to be core alignment problems, and I’m not sure how to do it well in some institutions. I’m excited about the prospect of alignment-focused research organizations that are carefully focused on x-risk-reducing alignment work and willing to deploy resources and increase headcount toward this work.
At Redwood, our current plan is to
There are various reasons why a focus on focus might not be the right call, such as “it’s important to have close contact with top ML researchers, even if they don’t care about working on alignment right now, otherwise you’ll be much worse at doing ML research” or “it’s important to use the latest technology, which could require developing that technology in house”. This is why I think industry labs may be a reasonable bet. My guess is that (with respect to quality-adjusted output of alignment research) they have lower variance but also lower upside. Roughly speaking, I am currently somewhat less excited about academia, independent work, and government agencies, but I’m also fairly uncertain, and also there are definitely people and types of work that might be much better in these homes.
To wildly speculate, I could imagine a good and achievable distribution across institutions being 500 in alignment-focused research organizations (who might be much more willing and able to productively absorb people for alignment research), 300 in industry labs, 100 in academia, 50 independent researchers, and 50 in government agencies (but plausibly these numbers should be very different in particular circumstances). Of course “number of people working in the field” is far from an ideal proxy for total productivity, so I’ve tried to adjust for targettedness and quality of their output in my discussion here.
I estimate the current size of the field of x-risk-reduction-motivated AI alignment research is 100 people (very roughly speaking, rounded to an order of magnitude), so 1000 people would constitute something like a 10x increase. (My guesses for the current distribution is 30 in alignment orgs, 30 in industry labs, 30 in academia, 10 independent researchers, and 0 in government (very rough numbers, rounded to nearest half order of magnitude).) I’d guess there are at this time something like 30 - 100 people who, though they are not currently working on x-risk-motivated AI alignment research, would start working on this if the right institutions existed. I would like this number (of potential people) to grow a lot in the future.
Regarding 3 (“people”), the spread of the idea that it would be good to reduce x-risks from TAI (and maybe general growth of the EA movement) could increase the size and quality of the pool of people who would develop and execute on alignment projects. I am excited for the work that Open Philanthropy and university student groups such as Stanford EA are doing towards this end.
I’m currently unsure what an appropriate fraction of the technical staff of alignment-focused research organizations should be people who understand and care a lot about x-risk-motivated alignment research. I could imagine that ratio being something like 10%, or like 90%, or in between.
I think there’s a case to be made that alignment research is bottlenecked by current ML capabilities, but I (unconfidently) don’t think that this is currently a bottleneck; I think there is a bunch more alignment research that could be done now with current capabilities (eg my guess is that less than 50% of the alignment work that could be done at current levels of capabilities has been done -- I could imagine there being something like 10 or more projects that are as helpful as “Deep RL from human preferences” or “Learning to summarize from human feedback”).
It's 2027, and Redwood has failed to be useful while spending hundreds of person-years of researcher time. What happened?
In most worlds where we fail to produce value, I think we fail before we spend a hundred researcher-years. So I’m also going to include possibilities for wasting 30 researcher-years in this answer.
Here’s some reasons we might have failed to produce useful research:
Some of the value of Redwood comes from building capacity to do more good research in the future (including building up this capacity for other orgs, eg by them being able to poach our employees). So you also have to imagine that this also didn’t work out.
It doesn't seem (unlike some other places) that Redwood is directly trying to create AGI, so value will have to come from the techniques being used by other labs. Assuming Redwood finds some promising techniques, how does Redwood plan to influence the biggest research labs that are working towards AGI? Do you hope for your techniques to be useful enough to AGI research that labs adopt them anyway? Do you want to heavily evangelize your techniques in publications/the press/etc.? Or do you expect the work of persuading the biggest players to be better done by somebody else?
So to start with, I want to note that I imagine something a lot more like “the alignment community as a whole develops promising techniques, probably with substantial collaboration between research organizations” than “Redwood does all the work themselves”. Among other things, we don’t have active plans to do much theoretical alignment work, and I’d be fairly surprised if it was possible to find techniques I was confident in without more theoretical progress--our current plan is to collaborate with theory researchers elsewhere.
In this comment, I mentioned the simple model of “labs align their AGI if the amount of pressure on them to use sufficiently reliable alignment techniques is greater than the inconvenience associated with using those techniques.” The kind of applied alignment work we’re doing is targeted at reducing the cost of using these techniques, rather than increasing the pressure--we’re hoping to make it cheaper and easier for capabilities labs to apply alignment techniques that they’re already fairly motivated to use, eg by ensuring that these techniques have been tried out in miniature, and so the labs feel pretty optimistic that their practical kinks have been worked out, and there are people who have implemented the techniques before who can help them.
Organizations grow and change over time, and I wouldn’t be shocked to hear that Redwood eventually ended up engaging in various kinds of efforts to get capabilities labs to put more work into alignment. We don’t currently have plans to do so.
That would be great, and seems plausible.
I don’t imagine wanting to heavily evangelize techniques in the press. I think that getting prominent publications about alignment research is probably useful.
This looks brilliant, and I want to strong-strong upvote!
What do you foresee as your biggest bottlenecks or obstacles in the next 5 years? Eg. finding people with a certain skillset, or just not being able to hire quickly while preserving good culture.
Thanks for the kind words!
Our biggest bottlenecks are probably going to be some combination of:
Suppose you successfully modify GPT models as desired, at moderate cost in compute and human classification. How might your process generalize?
So there’s this core question: "how are the results of this project going to help with the superintelligence alignment problem?" My claim can be broken down as follows:
I don’t think that the process we develop will generalize, in the sense that I don’t think that we’ll be able to actually apply it to solving the problems we actually care about, but I think it’s still likely to be a useful step.
There are more advanced techniques that have been proposed for ensuring models don’t do bad things. For example, relaxed adversarial training, or adversarial training where the humans have access to powerful tools that help them find examples where the model does bad things (eg as in proposal 2 here). But it seems easier to research those things once we’ve done this research, for a few reasons:
I often think of our project as being kind of analogous to Learning to summarize with human feedback. That paper isn’t claiming that if we know how to train models by getting humans to choose which of two options they prefer, we’ll have solved the whole alignment problem. But it’s still probably the case that it’s helpful for us to have sorted out some of the basic questions about how to do training from human feedback, before trying to move on to more advanced techniques (like training using human feedback where the humans have access to ML tools to help them provide better feedback).
What might be an example of a "much better weird, theory-motivated alignment research" project, as mentioned in your intro doc? (It might be hard to say at this point, but perhaps you could point to something in that direction?)
I think the best examples would be if we tried to practically implement various schemes that seem theoretically doable and potentially helpful, but quite complicated to do in practice. For example, imitative generalization or the two-head proposal here. I can imagine that it might be quite hard to get industry labs to put in the work of getting imitative generalization to work in practice, and so doing that work (which labs could perhaps then adopt) might have a lot of impact.
Some questions that aren't super related to Redwood/applied ML AI safety, so feel free to ignore if not your priority:
Assuming that it's taking too long to solve the technical alignment problem, what might be some of our other best interventions to reduce x-risk from AI? E.g., regulation, institutions for fostering cooperation and coordination between AI labs, public pressure on AI labs/other actors to slow deployment, ...
If we solve the technical alignment problem in time, what do you think are the other major sources of AI-related x-risk that remain? How likely do you think these are, compared to x-risk from not solving the technical alignment problem in time?
So one thing to note is that I think that there are varying degrees of solving the technical alignment problem. In particular, you’ve solved the alignment problem more if you’ve made it really convenient for labs to use the alignment techniques you know about. If next week some theory people told me “hey we think we’ve solved the alignment problem, you just need to use IDA, imitative generalization, and this new crazy thing we just invented”, then I’d think that the main focus of the applied alignment community should be trying to apply these alignment techniques to the most capable currently available ML systems, in the hope of working out all the kinks in these techniques, and then repeat this every year, so that whenever it comes time to actually build the AGI with these techniques, the relevant lab can just hire all the applied alignment people who are experts on these techniques and get them to apply them. (You might call this fire drills for AI safety, or having an “anytime alignment plan” (someone else invented this latter term, I don’t remember who).)
I normally focus my effort on the question “how do we solve the technical alignment problem and make it as convenient as possible to build aligned systems, and then ensure that the relevant capabilities labs put effort into using these alignment techniques”, rather than this question, because it seems relatively tractable, compared to causing things to go well in worlds like those you describe.
One way of thinking about your question is to ask how many years the deployment of existentially risky AI could be delayed (which might buy time to solve the alignment problem). I don’t have super strong takes on this question. I think that there are many reasonable-seeming interventions, such as all of those that you describe. I guess I’m more optimistic about regulation and voluntary coordination between AI labs (eg, I’m happy about “Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project.” from the OpenAI Charter) than about public pressure, but I’m not confident.
Again, I think that maybe 30% of AI accident risk comes from situations where we sort of solved the alignment problem in time but the relevant labs don’t use the known solutions. Excluding that, I think that misuse risk is serious and worth worrying about. I don’t know how much value I think is destroyed in expectation by AI misuse compared to AI accident. I can also imagine various x-risk related to narrow AI in various ways.
How crucial a role do you expect x-risk-motivated AI alignment will play in making things go well? What are the main factors you expect will influence this? (e.g. the occurrence of medium-scale alignment failures as warning shots)
We could operationalize this as “How does P(doom) vary as a function of the total amount of quality-adjusted x-risk-motivated AI alignment output?” (A related question is “Of the quality-adjusted AI alignment research, how much will be motivated by x-risk concerns?” This second question feels less well defined.)
I’m pretty unsure here. Today, my guess is like 25% chance of x-risk from AI this century, and maybe I imagine that being 15% if we doubled the quantity of quality-adjusted x-risk-motivated AI alignment output, and 35% if we halved that quantity. But I don’t have explicit models here and just made these second two numbers up right now; I wouldn’t be surprised to hear that they moved noticeably after two hours of thought. I guess that one thing you might learn from these numbers is that I think that x-risk-motivated AI alignment output is really important.
I definitely think that AI x-risk seems lower in worlds where we expect medium-scale alignment failure warning shots. I don’t know whether I think that x-risk-motivated alignment research seems less important in those worlds or not--even if everyone thinks that AI is potentially dangerous, we have to have scalable solutions to alignment problems, and I don’t see a reliable route that takes us directly from “people are concerned” to “people solve the problem”.
I think the main factor that affects the importance of x-risk-motivated alignment research is whether it turns out that most of the alignment problem occurs in miniature in sub-AGI systems. If so, much more of the work required for aligning AGI will be done by people who aren’t thinking about how to reduce x-risk.
It’s 2035, Redwood has built an array of alignment tools that make SOTA models far less existentially risky without sacrificing hardly any performance. But these tools don’t end up being used by enough of the richest labs such that we still face doom. What happened?
One simple model for this is: labs build aligned models if the amount of pressure on them to use sufficiently reliable alignment techniques is greater than the inconvenience associated with using those techniques.
Here are various sources of pressure:
In practice, all of these sources of pressure are involved in companies spending resources on, eg, improving animal welfare standards, reducing environmental costs, or DEI (diversity, equity, and inclusion).
And here are various sources of inconvenience that could be associated with using particular techniques, even assuming they’re in principle competitive (in both the performance-competitive and training-competitive senses).
And so when I’m thinking about labs not using excellent alignment strategies that had already been developed, I imagine the failures differently depending on how much inconvenience there was:
Thanks for the response! I found the second set of bullet points especially interesting/novel.
Also, how important does it seem like governance is here versus other kinds of coordination? Any historical examples that inform your beliefs?
This is a great question and I don't have a good answer.
What factors do you think would have to be in place for some other people to set up some similar but different organisation in 5 years time?
I imagine this is mainly about the skills and experience of the team, but also interested in other things if you think that's relevant
I think the main skillsets required to set up organizations like this are:
Of course, if you had some of these properties but not the others, many people in EA (eg me) would be very motivated to help you out, by perhaps introducing you to cofounders or helping you with parts you were less experienced with.
People who wanted to start a Redwood competitor should plausibly consider working on an alignment research team somewhere (preferably leading it) and then leaving to start their own team. We’d certainly be happy to host people who had that aspiration (though we’d think that such people should consider the possibility of continuing to host their research inside Redwood instead of leaving).
Does it make sense to think of your work as aimed at reducing a particular theory-practice gap? If so, which one (what theory / need input for theoretical alignment scheme)?
I think our work is aimed at reducing the theory-practice gap of any alignment schemes that attempt to improve worst-case performance by training the model on data that was selected in the hope of eliciting bad behavior from the model. For example, one of the main ingredients of our project is paying people to try to find inputs that trick the model, then training the model on these adversarial examples.
Many different alignment schemes involve some type of adversarial training. The kind of adversarial training we’re doing, where we just rely on human ingenuity, isn’t going to work for ensuring good behavior from superhuman models. But getting good at the simple, manual version of adversarial training seems like plausibly a prerequisite for being able to do research on the more complicated techniques that might actually scale.
What are examples of reasonably scoped non-alignment/non-technical research questions, if any, that you think would be helpful for your work?
I think that most questions we care about are either technical or related to alignment. Maybe my coworkers will think of some questions that fit your description. Were you thinking of anything in particular?
Well for me, better research on correlates of research performance would be pretty helpful for research hiring. Like it's an open question to me whether I should expect a higher or lower (within-distribution) correlation of {intelligence, work sample tests, structured interviews, resume screenings} to research productivity when compared to the literature on work performance overall. I expect there are similar questions for programming.
But the selfish reason I'm interested in asking this is that I plan to work on AI gov/strategy in the near future, and it'll be useful to know if there are specific questions in those domains that you'd like an answer to, as this may help diversify or add to our paths to impact.
Okay, "How alignment research might look different five or ten years from now?"
Here are some things I think are fairly likely:
What's the main way that you think resources for onboarding people has improved?
[Edited] How important do you think it is to have ML research projects be lead by researchers who have had a lot of previous success in ML? Maybe it's the case that the most useful ML research is done by the top ML researchers, or that the ML community won't take Redwood very seriously (e.g. won't consider using your algorithms) if the research projects aren't lead by people with strong track records in ML.
Additionally, what are/how strong are the track records of Redwood's researchers/advisors?
The people we seek advice from on our research most often are Paul Christiano and Ajeya Cotra. Paul is a somewhat experienced ML researcher, who among other things led some of the applied alignment research projects that I am most excited about.
On our team, the people with the most relevant ML experience are probably Daniel Ziegler, who was involved with GPT-3 and also several OpenAI alignment research projects, and Peter Schmidt-Nielsen. Many of our other staff have research backgrounds (including publishing ML papers) that make me feel pretty optimistic about our ability to have good ML ideas and execute on the research.
I think it kind of depends on what kind of ML research you’re trying to do. I think our projects require pretty similar types of expertise to eg Learning to Summarize with Human Feedback, and I think we have pretty analogous expertise to the team that did that research (and we’re advised by Paul, who led it).
I think that there are particular types of research that would be hard for us to do, due to not having certain types of expertise.
I think that a lot of the research we are most interested in doing is not super bottlenecked on having the top ML researchers, in the same way that Learning to Summarize with Human Feedback doesn’t seem super bottlenecked on having the top ML researchers. I feel like the expertise we end up needing is some mixture of ML stuff like “how do we go about getting this transformer to do better on this classification task”, reasoning about the analogy to the AGI alignment problem, and lots of random stuff like making decisions about how to give feedback to our labellers.
I don’t feel very concerned about this; in my experience, ML researchers are usually pretty willing to consider research on its merits, and we have had good interactions with people from various AI labs about our research.
Do you think that different trajectories of prosaic TAI have big impacts on the usefulness of your current project? (For example, perhaps you think that TAI that is agentic would just be taught to deceive). If so, which? If not, could you say something about why it seems general?
(NB: the above is not supposed to imply criticism of a plan that only works in some worlds).
I think this is a great question.
We are researching techniques that are simpler precursors to adversarial training techniques that seem most likely to work if you assume that it’s possible to build systems that are performance-competitive and training-competitive, and do well on average on their training distribution.
There are a variety of reasons to worry that this assumption won’t hold. In particular, it seems plausible that humanity will only have the ability to produce AGIs that will collude with each other if it’s possible for them to do so. This seems especially likely if it’s only affordable to train your AGI from scratch a few times, because then all the systems you’re using are similar to each other and will find collusion easier. (It’s not training-competitive to assume you’re able to train the AGI from scratch multiple times, if you believe that there’s a way of building an unaligned powerful system that only involves training it from scratch once.) But even if we train all our systems from scratch separately, it’s pretty plausible to me that models will collude, either via acausal trade or because the systems need to be able to communicate with each other for some competitiveness reason.
So our research is most useful if we’re able to assume a lack of such collusion.
I think that some people think you might be able to apply these techniques even in cases where you don’t have an a priori reason to be confident that the models won’t collude; I don’t have a strong opinion on this.
Hm, could you expand on why collusion is one of the most salient ways in which "it’s possible to build systems that are performance-competitive and training-competitive, and do well on average on their training distribution" could fail?
Is the thought here that — if models can collude — then they can do badly on the training distribution in an unnoticeable way, because they're being checked by models that they can collude with?
Yeah basically.
I think it is fair to say that so far alignment research is not a standard research area in academic machine learning, unlike for example model interpretability. Do you think that would be desirable, and if so what would need to happen?
In particular, I had this toy idea of making progress legible to academic journals: Formulating problems and metrics that are "publishing-friendly"could, despite the problems that optimizing for flawed metrics bring, allow researchers at regular universities to conduct work in these areas.
It seems definitely good on the margin if we had ways of harnessing academia to do useful work on alignment. Two reasons for this are that 1. perhaps non-x-risk-motivated researchers would produce valuable contributions, and 2. it would mean that x-risk-motivated researchers inside academia would be less constrained and so more able to do useful work.
Three versions of this:
What type of legal entity is Redwood Research operating as/under? Is it plausible that at some point the project will be funded by investors and that shareholders will be able to financially profit?
We're a nonprofit. We don't have plans to make profits, and it seems less likely than e.g. OpenAI that in the future we would go nonprofit --> tandem for-profit / nonprofit, but there are a variety of revenue-generating things I can imagine us doing (e.g. consulting with industry labs to help them align their models).
Two hiring (and personally-motivated) questions:
Re 1:
It’s probably going to be easier to get good at the infrastructure engineering side of things than the ML side of things, so I’ll assume that that’s what you’re going for.
For our infra engineering role, we want to hire people who are really productive and competent at engineering various web systems quickly. (See the bulleted list of engineering responsibilities on the job page.) There are some people who are qualified for this role without having much professional experience, because they’ve done a lot of Python programming and web programming as hobbyists. Most people who want to become more qualified for this work should seek out a job that’s going to involve practicing these skills. For example, being a generalist backend engineer at a startup, especially if you’re going to be working with ML, is likely to teach you a bunch of the skills that are valuable to us. You’re more likely to learn these skills quickly if you take your job really seriously and try hard to be very good at it--you should try to take on more responsibilities when you get the opportunity to do so, and generally practice the skill of understanding the current technical situation and business needs and coming up with plans to quickly and effectively produce value.
Re 2:
Currently our compensation packages are usually entirely salary. We don’t have equity because we’re a nonprofit. We’re currently unsure how to think about compensation policy--we’d like to be able to offer competitive salaries so that we can hire non-EA talent for appropriate roles (because almost all the talent is non-EA), but there are a bunch of complexities associated with this.
How likely do you think it would be for standard ML research to solve the problems you're working on in the course of trying to get good performance? Do such concerns affect your project choices much?