Why I think it's net harmful to do technical safety research at AGI labs

Remmelt

IMO it is harmful on expectation for a technical safety researcher to work at DeepMind, OpenAI or Anthropic.

Four reasons:

Interactive complexity. The intractability of catching up – by trying to invent general methods for AI corporations to somehow safely contain model interactions, as other engineers scale models' combinatorial complexity and outside connectivity.
Safety-capability entanglements
1. Commercialisation. Model inspection and alignment techniques can support engineering and productisation of more generally useful automated systems.
2. Infohazards. Researching capability risks within an AI lab can inspire researchers hearing about your findings to build new capabilities.
Shifts under competitive pressure
1. DeepMind merged with Google Brain to do commercialisable research,
  OpenAI set up a company and partnered with Microsoft to release ChatGPT,
  Anthropic pitched to investors they'd build a model 10 times more capable.
2. If you are an employee at one of these corporations, higher-ups can instruct you to do R&D you never signed up to do.^[1] You can abide, or get fired.
3. Working long hours surrounded by others paid like you are, by a for-profit corp, is bad for maintaining bearings and your epistemics on safety.^[2]
Safety-washing. Looking serious about 'safety' helps labs to recruit idealistic capability researchers, lobby politicians, and market to consumers.
1. 'let's build AI to superalign AI'
2. 'look, pretty visualisations of what's going on inside AI'

This is my view. I would want people to engage with the different arguments, and think for themselves what ensures that future AI systems are actually safe.

^{^}
I heard via via that Google managers are forcing DeepMind safety researchers to shift some of their hours to developing Gemini for product-ready launch.
I cannot confirm whether that's correct.
^{^}
For example, I was in contact with a safety researcher at an AGI lab who kindly offered to read my comprehensive outline on the AGI control problem, to consider whether to share with colleagues. They also said they're low energy. They suggested I'd remind them later, and I did, but they never got back to me. They're simply too busy it seems.

43 Reactions

Mentioned in

4Compliance Monitoring as an Impactful Mechanism of AI Safety Policy

Comments29

Sorted by

New & upvoted

Click to highlight new comments since: Today at 12:50 AM

Derek ShillerFeb 7 202438

Do you think it would be better if no one who worked at OpenAI / Anthropic / Deepmind worked on safety? If those organizations devoted less of their budget to safety? (Or do you think we should want them to hire for those roles, but hire less capable or less worried people, so individuals should avoid potentially increasing the pool of talent from which they can hire?)

RemmeltFeb 7 20243

(Let me get back on this when I find time, hopefully tomorrow)

RemmeltFeb 8 20243

Do you think it would be better if no one who worked at OpenAI / Anthropic / Deepmind worked on safety?

It depends on what you mean with 'work on safety'.
Standard practice for designing machine products to be safe in other established industries is to first narrowly scope the machinery's uses, the context of use, and the user group.

If employees worked at OpenAI / Anthropic / Deepmind on narrowing their operational scopes, all power to them! That would certainly help. It seems that leadership, who aim to design unscoped automated machinery to be used everywhere for everyone, would not approve though.

If working on safety means in effect playing close to a ceremonial role, where even though you really want to help, you cannot hope to catch up with the scaling efforts, I would reconsider. In other industries, when conscientious employees notice engineering malpractices that are already causing harms across society, sometimes one of them has the guts to find an attorney and become a whistleblower.

Also, in that case, I would prefer the AGI labs to not hire for those close-to-ceremonial roles.
I'd prefer them to be bluntly transparent to people in society that they are recklessly scaling ahead, and that they are just adding local bandaids to the 'Shoggoth' machinery.
Not that that is going to happen anyway.

If those organizations devoted less of their budget to safety?

If AGI labs can devote their budget to constructing operational design domains, I'm all up.
Again, that's counter to the leaders' intentions. Their intention is to scale everywhere and rely on the long-term safety researchers to tell them that there must be some yet-undiscovered general safe control patch.

so individuals should avoid potentially increasing the pool of talent from which they can hire?

I think we should avoid promoting AGI labs as a place to work at, or a place that somehow will improve safety. One of the reasons is indeed that I want us to be clear to idealistic talented people that they should really reconsider investing their career into supporting such an organisation.

BTW, I'm not quite answering from your suggested perspective of what an AGI lab "should do".
What feels relevant to me is what we can personally consider to do – as individuals connected into larger communities – so things won't get even worse.

Derek ShillerFeb 8 20245

I think I agree that safety researchers should prefer not to take a purely ceremonial role at a big company if they have other good options, but I'm hesitant to conclude that no one should be willing to do it. I don't think it is remotely obvious that safety research at big companies is ceremonial.

There are a few reasons why some people might opt for a ceremonial role:

It is good for some AI safety researchers to have access to what is going on at top labs, even if they can't do anything about it. They can at least keep tabs on it and can use that experience later in their careers.
It seems bad to isolate capabilities researchers from safety concerns. I bet capabilities researchers would take safety concerns more seriously if they eat lunch every day with someone who is worried than if they only talk to each other.
If labs do engage in behavior that is flagrantly reckless, employees can act as whistleblowers. Non-employees can't. Even if they can't prevent a disaster, they can create a paper trail of internal concerns which could be valuable in the future.
Internal politics might change and it seems better to have people in place already thinking about these things.

RemmeltFeb 11 20242

If labs do engage in behavior that is flagrantly reckless, employees can act as whistleblowers.

This is the crux for me.

If some employees actually have the guts to whistleblow on current engineering malpractices, I have some hope left that having AI safety researchers at these labs still turns out “net good”.

If this doesn’t happen, then they can keep having conversations about x-risks with their colleagues, but I don’t quite see when they will put up a resistance to dangerous tech scaling. If not now, when?

Internal politics might change

We’ve seen in which directions internal politics change, as under competitive pressures.

Nerdy intellectual researchers can wait that out as much as they like. That would confirm my concern here.

RemmeltFeb 11 20242

If some employees actually have the guts to whistleblow on current engineering malpractices…

Plenty of concrete practices you can whistleblow on that will be effective in getting society to turn against these companies:

The copying of copyrighted and person-identifying information without permission (pass on evidence to publishers and they will have a lawsuit feast).
The exploitation and underpayment of data workers and coders from the Global South (inside information on how OpenAI staff hid that they instructed workers in Kenya to collect images of child sexual abuse, anyone?).
The unscoped misdesign and failure to test these systems for all the uses the AI company promotes.
The extent of AI hardware’s environmental pollution.

Pick what you’re in a position to whistleblow on.

Be very careful to prepare well. You’re exposing a multi-billion-dollar company. First meet in person with an attorney experienced in protecting whistleblowers.

Once you start collecting information, make photographs with your personal phone, rather than screenshots or USB copies that might be tracked by software. Make sure you’re not in line of sight of an office camera or webcam. Etc. Etc.

Preferably, before you start, talk with an experienced whistleblower about how to maintain anonymity. The more at ease you are there, the more you can bide your time, carefully collecting and storing information.

If you need information to get started, email me at remmelt.ellen[a/}protonmail<d0t>com.

~ ~ ~

But don’t wait it out until you can see some concrete dependable sign of “extinction risk”. By that time, it’s too late.

RemmeltFeb 7 202425

80,000 Hours handpicks jobs at AGI labs.

Some of those jobs don't even focus on safety – instead they look like policy lobbying roles or engineering support roles.

Nine months ago, I wrote my concerns to 80k staff:

Hi [x, y, z]
I noticed the job board lists positions at OpenAI and AnthropicAI under the AI Safety category:

Not sure whom to contact, so I wanted to share these concerns with each of you:

Capability races
OpenAI's push for scaling the size and applications of transformer-network-based models has led Google and others to copy and compete with them.
Anthropic now seems on a similar trajectory.
By default, these should not be organisations supported by AI safety advisers with a security mindset.
No warning
Job applicants are not warned of the risky past behaviour by OpenAI and Anthropic. Given that 80K markets to a broader audience, I would not be surprised if 50%+ are not much aware of the history. The subjective impression I get is that taking the role will help improve AI safety and policy work.
At the top of the job board, positions are described as "Handpicked to help you tackle the world's most pressing problems with your career."
If anything, "About this organisation" makes the companies look more comprehensively careful about safety than they really have acted like:
"Anthropic is an AI safety and research company that’s working to build reliable, interpretable, and steerable AI systems."
"OpenAI is an AI research and deployment company, with roles working on AI alignment & safety."
It is understandable that people aspiring for AI safety & policy careers are not much aware, and therefore should be warned.
However, 80K staff should be tracking the harmful race dynamics and careless deployment of systems by OpenAI, and now Anthropic.
The departure of OpenAI's safety researchers was widely known, and we have all been tracking the hype cycles around ChatGPT.
Various core people in the AI Safety community have mentioned concerns about Anthropic.
Oliver Habryka mentions this as part of the reasoning for shutting down the LightCone offices:
I feel quite worried that the alignment plan of Anthropic currently basically boils down to "we are the good guys, and by doing a lot of capabilities research we will have a seat at the table when AI gets really dangerous, and then we will just be better/more-careful/more-reasonable than the existing people, and that will somehow make the difference between AI going well and going badly". That plan isn't inherently doomed, but man does it rely on trusting Anthropic's leadership, and I genuinely only have marginally better ability to distinguish the moral character of Anthropic's leadership from the moral character of FTX's leadership, and in the absence of that trust the only thing we are doing with Anthropic is adding another player to an AI arms race.
More broadly, I think AI Alignment ideas/the EA community/the rationality community played a pretty substantial role in the founding of the three leading AGI labs (Deepmind, OpenAI, Anthropic), and man, I sure would feel better about a world where none of these would exist, though I also feel quite uncertain here. But it does sure feel like we had a quite large counterfactual effect on AI timelines.
Not safety focussed
Some jobs seem far removed from positions of researching (or advising on restricting) the increasing harms of AI-system scaling.
For OpenAI:
IT Engineer, Support: "The IT team supports Mac endpoints, their management tools, local network, and AV infrastructure"
Software Engineer, Full-Stack: "to build and deploy powerful AI systems and products that can perform previously impossible tasks and achieve unprecedented levels of performance."
For Anthropic:
Technical Product Manager: "Rapidly prototype different products and services to learn how generative models can help solve real problems for users."
Prompt Engineer and Librarian: "Discover, test, and document best practices for a wide range of tasks relevant to our customers."
Align-washing
Even if an accepted job applicant get to be in a position of advising on and restricting harmful failure modes, how do you trade this off against:
the potentially large marginal relative difference in skills of top engineering candidates you sent OpenAI's and Anthropic's way, and are accepted to do work for scaling their technology stack?
how these R&D labs will use the alignment work to market the impression that they are safety-conscious, to:
avoid harder safety mandates (eg. document their copyrights-infringing data, don't allow API developers to deploy spaghetti code all over the place)?
attract other talented idealistic engineers and researchers?
and so on?
I'm confused and, to be honest, shocked that these positions are still listed for R&D labs heavily invested in scaling AI system capabilities (without commensurate care for the exponential increase in the number of security gaps and ways to break our complex society and supporting ecosystem that opens up).I think this is pretty damn bad.
Preferably, we can handle this privately and not make it bigger. If you can come back on these concerns in the next two weeks, I would very much appreciate that.

If not, or not sufficiently addressed, I hope you understand that I will share these concerns in public.

Warm regards,

Remmelt

80k removed one of the positions I flagged:
Software Engineer, Full-Stack, Human Data Team (reason given: it looked potentially more capabilities-focused than the original job posting that came into their system).

For the rest, little has changed:

80k still lists jobs that help AGI labs scale commercially,
- Jobs with similar names:
  research engineer product, prompt engineer, IT support, senior software engineer.
80k still describes these jobs as "Handpicked to help you tackle the world's most pressing problems with your career."
80k still describes Anthropic as "an Al safety and research company that's working to build reliable, interpretable, and steerable Al systems".
80k staff still have not accounted for that >50% of their broad audience checking 80k's handpicked jobs are not much aware of the potential issues of working at an AGI lab.
- Readers there don't get informed. They get to click on the button 'VIEW JOB DETAILS' , taking them straight to the job page. From there, they can apply and join the lab unprepared.

Two others in AI Safety also discovered the questionable job listings. They are disappointed in 80k.

Feeling exasperated about this. Thinking of putting out another post just to discuss this issue.

Benjamin HiltonFeb 7 202438

Hi Remmelt,

Thanks for sharing your concerns, both with us privately and here on the forum. These are tricky issues and we expect people to disagree about how to about how to weigh all the considerations — so it’s really good to have open conversations about them.

Ultimately, we disagree with you that it's net harmful to do technical safety research at AGI labs. In fact, we think it can be the best career step for some of our readers to work in labs, even in non-safety roles. That’s the core reason why we list these roles on our job board.

We argue for this position extensively in my article on the topic (and we only list roles consistent with the considerations in that article).

Some other things we’ve published on this topic in the last year or so:

A range of opinions from anonymous experts about the upsides and downsides of working on AI capabilities
How policy roles in AI companies can be valuable for career capital and for direct impact (as well as the potential downsides)
We recently released a podcast episode with Nathan Labenz on some of the controversy around OpenAI, including his concerns about some of their past safety practices, whether ChatGPT’s release was good or bad, and why its mission of developing AGI may be too risky.

Benjamin

Conor BarnesFeb 12 20249

Hi Remmelt,

Just following up on this — I agree with Benjamin’s message above, but I want to add that we actually did add links to the “working at an AI lab” article in the org descriptions for leading AI companies after we published that article last June.

It turns out that a few weeks ago the links to these got accidentally removed when making some related changes in Airtable, and we didn’t notice these were missing — thanks for bringing this to our attention. We’ve added these back in and think they give good context for job board users, and we’re certainly happy for more people to read our articles.

We also decided to remove the prompt engineer / librarian role from the job board, since we concluded it’s not above the current bar for inclusion. I don’t expect everyone will always agree with the judgement calls we make about these decisions, but we take them seriously, and we think it’s important for people to think critically about their career choices.

RemmeltFeb 13 20240

Hi Conor,

Thank you.

I’m glad to see that you already linked to clarifications before. And that you gracefully took the feedback, and removed the prompt engineer role. I feel grateful for your openness here.

It makes me feel less like I’m hitting a brick wall. We can have more of a conversation.

~ ~ ~

The rest is addressed to people on the team, and not to you in particular:

There are grounded reasons why 80k’s approaches to recommending work at AGI labs – with the hope of steering their trajectory – has supported AI corporations to scale. While disabling efforts that may actually prevent AI-induced extinction.

This concerns work on your listed #1 most pressing problem. It is a crucial consideration that can flip your perceived total impact from positive to negative.

I noticed that 80k staff responses so far started by stating disagreement (with my view), or agreement (with a colleague’s view).

This doesn’t do discussion of it justice. It’s like responding to someone’s explicit reasons for concern that they must be “less optimistic about alignment”. This ends reasoned conversations, rather than opens them up.

Something I would like to see more of is individual 80k staff engaging with the reasoning.

RemmeltFeb 8 20243

Ben, it is very questionable that 80k is promoting non-safety roles at AGI labs as 'career steps'.

Consider that your model of this situation may be wrong (account for model error).

The upside is that you enabled some people to skill up and gain connections.
The downside is that you are literally helping AGI labs to scale commercially (as well as indirectly supporting capability research).

RemmeltFeb 8 20249

A range of opinions from anonymous experts about the upsides and downsides of working on AI capabilities

I did read that compilation of advice, and responded to that in an email (16 May 2023):

"Dear [a],

People will drop in and look at job profiles without reading your other materials on the website. I'd suggest just writing a do-your-research cautionary line about OpenAI and Anthropic in the job descriptions itself.

Also suggest reviewing whether to trust advice on whether to take jobs that contribute to capability research.

Particularly advice by nerdy researchers paid/funded by corporate tech.
Particularly by computer-minded researchers who might not be aware of the limitations of developing complicated control mechanisms to contain complex machine-environment feedback loops.

Totally up to you of course.

Warm regards,

Remmelt"

We argue for this position extensively in my article on the topic

This is what the article says:
"All that said, we think it’s crucial to take an enormous amount of care before working at an organisation that might be a huge force for harm. Overall, it’s complicated to assess whether it’s good to work at a leading AI lab — and it’ll vary from person to person, and role to role."

So you are saying that people are making a decision about working for an AGI lab that might be (or actually is) a huge force for harm. And that whether it's good (or bad) to work at an AGI lab depends on the person – ie. people need to figure this out for them personally.

Yet you are openly advertising various jobs at AGI labs on the job board. People are clicking through and applying. Do you know how many read your article beforehand?

~ ~ ~
Even if they did read through the article, both the content and framing of the advice seems misguided. Noticing what is emphasised in your considerations.

Here are the first sentences of each consideration section:
(ie. as what readers are most likely to read, and what you might most want to convey).

"We think that a leading — but careful — AI project could be a huge force for good, and crucial to preventing an AI-related catastrophe."
- Is this your opinion about DeepMind, OpenAI and Anthropic?
"Top AI labs are high-performing, rapidly growing organisations. In general, one of the best ways to gain career capital is to go and work with any high-performing team — you can just learn a huge amount about getting stuff done. They also have excellent reputations more widely. So you get the credential of saying you’ve worked in a leading lab, and you’ll also gain lots of dynamic, impressive connections."
- Is this focussing on gaining prestige and (nepotistic) connections as an instrumental power move, with the hope of improving things later...?
- Instead of on actually improving safety?
"We’d guess that, all else equal, we’d prefer that progress on AI capabilities was slower."
- Why is only this part stated as a guess?
  - I did not read "we'd guess that a leading but careful AI project, all else equal, could be a force of good".
  - Or inversely: "we think that continued scaling of AI capabilities could be a huge force of harm."
  - Notice how those framings come across very differently.
- Wait, reading this section further is blowing my mind.
  - "But that’s not necessarily the case. There are reasons to think that advancing at least some kinds of AI capabilities could be beneficial. Here are a few"
  - "This distinction between ‘capabilities’ research and ‘safety’ research is extremely fuzzy, and we have a somewhat poor track record of predicting which areas of research will be beneficial for safety work in the future. This suggests that work that advances some (and perhaps many) kinds of capabilities faster may be useful for reducing risks."
    - Did you just argue for working on some capabilities because it might improve safety? This is blowing my mind.
  - "Moving faster could reduce the risk that AI projects that are less cautious than the existing ones can enter the field."
    - Are you saying we should consider moving faster because there are people less cautious than us?
    - Do you notice how a similarly flavoured argument can be used by and is probably being used by staff at three leading AGI labs that are all competing with each other?
    - Did OpenAI moving fast with ChatGPT prevent Google from starting new AI projects?
  - "It’s possible that the later we develop transformative AI, the faster (and therefore more dangerously) everything will play out, because other currently-constraining factors (like the amount of compute available in the world) could continue to grow independently of technical progress."
    - How would compute grow independently of AI corporations deciding to scale up capability?
    - The AGI labs were buying up GPUs to the point of shortage. Nvidia was not able to supply them fast enough. How is that not getting Nvidia and other producers to increase production of GPUs?
    - More comments on the hardware overhang argument here.
  - "Lots of work that makes models more useful — and so could be classified as capabilities (for example, work to align existing large language models) — probably does so without increasing the risk of danger"
    - What is this claim based on?
"As far as we can tell, there are many roles at leading AI labs where the primary effects of the roles could be to reduce risks."
1. As far as I can tell, this is not the case.
  1. For technical research roles, you can go by what I just posted.
  2. For policy, I note that you wrote the following:
    "Labs also often don’t have enough staff... to figure out what they should be lobbying governments for (we’d guess that many of the top labs would lobby for things that reduce existential risks)."
    1. I guess that AI corporations use lobbyists for lobbying to open up markets for profit, and to not get actually restricted by regulations (maybe to move focus to somewhere hypothetically in the future, maybe to remove upstart competitors who can't deal with the extra compliance overhead, but don't restrict us now!).
    2. On prior, that is what you should expect, because that is what tech corporations do everywhere. We shouldn't expect on prior that AI corporations are benevolent entities that are not shaped by the forces of competition. That would be naive.

~ ~ ~
After that, there is a new section titled "How can you mitigate the downsides of this option?"

That section reads as thoughtful and reasonable.
How about on the job board, you link to that section in each AGI lab job description listed, just above the 'VIEW JOB DETAILS' button?
- For example, you could append and hyperlink 'Suggestions for mitigating downsides' to the organisational descriptions of Google DeepMind, OpenAI and Anthropic.
That would help guide through potential applicants to AGI lab positions to think through their decision.

William the KiwiFeb 8 20245

"This distinction between ‘capabilities’ research and ‘safety’ research is extremely fuzzy, and we have a somewhat poor track record of predicting which areas of research will be beneficial for safety work in the future. This suggests that work that advances some (and perhaps many) kinds of capabilities faster may be useful for reducing risks."

This seems like a absurd claim. Are 80k actually making it?

EDIT: the claim is made by Benjamin Hilton, one of 80k's analysts and the person the OP is replying too.

RemmeltFeb 8 20240

It is an extreme claim to make in that context, IMO.

I think Benjamin made it to be nuanced. But the nuance in that article is rather one-sided.

If anything, the nuance should be on the side of identifying any ways you might accidentally support the development of dangerous auto-scaling technologies.

First do, no harm.

RemmeltFeb 8 20241

Note that we are focussing here on decisions at the individual level.
There are limitations to that.

See my LessWrong comment.

William the KiwiFeb 8 20242

I would agree with Remmelt here. While upskilling people is helpful, if those people then go on to increase the rate of capabilities gain by AI companies, this is reducing the time the world has available to find solutions to alignment and AI regulation.

While, as a rule, I don't disagree with an industries increasing their capabilities, I do disagree with this when those capabilities knowingly lead to human extinction.

Owen Cotton-BarrattFeb 7 202415

I think 1 and 3 seem like arguments that reduce the desirability of these roles but it's hard to see how they can make them net-negative.

Arguments 4 and to some extent 2 give a real case that could in principle make something net-negative but I'm sceptical that the effect scales that far. In particular if this were right, I think it would effectively say that it would be better if AI labs invested less rather than more in safety. I can't rule out that that's correct, but it seems like a pretty galaxy-brained take and I would want some robust arguments before I took it seriously, and I don't think these are close to meeting that threshold for me personally.

Further, I think that there are a bunch of arguments for the value of safety work within labs (e.g. access to sota models; building institutional capacity and learning; cultural outreach) which seem to me to be significant and you're not engaging with.

RemmeltFeb 7 20243

I think 1 and 3 seem like arguments that reduce the desirability of these roles but it's hard to see how they can make them net-negative.

Yes, specifically by claim 1, positive value can only asymptotically approach 0
(ignoring opportunity costs).

For small specialised models (designed for specific uses in a specific context of use for a specific user group), we see in practice that safety R&D can make a big difference.
For 'AGI', I would argue that the system cannot be controlled sufficiently to stay safe.
Unscoped everything-for-everyone models (otherwise called 'general purpose AI') sits somewhere in between.
- I think progress on generalisable safety R&D is practically intractable for the current model sizes and uses scaling rates that AI corporations are competing at.
- The functioning of the model weights' during computation is too variable depending on changes in the input (distribution), and the contexts the models output into are also too varied (too many possible paths through which the propagated effects could cause failures in irregular locality-dependent ways, given the complexity of the nested societies and ecosystems we humans depend on to live and to live well).

Arguments 4 and to some extent 2 give a real case that could in principle make something net-negative but I'm sceptical that the effect scales that far.

Some relevant aspects are missing in what you shared so far.

Particularly, we need to consider that any one AGI lab is (as of now) beholden to the rest of society to continue operating.

This is clearly true in the limit. Imagine some freak mass catastrophe caused by OpenAI:
staff would leave, consumers would stop buying, and regulators would shut the place down.

But it is also true in practice.
From the outside, these AGI labs may look like institutional pillars of strength. But from the inside, management is constantly jostling, trying to source enough investments and/or profitable productisation avenues to cover high staff salaries and compute costs. This is why I think DeepMind allowed themselves to be acquired by Google in the first place. They ran a $649 million loss in 2019, and could simply not maintain that burn rate without a larger tech corporation covering their losses for them.

In practice, AGI labs are constantly finding ways to make themselves look serious about safety, and finding ways to address safety issues customers are noticing. Not just because some employees there are paying attention to those harms and taking care to avoid them. But also because they're dealing with newly introduced AI products that already have lots of controversies associated to it (in these rough categories: data laundering, worker exploitation, design errors and misuses, resource-intensive and polluting hardware).

If we think about this in simplified dimensions:

There is an actual safety dimension, as would be defined by how the (potential) effects of this technology impact us humans and the world contexts we depend on to live.
There is a perceived safety dimension, which is defined by how safe we perceive the system to be, based in a small part on our own direct experience and reasoning, and for a large part on what we hear/read from others around us.

Outside stakeholders would need to perceive the system to be unsafe to restrict further scaling and/or uses (which IMO is much more effective than trying to make scaled open-ended systems comprehensively safe after the fact). Where 'the system' can include the institutional hierarchies and infrastructure through which an AI models is developed and deployed.

Corporations have a knack for finding ways to hide product harms, while influencing people to not notice or to dismiss those harms. See cases Big Tabacco, Big Pharma, Big Oil.

Corporations that manage to do that can make profit from selling products without getting shut down. This is what capitalism – open market transactions and private profit reinvestment – in part selects for. This is what Big Tech companies that win out over time manage to do.

(it feels like I'm repeating stuff obvious to you, but it bears repeating to set the context)

In particular if this were right, I think it would effectively say that it would be better if AI labs invested less rather than more in safety.

Are you stating an intuition that it would be surprising if AGI labs invested less in improving actual safety, then that would be overall less harmful?

I am saying with claim 4. that there is another dimension, perceived safety.
The more that an AI corporation is able to make the system be or at least look *locally* safe to users and other stakeholders (even if globally much more unsafe), the more the rest of society will permit and support the corporations to scale on. And the more that the AI corporation can promote that they are responsibly scaling toward some future aligned system that is *globally* safe, the more that nerdy researchers and other stakeholders open to that kind of messaging can treat that as a sign of virtue and give the corporation a pass there too.

And that unfortunately, by claim 1, that actual safety is intractable when scaling such open-ended (and increasingly automated) systems. Which is why in established safety-critical industries – eg. for medical devices, cars, planes, industrial plants, even kitchen devices – there are best practices for narrowly scoping the design of the machines to specific uses and contexts of use.

So actual safety is intractable for such open-ended systems, but AI corporations can and do disproportionately support research and research communication that increases perceived safety.

But actual safety is tractable for restricting corporate AI scaling (if you reduce the system's degrees of freedom of interaction, you reduce the possible ways things can go wrong). Unfortunately, fewer people move to restrict corporate-AI scaling if the corporate activities are perceived to be safe.

By researching safety at AGI labs, researchers are therefore predominantly increasing perceived safety, and as a result closing off realistic opportunities to improving actual safety.

Owen Cotton-BarrattFeb 7 20249

Thanks. I'm now understanding your central argument to be:

Improving the quality/quantity of output from safety teams within AI labs has a (much) bigger impact on perceived safety of the lab than it does on actual safety of the lab. This is therefore the dominant term in the impact of the team's work. Right now it's negative.

Is that a fair summary?

If so, I think:

Conditional on the premise, the conclusion appears to make sense
- It still feels kinda galaxy-brained, which may make me want to retain some scepticism
However I feel way less confident than you in the premise, for I believe a number of reasons:
- I'm more optimistic e.g. that control turns out to be useful, or that there are hacky alignment techniques which work long enough to get through to the automation of crucial safety research
- I think that there are various non-research pathways for such people to (in expectation) increase the safety of the lab they're working at
- It's unclear to me what the sign is of quality-of-safety-team-work on perceived-safety to the relevant outsiders (investors/regulators?)
  - e.g. I think that one class of work people in labs could do is capabilities monitoring, and I think that if this were done to a good standard it could in fact help to reduce perceived-safety to outsiders in a timely fashion
  - I guess I'm quite sceptical that signals like "well the safety team at this org doesn't really have any top-tier researchers and is generally a bit badly thought of" will be meaningfully legible to the relevant outsiders, so I don't really think that reducing the quality of their work will have too much impact on perceived safety

RemmeltFeb 7 20243

Thanks, I appreciate the paraphrase. Yes, that is a great summary.

I'm more optimistic e.g. that control turns out to be useful, or that there are hacky alignment techniques which work long enough to get through to the automation of crucial safety research

I hear this all the time, but I also notice that people saying it have not investigated the fundamental limits to controllability that you would encounter with any control system.

As a philosopher, would you not want to have a more generalisable and robust argument that this is actually going to work out?

I think that there are various non-research pathways for such people to (in expectation) increase the safety of the lab they're working at

I'm curious about the pathways you have in mind. I may have missed something here.

e.g. I think that one class of work people in labs could do is capabilities monitoring, and I think that if this were done to a good standard it could in fact help to reduce perceived-safety to outsiders in a timely fashion

I'm skeptical that that would work in this corporate context.

"Capabilities" are just too useful economically and can creep up on you. Putting aside whether we can even measure comprehensively enough for "dangerous capabilities".

In the meantime, it's great marketing to clients, to the media, and to national interests:
You are working on AI systems that could become so capable, that you even have an entire team devoted to capabilities monitoring.

I guess I'm quite sceptical that signals like "well the safety team at this org doesn't really have any top-tier researchers and is generally a bit badly thought of" will be meaningfully legible to the relevant outsiders, so I don't really think that reducing the quality of their work will have too much impact on perceived safety

This is interesting. And a fair argument. Will think about this.

Owen Cotton-BarrattFeb 7 20245

I'm curious about the pathways you have in mind. I may have missed something here.

I think it's basically things flowing in some form through "the people working on the powerful technology spend time with people seriously concerned with large-scale risks". From a very zoomed out perspective it just seems obvious that we should be more optimistic about worlds where that's happening compared to worlds where it's not (which doesn't mean that necessarily remains true when we zoom in, but it sure affects my priors).

If I try to tell more concrete stories they include things of the form "the safety-concerned people have better situational awareness and may make better plans later", and also "when systems start showing troubling indicators, culturally that's taken much more seriously". (Ok, I'm not going super concrete in my stories here, but that's because I don't want to anchor things on a particular narrow pathway.)

RemmeltFeb 8 20241

Thanks for clarifying.

Owen Cotton-BarrattFeb 7 20244

I hear this all the time, but I also notice that people saying it have not investigated the fundamental limits to controllability that you would encounter with any control system.

As a philosopher, would you not want to have a more generalisable and robust argument that this is actually going to work out?

Of course I'd prefer to have something more robust. But I don't think the lack of that means it's necessarily useless.

I don't think control is likely to scale to arbitrarily powerful systems. But it may not need to. I think the next phase of the problem is like "keep things safe for long enough that we can get important work out of AI systems", where the important work has to be enough that it can be leveraged to something which sets us up well for the following phases.

RemmeltFeb 8 20241

I don't think control is likely to scale to arbitrarily powerful systems. But it may not need to... which sets us up well for the following phases.

Under the concept of 'control', I am including the capacity of the AI system to control their own components' effects.

I am talking about fundamental workings of control. Ie. control theory and cybernetics.
I.e. as general enough that results are applicable to any following phases as well.

Anders Sandberg has been digging lately into fundamental controllability limits.
Could be interesting to talk with Anders.

William the KiwiFeb 8 20242

I would agree that this is a good summary:

Improving the quality/quantity of output from safety teams within AI labs has a (much) bigger impact on perceived safety of the lab than it does on actual safety of the lab. This is therefore the dominant term in the impact of the team's work. Right now it's negative.

If perception of safety is higher than actual safety, it will lead to underinvestment of future safety, which increases the probability of failure of the system.

RemmeltFeb 7 20241

Further, I think that there are a bunch of arguments for the value of safety work within labs (e.g. access to sota models; building institutional capacity and learning; cultural outreach) which seem to me to be significant and you're not engaging with.

Let's dig into the arguments you mentioned then.

Access to SOTA models
- Given that safety research is intractable where open-ended and increasingly automated systems are scaled anywhere near current rates, I don't really see the value proposition here.
- I guess if researchers noticed a bunch of bad design practices and violations of the law in inspecting the SOTA models, they could leak information about that to the public?
Building institutional capacity and learning
- Inside a corporation competing against other corporations, where the more power-hungry individuals tend to find ways to the top, the institutional capacity-building and learning you will see will be directed towards extracting more profit and power.
- I think this argument considered within its proper institutional context actually cuts against your current conclusion.
Cultural outreach
- This reminds me of the cultural exchanges between US and Soviet scientists during the Cold War. Are you thinking of something like that?
- Saying that, I notice that the current situation is different in the sense that AI Safety researchers are not one side racing to scale proliferation of dangerous machines in tandem with the other side (AGI labs).
- To the extent though that AI Safety researchers can come to share collectively important insights with colleagues at AGI labs – such as on why and how to stop scaling dangerous machine technology, this cuts against my conclusion.
- Looking from the outside, I haven't seen that yet. Early AGI safety thinkers (eg. Yudkowsky, Tegmark) and later funders (eg. Tallinn, Karnofsky) instead supported AGI labs to start up, even if they did not mean to.
- But I'm open (and hoping!) to change my mind.
  It would be great if safety researchers at AGI labs start connecting to collaborate effectively on restricting harmful scaling.

I'm going off the brief descriptions you gave.
Does that cover the arguments as you meant them? What did I miss?

William the KiwiFeb 8 20248

Of the four reasons you listed, reason 4 (safety washing) seems the most important. Safety-washing, alongside the related ethics-washing and green-washing are effective techniques that industry uses to increase peoples perception of the industry. Lizka wrote a post on this. These techniques are used by many industries, particularly by industries that produce significant externalities such as the oil industry. These techniques are used because they work, because they give people an out. It is easier to think about the shiny flowers on an ad than it is to think about the reality of an industry killing people.

Safety-washing of AI is harmful as it gives people an out, a chance to repeat the line "well at least they are allegedly doing some safety stuff", which is a convenient distraction from the fact that AI labs are knowingly developing a technology that can cause human extinction. This distraction causes otherwise safety-conscious people to invest in or work in an industry that they would reconsider if they had access to all the information. By pointing out this distraction, we can help people make more informed decisions.

RemmeltFeb 8 20243

Safety-washing of AI is harmful as it gives people an out, a chance to repeat the line "well at least they are allegedly doing some safety stuff", which is a convenient distraction from the fact that AI labs are knowingly developing a technology that can cause human extinction. This distraction causes otherwise safety-conscious people to invest in or work in an industry that they would reconsider if they had access to all the information.

Very much agreed.

VaipanFeb 7 20244

Yes, I think this is a very useful phenomenon to point at, and some people have a very naïve understanding of what these labs do, especially technical AI safety researchers that have a technical background where skills of critical thinking have not been at the heart of their education. I heard a lot of very candid remarks about the political influence carried out by these labs, and I am worried that these researchers lack a more global understanding of the effects of their work.

Given OpenAI's recent updates on military bans and transparency of documents, I find myself more and more cautious when it comes to trusting anyone working on AI safety. I would love to see representatives of these labs addressing the concerns raised in this post in a credible way.