AI safety
AI safety
Studying and reducing the existential risks posed by advanced artificial intelligence

Quick takes

We should expect that the incentives and culture for AI-focused companies to make them uniquely terrible for producing safe AGI.    From a “safety from catastrophic risk” perspective, I suspect an “AI-focused company” (e.g. Anthropic, OpenAI, Mistral) is abstractly pretty close to the worst possible organizational structure for getting us towards AGI. I have two distinct but related reasons: 1. Incentives 2. Culture From an incentives perspective, consider realistic alternative organizational structures to “AI-focused company” that nonetheless has enough firepower to host successful multibillion-dollar scientific/engineering projects: 1. As part of an intergovernmental effort (e.g. CERN’s Large Hadron Collider, the ISS) 2. As part of a governmental effort of a single country (e.g. Apollo Program, Manhattan Project, China’s Tiangong) 3. As part of a larger company (e.g. Google DeepMind, Meta AI) In each of those cases, I claim that there are stronger (though still not ideal) organizational incentives to slow down, pause/stop, or roll back deployment if there is sufficient evidence or reason to believe that further development can result in major catastrophe. In contrast, an AI-focused company has every incentive to go ahead on AI when the case for pausing is uncertain, and minimal incentive to stop or even take things slowly.  From a culture perspective, I claim that without knowing any details of the specific companies, you should expect AI-focused companies to be more likely than plausible contenders to have the following cultural elements: 1. Ideological AGI Vision AI-focused companies may have a large contingent of “true believers” who are ideologically motivated to make AGI at all costs and 2. No Pre-existing Safety Culture AI-focused companies may have minimal or no strong “safety” culture where people deeply understand, have experience in, and are motivated by a desire to avoid catastrophic outcomes.  The first one should be self-explanatory. Th
Most possible goals for AI systems are concerned with process as well as outcomes. People talking about possible AI goals sometimes seem to assume something like "most goals are basically about outcomes, not how you get there". I'm not entirely sure where this idea comes from, and I think it's wrong. The space of goals which are allowed to be concerned with process is much higher-dimensional than the space of goals which are just about outcomes, so I'd expect that on most reasonable sense of "most" process can have a look-in. What's the interaction with instrumental convergence? (I'm asking because vibe-wise it seems like instrumental convergence is associated with an assumption that goals won't be concerned with process.) * Process-concerned goals could undermine instrumental convergence (since some process-concerned goals could be fundamentally opposed to some of the things that would otherwise get converged-to), but many process-concerned goals won't * Since instrumental convergence is basically about power-seeking, there's an evolutionary argument that you should expect the systems which end up with most power to have the power-seeking behaviours * I actually think there are a couple of ways for this argument to fail: 1. If at some point you get a singleton, there's now no evolutionary pressure on its goals (beyond some minimum required to stay a singleton) 2. A social environment can punish power-seeking, so that power-seeking behaviour is not the most effective way to arrive at power * (There are some complications to this I won't get into here) * But even if it doesn't fail, it pushes towards things which have Omuhundro's basic AI drives (and so pushes away from process-concerned goals which could preclude those), but it doesn't push all the way to purely outcome-concerned goals In general I strongly expect humans to try to instil goals that are concerned with process as well as outcomes. Even if that goes wrong, I mostly expect
I spent way too much time organizing my thoughts on AI loss-of-control ("x-risk") debates without any feedback today, so I'm publishing perhaps one of my favorite snippets/threads: A lot of debates seem to boil down to under-acknowledged and poorly-framed disagreements about questions like “who bears the burden of proof.” For example, some skeptics say “extraordinary claims require extraordinary evidence” when dismissing claims that the risk is merely “above 1%”, whereas safetyists argue that having >99% confidence that things won’t go wrong is the “extraordinary claim that requires extraordinary evidence.”  I think that talking about “burdens” might be unproductive. Instead, it may be better to frame the question more like “what should we assume by default, in the absence of definitive ‘evidence’ or arguments, and why?” “Burden” language is super fuzzy (and seems a bit morally charged), whereas this framing at least forces people to acknowledge that some default assumptions are being made and consider why.  To address that framing, I think it’s better to ask/answer questions like “What reference class does ‘building AGI’ belong to, and what are the base rates of danger for that reference class?” This framing at least pushes people to make explicit claims about what reference class building AGI belongs to, which should make it clearer that it doesn’t belong in your “all technologies ever” reference class.  In my view, the "default" estimate should not be “roughly zero until proven otherwise,” especially given that there isn’t consensus among experts and the overarching narrative of “intelligence proved really powerful in humans, misalignment even among humans is quite common (and is already often observed in existing models), and we often don’t get technologies right on the first few tries.”
I find it encouraging that EAs have quickly pivoted to viewing AI companies as adversaries, after a long period of uneasily viewing them as necessary allies (c.f. Why Not Slow AI Progress?). Previously, I worried that social/professional entanglements and image concerns would lead EAs to align with AI companies even after receiving clear signals that AI companies are not interested in safety. I'm glad to have been wrong about that. Caveat: we've only seen this kind of scrutiny applied to OpenAI and it remains to be seen whether Anthropic and DeepMind will get the same scrutiny.
Being mindful of the incentives created by pressure campaigns I've spent the past few months trying to think about the whys and hows of large-scale public pressure campaigns (especially those targeting companies — of the sort that have been successful in animal advocacy). A high-level view of these campaigns is that they use public awareness and corporate reputation as a lever to adjust corporate incentives. But making sure that you are adjusting the right incentives is more challenging than it seems. Ironically, I think this is closely connected to specification gaming: it's often easy to accidentally incentivize companies to do more to look better, rather than doing more to be better. For example, an AI-focused campaign calling out RSPs recently began running ads that single out AI labs for speaking openly about existential risk (quoting leaders acknowledging that things could go catastrophically wrong). I can see why this is a "juicy" lever — most of the public would be pretty astonished/outraged to learn some of the beliefs that are held by AI researchers. But I'm not sure if pulling this lever is really incentivizing the right thing. As far as I can tell, AI leaders speaking openly about existential risk is good. It won't solve anything in and of itself, but it's a start — it encourages legislators and the public to take the issue seriously. In general, I think it's worth praising this when it happens. I think the same is true of implementing safety policies like RSPs, whether or not such policies are sufficient in and of themselves. If these things are used as ammunition to try to squeeze out stronger concessions, it might just incentivize the company to stop doing the good-but-inadequate thing (i.e. CEOs are less inclined to speak about the dangers of their product when it will be used as a soundbite in a campaign, and labs are probably less inclined to release good-but-inadequate safety policies when doing so creates more public backlash than they were
American Philosophical Association (APA) announces two $10,000 AI2050 Prizes for philosophical work related to AI, with June 23, 2024 deadline:
A lot of policy research seems to be written with an agenda in mind to shape the narrative. And this kind of destroys the point of policy research which is supposed to inform stakeholders and not actively convince or really nudge them. This might cause polarization in some topics and is in itself, probably snatching legitimacy away from the space. I have seen similar concerning parallels in the non-profit space, where some third-sector actors endorse/do things which they see as being good but destroys trust in the whole space. This gives me scary unilaterist's curse vibes..
Y-Combinator wants to fund Mechanistic Interpretability startups "Understanding model behavior is very challenging, but we believe that in contexts where trust is paramount it is essential for an AI model to be interpretable. Its responses need to be explainable. For society to reap the full benefits of AI, more work needs to be done on explainable AI. We are interested in funding people building new interpretable models or tools to explain the output of existing models." Link (Scroll to 12) What they look for in startup founders
Load more (8/102)