Rethink Priorities is working on a project called ‘Defense in Depth Against Catastrophic AI Failures’. “Defense in depth” refers to the use of multiple redundant layers of safety and/or security measures such that each layer reduces the chance of catastrophe. Our project is intended to (1) make the case for taking a defense in depth approach to ensuring safety when deploying near-term, high-stakes AI systems and (2) identify many defense layers/measures that may be useful for this purpose.
If you can think of any possible layers, please mention them below. We’re hoping to collect a very long list of such layers, either for inclusion in our main output or for potentially investigating further in future, so please err on the side of commenting even if the ideas are quite speculative, may not actually be useful, or may be things we’ve already thought of. Any relevant writing you can refer us to would also be useful.
If we end up including layers you suggest in our outputs, we’d be happy to either leave you anonymous or credit you, depending on your preference.
Some further info about the project: By “catastrophic AI failure”, we mean harmful accidents or harmful unintended use of computer systems that perform tasks typically associated with intelligent behavior (and especially of machine learning systems) that lead to at least 100 fatalities or $1 billion in economic loss. This could include failures in contexts like power grid management, autonomous weapons, or cyber offense (if you’re interested in more concrete examples, see here )
Defense layers can relate to any phase of a technology’s development and deployment, from early development to monitoring deployment to learning from failures, and can be about personnel, procedures, institutional set up, technical standards, etc.
Some examples of defense layers for AI include (find more here):
- Procedures for vetting and deciding on institutional partners, investors, etc.
- Methods for scaling human supervision and feedback during and after training high-stakes ML systems
- Tools for blocking unauthorized use of developed/trained IP, akin to the PALs on nuclear weapons
- Technical methods and process methods (e.g. certification; Cihon et al. 2021, benchmarks?) for gaining high confidence in certain properties of ML systems, and properties of the inputs to ML systems (e.g. datasets), at all stages of development (a la Ashmore et al. 2019)
- Background checks & similar for people being hired or promoted to certain types of roles
- Methods for avoiding or detecting supply chain attacks
- Procedures for deciding when and how to engage one's host government to help with security/etc.