The Mechanism Design for AI Safety (MDAIS) reading group, announced here, is currently in it's eighth of twelve weeks. I'm very excited by the quality of discussions we've had so far, and for the potential of future work from members of this group. If you're interested in working at the intersection of mechanism design and AI safety, please send me a message so that I can keep you in mind for future opportunities.
Edit: we have completed this initial list and are now meeting on a monthly basis. You can sign up to attend the meetings here.
A number of people have reached out to ask me for the reading list we're using. Until now, I've had to tell them that it was still being developed, but at long last it has been finalized. This post is to communicate the list publicly for anyone curious about what we've been discussing, or who would like to follow along themselves. It goes week by week listing the papers covered, the topics of discussion, and any notes I have. After the first two weeks, the order of the papers covered is largely inconsequential.
Reading List
Updated as of October 25th, 2024
Week 1
Papers:
- The Principal-Agent Alignment Problem in Artificial Intelligence by Dylan Hadfield-Menell
- Incomplete Contracting and AI Alignment by Dylan Hadfield-Menell and Gillian Hadfield
Discussion: Introductions, formalization of the alignment problem, inverse reinforcement learning and cooperative inverse reinforcement learning
Notes: The Principal-Agent Alignment Problem in Artificial Intelligence is extremely long, essentially multiple papers concatenated, so discussing it in the first week gave people more prep time to read it. Incomplete Contracting and AI Alignment is much shorter and less formal but did not add much, in hindsight I would not had included it.
Week 2
Paper: Risks from Learned Optimization in Advanced Machine Learning Systems by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant
Discussion: Inner vs. outer alignment, what applications mechanism design has for each "step" in alignment
Week 3
Paper: Decision Scoring Rules (Extended Version) by Caspar Oesterheld and Vincent Conitzer
Discussion: Oracle AI, making predictions safely
Week 4
Paper: Discovering Agents by Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt MacDermott, and Tom Everitt
Discussion: Defining agents, using causal influence diagrams in AI safety
Week 5
Papers:
- Model-Free Opponent Shaping by Chris Lu, Timon Willi, Christian Schroeder de Witt, and Jakob Foerster
- The Good Shepherd: An Oracle Agent for Mechanism Design by Jan Balaguer, Raphael Koster, Christopher Summerfield, and Andrea Tacchetti
Discussion: Mechanism design affecting learning, how deception might arise
Notes: Almost everything in The Good Shepherd was also covered in Model-Free Opponent Shaping, so in hindsight including it as well was redundant.
Week 6
Paper: Fully General Online Imitation Learning by Michael Cohen, Marcus Hutter, and Neel Nanda
Discussion: Advantages, disadvantages, and extensions for the mechanism proposed in the paper
Week 7
Papers:
- Corrigibility by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong
- The Off Switch Game by Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell
Discussion: Formalizing issues with corrigibility, approaches to instill corrigibility
Week 8
Paper: Investment Incentives in Truthful Approximation Mechanisms by Mohammad Akbarpour, Scott Kominers, Kevin Li, Shengwu Li, and Paul Milgrom
Discussion: Implementing mechanisms with AI, issues with approximation
Week 9
Paper: Cooperation, Conflict, and Transformative Artificial Intelligence - A Research Agenda by Jesse Clifton
Discussion: Various topics from the agenda with a focus on S-risks and bargaining
Week 10
Paper: Getting Dynamic Implementation to Work (excluding sections 3 and 4) by Yi-Chun Chen, Richard Holden, Takashi Kunimoto, Yifei Sun, and Tom Wilkening
Discussion: Ensemble models, AI monitoring AI
Notes: Sections 3 and 4 of the paper were excluded as they focus on experimental results with humans, which are of minimal relevance.
Week 11
Papers:
- Learning to Communicate with Deep Multi-Agent Reinforcement Learning by Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson
- Emergent Cover Signaling in Adversarial Reference Games by Dhara Yu, Jesse Mu, and Noah Goodman
Discussion: Detecting communication, intercepting communication
Week 12
Paper: Functional Decision Theory: A New Theory of Instrumental Rationality by Eliezer Yudkowsky and Nate Soares
Discussion: Functional decision theory, mechanism design for superrational agents and functional decision theorists
Next Steps
Once we have finished going through this reading list, I would like to move to a more infrequent and irregular schedule. Meetings would be to discuss new developments in the space, the research produced by reading groups members, or topics missed during the first twelve weeks. I expect this ongoing reading group would expand beyond the initial members and be open to anyone interested.
If there is sufficient interest, another iteration going through the above reading list can be run, although likely with several updates.
Finally, we plan to collaborate on an agenda laying out promising research directions in the intersection mechanism design and AI safety. Ideally, we will have interested members transition to a working group where we can collaborate on research to address the challenge of ensuring AI is a positive development for humanity.
Edit: We have completed the initial readings and are now meeting once a month for further readings. You can sign up to be notified here.
Ongoing Readings
Meeting 13
Paper: Safe Pareto Improvements for Delegated Game Playing by Caspar Oesterheld and Vince Conitzer
Meeting 14
Papers:
- Quantilizers: A Safer Alternative to Maximizers for Limited Optimization by Jessica Taylor
- Safety Considerations for Online Generative Modeling by Sam Marks
Meeting 16
Paper: A Robust Bayesian Truth Serum for Small Populations by Jens Witkowski and David C. Parkes
Meeting 17
Paper: Misspecification in Inverse Reinforcement Learning by Joar Skalse and Alessandro Abate
Meeting 18
Paper: Hidden Incentives for Auto-Induced Distributional Shift by David Krueger, Tegan Maharaj, and Jan Leike
Meeting 19
Paper: Evolution of Preferences by Eddie Dekel, Jeffrey Ely, and Okan Yilankaya
Meeting 20
Paper: A Theory of Rule Development by Glenn Ellison and Richard Holden
Meeting 21
Paper: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback by Stephen Caspar et al.
Meeting 22
Paper: Natural Selection of Artificial Intelligence by Jeffrey Ely and Balazs Szentes
Meeting 23
Paper: The Shutdown Problem: Incomplete Preferences as a Solution by Elliott Thornley
Meeting 24
Paper: Conservative Agency via Attainable Utility Preservation, by Alex Turner, Dylan Hadfield-Menell, and Prasad Tadepalli
This is really cool, a whole topic within AI safety that I haven’t seen much focus on. I’ll plan to read some of these papers, but if any of your participants is interested in writing up summaries or key insights from the curriculum I’d definitely be interested in reading them.