Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.^[1]

Present-day machine learning systems are typically not very transparent or interpretable. You can use a model's output, but the model can't tell you why it made that output. This makes it hard to determine the cause of biases in ML models.^[1]

Interpretability is a focus of Chris Olah and Anthropic's work, though most AI alignment organisations work on interpretability to some extent, such as Redwood Research^[2].

Related entries

AI risk | AI safety | Artificial intelligence

^{^}
Multicore (2020) Transparency / Interpretability (ML & AI), AI Alignment Forum, August 1.
^{^}
Shlegeris, Buck (2022) Answer to 'How might a herd of interns help with AI or biosecurity research tasks/questions?', Effective Altruism Forum, March 21.