The idea behind this hackathon is simple but crucial. We want to develop and refine benchmarks that provide empirical answers to the question: How far are we from solving alignment?
Event overview
As AI capabilities continue to advance, one of the challenges we face is making sure that AI models still align with our intended values. Right now, we have a few methods for embedding values into AI systems. However, it’s often unclear whether these methods are scalable as capabilities increase, or if the values embedded by these models are robust in the first place.
For example, when a model is trained on value-aligned data and its loss goes down, it indicates that there is some abstraction of those values that the models are learning. But this raises a question on whether AI models genuinely ‘believes’ these values akin to how a moral person does, or is it simply parroting alignment while potentially hiding harmful values that might emerge at larger scales, under greater power, or at a later time?
Let’s address this.
AI-Plans is hosting an AI Alignment Evals Hackathon from January 25 to February 2, 2025. This week-long event aims to bring together researchers, developers, and curious minds to tackle the challenge of precisely measuring (and advancing) AI alignment methods.
What are AI alignment evaluations and why do they matter?
AI alignment evaluations are tools designed to measure how well AI systems align the intended values and goals. These tools systematically assess whether that model chooses to behave in ways consistent with the intended values when it can (alignment). For a deeper dive into model evaluations, check out this starter guide on evals. This paper on safety washing also gives further information about the difference between evaluating capabilities and alignment/safety.
Some well-known alignment benchmarks include:
- DecodingTrust: Provides a comprehensive trustworthiness evaluation of GPT models.
- MACHIAVELLI: A benchmark of 134 Choose-Your-Own-Adventure games testing scenarios that center on social decision-making.
- RuLES: Measures the rule-following ability of LLMs.
- SALAD-Bench: A safety benchmark designed for evaluating LLMs, defense, and attack methods.
- TruthfulQA: Measures whether a language model is truthful in generating answers to questions.
These tools are a start, but we’ve yet to fully address issues like scale (upholding its intended values at greater capabilities) and robustness (applying these values effectively even in out-of-distribution scenarios). Another unresolved question is how well values such as trustworthiness, truthfulness, etc., are preserved by an AI across successive generations of AI systems.
What to expect from the hackathon?
Participants will get hands-on experience in creating, refining, and testing alignment evaluation tools. You’ll learn how to:
- Design benchmarks, define success metrics, and set up test cases.
- Apply existing benchmarks to real-world use cases.
- Fine-tune models and measure the impact on alignment outcomes.
- Develop adversarial test cases to expose weaknesses in current benchmarks.
- Train cross-coders to compare fine-tined models with base models.
We’ll provide participants with the following:
- 10 versions of a model, all sharing the same base but trained with PPO, DPO, IPO, KPO, etc.
- Step-by-step guides for creating evals (i.e., what is it, how to run an eval, things to consider when making one, how to make one, etc.).
- Tutorials on using HHH, SALAD-Bench, MACHIAVELLI, and more.
- An introduction to Inspect, an evaluation framework by the UK AISI.
How does the red team vs blue team challenge work?
As a participant, you can either be part of the red team or a blue team.
- Red team’s challenge is to make the Trojans. Uncover the weaknesses of existing alignment benchmarks by finding/creating cases where models achieve high scores on these benchmarks but fail to uphold the values or behaviors that these benchmarks intended to evaluate.
- Blue team’s challenge is to build better benchmarks. Adapt existing (or create new) benchmarks to test if the intended values were embedded in the model (e.g., comparing a fine-tuned model to its base using a cross-coder or testing if the model learned an abstraction).
How can you participate?
You don't need to be an evals expert to join the hackathon. All you need is basic Python programming experience and a high-level familiarity with neural network training concepts. You can learn the rest in the hackathon!
To participate, register your interest via https://lu.ma/xjkxqcya, and indicate your preferred role. There will also be events all throughout the month on the AI-Plans Discord server such as paper reading sessions and office hours to help you form an idea and a team.
Are there other ways I can help without participating in the hackathon?
We want to make this hackathon truly global. To do that, we’re looking for folks who would be interested in hosting a local gathering for their community. Let us know by filling out this form: https://tally.so/r/wvENk8.
We are also seeking collaborators like mentors, judges, and sponsors who can help us make this event a success.
- Speaker: Lead a 30-minute session or talk to inspire and educate participants in working on AI alignment evals throughout the event.
- Mentor: Offer mentorship, insights, or tools to participants as they work on their project.
- Judge: Evaluate and provide feedback on participants' projects during the hackathon.
- Sponsor: Support the event with resources or prizes.
If you’d like to be a collaborator, kindly let us know by selecting the Collaborator ticket on https://lu.ma/xjkxqcya.