jacquesthibs

AI Safety Researcher @ Independent Researcher
1225 karmaJoined Working (6-15 years)London, UK
jacquesthibodeau.com/about/

Bio

I work primarily on AI Alignment. My main direction at the moment is to accelerate alignment work via language models and interpretability.

Comments
93

Yeah, apologies; I thought I had noted that, but I only mentioned the iOS app. There are a few that exist, but I think the ones I've seen are only Mac-compatible at the moment, unfortunately. There has to be a Windows or Linux one...

I’m still getting the hang of it, but primarily have been using it when I want to brainstorm some project ideas that I can later pass off to an LLM for context on what I’m working on or when I want to reflect on a previous meeting I had. Will probably turn it on about ~1 time per week while I’m walking to work and ramble about a project in case I think of something good. (I also sometimes use it to explain the project spec or small adjustments I want my AI coding assistant to do.)

Sometimes I’ll use the Advanced Voice Mode or normal voice mode from ChatGPT for this instead. For example, I used it to practice for an interview after passing off a lot of the context to the model (my CV, the org, etc). I used this to just blurt out all the thoughts I have in my head in a question-answer format and then asked the AI for feedback on my answers and asked it to give a summary of the conversation (like a cheat sheet to remind myself what I want to talk about).

Yeah, I think most of the gains we've gotten from AI have been in coding and learning. Many of the big promises have yet to be met; definitely still a struggle to get it to work well for writing (in the style we'd want it to write) or getting AI agents to work well, so it limits the possible useful application.

I quickly wrote up some rough project ideas for ARENA and LASR participants, so I figured I'd share them here as well. I am happy to discuss these ideas and potentially collaborate on some of them.

Alignment Project Ideas (Oct 2, 2024)

1. Improving "A Multimodal Automated Interpretability Agent" (MAIA)

Overview

MAIA (Multimodal Automated Interpretability Agent) is a system designed to help users understand AI models by combining human-like experimentation flexibility with automated scalability. It answers user queries about AI system components by iteratively generating hypotheses, designing and running experiments, observing outcomes, and updating hypotheses.

MAIA uses a vision-language model (GPT-4V, at the time) backbone equipped with an API of interpretability experiment tools. This modular system can address both "macroscopic" questions (e.g., identifying systematic biases in model predictions) and "microscopic" questions (e.g., describing individual features) with simple query modifications.

This project aims to improve MAIA's ability to either answer macroscopic questions or microscopic questions on vision models.

2. Making "A Multimodal Automated Interpretability Agent" (MAIA) work with LLMs

MAIA is focused on vision models, so this project aims to create a MAIA-like setup, but for the interpretability of LLMs.

Given that this would require creating a new setup for language models, it would make sense to come up with simple interpretability benchmark examples to test MAIA-LLM. The easiest way to do this would be to either look for existing LLM interpretability benchmarks or create one based on interpretability results we've already verified (would be ideal to have a ground truth). Ideally, the examples in the benchmark would be simple, but new enough that the LLM has not seen them in its training data.

3. Testing the robustness of Critique-out-Loud Reward (CLoud) Models

Critique-out-Loud reward models are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward. In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM. Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly. In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces. CLoud reward models lead to large gains for pairwise preference modeling on RewardBench, and also lead to large gains in win rate when used as the scoring model in Best-of-N sampling on ArenaHard.

The goal for this project would be to test the robustness of CLoud reward models. For example, are the CLoud RMs (discriminators) more robust to jailbreaking attacks from the policy (generator)? Do the CLoud RMs generalize better?

From an alignment perspective, we would want RMs that generalize further out-of-distribution (and ideally, always more than the generator we are training).

4. Synthetic Data for Behavioural Interventions

Simple synthetic data reduces sycophancy in large language models by (Google) reduced sycophancy in LLMs with a fairly small number of synthetic data examples. This project would involve testing this technique for other behavioural interventions and (potentially) studying the scaling laws. Consider looking at the examples from the Model-Written Evaluations paper by Anthropic to find some behaviours to test.

5. Regularization Techniques for Enhancing Interpretability and Editability

Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.

In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hidden away the superposition in other parts of the network, making SoLU unhelpful in making the models more interpretable

That said, we hope to find that we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.

Methodology:

  1. Identify a set of regularization techniques (e.g., L1 regularization, weight pruning, activation sparsity) to be applied during fine-tuning.
  2. Fine-tune pre-trained language models with different regularization techniques and hyperparameters.
  3. Evaluate the fine-tuned models using interpretability tools (e.g., attention visualization, probing classifiers) and editability benchmarks (e.g., ROME).
  4. Analyze the impact of regularization on model interpretability, editability, and performance.
  5. Investigate the relationship between interpretability, editability, and model alignment.

Expected Outcomes:

  • Quantitative assessment of the effectiveness of different regularization techniques for improving interpretability and editability.
  • Insights into the trade-offs between interpretability, editability, and model performance.
  • Recommendations for regularization techniques that enhance interpretability and editability while maintaining model performance and alignment.

6. Quantifying the Impact of Reward Misspecification on Language Model Behavior

Investigate how misspecified reward functions influence the behavior of language models during fine-tuning and measure the extent to which the model's outputs are steered by the reward labels, even when they contradict the input context. We hope to better understand language model training dynamics. Additionally, we expect online learning to complicate things in the future, where models will be able to generate the data they may eventually be trained on. We hope that insights from this work can help us prevent catastrophic feedback loops in the future. For example, if model behavior is mostly impacted by training data, we may prefer to shape model behavior through synthetic data (it has been shown we can reduce sycophancy by doing this).

Prior works:

Methodology:

  1. Create a diverse dataset of text passages with candidate responses and manually label them with coherence and misspecified rewards.
  2. Fine-tune pre-trained language models using different reward weighting schemes and hyperparameters.
  3. Evaluate the generated responses using automated metrics and human judgments for coherence and misspecification alignment.
  4. Analyze the influence of misspecified rewards on model behavior and the trade-offs between coherence and misspecification alignment.
  5. Use interpretability techniques to understand how misspecified rewards affect the model's internal representations and decision-making process.

Expected Outcomes:

  • Quantitative measurements of the impact of reward misspecification on language model behavior.
  • Insights into the trade-offs between coherence and misspecification alignment.
  • Interpretability analysis revealing the effects of misspecified rewards on the model's internal representations.

7. Investigating Wrong Reasoning for Correct Answers

Understand the underlying mechanisms that lead to language models producing correct answers through flawed reasoning, and develop techniques to detect and mitigate such behavior. Essentially, we want to apply interpretability techniques to help us identify which sets of activations or token-layer pairs impact the model getting the correct answer when it has the correct reasoning versus when it has the incorrect reasoning. The hope is to uncover systematic differences as to when it is not relying on its chain-of-thought at all and when it does leverage its chain-of-thought to get the correct answer.

[EDIT Oct 2nd, 2024] This project intends to follow a similar line of reasoning as described in this post and this comment. The goal is to study chains-of-thought and improve faithfulness without suffering an alignment tax so that we can have highly interpretable systems through their token outputs and prevent loss of control. The project doesn't necessarily need to rely only on model internals.

Related work:

  1. Decomposing Predictions by Modeling Model Computation by Harshay Shah, Andrew Ilyas, Aleksander Madry
  2. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models by Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun
  3. On Measuring Faithfulness or Self-consistency of Natural Language Explanations by Letitia ParcalabescuAnette Frank
  4. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting by Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman
  5. Measuring Faithfulness in Chain-of-Thought Reasoning by Tamera Lanham et al.

Methodology:

  1. Curate a dataset of questions and answers where language models are known to provide correct answers but with flawed reasoning.
  2. Use interpretability tools (e.g., attention visualization, probing classifiers) to analyze the model's internal representations and decision-making process for these examples.
  3. Develop metrics and techniques to detect instances of correct answers with flawed reasoning.
  4. Investigate the relationship between model size, training data, and the prevalence of flawed reasoning.
  5. Propose and evaluate mitigation strategies, such as data augmentation or targeted fine-tuning, to reduce the occurrence of flawed reasoning.

Expected Outcomes:

  • Insights into the underlying mechanisms that lead to correct answers with flawed reasoning in language models.
  • Metrics and techniques for detecting instances of flawed reasoning.
  • Empirical analysis of the factors contributing to flawed reasoning, such as model size and training data.
  • Proposed mitigation strategies to reduce the occurrence of flawed reasoning and improve model alignment.

I just saw this; thanks for sharing! Yup, some of these should be able to be solved quickly with LLMs.

I'm exploring the possibility of building an alignment research organization focused on augmenting alignment researchers and progressively automating alignment research (yes, I have thought deeply about differential progress and other concerns). I intend to seek funding in the next few months, and I'd like to chat with people interested in this kind of work, especially great research engineers and full-stack engineers who might want to cofound such an organization. If you or anyone you know might want to chat, let me know! Send me a DM, and I can send you some initial details about the organization's vision.

Here are some things I'm looking for in potential co-founders:

Need

  • Strong software engineering skills

Nice-to-have

  • Experience in designing LLM agent pipelines with tool-use
  • Experience in full-stack development
  • Experience in scalable alignment research approaches (automated interpretability/evals/red-teaming)

Hey everyone, in collaboration with Apart Research, I'm helping organize a hackathon this weekend to build tools for accelerating alignment research. This hackathon is very much related to my effort in building an "Alignment Research Assistant."

Here's the announcement post:

2 days until we revolutionize AI alignment research at the Research Augmentation Hackathon!

As AI safety researchers, we pour countless hours into crucial work. It's time we built tools to accelerate our efforts! Join us in creating AI assistants that could supercharge the very research we're passionate about.

Date: July 26th to 28th, online and in-person
Prizes: $2,000 in prizes

Why join?

* Build tools that matter for the future of AI
* Learn from top minds in AI alignment
* Boost your skills and portfolio

We've got a Hackbook with an exciting project to work on waiting for you! No advanced AI knowledge required - just bring your creativity!

Register now: Sign up on the website here, and don't miss this chance to shape the future of AI research!

We're doing a hackathon with Apart Research on 26th. I created a list of problem statements for people to brainstorm off of.

Pro-active insight extraction from new research

Reading papers can take a long time and is often not worthwhile. As a result, researchers might read too many papers or almost none. However, there are still valuable nuggets in papers and posts. The issue is finding them. So, how might we design an AI research assistant that proactively looks at new papers (and old) and shares valuable information with researchers in a naturally consumable way? Part of this work involves presenting individual research with what they would personally find valuable and not overwhelm them with things they are less interested in.

How can we improve the LLM experience for researchers?

Many alignment researchers will use language models much less than they would like to because they don't know how to prompt the models, it takes time to create a valuable prompt, the model doesn't have enough context for their project, the model is not up-to-date on the latest techniques, etc. How might we make LLMs more useful for researchers by relieving them of those bottlenecks?

Simple experiments can be done quickly, but turning it into a full project can take a lot of time 

One key bottleneck for alignment research is transitioning from an initial 24-hour simple experiment in a notebook to a set of complete experiments tested with different models, datasets, interventions, etc. How can we help researchers move through that second research phase much faster?

How might we use AI agents to automate alignment research?

As AI agents become more capable, we can use them to automate parts of alignment research. The paper "A Multimodal Automated Interpretability Agent" serves as an initial attempt at this. How might we use AI agents to help either speed up alignment research or unlock paths that were previously inaccessible?

How can we nudge research toward better objectives (agendas or short experiments) for their research?

Even if we make researchers highly efficient, it means nothing if they are not working on the right things. Choosing the right objectives (projects and next steps) through time can be the difference between 0x to 1x to +100x. How can we ensure that researchers are working on the most valuable things?

What can be done to accelerate implementation and iteration speed?

Implementation and iteration speed on the most informative experiments matter greatly. How can we nudge them to gain the most bits of information in the shortest time? This involves helping them work on the right agendas/projects and helping them break down their projects in ways that help them make progress faster (and avoiding ending up tunnel-visioned on the wrong project for months/years). 

How can we connect all of the ideas in the field?

How can we integrate the open questions/projects in the field (with their critiques) in such a way that helps the researcher come up with well-grounded research directions faster? How can we aid them in choosing better directions and adjust throughout their research? This kind of work may eventually be a precursor to guiding AI agents to help us develop better ideas for alignment research.

As an update to the Alignment Research Assistant I'm building, here is a set of shovel-ready tasks I would like people to contribute to (please DM if you'd like to contribute!):

Core Features

1. Setup the Continue extension for research: https://www.continue.dev/ 

  • Design prompts in Continue that are suitable for a variety of alignment research tasks and make it easy to switch between these prompts
  • Figure out how to scaffold LLMs with Continue (instead of just prompting one LLM with additional context)
    • Can include agents, search, and more
  • Test out models to quickly help with paper-writing

2. Data sourcing and management

  • Integrate with the Alignment Research Dataset (pulling from either the SQL database or Pinecone vector database): https://github.com/StampyAI/alignment-research-dataset 
  • Integrate with other apps (Google Docs, Obsidian, Roam Research, Twitter, LessWrong)
  • Make it easy to look and edit long prompts for project context

3. Extract answers to questions across multiple papers/posts (feeds into Continue)

  • Develop high-quality chunking and scaffolding techniques
  • Implement multi-step interaction between researcher and LLM

4. Design Autoprompts for alignment research

  • Creates lengthy, high-quality prompts for researchers that get better responses from LLMs

5. Simulated Paper Reviewer

  • Fine-tune or prompt LLM to behave like an academic reviewer
  • Use OpenReview data for training

6. Jargon and Prerequisite Explainer

  • Design a sidebar feature to extract and explain important jargon
  • Could maybe integrate with some interface similar to https://delve.a9.io/ 

7. Setup automated "suggestion-LLM"

  • An LLM periodically looks through the project you are working on and tries to suggest *actually useful* things in the side-chat. It will be a delicate balance to make sure not to share too much and cause a loss of focus. This could be custom for the research with an option only to give automated suggestions post-research session.

8. Figure out if we can get a useable browser inside of VSCode (tried quickly with the Edge extension but couldn't sign into the Claude chat website)

  • Could make use of new features other companies build (like Anthropic's Artifact feature), but inside of VSCode to prevent context-switching in an actual browser

9. "Alignment Research Codebase" integration (can add as Continue backend)

  • Create an easily insertable set of repeatable code that researchers can quickly add to their project or LLM context
  • This includes code for Multi-GPU stuff, best practices for codebase, and more
  • Should make it easy to populate a new codebase
  • Pro-actively gives suggestions to improve the code
  • Generally makes common code implementation much faster

Specialized tooling (outside of VSCode)

Bulk fast content extraction

  • Create an extension to extract content from multiple tabs or papers
  • Simplify the process of feeding content to the VSCode backend for future use

Personalized Research Newsletter

  • Create a tool that extracts relevant information for researchers (papers, posts, other sources)
  • Generate personalized newsletters based on individual interests (open questions and research they care about)
  • Sends pro-active notification in VSCode and Email

Discord Bot for Project Proposals

  • Suggest relevant papers/posts/repos based on project proposals
  • Integrate with Apart Research Hackathons

I've created a private discord server to discuss this work. If you'd like to contribute to this project (or might want to in the future if you see a feature you'd like to contribute to) or if you are an alignment/governance researcher who would like to be a beta user so we can iterate faster, please DM me for a link!

Load more