"If anyone with insider knowledge wants to write about the impact of Effective Altruism in the technology industry please get in touch with me claire@quillette.com. We pay our writers and can protect authors' anonymity if desired."
It would probably be impactful if someone in the know provided a counterbalance to whoever will undoubtedly email her to disparage EA with half-truths/lies.
Hey everyone, my name is Jacques, I'm an independent technical alignment researcher (primarily focused on evaluations, interpretability, and scalable oversight). I'm now focusing more of my attention on building an Alignment Research Assistant. I'm looking for people who would like to contribute to the project. This project will be private unless I say otherwise.
Side note: I helped build the Alignment Research Dataset ~2 years ago. It has been used at OpenAI (by someone on the alignment team), (as far as I know) at Anthropic for evals, and is now used as the backend for Stampy.ai.
If you are interested in potentially helping out (or know someone who might be!), send me a DM with a bit of your background and why you'd like to help out. To keep things focused, I may or may not accept.
I have written up the vision and core features for the project here. I expect to see it evolve in terms of features, but the vision will likely remain the same. I'm currently working on some of the features and have delegated some tasks to others (tasks are in a private GitHub project board).
I'm also collaborating with different groups. For now, the focus is to build core features that can be used individually but will eventually work together into the core product. In 2-3 months, I want to get it to a place where I know whether this is useful for other researchers and if we should apply for additional funding to turn it into a serious project.
As an update to the Alignment Research Assistant I'm building, here is a set of shovel-ready tasks I would like people to contribute to (please DM if you'd like to contribute!):
An LLM periodically looks through the project you are working on and tries to suggest *actually useful* things in the side-chat. It will be a delicate balance to make sure not to share too much and cause a loss of focus. This could be custom for the research with an option only to give automated suggestions post-research session.
8. Figure out if we can get a useable browser inside of VSCode (tried quickly with the Edge extension but couldn't sign into the Claude chat website)
Could make use of new features other companies build (like Anthropic's Artifact feature), but inside of VSCode to prevent context-switching in an actual browser
9. "Alignment Research Codebase" integration (can add as Continue backend)
Create an easily insertable set of repeatable code that researchers can quickly add to their project or LLM context
This includes code for Multi-GPU stuff, best practices for codebase, and more
Should make it easy to populate a new codebase
Pro-actively gives suggestions to improve the code
Generally makes common code implementation much faster
Specialized tooling (outside of VSCode)
Bulk fast content extraction
Create an extension to extract content from multiple tabs or papers
Simplify the process of feeding content to the VSCode backend for future use
Personalized Research Newsletter
Create a tool that extracts relevant information for researchers (papers, posts, other sources)
Generate personalized newsletters based on individual interests (open questions and research they care about)
Sends pro-active notification in VSCode and Email
Discord Bot for Project Proposals
Suggest relevant papers/posts/repos based on project proposals
We're doing a hackathon with Apart Research on 26th. I created a list of problem statements for people to brainstorm off of.
Pro-active insight extraction from new research
Reading papers can take a long time and is often not worthwhile. As a result, researchers might read too many papers or almost none. However, there are still valuable nuggets in papers and posts. The issue is finding them. So, how might we design an AI research assistant that proactively looks at new papers (and old) and shares valuable information with researchers in a naturally consumable way? Part of this work involves presenting individual research with what they would personally find valuable and not overwhelm them with things they are less interested in.
How can we improve the LLM experience for researchers?
Many alignment researchers will use language models much less than they would like to because they don't know how to prompt the models, it takes time to create a valuable prompt, the model doesn't have enough context for their project, the model is not up-to-date on the latest techniques, etc. How might we make LLMs more useful for researchers by relieving them of those bottlenecks?
Simple experiments can be done quickly, but turning it into a full project can take a lot of time
One key bottleneck for alignment research is transitioning from an initial 24-hour simple experiment in a notebook to a set of complete experiments tested with different models, datasets, interventions, etc. How can we help researchers move through that second research phase much faster?
How might we use AI agents to automate alignment research?
As AI agents become more capable, we can use them to automate parts of alignment research. The paper "A Multimodal Automated Interpretability Agent" serves as an initial attempt at this. How might we use AI agents to help either speed up alignment research or unlock paths that were previously inaccessible?
How can we nudge research toward better objectives (agendas or short experiments) for their research?
Even if we make researchers highly efficient, it means nothing if they are not working on the right things. Choosing the right objectives (projects and next steps) through time can be the difference between 0x to 1x to +100x. How can we ensure that researchers are working on the most valuable things?
What can be done to accelerate implementation and iteration speed?
Implementation and iteration speed on the most informative experiments matter greatly. How can we nudge them to gain the most bits of information in the shortest time? This involves helping them work on the right agendas/projects and helping them break down their projects in ways that help them make progress faster (and avoiding ending up tunnel-visioned on the wrong project for months/years).
How can we connect all of the ideas in the field?
How can we integrate the open questions/projects in the field (with their critiques) in such a way that helps the researcher come up with well-grounded research directions faster? How can we aid them in choosing better directions and adjust throughout their research? This kind of work may eventually be a precursor to guiding AI agents to help us develop better ideas for alignment research.
I've created a private discord server to discuss this work. If you'd like to contribute to this project (or might want to in the future if you see a feature you'd like to contribute to) or if you are an alignment/governance researcher who would like to be a beta user so we can iterate faster, please DM me for a link!
If you work at a social media website or YouTube (or know anyone who does), please read the text below:
Community Notes is one of the best features to come out on social media apps in a long time. The code is even open source. Why haven't other social media websites picked it up yet? If they care about truth, this would be a considerable step forward beyond. Notes like “this video is funded by x nation” or “this video talks about health info; go here to learn more” messages are simply not good enough.
If you work at companies like YouTube or know someone who does, let's figure out who we need to talk to to make it happen. Naïvely, you could spend a weekend DMing a bunch of employees (PMs, engineers) at various social media websites in order to persuade them that this is worth their time and probably the biggest impact they could have in their entire career.
If you have any connections, let me know. We can also set up a doc of messages to send in order to come up with a persuasive DM.
Don't forget that we train language models on the internet! The more truthful your dataset is, the more truthful the models will be! Let's revamp the internet for truthfulness, and we'll subsequently improve truthfulness in our AI systems!!
Hey everyone, in collaboration with Apart Research, I'm helping organize a hackathon this weekend to build tools for accelerating alignment research. This hackathon is very much related to my effort in building an "Alignment Research Assistant."
Here's the announcement post:
2 days until we revolutionize AI alignment research at the Research Augmentation Hackathon!
As AI safety researchers, we pour countless hours into crucial work. It's time we built tools to accelerate our efforts! Join us in creating AI assistants that could supercharge the very research we're passionate about.
Date: July 26th to 28th, online and in-person Prizes:$2,000 in prizes
Why join?
* Build tools that matter for the future of AI * Learn from top minds in AI alignment * Boost your skills and portfolio
We've got a Hackbook with an exciting project to work on waiting for you! No advanced AI knowledge required - just bring your creativity!
Register now: Sign up on the website here, and don't miss this chance to shape the future of AI research!
I'm exploring the possibility of building an alignment research organization focused on augmenting alignment researchers and progressively automating alignment research (yes, I have thought deeply about differential progress and other concerns). I intend to seek funding in the next few months, and I'd like to chat with people interested in this kind of work, especially great research engineers and full-stack engineers who might want to cofound such an organization. If you or anyone you know might want to chat, let me know! Send me a DM, and I can send you some initial details about the organization's vision.
Here are some things I'm looking for in potential co-founders:
Need
Strong software engineering skills
Nice-to-have
Experience in designing LLM agent pipelines with tool-use
Experience in full-stack development
Experience in scalable alignment research approaches (automated interpretability/evals/red-teaming)
I quickly wrote up some rough project ideas for ARENA and LASR participants, so I figured I'd share them here as well. I am happy to discuss these ideas and potentially collaborate on some of them.
Alignment Project Ideas (Oct 2, 2024)
1. Improving "A Multimodal Automated Interpretability Agent" (MAIA)
Overview
MAIA (Multimodal Automated Interpretability Agent) is a system designed to help users understand AI models by combining human-like experimentation flexibility with automated scalability. It answers user queries about AI system components by iteratively generating hypotheses, designing and running experiments, observing outcomes, and updating hypotheses.
MAIA uses a vision-language model (GPT-4V, at the time) backbone equipped with an API of interpretability experiment tools. This modular system can address both "macroscopic" questions (e.g., identifying systematic biases in model predictions) and "microscopic" questions (e.g., describing individual features) with simple query modifications.
This project aims to improve MAIA's ability to either answer macroscopic questions or microscopic questions on vision models.
2. Making "A Multimodal Automated Interpretability Agent" (MAIA) work with LLMs
MAIA is focused on vision models, so this project aims to create a MAIA-like setup, but for the interpretability of LLMs.
Given that this would require creating a new setup for language models, it would make sense to come up with simple interpretability benchmark examples to test MAIA-LLM. The easiest way to do this would be to either look for existing LLM interpretability benchmarks or create one based on interpretability results we've already verified (would be ideal to have a ground truth). Ideally, the examples in the benchmark would be simple, but new enough that the LLM has not seen them in its training data.
3. Testing the robustness of Critique-out-Loud Reward (CLoud) Models
Critique-out-Loud reward models are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward. In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM. Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly. In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces. CLoud reward models lead to large gains for pairwise preference modeling on RewardBench, and also lead to large gains in win rate when used as the scoring model in Best-of-N sampling on ArenaHard.
The goal for this project would be to test the robustness of CLoud reward models. For example, are the CLoud RMs (discriminators) more robust to jailbreaking attacks from the policy (generator)? Do the CLoud RMs generalize better?
From an alignment perspective, we would want RMs that generalize further out-of-distribution (and ideally, always more than the generator we are training).
4. Synthetic Data for Behavioural Interventions
Simple synthetic data reduces sycophancy in large language models by (Google) reduced sycophancy in LLMs with a fairly small number of synthetic data examples. This project would involve testing this technique for other behavioural interventions and (potentially) studying the scaling laws. Consider looking at the examples from the Model-Written Evaluations paper by Anthropic to find some behaviours to test.
5. Regularization Techniques for Enhancing Interpretability and Editability
Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.
In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hidden away the superposition in other parts of the network, making SoLU unhelpful in making the models more interpretable
That said, we hope to find that we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.
Methodology:
Identify a set of regularization techniques (e.g., L1 regularization, weight pruning, activation sparsity) to be applied during fine-tuning.
Fine-tune pre-trained language models with different regularization techniques and hyperparameters.
Evaluate the fine-tuned models using interpretability tools (e.g., attention visualization, probing classifiers) and editability benchmarks (e.g., ROME).
Analyze the impact of regularization on model interpretability, editability, and performance.
Investigate the relationship between interpretability, editability, and model alignment.
Expected Outcomes:
Quantitative assessment of the effectiveness of different regularization techniques for improving interpretability and editability.
Insights into the trade-offs between interpretability, editability, and model performance.
Recommendations for regularization techniques that enhance interpretability and editability while maintaining model performance and alignment.
6. Quantifying the Impact of Reward Misspecification on Language Model Behavior
Investigate how misspecified reward functions influence the behavior of language models during fine-tuning and measure the extent to which the model's outputs are steered by the reward labels, even when they contradict the input context. We hope to better understand language model training dynamics. Additionally, we expect online learning to complicate things in the future, where models will be able to generate the data they may eventually be trained on. We hope that insights from this work can help us prevent catastrophic feedback loops in the future. For example, if model behavior is mostly impacted by training data, we may prefer to shape model behavior through synthetic data (it has been shown we can reduce sycophancy by doing this).
Create a diverse dataset of text passages with candidate responses and manually label them with coherence and misspecified rewards.
Fine-tune pre-trained language models using different reward weighting schemes and hyperparameters.
Evaluate the generated responses using automated metrics and human judgments for coherence and misspecification alignment.
Analyze the influence of misspecified rewards on model behavior and the trade-offs between coherence and misspecification alignment.
Use interpretability techniques to understand how misspecified rewards affect the model's internal representations and decision-making process.
Expected Outcomes:
Quantitative measurements of the impact of reward misspecification on language model behavior.
Insights into the trade-offs between coherence and misspecification alignment.
Interpretability analysis revealing the effects of misspecified rewards on the model's internal representations.
7. Investigating Wrong Reasoning for Correct Answers
Understand the underlying mechanisms that lead to language models producing correct answers through flawed reasoning, and develop techniques to detect and mitigate such behavior. Essentially, we want to apply interpretability techniques to help us identify which sets of activations or token-layer pairs impact the model getting the correct answer when it has the correct reasoning versus when it has the incorrect reasoning. The hope is to uncover systematic differences as to when it is not relying on its chain-of-thought at all and when it does leverage its chain-of-thought to get the correct answer.
[EDIT Oct 2nd, 2024] This project intends to follow a similar line of reasoning as described in this post and this comment. The goal is to study chains-of-thought and improve faithfulness without suffering an alignment tax so that we can have highly interpretable systems through their token outputs and prevent loss of control. The project doesn't necessarily need to rely only on model internals.
Curate a dataset of questions and answers where language models are known to provide correct answers but with flawed reasoning.
Use interpretability tools (e.g., attention visualization, probing classifiers) to analyze the model's internal representations and decision-making process for these examples.
Develop metrics and techniques to detect instances of correct answers with flawed reasoning.
Investigate the relationship between model size, training data, and the prevalence of flawed reasoning.
Propose and evaluate mitigation strategies, such as data augmentation or targeted fine-tuning, to reduce the occurrence of flawed reasoning.
Expected Outcomes:
Insights into the underlying mechanisms that lead to correct answers with flawed reasoning in language models.
Metrics and techniques for detecting instances of flawed reasoning.
Empirical analysis of the factors contributing to flawed reasoning, such as model size and training data.
Proposed mitigation strategies to reduce the occurrence of flawed reasoning and improve model alignment.
I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.
Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.
CURRENT WORK
Collaborating with Quintin Pope on our Supervising AIs Improving AIs agenda (making automated AI science safe and controllable). The current project involves a new method allowing unsupervised model behaviour evaluations. Our agenda.
I'm a research lead in the AI Safety Camp for a project on stable reflectivity (testing models for metacognitive capabilities that impact future training/alignment).
Other research that currently interests me: multi-polar AI worlds (and how that impacts post-deployment model behaviour), understanding-based interpretability, improving evals, designing safer training setups, interpretable architectures, and limits of current approaches (what would a new paradigm that addresses these limitations look like?).
Used to focus more on model editing, rethinking interpretability, causal scrubbing, etc.
TOPICS TO CHAT ABOUT
How do you expect AGI/ASI to actually develop (so we can align our research accordingly)? Will scale plateau? I'd like to get feedback on some of my thoughts on this.
How can we connect the dots between different approaches? For example, connecting the dots between Influence Functions, Evaluations, Probes (detecting truthful direction), Function/Task Vectors, and Representation Engineering to see if they can work together to give us a better picture than the sum of their parts.
Debate over which agenda actually contributes to solving the core AI x-risk problems.
What if the pendulum swings in the other direction, and we never get the benefits of safe AGI? Is open source really as bad as people make it out to be?
How can we make something like the d/acc vision (by Vitalik Buterin) happen?
How can we design a system that leverages AI to speed up progress on alignment? What would you value the most?
What kinds of orgs are missing in the space?
POTENTIAL COLLABORATIONS
Examples of projects I'd be interested in: extending either the Weak-to-Strong Generalization paper or the Sleeper Agents paper, understanding the impacts of synthetic data on LLM training, working on ELK-like research for LLMs, experiments on influence functions (studying the base model and its SFT, RLHF, iterative training counterparts; I heard that Anthropic is releasing code for this "soon") or studying the interpolation/extrapolation distinction in LLMs.
I’m also interested in talking to grantmakers for feedback on some projects I’d like to get funding for.
I'm slowly working on a guide for practical research productivity for alignment researchers to tackle low-hanging fruits that can quickly improve productivity in the field. I'd like feedback from people with solid track records and productivity coaches.
TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH
Strong math background, can understand Influence Functions enough to extend the work.
Strong machine learning engineering background. Can run ML experiments and fine-tuning runs with ease. Can effectively create data pipelines.
Strong application development background. I have various project ideas that could speed up alignment researchers; I'd be able to execute them much faster if I had someone to help me build my ideas fast.
My current speculation as to what is happening at OpenAI
How do we know this wasn't their best opportunity to strike if Sam was indeed not being totally honest with the board?
Let's say the rumours are true, that Sam is building out external orgs (NVIDIA competitor and iPhone-like competitor) to escape the power of the board and potentially go against the charter. Would this 'conflict of interest' be enough? If you take that story forward, it sounds more and more like he was setting up AGI to be run by external companies, using OpenAI as a fundraising bargaining chip, and having a significant financial interest in plugging AGI into those outside orgs.
So, if we think about this strategically, how long should they wait as board members who are trying to uphold the charter?
On top of this, it seems (according to Sam) that OpenAI has made a significant transformer-level breakthrough recently, which implies a significant capability jump. Long-term reasoning? Basically, anything short of 'coming up with novel insights in physics' is on the table, given that Sam recently used that line as the line we need to cross to get to AGI.
So, it could be a mix of, Ilya thinking they have achieved AGI while Sam places a higher bar (internal communication disagreements) + the board not being alerted (maybe more than once) about what Sam is doing, e.g. fundraising for both OpenAI and the orgs he wants to connect AGI to + new board members who are more willing to let Sam and GDB do what they want being added soon (another rumour I've heard) + ???. Basically, perhaps they saw this as their final opportunity to have any veto on actions like this.
Here's what I currently believe:
There is a GPT-5-like model that already exists. It could be GPT-4.5 or something else, but another significant capability jump. Potentially even a system that can coherently pursue goals for months, capable of continual learning, and effectively able to automate like 10% of the workforce (if they wanted to).
As of 5 PM, Sunday PT, the board is in a terrible position where they either stay on board and the company employees all move to a new company, or they leave the board and bring Sam back. If they leave, they need to say that Sam did nothing wrong and sweep everything under the rug (and then potentially face legal action for saying he did something wrong); otherwise, Sam won't come back.
Sam is building companies externally; it is unclear if this goes against the charter. But he does now have a significant financial incentive to speed up AI development. Adam D'Angelo said that he would like to prevent OpenAI from becoming a big tech company as part of his time on the board because AGI was too important for humanity. They might have considered Sam's action going in this direction.
A few people left the board in the past year. It's possible that Sam and GDB planned to add new people (possibly even change current board members) to the board to dilute the voting power a bit or at least refill board seats. This meant that the current board had limited time until their voting power would become less important. They might have felt rushed.
The board is either not speaking publicly because 1) they can't share information about GPT-5, 2) there is some legal reason that I don't understand (more likely), or 3) they are incompetent (least likely by far IMO).
We will possibly never find out what happened, or it will become clearer by the month as new things come out (companies and models). However, it seems possible the board will never say or admit anything publicly at this point.
Lastly, we still don't know why the board decided to fire Sam. It could be any of the reasons above, a mix or something we just don't know about.
Other possible things:
Ilya was mad that they wouldn't actually get enough compute for Superalignment as promised due to GPTs and other products using up all the GPUs.
Ilya is frustrated that Sam is focused on things like GPTs rather than the ultimate goal of AGI.
Would newer people find it valuable to have some kind of 80,000 hours career chatbot that had access to the career guide, podcast notes, EA forum posts, job postings, etc, and then answered career questions? I’m curious if it could be designed to be better than just a raw read of the career guide or at least a useful add-on to the career guide.
Potential features:
It could collect your conversation and convert most of it into an application for a (human) 1-on-1 meeting.
You could have a speech-to-text option to ramble all the things you’ve been thinking of.
???
If anyone from 80k is reading this, I’d be happy to build this as a paid project.
Attempt to explain why I think AI systems are not the same thing as a library card when it comes to bio-risk.
To focus on less of an extreme example, I’ll be ignoring the case where AI can create new, more powerful pathogens faster than we can create defences, though I think this is an important case (some people just don’t find it plausible because it relies on the assumption that AIs being able to create new knowledge).
I think AI Safety people should make more of an effort to walkthrough the threat model so I’ll give an initial quick first try:
1) Library. If I’m a terrorist and I want to build a bioweapon, I have to spend several months reading books at minimum to understand how it all works. I don’t have any experts on-hand to explain how to do it step-by-step. I have to figure out which books to read and in what sequence. I have to look up external sources to figure out where I can buy specific materials.
Then, I have to somehow find out how to to gain access to those materials (this is the most difficult part for each case). Once I gain access to the materials, I still need to figure out how to make things work as a total noob at creating bioweapons. I will fail. Even experts fail. So, it will take many tries to get it right, and even then, there are tricks of the trade I’ll likely be unaware of no matter which books I read. Either it’s not in a book or it’s incredibly hard to find so you’ll basically never find it.
All this while needing a high enough degree of intelligence and competence.
2) AI agent system. You pull up your computer and ask for a synthesized step-by-step plan on how to cause the most death or ways to cripple your enemy. Many agents search through books and the internet while also using latent knowledge about the subject. It tells you everything you truly need to know in a concise 4-page document.
Relevant theory, practical steps (laid out with images and videos on how to do it), what to buy and where/how to buy it, pre-empting any questions you may have, explaining the jargon in a way that is understandable to nearly anyone, can take actions on the web to automatically buy all the supplies you need, etc.
You can even share photos of the entire process to your AI as it continues to guide you through the creation of the weapon because it’s multi-modal.
You can basically outsource all cognition to the AI system, allowing you to be the lazy human you are (we all know that humans will take the path of least-resistance or abandon something altogether if there is enough friction).
That topic you always said you wanted to know more about but never got around to it? No worries, your AI system has lowered the bar sufficiently that the task doesn’t seem as daunting anymore and laziness won’t be in the way of you making progress.
Conclusion: a future AI system will have the power of efficiency (significantly faster) and capability (able to make more powerful weapons than any one person could do on their own). It has the interactivity that Google and libraries don’t have. It’s just not the same as information scattered in different sources.
Is someone planning on doing an overview post of all the AI Pause discussion? I’m guessing some people would appreciate it if someone took the time to make an unbiased synthesis of the posts and discussions.
I gave talk about my Accelerating Alignment with LLMs agenda about 1 month ago (which is basically a decade in AI tools time). Part of the agenda covered (publicly) here.
I will maybe write an actual post about the agenda soon, but would love to have some people who are willing to look over it. If you are interested, send me a message. I am currently applying for grants and exploring the possibility of building an org focused on speeding up this agenda and avoid spreading myself too thin.
I recently sent in some grant proposals to continue working on my independent alignment research. It gives an overview of what I'd like to work on for this next year (and more really). If you want to have a look at the full doc, send me a DM. If you'd like to help out through funding or contributing to the projects, please let me know.
Here's the summary introduction:
12-month salary for building a language model system for accelerating alignment research and upskilling (additional funding will be used to create an organization), and studying how to supervise AIs that are improving AIstoensure stable alignment.
Summary
Agenda 1: Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research. Could use additional funding to hire an engineer and builder, which could evolve into an AI Safety organization focused on this agenda. Recent talk giving a partial overview of the agenda.
Agenda 2: Supervising AIs Improving AIs(through self-training or training other AIs). Publish a paper and create an automated pipeline for discovering noteworthy changes in behaviour between the precursor and the fine-tuned models. Short Twitter thread explanation.
Other: create a mosaic of alignment questions we can chip away at, better understand agency in the current paradigm, outreach, and mentoring.
As part of my Accelerating Alignment agenda, I aim to create the best Alignment Research Assistant using a suite of language models (LLMs) to help researchers (like myself) quickly produce better alignment research through an LLM system. The system will be designed to serve as the foundation for the ambitious goal of increasing alignment productivity by 10-100x during crunch time (in the year leading up to existentially dangerous AGI). The goal is to significantly augment current alignment researchers while also providing a system for new researchers to quickly get up to speed on alignment research or promising parts they haven’t engaged with much.
For Supervising AIs Improving AIs, this research agenda focuses on ensuring stable alignment when AIs self-train or train new AIs and studies how AIs may drift through iterative training. We aim to develop methods to ensure automated science processes remain safe and controllable. This form of AI improvement focuses more on data-driven improvements than architectural or scale-driven ones.
I’m seeking funding to continue my work as an independent alignment researcher and intend to work on what I’ve just described. However, to best achieve the project’s goal, I would want additional funding to scale up the efforts for Accelerating Alignment to develop a better system faster with the help of engineers so that I can focus on the meta-level and vision for that agenda. This would allow me to spread myself less thin and focus on my comparative advantages. If you would like to hop on a call to discuss this funding proposal in more detail, please message me. I am open to refocusing the proposal or extending the funding.
Near-Term AI capabilities probably bring low-hanging fruits for global poverty/health
I'm an alignment researcher, but I still think we should be vigilant about how models like GPT-N could potentially be used to make the world a better place. I like the work that Ought is doing with respect to the academic field (and, hopefully, alignment soon as well). However, my guess is that there are low-hanging fruits popping up because of this new technology, and the non-profit sector has yet to catch up.
This shortform is a Call To Action for any EA entrepreneur, you could potentially boost efficiency of the non-profit sector with the use of these tools. Of course, be careful since GPT-3 will hallucinate sometimes. But putting it in a larger system with checks and balances could 1) make non-profits save time and money 2) make previously inefficient or non-viable non-profits become a top charity.
I could be wrong about this, but my expectation is that there will be a lag between the time people can use GPT effectively for the non-profit sector and when they actually do.
Quillette founder seems to be planning to write an article regarding EA's impact on on tech:
"If anyone with insider knowledge wants to write about the impact of Effective Altruism in the technology industry please get in touch with me claire@quillette.com. We pay our writers and can protect authors' anonymity if desired."
It would probably be impactful if someone in the know provided a counterbalance to whoever will undoubtedly email her to disparage EA with half-truths/lies.
Hey everyone, my name is Jacques, I'm an independent technical alignment researcher (primarily focused on evaluations, interpretability, and scalable oversight). I'm now focusing more of my attention on building an Alignment Research Assistant. I'm looking for people who would like to contribute to the project. This project will be private unless I say otherwise.
Side note: I helped build the Alignment Research Dataset ~2 years ago. It has been used at OpenAI (by someone on the alignment team), (as far as I know) at Anthropic for evals, and is now used as the backend for Stampy.ai.
If you are interested in potentially helping out (or know someone who might be!), send me a DM with a bit of your background and why you'd like to help out. To keep things focused, I may or may not accept.
I have written up the vision and core features for the project here. I expect to see it evolve in terms of features, but the vision will likely remain the same. I'm currently working on some of the features and have delegated some tasks to others (tasks are in a private GitHub project board).
I'm also collaborating with different groups. For now, the focus is to build core features that can be used individually but will eventually work together into the core product. In 2-3 months, I want to get it to a place where I know whether this is useful for other researchers and if we should apply for additional funding to turn it into a serious project.
As an update to the Alignment Research Assistant I'm building, here is a set of shovel-ready tasks I would like people to contribute to (please DM if you'd like to contribute!):
Core Features
1. Setup the Continue extension for research: https://www.continue.dev/
2. Data sourcing and management
3. Extract answers to questions across multiple papers/posts (feeds into Continue)
4. Design Autoprompts for alignment research
5. Simulated Paper Reviewer
6. Jargon and Prerequisite Explainer
7. Setup automated "suggestion-LLM"
8. Figure out if we can get a useable browser inside of VSCode (tried quickly with the Edge extension but couldn't sign into the Claude chat website)
9. "Alignment Research Codebase" integration (can add as Continue backend)
Specialized tooling (outside of VSCode)
Bulk fast content extraction
Personalized Research Newsletter
Discord Bot for Project Proposals
We're doing a hackathon with Apart Research on 26th. I created a list of problem statements for people to brainstorm off of.
Pro-active insight extraction from new research
Reading papers can take a long time and is often not worthwhile. As a result, researchers might read too many papers or almost none. However, there are still valuable nuggets in papers and posts. The issue is finding them. So, how might we design an AI research assistant that proactively looks at new papers (and old) and shares valuable information with researchers in a naturally consumable way? Part of this work involves presenting individual research with what they would personally find valuable and not overwhelm them with things they are less interested in.
How can we improve the LLM experience for researchers?
Many alignment researchers will use language models much less than they would like to because they don't know how to prompt the models, it takes time to create a valuable prompt, the model doesn't have enough context for their project, the model is not up-to-date on the latest techniques, etc. How might we make LLMs more useful for researchers by relieving them of those bottlenecks?
Simple experiments can be done quickly, but turning it into a full project can take a lot of time
One key bottleneck for alignment research is transitioning from an initial 24-hour simple experiment in a notebook to a set of complete experiments tested with different models, datasets, interventions, etc. How can we help researchers move through that second research phase much faster?
How might we use AI agents to automate alignment research?
As AI agents become more capable, we can use them to automate parts of alignment research. The paper "A Multimodal Automated Interpretability Agent" serves as an initial attempt at this. How might we use AI agents to help either speed up alignment research or unlock paths that were previously inaccessible?
How can we nudge research toward better objectives (agendas or short experiments) for their research?
Even if we make researchers highly efficient, it means nothing if they are not working on the right things. Choosing the right objectives (projects and next steps) through time can be the difference between 0x to 1x to +100x. How can we ensure that researchers are working on the most valuable things?
What can be done to accelerate implementation and iteration speed?
Implementation and iteration speed on the most informative experiments matter greatly. How can we nudge them to gain the most bits of information in the shortest time? This involves helping them work on the right agendas/projects and helping them break down their projects in ways that help them make progress faster (and avoiding ending up tunnel-visioned on the wrong project for months/years).
How can we connect all of the ideas in the field?
How can we integrate the open questions/projects in the field (with their critiques) in such a way that helps the researcher come up with well-grounded research directions faster? How can we aid them in choosing better directions and adjust throughout their research? This kind of work may eventually be a precursor to guiding AI agents to help us develop better ideas for alignment research.
I've created a private discord server to discuss this work. If you'd like to contribute to this project (or might want to in the future if you see a feature you'd like to contribute to) or if you are an alignment/governance researcher who would like to be a beta user so we can iterate faster, please DM me for a link!
Have you talked with someone from Ought/Elicit? It seems like they should be able to give you useful feedback.
Yes, I’ve talked to them a few times in the last 2 years!
If you work at a social media website or YouTube (or know anyone who does), please read the text below:
Community Notes is one of the best features to come out on social media apps in a long time. The code is even open source. Why haven't other social media websites picked it up yet? If they care about truth, this would be a considerable step forward beyond. Notes like “this video is funded by x nation” or “this video talks about health info; go here to learn more” messages are simply not good enough.
If you work at companies like YouTube or know someone who does, let's figure out who we need to talk to to make it happen. Naïvely, you could spend a weekend DMing a bunch of employees (PMs, engineers) at various social media websites in order to persuade them that this is worth their time and probably the biggest impact they could have in their entire career.
If you have any connections, let me know. We can also set up a doc of messages to send in order to come up with a persuasive DM.
One may infer that they do not care about truth, at least not relative to other considerations.
I've also started working on a repo in order to make Community Notes more efficient by using LLMs.
Don't forget that we train language models on the internet! The more truthful your dataset is, the more truthful the models will be! Let's revamp the internet for truthfulness, and we'll subsequently improve truthfulness in our AI systems!!
I shared a tweet about it here: https://x.com/JacquesThibs/status/1724492016254341208?s=20
Consider liking and retweeting it if you think this is impactful. I'd like it to get into the hands of the right people.
Hey everyone, in collaboration with Apart Research, I'm helping organize a hackathon this weekend to build tools for accelerating alignment research. This hackathon is very much related to my effort in building an "Alignment Research Assistant."
Here's the announcement post:
2 days until we revolutionize AI alignment research at the Research Augmentation Hackathon!
As AI safety researchers, we pour countless hours into crucial work. It's time we built tools to accelerate our efforts! Join us in creating AI assistants that could supercharge the very research we're passionate about.
Date: July 26th to 28th, online and in-person
Prizes: $2,000 in prizes
Why join?
* Build tools that matter for the future of AI
* Learn from top minds in AI alignment
* Boost your skills and portfolio
We've got a Hackbook with an exciting project to work on waiting for you! No advanced AI knowledge required - just bring your creativity!
Register now: Sign up on the website here, and don't miss this chance to shape the future of AI research!
Hey :)
Looking at some of the engineering projects (which is closest to my field) :
I'm guessing Claude 3.5 Sonnet could do these things, probably using 1 prompt for each (or perhaps even all at once).
Consider trying, if you didn't yet. You might not need any humans for this. Or if you already did then oops and never mind!
Thanks for saving the world!
I just saw this; thanks for sharing! Yup, some of these should be able to be solved quickly with LLMs.
More information about the alleged manipulative behaviour of Sam Altman
Source
I'm exploring the possibility of building an alignment research organization focused on augmenting alignment researchers and progressively automating alignment research (yes, I have thought deeply about differential progress and other concerns). I intend to seek funding in the next few months, and I'd like to chat with people interested in this kind of work, especially great research engineers and full-stack engineers who might want to cofound such an organization. If you or anyone you know might want to chat, let me know! Send me a DM, and I can send you some initial details about the organization's vision.
Here are some things I'm looking for in potential co-founders:
Need
Nice-to-have
I quickly wrote up some rough project ideas for ARENA and LASR participants, so I figured I'd share them here as well. I am happy to discuss these ideas and potentially collaborate on some of them.
Alignment Project Ideas (Oct 2, 2024)
1. Improving "A Multimodal Automated Interpretability Agent" (MAIA)
Overview
MAIA (Multimodal Automated Interpretability Agent) is a system designed to help users understand AI models by combining human-like experimentation flexibility with automated scalability. It answers user queries about AI system components by iteratively generating hypotheses, designing and running experiments, observing outcomes, and updating hypotheses.
MAIA uses a vision-language model (GPT-4V, at the time) backbone equipped with an API of interpretability experiment tools. This modular system can address both "macroscopic" questions (e.g., identifying systematic biases in model predictions) and "microscopic" questions (e.g., describing individual features) with simple query modifications.
This project aims to improve MAIA's ability to either answer macroscopic questions or microscopic questions on vision models.
2. Making "A Multimodal Automated Interpretability Agent" (MAIA) work with LLMs
MAIA is focused on vision models, so this project aims to create a MAIA-like setup, but for the interpretability of LLMs.
Given that this would require creating a new setup for language models, it would make sense to come up with simple interpretability benchmark examples to test MAIA-LLM. The easiest way to do this would be to either look for existing LLM interpretability benchmarks or create one based on interpretability results we've already verified (would be ideal to have a ground truth). Ideally, the examples in the benchmark would be simple, but new enough that the LLM has not seen them in its training data.
3. Testing the robustness of Critique-out-Loud Reward (CLoud) Models
Critique-out-Loud reward models are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward. In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM. Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly. In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces. CLoud reward models lead to large gains for pairwise preference modeling on RewardBench, and also lead to large gains in win rate when used as the scoring model in Best-of-N sampling on ArenaHard.
The goal for this project would be to test the robustness of CLoud reward models. For example, are the CLoud RMs (discriminators) more robust to jailbreaking attacks from the policy (generator)? Do the CLoud RMs generalize better?
From an alignment perspective, we would want RMs that generalize further out-of-distribution (and ideally, always more than the generator we are training).
4. Synthetic Data for Behavioural Interventions
Simple synthetic data reduces sycophancy in large language models by (Google) reduced sycophancy in LLMs with a fairly small number of synthetic data examples. This project would involve testing this technique for other behavioural interventions and (potentially) studying the scaling laws. Consider looking at the examples from the Model-Written Evaluations paper by Anthropic to find some behaviours to test.
5. Regularization Techniques for Enhancing Interpretability and Editability
Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.
In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hidden away the superposition in other parts of the network, making SoLU unhelpful in making the models more interpretable
That said, we hope to find that we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.
Methodology:
Expected Outcomes:
6. Quantifying the Impact of Reward Misspecification on Language Model Behavior
Investigate how misspecified reward functions influence the behavior of language models during fine-tuning and measure the extent to which the model's outputs are steered by the reward labels, even when they contradict the input context. We hope to better understand language model training dynamics. Additionally, we expect online learning to complicate things in the future, where models will be able to generate the data they may eventually be trained on. We hope that insights from this work can help us prevent catastrophic feedback loops in the future. For example, if model behavior is mostly impacted by training data, we may prefer to shape model behavior through synthetic data (it has been shown we can reduce sycophancy by doing this).
Prior works:
Methodology:
Expected Outcomes:
7. Investigating Wrong Reasoning for Correct Answers
Understand the underlying mechanisms that lead to language models producing correct answers through flawed reasoning, and develop techniques to detect and mitigate such behavior. Essentially, we want to apply interpretability techniques to help us identify which sets of activations or token-layer pairs impact the model getting the correct answer when it has the correct reasoning versus when it has the incorrect reasoning. The hope is to uncover systematic differences as to when it is not relying on its chain-of-thought at all and when it does leverage its chain-of-thought to get the correct answer.
[EDIT Oct 2nd, 2024] This project intends to follow a similar line of reasoning as described in this post and this comment. The goal is to study chains-of-thought and improve faithfulness without suffering an alignment tax so that we can have highly interpretable systems through their token outputs and prevent loss of control. The project doesn't necessarily need to rely only on model internals.
Related work:
Methodology:
Expected Outcomes:
I shared the following as a bio for EAG Bay Area 2024. I'm sharing this here if it reaches someone who wants to chat or collaborate.
Hey! I'm Jacques. I'm an independent technical alignment researcher with a background in physics and experience in government (social innovation, strategic foresight, mental health and energy regulation). Link to Swapcard profile. Twitter/X.
CURRENT WORK
TOPICS TO CHAT ABOUT
POTENTIAL COLLABORATIONS
TYPES OF PEOPLE I'D LIKE TO COLLABORATE WITH
My current speculation as to what is happening at OpenAI
How do we know this wasn't their best opportunity to strike if Sam was indeed not being totally honest with the board?
Let's say the rumours are true, that Sam is building out external orgs (NVIDIA competitor and iPhone-like competitor) to escape the power of the board and potentially go against the charter. Would this 'conflict of interest' be enough? If you take that story forward, it sounds more and more like he was setting up AGI to be run by external companies, using OpenAI as a fundraising bargaining chip, and having a significant financial interest in plugging AGI into those outside orgs.
So, if we think about this strategically, how long should they wait as board members who are trying to uphold the charter?
On top of this, it seems (according to Sam) that OpenAI has made a significant transformer-level breakthrough recently, which implies a significant capability jump. Long-term reasoning? Basically, anything short of 'coming up with novel insights in physics' is on the table, given that Sam recently used that line as the line we need to cross to get to AGI.
So, it could be a mix of, Ilya thinking they have achieved AGI while Sam places a higher bar (internal communication disagreements) + the board not being alerted (maybe more than once) about what Sam is doing, e.g. fundraising for both OpenAI and the orgs he wants to connect AGI to + new board members who are more willing to let Sam and GDB do what they want being added soon (another rumour I've heard) + ???. Basically, perhaps they saw this as their final opportunity to have any veto on actions like this.
Here's what I currently believe:
Other possible things:
Update, board members seem to be holding their ground more than expected in this tight situation:
Would newer people find it valuable to have some kind of 80,000 hours career chatbot that had access to the career guide, podcast notes, EA forum posts, job postings, etc, and then answered career questions? I’m curious if it could be designed to be better than just a raw read of the career guide or at least a useful add-on to the career guide.
Potential features:
If anyone from 80k is reading this, I’d be happy to build this as a paid project.
Attempt to explain why I think AI systems are not the same thing as a library card when it comes to bio-risk.
To focus on less of an extreme example, I’ll be ignoring the case where AI can create new, more powerful pathogens faster than we can create defences, though I think this is an important case (some people just don’t find it plausible because it relies on the assumption that AIs being able to create new knowledge).
I think AI Safety people should make more of an effort to walkthrough the threat model so I’ll give an initial quick first try:
1) Library. If I’m a terrorist and I want to build a bioweapon, I have to spend several months reading books at minimum to understand how it all works. I don’t have any experts on-hand to explain how to do it step-by-step. I have to figure out which books to read and in what sequence. I have to look up external sources to figure out where I can buy specific materials.
Then, I have to somehow find out how to to gain access to those materials (this is the most difficult part for each case). Once I gain access to the materials, I still need to figure out how to make things work as a total noob at creating bioweapons. I will fail. Even experts fail. So, it will take many tries to get it right, and even then, there are tricks of the trade I’ll likely be unaware of no matter which books I read. Either it’s not in a book or it’s incredibly hard to find so you’ll basically never find it.
All this while needing a high enough degree of intelligence and competence.
2) AI agent system. You pull up your computer and ask for a synthesized step-by-step plan on how to cause the most death or ways to cripple your enemy. Many agents search through books and the internet while also using latent knowledge about the subject. It tells you everything you truly need to know in a concise 4-page document.
Relevant theory, practical steps (laid out with images and videos on how to do it), what to buy and where/how to buy it, pre-empting any questions you may have, explaining the jargon in a way that is understandable to nearly anyone, can take actions on the web to automatically buy all the supplies you need, etc.
You can even share photos of the entire process to your AI as it continues to guide you through the creation of the weapon because it’s multi-modal.
You can basically outsource all cognition to the AI system, allowing you to be the lazy human you are (we all know that humans will take the path of least-resistance or abandon something altogether if there is enough friction).
That topic you always said you wanted to know more about but never got around to it? No worries, your AI system has lowered the bar sufficiently that the task doesn’t seem as daunting anymore and laziness won’t be in the way of you making progress.
Conclusion: a future AI system will have the power of efficiency (significantly faster) and capability (able to make more powerful weapons than any one person could do on their own). It has the interactivity that Google and libraries don’t have. It’s just not the same as information scattered in different sources.
Is someone planning on doing an overview post of all the AI Pause discussion? I’m guessing some people would appreciate it if someone took the time to make an unbiased synthesis of the posts and discussions.
According to the debate week announcement, Scott Alexander will be writing a summary/conclusion post.
Perfect, thanks!
I'm working on an ultimate doc on productivity I plan to share and make it easy, specifically for alignment researchers.
Let me know if you have any comments or suggestions as I work on it.
Roam Research link for easier time reading.
Google Docs link in case you want to leave comments there.
I gave talk about my Accelerating Alignment with LLMs agenda about 1 month ago (which is basically a decade in AI tools time). Part of the agenda covered (publicly) here.
I will maybe write an actual post about the agenda soon, but would love to have some people who are willing to look over it. If you are interested, send me a message. I am currently applying for grants and exploring the possibility of building an org focused on speeding up this agenda and avoid spreading myself too thin.
I recently sent in some grant proposals to continue working on my independent alignment research. It gives an overview of what I'd like to work on for this next year (and more really). If you want to have a look at the full doc, send me a DM. If you'd like to help out through funding or contributing to the projects, please let me know.
Here's the summary introduction:
12-month salary for building a language model system for accelerating alignment research and upskilling (additional funding will be used to create an organization), and studying how to supervise AIs that are improving AIs to ensure stable alignment.
Summary
As part of my Accelerating Alignment agenda, I aim to create the best Alignment Research Assistant using a suite of language models (LLMs) to help researchers (like myself) quickly produce better alignment research through an LLM system. The system will be designed to serve as the foundation for the ambitious goal of increasing alignment productivity by 10-100x during crunch time (in the year leading up to existentially dangerous AGI). The goal is to significantly augment current alignment researchers while also providing a system for new researchers to quickly get up to speed on alignment research or promising parts they haven’t engaged with much.
For Supervising AIs Improving AIs, this research agenda focuses on ensuring stable alignment when AIs self-train or train new AIs and studies how AIs may drift through iterative training. We aim to develop methods to ensure automated science processes remain safe and controllable. This form of AI improvement focuses more on data-driven improvements than architectural or scale-driven ones.
I’m seeking funding to continue my work as an independent alignment researcher and intend to work on what I’ve just described. However, to best achieve the project’s goal, I would want additional funding to scale up the efforts for Accelerating Alignment to develop a better system faster with the help of engineers so that I can focus on the meta-level and vision for that agenda. This would allow me to spread myself less thin and focus on my comparative advantages. If you would like to hop on a call to discuss this funding proposal in more detail, please message me. I am open to refocusing the proposal or extending the funding.
Near-Term AI capabilities probably bring low-hanging fruits for global poverty/health
I'm an alignment researcher, but I still think we should be vigilant about how models like GPT-N could potentially be used to make the world a better place. I like the work that Ought is doing with respect to the academic field (and, hopefully, alignment soon as well). However, my guess is that there are low-hanging fruits popping up because of this new technology, and the non-profit sector has yet to catch up.
This shortform is a Call To Action for any EA entrepreneur, you could potentially boost efficiency of the non-profit sector with the use of these tools. Of course, be careful since GPT-3 will hallucinate sometimes. But putting it in a larger system with checks and balances could 1) make non-profits save time and money 2) make previously inefficient or non-viable non-profits become a top charity.
I could be wrong about this, but my expectation is that there will be a lag between the time people can use GPT effectively for the non-profit sector and when they actually do.