J

jacquesthibs

AI Safety Researcher @ Independent Researcher
905 karmaJoined Working (6-15 years)
jacquesthibodeau.com/about/

Bio

I work primarily on AI Alignment. My main direction at the moment is to accelerate alignment work via language models and interpretability.

Comments
55

Thanks for all your work, JJ! Good luck with whatever you end up doing next!

(Note in case anybody else wants to pick up where AISS left off: This is a bit unfortunate for me given that not having an org to sponsor work visas in the UK might affect my decision for moving to there. We had talked about AISS trying to do the work to get that setup in the next 1-2 years.)

In this framework, I propose that Xnder ³amoXnt of mone\ to donate ́ Ze bXndle
considerations relating to taxes, weakness of will and uncertainty as well as financial
investment.

One thing I think is missing from the "how much you should donate" section above is a discussion about what kind of job the person is doing. Should the percentage be the same for someone doing Earning to Give vs someone working on a direct cause area?

I recently sent in some grant proposals to continue working on my independent alignment research. It gives an overview of what I'd like to work on for this next year (and more really). If you want to have a look at the full doc, send me a DM. If you'd like to help out through funding or contributing to the projects, please let me know.

Here's the summary introduction:

12-month salary for building a language model system for accelerating alignment research and upskilling (additional funding will be used to create an organization), and studying how to supervise AIs that are improving AIs to ensure stable alignment.

Summary

  • Agenda 1Build an Alignment Research Assistant using a suite of LLMs managing various parts of the research process. Aims to 10-100x productivity in AI alignment research. Could use additional funding to hire an engineer and builder, which could evolve into an AI Safety organization focused on this agenda. Recent talk giving a partial overview of the agenda.
  • Agenda 2Supervising AIs Improving AIs (through self-training or training other AIs). Publish a paper and create an automated pipeline for discovering noteworthy changes in behaviour between the precursor and the fine-tuned models. Short Twitter thread explanation.
  • Other: create a mosaic of alignment questions we can chip away at, better understand agency in the current paradigm, outreach, and mentoring.

As part of my Accelerating Alignment agenda, I aim to create the best Alignment Research Assistant using a suite of language models (LLMs) to help researchers (like myself) quickly produce better alignment research through an LLM system. The system will be designed to serve as the foundation for the ambitious goal of increasing alignment productivity by 10-100x during crunch time (in the year leading up to existentially dangerous AGI). The goal is to significantly augment current alignment researchers while also providing a system for new researchers to quickly get up to speed on alignment research or promising parts they haven’t engaged with much.

For Supervising AIs Improving AIsthis research agenda focuses on ensuring stable alignment when AIs self-train or train new AIs and studies how AIs may drift through iterative training. We aim to develop methods to ensure automated science processes remain safe and controllable. This form of AI improvement focuses more on data-driven improvements than architectural or scale-driven ones.

I’m seeking funding to continue my work as an independent alignment researcher and intend to work on what I’ve just described. However, to best achieve the project’s goal, I would want additional funding to scale up the efforts for Accelerating Alignment to develop a better system faster with the help of engineers so that I can focus on the meta-level and vision for that agenda. This would allow me to spread myself less thin and focus on my comparative advantages. If you would like to hop on a call to discuss this funding proposal in more detail, please message me. I am open to refocusing the proposal or extending the funding.

I gave talk about my Accelerating Alignment with LLMs agenda about 1 month ago (which is basically a decade in AI tools time). Part of the agenda covered (publicly) here.

I will maybe write an actual post about the agenda soon, but would love to have some people who are willing to look over it. If you are interested, send me a message. I am currently applying for grants and exploring the possibility of building an org focused on speeding up this agenda and avoid spreading myself too thin.

I agree with this post. I've been reading many more papers since first entering this field because I've been increasingly convinced of the value of treating alignment as an engineering problem and pulling insights from the literature. I've also been trying to do more thinking about how to update on the current paradigm from the classic Yud and Bostrom alignment arguments. In this respect, I applaud Quintin Pope for his work.

This week, I will send a grant proposal to continue my work in alignment. I'd be grateful if you could look at my proposal and provide some critique. It would be great to have an outside view (like yours) to give feedback on it.

Current short summary: "This project comprises two main interrelated components: accelerating AI alignment research by integrating large language models (LLMs) into a research system, and conducting direct work on alignment with a focus on interpretability and steering the training process towards aligned AI. The "accelerating alignment" agenda aims to impact both conceptual and empirical aspects of alignment research, with the ambitious long-term goal of providing a massive speed-up and unlocking breakthroughs in the field. The project also includes work in interpretability (using LLMs for interpreting models; auto-interpretability), understanding agency in the current deep learning paradigm, and designing a robustly aligned training process. The tools built will integrate seamlessly into the larger alignment ecosystem. The project serves as a testing ground for potentially building an organization focused on using language models to accelerate alignment work."

Please send me a DM if you'd like to give feedback!

So, things have blown up way more than I expected and things are chaotic. Still not sure what will happen or if a treaty is actually in the cards, but I’m beginning to see a path to tons of more investment in alignment potentially. One example why is that Jeff Bezos just followed Eliezer on Twitter and I think it may catch the attention of pretty powerful and rich people who want to see AI go well. We are so off-distribution, could go in any direction.

Thanks for reporting back! I’m sharing it with my friends as well (none of which are in tech and most of them live in fairly rural parts in Canada) to see their reaction.

In case we have very different feeds, here’s a set of tweets critical about the article:

Here’s a comment I shared on my LessWrong shortform.

——

I’m still thinking this through, but I am deeply concerned about Eliezer’s new article for a combination of reasons:

  • I don’t think it will work.
  • Given that it won’t work, I expect we lose credibility and it now becomes much harder to work with people who were sympathetic to alignment, but still wanted to use AI to improve the world.
  • I am not convinced as he is about doom and I am not as cynical about the main orgs as he is.

In the end, I expect this will just alienate people. And stuff like this concerns me.

I think it’s possible that the most memetically powerful approach will be to accelerate alignment rather than suggesting long-term bans or effectively antagonizing all AI use.

Load more