Supported by Rethink Priorities
This is part of a weekly series summarizing the top posts on the EA and LW forums - you can see the full collection here. The first post includes some details on purpose and methodology. Feedback, thoughts, and corrections are welcomed.
If you'd like to receive these summaries via email, you can subscribe here.
Podcast version: prefer your summaries in podcast form? A big thanks to Coleman Snell for producing these! Subscribe on your favorite podcast app by searching for 'EA Forum Podcast (Summaries)'. More detail here.
Author's note: I'm heading on holidays, so this will be the last weekly summary until mid-January. Hope you all have a great end of year!
Top / Curated Readings
Designed for those without the time to read all the summaries. Everything here is also within the relevant sections later on so feel free to skip if you’re planning to read it all. These are picked by the summaries’ author and don’t reflect the forum ‘curated’ section.
Announcing WildAnimalSuffering.org, a new resource launched for the cause
by David van Beveren
Vegan Hacktivists released this website, which educates the viewer on issues surrounding Wild Animal Suffering, and gives resources for getting involved or learning more. Their focus was combining existing resources into something visually engaging and accessible, as an intro point for those interested in learning about it. Please feel free to share with your networks!
The winners of the Change Our Mind Contest—and some reflections
by GiveWell
First place winners of GiveWell’s contest for critiques of their cost-effectiveness analyses:
- GiveWell’s Uncertainty Problem by Noah Haber: The author argues that without properly accounting for uncertainty, GiveWell is likely to allocate its portfolio of funding suboptimally, and proposes methods for addressing uncertainty.
- An Examination of GiveWell’s Water Quality Intervention Cost-Effectiveness Analysis by Matthew Romer and Paul Romer: The authors suggest several changes to GiveWell's analysis of water chlorination programs, which overall make Dispensers for Safe Water's program appear less cost-effective.
They assigned two first place winners due to the quality of submissions, in addition to 8 honorable mentions, and $500 prizes for all others of the 49 entries meeting contest criteria.
GiveWell think the contest was worth doing, providing both new ideas, and affecting their prioritization on issues they were aware of but hadn’t addressed. Currently they expect the contest entries to shift the allocation of resources between programs, but think it’s unlikely they’ll lead to adding or removing programs from their list of recommended charities. They’ve identified ~100 discrete suggestions from entries which they’re tracking and prioritizing now.
Revisiting algorithmic progress
by Tamay, Ege Erdil
Summary of the authors’ research paper on the effect of algorithmic process in image classification on ImageNet. They find that every 9 months (95% CI: 4 to 25 months), better algorithms contribute the equivalent of a doubling of computer budgets. Progress in image classification has been roughly ~45% scaling of compute, 45% better algorithms, and ~10% scaling of data. The better algorithms primarily act via using compute more effectively (as opposed to data augmenting).
EA Forum
Philosophy and Methodologies
Octopuses (Probably) Don't Have Nine Minds
by Bob Fischer
Part of the Moral Weight Project Sequence.
Based on the split-brain condition in humans, some people have wondered whether some humans “house” multiple subjects.
There are superficial parallels between the split-brain condition and the apparent neurological structures of some animals, including octopuses and chickens. To assign a non-negligible credence to these animals housing multiple subjects in a way that matters morally, we’d need evidence that different parts of the animals have valenced conscious states (like pain). This is difficult to get for several reasons outlined in the post. The author therefore recommends not assuming multiple subjects in a single animal for the purposes of the Moral Weight Project.
Overall, the author places up to a 0.1 credence that there are multiple subjects in the split-brain case, but no higher than 0.025 for the 1+8 model of octopuses.
GiveWell’s Moral Weights Underweight the Value of Transfers to the Poor
by Trevor Woolley and Ethan Ligon
Givewell baselines their cost-effectiveness analyses on the value of doubling consumption. This assumes that the functional form of marginal utility over consumption is 1/x (where x is real consumption). There is strong evidence this doesn’t match the preferences of the Kenyan beneficiaries of GiveDirectly, and therefore underweights the value of cash transfers to the very poor.
The authors suggest GiveWell was likely intending to value “halving marginal utility of expenditure”. They empirically estimate the marginal utility over consumption (λ) as revealed by Kenyan beneficiaries of GiveDirectly’s cash transfers program and conclude the value per dollar of cash transfers is 2.6 times GiveWell’s current number (from 0.0034 to 0.009).
The full paper can be read here.
Neuron Count-Based Measures May Currently Underweight Suffering in Farmed Fish
by MHR
Neuron counts have historically been used as a proxy for the moral weight of different animal species. While alternate systems have been proposed, they are often still an input.
The only publicly-available empirical reports of fish neuron counts sample exclusively from species of <1g bodyweight, while farmed fish are at least 1000x larger. Some sources apply these neuron counts to farmed fish without correction, which is likely to underweight them. Even where corrections are applied, there is uncertainty in how to extrapolate it.
Because of this, the author suggests animal welfare advocates be highly skeptical of current neuron-count based estimates of the moral weight of farmed fish, and consider funding studies to empirically measure neuron counts in these species.
Object Level Interventions / Reviews
Creating a database for base rates
by nikos
The author is creating a database to collect base rates for various categories of events eg. protests that have (or have not) led to regime change, developments of new antibiotics, elections with small margins of victory. You can suggest new base rate categories you’d like looked into here.
The main goal is to develop a better understanding of the merits and limitations of reference class forecasting, with a secondary goal of collecting information useful to forecasters and EA stakeholders. Anyone is free to use the data for their own research.
The next decades might be wild
by mariushobbhahn
The author imagines “what [they] would expect the world to look like if (median compute for transformative AI ~2036) were true”. They claim tech can be disruptive, and reach widespread adoption within a few decades of introduction (eg. phones, internet), with the rate of adoption accelerating. AI is getting useful in the real world (including in assisting human coders), transformers work astonishingly well in multiple domains, it seems like AI hype is not slowing down, and AI accomplishments have been unexpected in the past (eg. many were surprised by the first Chess AIs, or by GPT-2, GPT-3, or DALL-E). Based on these points the author writes predictions for each decade between now and 2050+, in the form of vignettes.
Radical tactics can increase support for more moderate groups
by James Ozden
Surveys were conducted on the same 1.4K people before and after a ‘Just Stop Oil’ campaign. The campaign was radical, with 92% of those surveyed aware of them after it. The survey asked about support for climate policies and identification with a more moderate climate organization (Friends of the Earth). Identification increased from 50.3% to 52.9%, showing a ‘radical flank effect’ - a benefit to the moderate organization from the more radical organization’s campaigning (p =0.007). However, it also showed increased polarization - those with low baseline identification with Friends of the Earth reduced their support for climate policies after the campaign (and vice versa).
Concrete actionable policies relevant to AI safety (written 2019)
by weeatquince
An unedited copy of the author’s 2019 notes on UK AI policy. They took best practices from nuclear safety policy and applied them to AI safety. They no longer agree with everything written. Key recommendations (excluding those marked as ‘now unsure of’) include:
- Support more long-term thinking in policy / politics.
- Improve the processes for identifying, mitigating, and planning for future risks.
- Improve the ability of the government to draw on technical and scientific expertise.
- Have civil servants research policy issues around ethics and technology and AI.
- Set up a regulator in the form of a well-funded body of technical experts, to ensure safe and ethical behavior of the tech industry and government.
Opportunities
The winners of the Change Our Mind Contest—and some reflections
by GiveWell
First place winners of GiveWell’s contest for critiques of their cost-effectiveness analyses:
- GiveWell’s Uncertainty Problem by Noah Haber: The author argues that without properly accounting for uncertainty, GiveWell is likely to allocate its portfolio of funding suboptimally, and proposes methods for addressing uncertainty.
- An Examination of GiveWell’s Water Quality Intervention Cost-Effectiveness Analysis by Matthew Romer and Paul Romer: The authors suggest several changes to GiveWell's analysis of water chlorination programs, which overall make Dispensers for Safe Water's program appear less cost-effective.
They assigned two first place winners due to the quality of submissions, in addition to 8 honorable mentions, and $500 prizes for all others of the 49 entries meeting contest criteria.
They think the contest was worth doing, providing both new ideas, and increasing their prioritization on issues they were aware of but hadn’t addressed. Currently they expect the contest entries to shift the allocation of resources between programs, but think it’s unlikely they’ll lead to adding or removing programs from their list of recommended charities. They identified ~100 discrete suggestions from entries which they’re tracking and prioritizing now.
Announcing the Forecasting Research Institute (we’re hiring)
by Tegan
The Forecasting Research Institute (FRI) is a new organization focused on advancing the science of forecasting for the public good. Their strategy is based around:
- Filling in gaps in the science of forecasting eg. how to handle low probability events or complex topics that can’t be captured in a single forecast.
- Adapting forecast methods to practical purposes eg. identifying where forecasting could be most useful, and increasing decision-relevance of questions.
Concrete upcoming projects include developing a forecasting proficiency test to quickly identify accurate forecasters, identifying leading indicators of increased risk from AI, and exploring ways to judge and incentivize answers to far-future questions.
They have open, fully remote positions for research analysts, data analysts, content editors and research assistants. Apply here.
Open Philanthropy is hiring for (lots of) operations roles!
by maura
Open Philanthropy is hiring for a Business Operations Lead, Business Operations Generalists, Finance Operations Assistant, Grants Associates, People Operations Generalist, Recruiter and Salesforce Administrator & Technical Project Manager. Most but not all roles are worldwide remote, if you can overlap with some US working hours. Applications and referrals are open now (there’s a $5K referral bonus).
by CEEALAR
The Centre for Enabling EA Learning & Research (CEEALAR) is an EA hotel that provides grants in the form of food and accommodation on-site in Blackpool, UK. They have lots of space and encourage applications from those wishing to learn or work on research or charitable projects in any cause area. This includes study and upskilling with the intent to move into those areas.
Since opening 4.5 years ago, they’ve supported ~100 EAs with their career development, and hosted another ~200 visitors for events / networking / community building. It costs CEEALAR ~£800/month to host someone - including free food, logistics, and project guidance. This is ~13% the cost of an established EA worker, and an example of hits-based giving.
They have plans to expand, and are fixing up a next door property that will increase capacity by ~70%. They welcome donations, though aren’t in imminent need (they have 12 - 20 months of runway, depending on factors covered in the post). They’re also looking for a handy-person.
Applications open for AGI Safety Fundamentals: Alignment Course
by Jamie Bernardi, richard_ngo
Apply by 5th January to join the AGI Safety Fundamentals: Alignment Course. It will run Feb - Apr 2023, with 8 weeks of reading and virtual discussions, and a 4-week capstone. Commitment is ~4 hours per week.
Community & Media
by Linch
The General Longtermism team at Rethink Priorities has existed for just under a year, with an average of ~5 FTE. Its theory of change was facilitating the creation of scalable longtermist megaprojects, and improving strategic clarity on intermediate goals longtermists should pursue.
Outputs included:
- Supporting creation of the Special Projects team, which provides fiscal sponsorship to external entrepreneurial projects.
- Cofounding and running Condor Camp, a project to engage world-class talent in Brazil for longtermist causes.
- Cofounding and running Pathfinder, a project to help mid-career professionals find high impact work.
- 13 shallow research dives into specific projects, with deeper dives on air sterilization techniques, whistleblowing, AI safety recruitment, and infrastructure for independent researchers.
- Founder search for multiple promising projects.
- A model for prioritizing between longtermist projects.
- Research and database of resources on nanotech strategy.
The team is currently reorienting strategy for 2023. Recent changes to EA funding mean megaprojects seem less relevant (and some research questions more relevant), but it’s still plausible entrepreneurial longtermist projects might be a main research direction for the team.
Ideas for highly impactful research projects, donations, expressions of interest, and feedback on plans are all highly appreciated.
EA career guide for people from LMICs
by Surbhi B, Mo Putera, varun_agr, AmAristizabal
The authors broadly recommend the following for EAs from low and middle income countries (LMICs):
- Build career capital early on
- Work on global issues over local ones, unless clear reasons for the latter
- Some individuals to do local versions of: community building, priorities research, charity-related activities, or career advising
They discuss pros, cons, and concrete next steps for each. Individuals can use the scale / neglectedness / tractability framework, marginal value, and personal fit to assess options. They suggest looking for local comparative advantage at global priorities, and taking the time to upskill and engage deeply with EA ideas before jumping into direct work.
Announcing WildAnimalSuffering.org, a new resource launched for the cause
by David van Beveren
Vegan Hacktivists released this website, which educates the viewer on issues surrounding Wild Animal Suffering, and gives resources for getting involved or learning more. Their focus was combining existing resources into something visually engaging and accessible, as an intro point for those interested in learning about it. Please feel free to share with your networks!
Announcing ERA: a spin-off from CERI
by Nandini Shiralkar
The CERI Fellowship has spun off from the Cambridge Existential Risks Initiative (CERI), and will be run by a new nonprofit called ERA from 2023. This allows CERI to re-focus on local community projects for the University of Cambridge, and reduces name confusion with the many EA projects / groups ending in ‘ERI’.
Applications for their July - August ERA Cambridge Fellowship (8-week paid programme focused on existential risk mitigation projects) will open in Jan / Feb - register your interest here to be notified when they do. They’re also looking for mentors, and expressions of interest for joining the team.
The Rules of Rescue - out now!
by Theron
The Rules of Rescue is a new book by the post author, which “defends a novel picture of the moral reasons and requirements to use time, money, and other resources to help others the most.” It’s open access and you can read the PDF for free here, visit the website, or buy an ebook or printed copy.
Reflections on the PIBBSS Fellowship 2022
by nora, particlemania
PIBBSS (Principles of Intelligent Behavior in Biological and Social Systems) facilitates research on parallels between intelligent behavior in natural and artificial systems, with the aim to use this towards building safe and aligned AI.
They ran a 3-month summer research fellowship with 20 scholars from varying fields - including 6 weeks reading, 2 research retreats, biweekly speakers, and individual research support. ~12 had a significant counterfactual move toward engaging in the AI safety field, 6-10 made interesting progress on promising research programs like intrinsic reward-shaping in brains, and 3-5 started long-term collaborations. They also developed a multi-disciplinary research network beyond just Fellows.
They think they’ll run it again, with more structured support, encouraging faster communicable outputs of research, weighting ML experience higher for those with prosaic projects, and being more careful about accepting fellows with conflicting incentives (eg. from academia).
I went to the Progress Summit. Here’s What I Learned.
by Nick Corvino
The Progress Summit is run by The Atlantic, to “highlight the most exciting ideas in science and technology” and “discuss how we can invent our way to a better world”. The author thinks the Progress Studies community is reasonably aligned with what EAs care about and could be a good alternative for those who find EA too intimidating, intense, or too longtermism-focused.
The author shares some reflections from attending, including:
- It felt more professional than EA events (cocktails, food, outfits, smooth Ops).
- Talks were fluffy, but speakers were eloquent and engaging. They were often ‘selling’ their products in their talks, to appeal to investors and venture capitalists.
- Networking was less intense - more small talk.
- The majority of attendees were bullish on tech progress and weren’t across x-risks like AGI or biorisk. Where risk was addressed, it was economic, climate change, or war.
by NicoleJaneway
The author went to EAGxBerkeley, and found many young EAs don’t have a strong grasp of personal finance. They suggest EAs (especially student groups) could benefit from education here eg. how to use low-cost index funds for investing, or setting up rainy day funds. Because EAs have different needs to the general population (eg. they can take more risk with assets they plan to donate), they also suggest the next EAGx have a talk that covers smart ways to maximize giving strategies, geared towards the rules of the country hosting it.
by DavidNash
The UK has three EA hubs - London, Oxford, and Cambridge. In addition there are many student groups, a city group in Bristol, and the EA hotel in Blackpool. The post details EA communities, organisations, and offices in each city.
We should say more than “x-risk is high”
by OllieBase
Some posts have argued that in order to persuade people to work on high priority issues like AI Safety and biosecurity, we only need to point to high x-risk this century, not to longtermism or broader EA principles. The author agrees this could convince people, but disagrees with that approach in general, because:
- Our situation could change (eg. x-risk lower than we thought)
- Our priorities could change (eg. the best interventions could be something indirect like ensuring global peace)
- It risks losing what makes EA distinctive, and being dismissed as alarmist - other movements also focus on x-risk arguments (eg. Extinction Rebellion).
Therefore, the author suggests outlining the case for longtermism and how it implies that x-risk should be a top priority even if x-risk is low, to make the community robust to these scenarios.
by David van Beveren
Kurzgesagt is an educational youtube channel that has ~20M subscribers. They’ve done several videos on EA and Longtermism related topics, and have funding from Open Philanthropy for this.
Their latest video, “How to Terraform Mars - WITH LASERS” promotes the idea of seeding wildlife on other planets. It doesn’t mention anything about the welfare of these animals, which could involve suffering from adapting to hostile and unfamiliar environments. The author argues not addressing this issue is a common problem in almost all major plans and discussions on terraforming or space colonization.
by Omnizoid
There are amazing opportunities to help the global poor (see GiveWell recommendations), some of whose incomes are ~1% of poor people in the USA. The author asks readers to please support this cause, even if they think badly of EA / don’t want to be part of the EA community.
The Effective Altruism movement is not above conflicts of interest
by sphor
Linkpost and excerpts from an EA criticism contest entry published by a pseudonymous author on 31st August 2022 (before the collapse of FTX).
The post notes that EA relying on ultra-wealthy individuals like Sam Bankman-Fried (SBF) incentivizes the community to accept political stances and moral judgments based on their alignment with the interests of its wealthy donors. They argue EA has failed to identify and publicize these conflicts of interest, and suggest that EA should do so, and then consider what systematic safeguards might be needed. So far EA has relied on the promotion of debates, which isn’t sufficient because individuals can’t consciously free themselves of bias.
As an example, they discuss how cryptocurrency is inherently political, how attacks on it affect EA’s reputation, and the risk to EA if SBF were involved in an ethical or legal scandal. Because of this, EA has an incentive to protect SBF’s reputation, think positively of cryptocurrency, and counter critics.
FTX-Related
EA is probably undergoing "Evaporative Cooling" right now
by freedomandutility
When a group goes through a crisis (eg. the FTX collapse), those who hold the group's beliefs least strongly leave, and those who hold the group's beliefs most strongly stay. This might leave the remaining group less able to identify weaknesses within group beliefs or course-correct, or "steer". The author suggests one way to combat this would be to move community building focus to producing moderately-engaged EAs instead of highly-engaged EAs.
Cryptocurrency is not all bad. We should stay away from it anyway.
by titotal
The author argues that “the crypto industry as a whole has significant problems with speculative bubbles, ponzis, scams, frauds, hacks, and general incompetence”, and that EA orgs should avoid being significantly associated with it until the industry becomes stable.
In the last year, at least 4 crypto firms collapsed, excluding FTX. Previous downturns have included the collapse of the largest-at-the-time crypto exchange mt gox. Crypto’s use is dominated by people using it to get rich - after 14 years, there are almost no widespread uses outside of this. This all means it’s a speculative bubble, and it will likely collapse again (maybe not in the same way). If we’re associated with it this could lead to a negative reputation that EA “keeps getting scammed”.
"I'm as approving of the EA community now as before the FTX collapse" (Duncan Sabien)
by Will Aldred
Given a community of thousands, the author expects some bad things to happen. FTX feels more like bad luck than “how dare we not have predicted this / why weren’t we robust to this”. We should reflect and potentially act, which they believe is happening, but the level of vigilance being proposed by some would paralyze the movement. They continue to support and endorse EA to the same level as before this black swan event.
I'm less approving of the EA community now than before the FTX collapse
by throwaway790
The author is specifically less approving of CEA / EVF, Will MacAskill, and donation practices during the “funding overhang”. They are more approving of Peter Wildeford, Rethink Priorities, Rob Wiblin, and Dustin Moskovitz. The reasons for disapproval include:
- Statements during the events being minimizing, ambiguous, or missing
- Not taking enough responsibility for involvement with FTX
- Funding decisions like buying Wytham Abbey
They are still in favor of EA principles, and plan to donate to EA causes.
Reflections on Vox's "How effective altruism let SBF happen"
by Richard Y Chappell
Richard believes the article correctly identifies that EA needs more respect for established procedures, and suggests a culture of consulting with senior advisors who understand how institutions work and why. He disagrees with the framing from Vox that “the problem is the dominance of philosophy”.
Sam Bankman-Fried has been arrested
by Markus Amalthea Magnuson
On 12th December, SBF was arrested in the Bahamas following receipt of formal notification from the United States that it has filed criminal charges against SBF and is likely to request his extradition.
The US Attorney for the SDNY asked that FTX money be returned to victims. What are the moral and legal consequences to EA? by Fermi–Dirac Distribution
In a December 13 press conference the United States Attorney for the Southern District of New York said: “To any person, entity or political campaign that has received stolen customer money, we ask that you work with us to return that money to the innocent victims.” This post is an open thread for discussion on this.
Didn’t Summarize
Hugh Thompson Jr (1943–2006) by Gavin
EA's Achievements in 2022 by ElliotJDavies (open thread)
Today is Draft Amnesty Day (December 16-18) by Lizka
[Expired] $50 TisBest Charity Gift Card to the first 20,000 people who sign up by Michael Huang
LW Forum
AI Related
by Collin
The author and collaborators published Discovering Latent Knowledge in Language Models Without Supervision. This post discusses how it fits into a broader alignment scheme.
Paper summary (summarized from this twitter thread): Existing language model training techniques have the issue that human data has human-like errors, and eg. a model trained to generate highly-rated text can output errors human evaluators don’t notice. Instead, the authors propose finding latent “truth-like” features without human supervision, by searching for implicit beliefs or knowledge learned by a model. They use CCS (Contrast-Consistent Search) to outperform model outputs on accuracy, even when model outputs are unreliable or uninformative.
Post summary (Author’s tl;dr): unsupervised methods are more scalable than supervised methods, deep learning has special structure that we can exploit for alignment, and we may be able to recover superhuman beliefs from deep learning representations in a totally unsupervised way.
AI alignment is distinct from its near-term applications
by paulfchristiano
Existing AI systems are misaligned ie. they will often do things their designers don’t want like say offensive things. These systems are a good empirical testbed for alignment research. However, if companies train AIs to be very conservative and inoffensive, it risks backlash against and misunderstanding of what alignment is. The main purpose of alignment is to stop AI killing everyone, and it could be very bad if efforts to prevent this were undermined by a vague public conflation between AI alignment and corporate policies.
Revisiting algorithmic progress
by Tamay, Ege Erdil
Summary of the authors’ research paper on the effect of algorithmic process in image classification on ImageNet. They find that every 9 months (95% CI: 4 to 25 months), better algorithms contribute the equivalent of a doubling of computer budgets. Progress in image classification has been roughly ~45% scaling of compute, 45% better algorithms, and ~10% scaling of data. The better algorithms primarily act via using compute more effectively (as opposed to data augmenting).
[Interim research report] Taking features out of superposition with sparse autoencoders
by Lee Sharkey, Dan Braun, beren
Author’s TL;DR: Recent results from Anthropic suggest that neural networks represent features in superposition. This motivates the search for a method that can identify those features. Here, we construct a toy dataset of neural activations and see if we can recover the known ground truth features using sparse coding. We show that, contrary to some initial expectations, it turns out that an extremely simple method – training a single layer autoencoder to reconstruct neural activations with an L1 penalty on hidden activations – doesn’t just identify features that minimize the loss, but actually recovers the ground truth features that generated the data. We’re sharing these observations quickly so that others can begin to extract the features used by neural networks as early as possible. We also share some incomplete observations of what happens when we apply this method to a small language model and our reflections on further research directions.
Trying to disambiguate different questions about whether RLHF is “good”
by Buck
Conversations on whether Reinforcement Learning from Human Feedback (RLHF) is a promising alignment strategy, ‘won’t work’, or ‘is just capabilities research’ are muddled. The author distinguishes 11 related questions, and gives their opinions on them.
Overall they think RLHF by itself (with non-aided human overseers) is unlikely to be a promising alignment strategy, and there are failure modes like RLHF selecting for models that look aligned but aren’t. However, they think a broader version (eg. with AI-assisted humans) could be a part of an alignment strategy, and that researching alignment schemes involving RLHF could be one of the most promising research directions.
by g1
The author has been aware of AI x-risk arguments for a while, and often agreed with them, but in a detached way. Spending time observing ChatGPT has brought their gut feelings into line with their beliefs.
Can we efficiently explain model behaviors?
by paulfchristiano
Alignment Research Center’s (ARC’s) current plan for Eliciting Latent Knowledge (ELK) has 3 major challenges. The author describes why they expect significant progress on #1 and #3 over the next 6 months, and why that would be a big deal even if #2 turns out to be extremely challenging. The challenges are:
- Formalizing probabilistic heuristic argument as an operationalization of ‘explanation’
- Finding sufficiently specific explanations for important model behaviors
- Checking whether particular instances of a behavior are ‘because of’ a particular explanation
Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)
by LawrenceC
Linkpost for this paper.
Author’s summary: The authors propose a method for training a harmless AI assistant that can supervise other AIs, using only a list of rules (a "constitution") as human oversight. The method involves two phases: first, the AI improves itself by generating and revising its own outputs; second, the AI learns from preference feedback, using a model that compares different outputs and rewards the better ones. The authors show that this method can produce a non-evasive AI that can explain why it rejects harmful queries, and that can reason in a transparent way, better than standard RLHF.
Didn’t Summarize
Consider using reversible automata for alignment research by Alex_Altair
[Interim research report] Taking features out of superposition with sparse autoencoders by Lee Sharkey, Dan Braun, beren