I worry there's a negative example bias in the section about working with AI companies/accumulating power and influence, vs. working outside the system.
You point to cases where something bad happened, and say that some of the people complacent in the bad thing didn't protest because they wanted to accumulate power/influence within the system.
But these should be matched by looking for cases where something good happened because people tried to accumulate power/influence within a system.
I think this is a significant percent of all good things that have ever happened. Just to give a trivial example, slavery ended because people like Abraham Lincoln successfully accumulated power within the federal government, which at the time was pro-slavery and an enforcer of slavery. If abolitionists had tried to "stay pure" by refusing to run for office, they probably would have gotten nowhere.
Or an even clearer example: Jimmy Carter ended segregation in Georgia by pretending to be extremely racist, winning the gubernatorial election on the strength of the racist vote, then showing his true colors and ending segregation.
(is it cheating to use government as an example? I don't think so - you mentioned Vietnam)
You also mention academia and say that maybe the desire of academics to "work within the system" prevents intellectual change. I would argue that any time an academic consensus changes - which historically has been pretty common - it's been because someone worked within the system, got their PhD, and used their prestige to advocate for the new better position. If nobody who disagreed with an academic consensus ever did that, paradigms would never change, and academia would be much worse.
(here I think a good example is wokeness - the colleges are full of people who said they decided that such-and-such a field was racist and it was their duty to change it from within. Those people won, and they'll keep winning until people with alternate ideologies are equally dedicated)
I also think there's a bias in this space towards thinking that the current AI situation is maximally cursed compared to all counterfactuals. Suppose nobody who cared about alignment had founded an AI company. We'd still have Moore's Law and compute costs would still go down. Using modern chips, it costs $20 to train a GPT-2 equivalent (this might be slightly conditioning on chip or algorithmic progress spurred by OpenAI, but I think it's a useful comparison point). If OpenAI hadn't done it, eventually someone else would have. So maybe in this world, since OpenAI/Anthropic/DeepMind don't exist, the top AI companies are Google (not Deepmind), Meta, and Baidu, they're 1-5 years of algorithmic progress and getting-scaling-running behind where they are now, and they all have the Yann LeCunn approach to alignment (or in Baidu's case have never even heard the term). Is subtracting 1-5 years from timelines in exchange for making most big AI companies have alignment teams and at-least-mildly-concerned CEOs, a good trade? I can't really say, but I don't understand everyone else's strong conviction that it isn't. What would we have done with 1-5 years extra timeline? MIRI-style agent foundations research? Try to lobby politicians to pause a thing that wasn't happening?
(in fact, for this counterfactual to be fair, there can't be any alignment discussion at all - if there's alignment discussion, it inspires Sam Altman and the rest. So I think we would just let those 1-5 years pass by without using them strategically, unless we can somehow do the alignment research in secret with no public community to speak of.)
I don't want to argue that working within the system is definitely better - I'm on the fence, because of a combination of your considerations and the ones above. My cruxes are -
1. What is the chance that PauseAI activism will work?
2. If it does work, is there a plan for what to do with the pause?
3. Does pro-pause activism now complement or substitute for pro-pause activism later? (eg mid-intelligence-explosion when the case will hopefully be more obvious)
4. How much goodwill do we burn with AI companies per percent likelihood of an actually-useful AI pause that we gain? Are there different framings / forms of activism / target laws that would buy us better chances of a useful pause per unit of goodwill burnt?
5. If pro-pause activism burns goodwill, how effectively can we pull off a good cop / bad cop strategy as opposed to having the Unilateralist's Curse poison the whole movement?
6. What's the difference in p(doom) between a world where AI companies have 75th percentile vs. 25th percentile levels of goodwill towards the concept of alignment / friendly professional relationship with the alignment community?
Of these, I find myself thinking most about the last. My gut feeling is that nothing we do is going to matter, and the biggest difference between good outcomes and bad outcomes is how much work the big AI labs put into alignment during the middle of the intelligence explosion when progress moves fastest. The fulcrum for human extinction might look like a meeting between Sam Altman + top OpenAI executives where they decide whether to allocate 10% vs. 20% of their GPT-6 instances to alignment research. And the fulcrum for that fulcrum might be whether the executives think "Oh yeah, alignment, that thing that all the cool Silicon Valley people whose status hierarchy we want to climb agree is really important" vs. "Oh, the ideology of the hated socialist decel outgroup who it would be social suicide to associate ourselves with". If we get five SB 1047 style bills at the cost of shifting from the first perspective to the second, I'm not sure we're winning here (even if those bills don't get vetoed). And the more you think that all past EA interventions have made things worse, the more concerned you should be about this (arguably - I admit it depends how you generalize).
Right now I lean towards trying to chart a muddy middle course, something like "support activism that seems especially efficient in getting things done per unit of AI company goodwill burnt". But I am most optimistic about laying the groundwork for a pause campaign that might come later, in the middle of the intelligence explosion, when it will become obvious that something crazy is happening, and when Sam Altman will have all those spare GPT-6 instances which - if paused from doing capabilities research - can be turned to alignment.
"To redeem their version of morality from the demangingness objection, the tweeters assert that some good deeds are supererogatory, which is philosophy for “nice to do, but not obligatory.” The problem is that they do not present a reason why doing more good would ever be supererogatory, other than the implicit convenience of ducking the demandingness objection."
I think this might be addressed to me. My reasoning is at https://slatestarcodex.com/2017/08/28/contra-askell-on-moral-offsets/ . Other than that, I'm not sure how you get a coherent theory of obligatory vs. supererogatory.
What would it mean for a thing which nobody does (donate literally all their money beyond minimum required to live) to be obligatory? I think of "obligatory" as meaning that if someone doesn't do a thing, then we all agree to consider them a bad person. But we can't make that agreement regarding donating 100% of income beyond survival level, because we'd never stick to it - I think of the EAs who donate 50% of their income as extremely good people, I can't self-modify to not do that even if I wanted to, and I don't want to.
Without something like that, how do you distinguish "obligatory" from simply "a good thing to do" (which I and IIUC everyone in the discussion agrees that donating more is).
If you admit 84% of people, but also feel like many people who you would like to have are turned off by the perception of a high admissions bar, wouldn't it make sense to admit everyone (or have a default-admit policy that you stray from only in cases of extreme poor culture fit)?
I won't quite say "worst case scenario is that there are an extra 16% of people there who you don't like", because the worst case scenario is that the marginal applicant lured in by the lack of an admissions bar is much worse than the current applicant pool, but it seems like something like that could be true (ie it doesn't seem like there's currently a large pool of unqualified applicants who it would overwhelm the conference to let in).
Habryka referred me to https://forum.effectivealtruism.org/posts/A47EWTS6oBKLqxBpw/against-anthropic-shadow , whose "Possible Solution 2" is what I was thinking of. It looks like anthropic shadow holds if you think there are many planets (which seems true) and you are willing to accept weird things about reference classes (which seems like the price of admissions to anthropics). I appreciate the paper you linked for helping me distinguish between the claim that anthropic shadow is transparently true without weird assumptions, vs. the weaker claim in Possible Solution 2 that it might be true with about as much weirdness as all the other anthropic paradoxes.
I'm having trouble understanding this. The part that comes closest to making sense to me is this summary:
The fact that life has survived so long is evidence that the rate of
potentially omnicidal events is low...[this and the anthropic shadow effect] cancel out, so that overall the historical record provides evidence for a true rate close to the observed rate.
Are they just applying https://en.wikipedia.org/wiki/Self-indication_assumption_doomsday_argument_rebuttal to anthropic shadow without using any of the relevant terms, or is it something else I can't quite get?
Also, how would they respond to the fine-tuning argument? That is, it seems like most planets (let's say 99.9%) cannot support life (eg because they're too close to their sun). It seems fantastically surprising that we find ourselves on a planet that does support life, but anthropics provides an easy way out of this apparent coincidence. That is, anthropics tells us that we overestimate the frequency of things that allow us to be alive. This seems like reverse anthropic shadow, where anthropic shadow is underestimating the frequency of things that cause us to be dead. So is the paper claiming that anthropics does change our estimates of the frequency of good things, but can't change our estimate of the frequency of bad things? Why would this be?
I mostly agree with this. The counterargument I can come up with is that the best AI think tanks right now are asking for grants in the range of $2 - $5 million and seem to be pretty influential, so it's possible that a grantmaker who got $8 million could improve policy by 5%, in which case it's correct to equate those two.
I'm not sure how that fits with the relative technical/policy questions.
2. I agree I'm assuming there will be a slow takeoff (operationalized as let's say a ~one year period where GPT-integer-increment-level-changes happen on a scale of months, before any such period where they happen on a scale of days).
3. AI companies being open to persuasion seems kind of trivial to me. They already have alignment teams. They already (I assume) have budget meetings where they discuss how many resources these teams should get. I'm just imagining inputs into this regular process. I agree that issues around politics could be a lesser vs. greater input.
1. I wouldn't frame this as alignment is easy/hard, so much as "alignment is more refractory to 10,000 copies of GPT-6 working for a subjective century" vs. "alignment is more refractory to one genius, not working at a lab, coming up with a new paradigm using only current or slightly-above-current AIs as model organisms, in a sense where we get one roll at this per calendar year".