What AI Safety Materials Do ML Researchers Find Compelling?

Vael Gates

aogaraDec 28 202223

Really interesting stuff! Another question that could be useful is how much each piece shifted their views on existential risk.

Some of the better liked pieces are less ardent about the possibility of AI x-risk. The two pieces that are most direct about x-risk might be the unpopular Cotra and Carlsmith essays. I’m open to the idea that gentler introductions to ideas about safety could be more persuasive, but it might also result in people working on topics that are less relevant for existential safety. Hopefully we’ll be able to find or write materials that are both persuasive to the ML community and directly communicate the most pressing concerns about alignment.

Separately, is your sample size 28 for each document? Or did different documents have different numbers of readers? Might be informative to see those individual sample sizes. Especially for a long report like Carlsmith’s, you might think that not many readers put in the hour+ necessary to read it.

Edit: Discussion of this point here: https://www.lesswrong.com/posts/gpk8dARHBi7Mkmzt9/what-ai-safety-materials-do-ml-researchers-find-compelling?commentId=Cxoa577LadGYwC49C#comments

Vael GatesDec 29 20223

(in response to the technical questions)

Mostly n=28 for each document, some had n =29 or n= 30; you can see details in the Appendix, quantitative section.

The Carlsmith link is to the Youtube talk version, not the full report -- we chose materials based on them being pretty short.

wANIELJan 4 20233

Was each piece of writing read by a fresh set of n researchers (i.e. meaning that a total of ~30*8 researchers participated)? I understand the alternative to be that the same ~30 researchers read the 8 pieces of writing.

The following question interests me if the latter was true:
Do you specify in what order they should read the pieces?

I expect somebody making their first contact with AIS to have a very path-dependent response. For instance, encountering Carlsmith first and encountering Carlsmith last seem to produce different effects—these effects possibly extending to the researchers' ratings of the other pieces.

Unrelatedly, I'm wondering whether researchers were exposed only to the transcripts of the videos as opposed to the videos themselves.

Vael GatesJan 4 20231

No, the same set of ~28 authors read all of the readings.

The order of the readings was indeed specified:

Concise overview (Stuart Russell, Sam Bowman; 30 minutes)
Different styles of thinking about future AI systems (Jacob Steinhardt; 30 minutes)
A more in-depth argument for highly advanced AI being a serious risk (Joe Carlsmith; 30 minutes)
A more detailed description of how deep learning models could become dangerously "misaligned" and why this might be difficult to solve with current ML techniques (Ajeya Cotra; 30 minutes)
An overview of different research directions (Paul Christiano; 30 minutes)
A study of what ML researchers think about these issues (Vael Gates; 45 minutes)
Some common misconceptions (John Schulman; 15 minutes)

Researchers had the option to read the transcripts where transcripts were available; we said that consuming the content in either form (video or transcript) was fine.

Vasco GriloDec 29 202210

Thanks for sharing!

I was wondering what is the likelihood of the results being a fluke, so I calculated the p-value for the null hypothesis that the true means of the scores respecting the question “Overall, how much did you like this content?” for Steinhardt (S) and Gates (G) were equal.

Assumption: S and G follow normal distributions.
Sample sizes. n_S = n_G = 29.
Sample means. mu_S = 5.7. mu_G = 5.4.
Standar errors of the sample means. SE_S = SE_G = 0.2.
T-score: t = (mu_S - mu_G)/(SE_S^2 + SE_G^2)^0.5 = 1.06.
Degrees of freedom: D = n_S + n_G - 2 = 56.
P-value: 2*(1-T.DIST(t, D, 1)) = 29.3 %.

The value will be lower if we compare Steinhardt with authors which got a lower mean score. I guess it would be nice to include some statistical analysis of this type into the report, such that it is easier to quickly assess how robust are the conclusions.

Vael GatesDec 29 20227

Nice, yeah! I wouldn't have expected a statistically significant difference between a mean of 5.7 and 5.4 with those standard errors, but it's nice to see it here.

I considered doing a statistical test, and then spent some time googling how to do something like a "3-paired" ANOVA on data that looks like ("s" is subject, "r" is reading):

[s1 r1 "like"] [s1 r1 "agreement"] [s1 r1 "informative"]

[s2 r1 "like"] [s2 r1 "agreement"] [s2 r1 "informative"]

... [s28 r1 "like"] [s28 r1 "agreement"] [s28 r1 "informative"]

[s1 r2 "like"] [s1 r2"agreement"] [s1 r2 "informative"]

[s2 r2 "like"] [s2 r2 "agreement"] [s2 r2 "informative"]

...

because I'd like to do an ANOVA on the raw scores, rather than the means. I did not resolve my confusion about about what to do about the 3-paired data (I guess you could lump each subject's data in one column, or do it separately by "like", "agreement", and "informative", but I'm interested in how good each of the readings are summed across the three metrics). I then gave up and just presented the summary statistics. (You can extract the raw scores from the Appendix if you put some work into it though, or I could pass along the raw scores, or you could tell me how to do this sort of analysis in Python if you wanted me to do it!)

When I look at these tables, I'm also usually squinting at the median rather than mean, though I look at both. You can see the distributions in the Appendix, which I like even better. But point taken about how it'd be nice to have stats.

Vasco GriloDec 30 20222

You can extract the raw scores from the Appendix if you put some work into it though, or I could pass along the raw scores, or you could tell me how to do this sort of analysis in Python if you wanted me to do it!

Ah, thanks for the suggestion! To be honest, I only have basic knowledge about stats, so I do not know to do the more complex analysis you described. My (quite possibly flawed) intuition for analysing all questions would be:

Determine, for each subject, "overall score" = ("score of question 1" + "score of question 2" + "score of question 3")/3.
- If some subjects did not answer to all 3 questions, "overall score" = "sum of the scores of the answered questions"/"number of answered questions".
Calculate the mean and standard error for each of the AI safety materials.
Repeat the calculation of the p-value as I illustrated above for the pairs of AI safety materials (best, 2nd best), (2nd best, 3rd best), ..., and (2nd worst, worst), or just analyse all possible pairs.

GavinDec 29 20224

I take it the authors weren't anonymised? Not actually that important though.

Vael GatesDec 29 20221

The authors were not anonymized, no.

CEvansDec 28 20224

Do you plan on doing any research into the cruxes of disagreement with ML researchers?

I realise that there is some information on this within the qualitative data you collected (which I will admit to not reading all 60 pages of), but it surprises me that this isn't more of a focus. From my incredibly quick scan (so apologies for any inaccurate conclusions) of the qualitative data, it seems like many of the ML researchers were familiar with basic thinking about safety but seemed to not buy it for reasons that didn't look fully drawn out.

It seems to me that there is a risky presupposition that the arguments made in the papers you used are correct, and that what matters now is framing. To me, given the proportion of resources EA stakes on AI safety, it would be worth trying to understand why people (particularly knowledgeable ML researchers) have a different set of priorities to many in EA. It seems suspicious how little intellectual credit that ML/AI people who aren't EA are given.

I am curious to hear your thoughts. I really appreciate the research done here and am very much in favour of more rigorous community/field building being done as you have here.

Vael GatesDec 29 20221

I'm not going to comment too much here, but if you haven't seen my talk (“Researcher Perceptions of Current and Future AI” (first 48m; skip the Q&A) (Transcript)), I'd recommend it! Specifically, you want the timechunk 23m-48m in that talk, when I'm talking about the results of interviewing ~100 researchers about AI safety arguments. We're going to publish much more on this interview data within the next month or so, but the major results are there, which describes some AI researchers cruxes.

[anonymous]Dec 28 20221

To me, given the proportion of resources EA stakes on AI safety, it would be worth trying to understand why people (particularly knowledgeable ML researchers) have a different set of priorities to many in EA. It seems suspicious how little intellectual credit that ML/AI people who aren't EA are given.

I don't see this as suspicious, because I suspect different goals are driving EAs compared to AI researchers. I'm not surprised by the fact that they disagree, since even if AI risk is high, if you have a selfish worldview, it's probably still rational to work on AI research.

Effective Altruism Forum
EA Forum

What AI Safety Materials Do ML Researchers Find Compelling?

130

130

Reactions