DA

David_Althaus

1556 karmaJoined

Comments
68

Thanks for writing this sequence, lots of great points and matches my experience. 

I think there is evidence in mainstream psychology for your recommendation of replacing fear with more excitement-based motivation. They use the terms "avoidance motivation" and "approach motivation" but my sense is that they match the terminology of fear and excitement very well. It appears that (excessive) avoidance motivation leads to reduced well-being and performance, at least in the long-term (e.g., Roskes et al. 2014).

ETA: The above reference is not good. Scholer et al. (2019) provide a more nuanced and comprehensive overview of the literature and also discuss when (and in which contexts) avoidance motivation can be useful or even superior.

Thanks for sharing, I thought this was interesting and relatable. 

For what it's worth, you seem like a really committed person to me, so I wouldn't call you lazy (if you're "lazy", why can you work 50 hours and managed to perform well in the military?). In some cheeky sense, you might have benefited from being more lazy and "giving up" sooner, rather than trying to push yourself to make it work for years, always hoping that change is around the corner. 

In my early twenties I also tried to study computer science and programming for similar reasons (AI safety research, EtG potential). I think I basically gave up after like 1-2 weeks because I did not like it. In some sense, you could say that my own laziness saved me from making the potentially huge mistake of pursuing something for a few years and then burning out/getting stuck in the sunk cost fallacy, etc.

Though that's usually not how I view it. Over the years I've often blamed myself for being a lazy quitter and that I should have tried harder back then to study CS. Otoh, stories like yours are (weak) evidence that it probably wouldn't have ended well and that I should be glad to have continued to study where my personal fit was higher even though it was (way) less impactful. 

Anyways, enough rambling about myself. In my book, you tried really hard to have impact and showed real courage in sharing your story. I think you're cool. :)



 

That's certainly possible though I think it's more likely that the public will become more and more concerned as more and more powerful AIs will be deployed.

Looks to me like Yudkowsky was wrong and there was a fire alarm. (To be fair, if you had asked me in 2017 there is no way I'd have predicted that AI risk is as mainstream as it is now.)

Really great post, agree with almost everything, thanks for writing!

(More speculatively, it seems plausible to me that many EAs have worse judgement of character than average, because e.g. they project their good intentions onto others.)

Agreed. Another plausible reason is that system 1 / gut instincts play an important role in character judgment but many EAs dismiss their system 1 intuitions more or experience them less strongly than the average human. This is partly due to selection effects (EA appeals more to analytical people) but perhaps also because several EA principles emphasize putting more weight on reflective, analytical reasoning than on instincts and emotions (e.g., the heuristics and biases literature, several top cause areas (like AI) aren't intuitive at all, and so on). [1]

That's at least what I experienced first hand when interacting with a dangerous EA several years ago. I met a few people who had negative impressions of this person's character but couldn't really back them up with any concrete evidence or reasoning, and this EA continued to successfully deceive me for more than a year.[2] Personally, I didn't have a negative impression in the first place (partly because the concept of a non-trustworthy EA was completely out of my hypothesis space back then) so other people were clearly able to pick up on something that I couldn't. 

  1. ^

    To be clear, I'm not saying that reflective reasoning is bad (it's awesome) or that we now should all trust our gut instincts when it comes to character judgment. Gut instincts are clearly fallible. The average human certainly isn't amazing at character judgment, for example, ~50% of US Americans have voted for clearly dangerous people like Trump.

  2. ^

    FWIW, my experiences with this person were a major inspiration for this post

Selecting RLHF human raters for desirable traits?

Epistemic status: I wrote this quickly (for my standards) and I have ~zero expertise in this domain.

Introduction

It seems plausible that language models such as GPT3 inherit (however haphazardly) some of the traits, beliefs and value judgments of human raters doing RLHF. For example, Perez et al. (2022) find that models trained via RLHF are more prone to make statements corresponding to Big Five agreeableness than models not trained via RLHF. This is presumably (in part) because human raters gave positive ratings to any behavior exhibiting such traits.

Given this, it seems plausible that selecting RLHF raters for more desirable traits—e.g., low malevolence, epistemic virtues / truth-seeking, or altruism—would result in LLMs instantiating more of these characteristics. (In a later section, I will discuss which traits seem most promising to me and how to measure them.)

It’s already best practice to give human RLHF raters reasonably long training instructions and have them undergo some form of selection process. For example, for InstructGPT, the instruction manual was 17 pages long and raters were selected based on their performance in a trial which involved things like ability to identify sensitive speech (Ouyang et al., 2022, Appendix B). So adding an additional (brief) screening for these traits wouldn’t be that costly or unusual.

Clarification

Talking about stable traits or dispositions of LLMs is inaccurate. Given different prompts, LLMs simulate wildly different characters with different traits. So the concept of inheriting dispositions from human RLHF raters is misleading. 

We might reformulate the path to impact as follows: If we train LLMs with RLHF raters with traits X, then a (slightly) larger fraction of characters or simulacra that LLMs tend to simulate will exhibit the traits X. This increases the probability that the eventual character(s) that transformative AIs will “collapse on” (if this ever happens) will have traits X.

Open questions

I don’t know how the RLHF process works in detail. For example, i) to what extent is the behavior of individual RLHF raters double-checked or scrutinized, either by AI company employees or other RLHF raters, after the initial trial period is over, and ii) do RLHF raters know when the trial period has ended? In the worst case, trolls could behave well during the initial trial period but then, e.g., deliberately reward offensive or harmful LLM behavior for the lulz. 

Fortunately, I expect that at most a few percent of people would behave like this. Is this enough to meaningfully affect the behavior of LLMs?

Generally, it could be interesting to do more research on whether and to what extent the traits and beliefs of RLHF raters influence the type of feedback they give. For example, it would be good to know whether RLHF raters that score highly on some dark triad measure in fact systematically reward more malevolent LLM behavior.

Which traits precisely should we screen RLHF raters for? I make some suggestions in this section below

Positive impact, useless, or negative impact?

Why this might be positive impact

  • Pushing for adopting such selection processes now increases the probability that they will be used when training truly transformative AI. Arguably, whether or not current-day LLMs exhibit desirable traits doesn’t really matter all that much. However, if we convince AI companies to adopt such selection processes now, this will plausibly increase the probability that they will continue to use these selection processes (if only because of organizational inertia) once they train truly transformative AIs. If we wait to do so six months before the singularity, AI companies might be too busy to adopt such practices. 
    • Of course, the training setup and architecture of future transformative AIs might be totally different. But they might also be at least somewhat similar. 
  • If (transformative) AIs really inherit, even if in a haphazard fashion, the traits and beliefs of RLHF raters, then this increases the expected value of the long-term future as long as RLHF raters are selected for desirable traits. For example, it seems fairly clear that transformative AIs with malevolent traits increase s-risk and x-risks. 
    • This is probably especially valuable if we fail at aligning AIs. That is, if we successfully align our AIs, the idiosyncratic traits of RLHF raters won’t make a difference because the values of the AI are fully aligned with the human principals anyways. But unaligned AIs might differ a lot in their values. For example, an unaligned AI with some sadistic traits will create more expected disvalue than an unaligned AI that just wants to create paper clips.
  • It might already be valuable to endow non-transformative, present-day AIs with more desirable traits. For example, having more truthful present-day AI assistants seems beneficial for various reasons, such as having a more informed populace, more truth-tracking, nuanced political discourse, and increased cooperation and trust. Ultimately, truthful AI assistants would also help us with AI alignment. For much more details, see Evans et al. (2021, chapter 3).

Why this is probably useless not that impactful

  • This doesn’t solve any problems related to inner alignment or mesa optimization. (In fact, it might increase risks related to deceptive alignment but more on this below.)
  • Generally, it’s not clear that the dispositions or preferences of AIs will correspond in some predictable way to the kind of human feedback they received. It seems clear that current AIs will inherit some of the traits, views, and values of human RLHF raters, at least on distribution. However, as the CoinRun example showcases, it’s difficult to know what values an AI is actually learning as a result of our training. That is, off-distribution behavior might be radically different than what we expect.
  • There will probably be many RLHF raters. Many of the more problematic traits such as, e.g. psychopathy or sadism seem relatively rare, so they wouldn’t have much of an influence anyways.
  • People won’t just give feedback based on what appeals to their idiosyncratic traits or beliefs. They are given detailed instructions on what to reward. This means that working on the instructions that RLHF raters receive is probably more important. However, as mentioned above, malevolent RLHF raters or “trolls” might deliberately do the opposite of what they are instructed to do and reward e.g. sadistic or psychopathic behavior. Also, instructions cannot cover every possible example so in unclear cases, the idiosyncratic traits and beliefs of human RLHF raters might make a (tiny) difference.
  • The values AGIs learn during training might change later as they reflect more and resolve internal conflicts. This process might be chaotic and thus reduces the expected magnitude of any intervention that focuses on installing any particular values right now. 
  • Generally, what matters are not the current LLMs but the eventual transformative AIs. These AIs might have a completely different architecture or training setups than current systems.

Why this might be negative impact

  • RLHF might actually be net negative and selecting for desirable traits in RLHF raters (insofar it has an effect at all) might exacerbate these negative effects. For instance, Oliver Habryka argues: “In most worlds RLHF, especially if widely distributed and used, seems to make the world a bunch worse from a safety perspective (by making unaligned systems appear aligned at lower capabilities levels, meaning people are less likely to take alignment problems seriously, and by leading to new products that will cause lots of money to go into AI research, as well as giving a strong incentive towards deception at higher capability levels)”. For example, the fact that Bing Chat was blatantly misaligned was arguably positive because it led more people to take AI risks seriously. 
    • On the other hand, Paul Christiano addresses (some of) these arguments here and overall beliefs that RLHF has been net positive.
  • In general, this whole proposal is not an intervention that makes substantial, direct progress on the central parts of the alignment problem. Thus, it might just distract from the actually important and difficult parts of the problem. It might even be used as some form of safety washing.
  • Another worry is that pushing for selection processes will mutate into selecting traits we don’t particularly care about. For instance, OpenAI seems primarily concerned with issues that are important to the political left.[1] So maybe pitching OpenAI (or other AI companies) the idea of selecting RLHF raters according to desirable traits will mostly result in a selection process that upholds a long list of “woke” constraints, which in some instances, might be in conflict with other desirable traits such as truthfulness. However, it might still be net positive. 

Which traits and how?

I list a few suggestions for traits we might want to select for below. All of the traits I list arguably have the following characteristics: 

  • i) plausibly affects existential or suffering risks if present in transformative AIs.
  • ii) AI assistants exhibiting more of these traits is beneficial for the longterm future or at least not negative
  • iii) is uncontroversially viewed as (un)desirable
  • iv) is (reliably and briefly) measurable in humans. 
    • If we can’t reliably measure a trait in humans, we obviously cannot select for it. 
    • The shorter the measures, the cheaper they are to employ, and the easier it is to convince AI companies to use them.

Ideally, any trait which we want to include in a RLFH rater selection process should have these characteristics. The reasons for these criteria are obvious but I briefly elaborate on them in this footnote.[2]

This isn’t a definitive or exhaustive list by any means. In fact, which traits to select for, and how to measure them (perhaps even developing novel measurements) could arguably be a research area for psychologists or other social scientists. 

Dark tetrad traits / malevolence

One common operationalization of malevolence are the dark tetrad traits, comprising machiavellianism, narcissism, psychopathy, and sadism. I have previously written on the nature of dark tetrad traits and the substantial risks they pose. It seems obvious that we don’t want any AIs to exhibit these traits. 

Fortunately, these traits have been studied extensively by psychologists. Consequently, brief and reliable measures of these traits exist, e.g., the Short Dark Tetrad (Paulhus et al., 2020) or the Short Dark Triad (Jones & Paulhus, 2014). However, since these are merely self-report scales, it’s unclear how well they work in situations where people know they are being assessed for a job.

Truthfulness and epistemic virtues

(I outlined some of the benefits of truthfulness above, in the third bullet point of this section.)

It’s not easy to measure how truthful humans are, especially in assessment situations.[3] Fortunately, there exist reliable measures for some epistemic virtues that correlate with truthfulness. For example, the argument evaluation test, (Stanovich & West, 1997) or the actively open-minded thinking scale (e.g., Baron, 2019). See also Stanovich and West (1988) for a classic overview of various measures of epistemic rationality.

Still, none of these measures are all that great. For example, some of these measures, especially the AOT scale, have strong ceiling effects. Developing more powerful measures would be useful.

Pragmatic operationalization: forecasting ability

One possibility would be to select for human raters above some acceptable threshold of forecasting ability as forecasting skills correlate with epistemic virtues. The problem is that very few people have a public forecasting track record and measuring people’s forecasting ability is a lengthy and costly process. 

Cooperativeness, harm aversion, altruism

In some sense, altruism or benevolence are just the opposite of malevolence[4], so perhaps we could just use one or the other. HEXACO honesty-humility (e.g., Ashton et al., 2014) is one very well-studied measure of benevolence. Alternatives include the self-report altruism scale (Rushton et al., 1981) or behavior in economic games such as the dictator game.

Cooperativeness, however, is a somewhat distinct construct. Others have written about the benefits of making AIs more cooperative in this sense. One measure of cooperativeness is the cooperative personality scale by Lu et al. (2013).

Harm aversion could also be desirable because it might translate into (some form of) low-impact AIs. On the other hand, (excessive) instrumental harm aversion can come into conflict with consequentialist principles.

Other traits

As mentioned above, this is by no means an exhaustive list. There are many other traits which could be desirable, such as empathy, tolerance, helpfulness, fairness, intelligence, effectiveness-focus, compassion, or wisdom. Other possibly undesirable traits include spite, tribalism, partisanship, vengefulness, or (excessive) retributivism.

References

Ashton, M. C., Lee, K., & De Vries, R. E. (2014). The HEXACO Honesty-Humility, Agreeableness, and Emotionality factors: A review of research and theory. Personality and Social Psychology Review18(2), 139-152.

Baron, J. (2019). Actively open-minded thinking in politics. Cognition188, 8-18.

Evans, O., Cotton-Barratt, O., Finnveden, L., Bales, A., Balwit, A., Wills, P., ... & Saunders, W. (2021). Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674.

Forsyth, L., Anglim, J., March, E., & Bilobrk, B. (2021). Dark Tetrad personality traits and the propensity to lie across multiple contexts. Personality and individual differences177, 110792.

Lee, K., & Ashton, M. C. (2014). The dark triad, the big five, and the HEXACO model. Personality and Individual Differences67, 2-5.

Lu, S., Au, W. T., Jiang, F., Xie, X., & Yam, P. (2013). Cooperativeness and competitiveness as two distinct constructs: Validating the Cooperative and Competitive Personality Scale in a social dilemma context. International Journal of Psychology48(6), 1135-1147.

Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., ... & Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. arXiv preprint arXiv:2212.09251.

Rushton, J. P., Chrisjohn, R. D., & Fekken, G. C. (1981). The altruistic personality and the self-report altruism scale. Personality and individual differences2(4), 293-302.

Stanovich, K. E., & West, R. F. (1997). Reasoning independently of prior belief and individual differences in actively open-minded thinking. Journal of educational psychology89(2), 342.

Stanovich, K. E., & West, R. F. (1998). Individual differences in rational thought. Journal of experimental psychology: general127(2), 161.

  1. ^

    Though, to be fair, this snapshot of the instruction guidelines seem actually fair and balanced.

  2. ^

    i) is important because the trait is otherwise not very consequential, ii) is obvious, iii) is more or less necessary because we otherwise couldn’t convince AI companies to select according to these traits because they would disagree or because they would fear public backlash, iv) is required because if we can’t reliably measure a trait in humans, we obviously cannot select for it. The shorter the measures, the cheaper they are to employ, and the easier it is to convince AI companies to use them.

  3. ^

    Though dark tetrad traits correlate with a propensity to lie (Forsyth et al., 2021).

  4. ^

    For instance, HEXACO honesty-humility correlates highly negatively with dark triad traits (e.g., Lee & Ashton, 2014).

ETA July:

I regret posting my comments for several reasons. I'm sorry to anyone I upset.

Specifically, I regret not putting more effort into ensuring that my first comment is not going to be misinterpreted, and ensuring to put things into context, e.g., that "putting SBF on a pedestal", if meaning something like “holding SBF up as a role model”, was—certainly for those who didn’t know him well in person!—in the vast majority of instances reasonable and understandable at the time, and I would have easily done the same! (Some things like e.g. tying EA’s reputation to SBF to such a close extent were perhaps not super wise but much of this is probably hindsight bias).

I feel also bad about mentioning the flyers, mostly because I got important information wrong which is a grave mistake in such a situation, and partly because my phrasing was too harsh/critical (if the person who created those flyers is ever reading this, I'm sorry, you had good intentions and it wasn't a big deal at all!).

I wrote the comment because I was disconcerted by the original comment (and its initial high Karma count) which seemed to seriously question whether we “elevated [SBF] as a moral paragon and someone to emulate “, or “tied EA's reputation closely to his”, and asked for specific examples. I still feel like it’s a no-brainer that EAs, in general, obviously and understandably elevated SBF as a role model and someone worth emulating! Seriously questioning this still strikes me as defensive motivated reasoning and concerning. Many people seemed to agree with the commenter which made me worry that EAs would refuse to learn any lessons from this whole scandal which might risk repeating a similar monumental catastrophe. (Me being upset[1] is not meant to be an excuse but if anything a further reason why I should have written this comment differently. As a general rule, it's simply bad to write comments when one is upset because it clouds one's judgment and reduces compassion, and this is an obvious mistake which I should not have made.)

So why did the FTX scandal happen? One simplistic perspective is (ignoring many other, more important causal factors!): someone with dark triad/malevolent traits got more and more power and ended up doing something extremely bad. Other people did not realize that this person is malevolent, or suspected it and didn’t speak up (e.g., because of fear, miscalculation, or motivated reasoning, or opportunism). I’ve seen that story before and it really shaped my outlook on life. 

That’s why I wanted to make the following argument: let’s not put too much faith in the character judgment of the people who championed SBF (and have known him very well) going forward, to make sure that something like this doesn’t happen again. Let’s not be like “oh well, there is nothing we can learn from this, no need to change anything”. That does seem like a very important point to me and I stand by that. 

Now, importantly, I wasn’t trying to imply that any EA leader knew about the fraud or did something illegal. I also wasn’t trying to imply that mistakes like ‘suboptimal character judgment’ are even remotely comparable to the mistakes that SBF made. Of course, it’s not even close. In some sense, it’s a minor mistake that probably more than 95% of people would have made (because lots of things would have to come together to not make such mistakes). In fact, in my experience, many amazing people don’t have great character judgment. 

But on the other hand, it’s still substantial and worth keeping in mind and should be factored in when making, e.g., board decisions (as board members appoint executive directors and those should have good character) or when trusting these people’s character judgment in the future. (Also, I feel like some comments seemed to suggest that being naive and overly trusting is just cute but not worth worrying about which I don’t agree with.)

As I also wrote in the original comment, I’m not even sure that EA leaders, including Will, made any mistakes ex ante given the enormous uncertainty and complexity of the whole situation and all the important trade-offs involved. I do think that it’s plausible though that some mistakes were made, including significant ones. 

I also regret having singled out Will and I’m sorry if this comment upset anyone. I worry that others may have interpreted my comment as trying to put all the blame on him which I really didn’t want to. I did it, because, to my knowledge, Will was really the EA leader who championed SBF the most and had the closest personal connection to him (aside from people like Caroline, etc., of course). And generally, I think it’s valuable to give specific examples when possible. It’s important to note that many others were involved in this too and could have stepped in! 

To be perfectly clear, I think EA leaders, including Will, have done tremendous good and worked very hard to make the world a better place. I don’t want to belittle their extraordinary contributions.

Last, I worry that my comment was interpreted as taking the side of EA critics which is not the case. I think that much criticism of EA and EA leaders in the media has been unfair and exaggerated.

There is more I could write about all of this but this issue is emotionally taxing and I already spent several days on this comment, and I’m trying to move on. (Several days for just writing this crappy comment? Yeah, most of this was just feeling guilty without being able to do anything else productive. This ties to the general issue of how much time to put into comments. FWIW, in the months before writing the comments in March, I was actively challenging myself to write comments more quickly (and often). In hindsight, this could have been a mistake since I may lack the necessary verbal intelligence to pull this off.)

--------------------------------------------------------------------
[Original comment.]

Thanks, these are good points. 

I do think it's plausible that (some!) EA leaders made substantial mistakes. Spotting questionable behavior or character is hard but not impossible, especially if you have known them for 10 years and work very closely with them and basically were in a mentee-mentor relationship (like e.g. Will, is my impression). I don't fault other people, e.g. those who rarely or never interacted with SBF, for not having done more.

Either people ignored warning signs -> clear mistake. Or they didn't notice anything even though others had noticed signs (like e.g. Habryka)-> suboptimal character judgment. I think the ability to spot such people and don't let them into positions of power is extremely important.

Of course, the crucial question is what could have been done even if you know 100% that SBF is not at all trustworthy. It's plausible to me that not much could have been done because SBF already accumulated so much power. So it's plausible that no one made substantial mistakes. On the other hand, no one forced Will to write Musk and vouch for SBF which perhaps wasn't wise if you have concerns about SBF. On the other hand, it's perhaps also reasonable to gamble on SBF given the inevitable uncertainty about other's character and the large possible upsides. Perhaps I'm just suffering from hindsight bias.

Also, just to be clear, I agree that much of the criticism against EAs and EA leaders we see in the media is unfairly exaggerated. I'm wary of contributing to what I perceive as others unjustly piling-on a movement of moral activists, probably fueled by do-gooder derogation, and so on (as Geoffrey mentions in his comment.) 

  1. ^

    Why have I been so upset? The usual. The ideals of EA are very close to my heart so it made me very sad to see so many people (outside of EA) hate on EA ideals and to ridicule so many important values and concepts. That's a terrible sign for the long-term trajectory of humanity and it has reduced the global level of good-will, cooperation, and trust. It made many people more cynical about the very ideas of altruism and truth-seeking itself.

(ETA: Sorry for not engaging with everything you wrote. I'm short on time and I'll try to elaborate on my views in a week or so.)

Just to clarify my position: I think it's clear that we put SBF on a pedestal and promoted him as someone worth emulating, I don't really know what to say to someone who disagrees with this. (Perhaps you interpret the phrase "put someone on a pedestal" differently; yes, we didn't built statues of SBF, I agree.) 

But I also think that basically almost all of this has been completely understandable. I mean, guy makes 10B dollars and wants to donate it all? One needs to be deranged to not try to emulate him, to not want to learn from him and to not paint him as highly morally praiseworthy. I certainly tried emulating that and learning from SBF (with little success obviously). At the time, I didn't think that we went too far. I even thought the sticker thing was kinda funny (if weird and inadvisable), but I didn't really give it much thought at all at the time.

Ah thanks, I didn't know that! Sorry, could have noticed my confusion here. I edited the above comment.

ETA July: I regret posting the following comment for several reasons, partly because I got crucial information wrong and failed to put things into context and prevent misunderstandings. Please consider reading my longer explanation at the top of my follow-up comment here. I'm sorry to anyone I upset.

------------------------------------------------------------------

At EAG London 2022, they [ETA: this was an individual without consent of the organizers] distributed hundreds of stickers depicting Sam on a bean bag with the text "what would SBF do?". To my knowledge, never before were flyers depicting individual EAs at EAG distributed. (Also, such behavior seems generally unusual to me, like, imagine going to a conference and seeing hundreds of flyers  and stickers all depicting one guy. Doesn't that seem a tad culty?)

On the 80k website, they had several articles mentioning SBF as someone highly praiseworthy and worth emulating. 

Will vouched for SBF "very much" when talking to Elon Musk.

Sam was invited to many discussions between EA leaders. 

There are probably more examples.

Generally, almost everyone was talking about how great Sam is and how much good he has achieved and how, as a good EA, one should try to be more like him.  

Load more