tobycrisford

Personally, I find points 4, 5, and 6 really unconvincing. Are there any stronger arguments for these, that don't consist of pointing to a weird example and then appealing to the intuition that "it would be weird if this thing was conscious"?

Because to me, my intuition tells me that all these examples would be conscious. This means I find the arguments unconvincing, but also hard to argue against!

But overall I get that given the uncertainty around what consciousness is, it might be a good idea to use implementation considerations to hedge our bets. This is a nice post.

Is the 10% Giving What We Can Pledge Core to EA's Reputation?

tobycrisford3y13

I think this is an interesting question, and I don't know the answer.

I think two quite distinct ideas are being conflated in your post though: (i) 'earning to give' and (ii) the GWWC 10% pledge.

These concepts are very different in my head.

'Earning to give': When choosing a career with the aim of doing good, some people should pick a career to maximize their income (perhaps subject to some ethical constraints), and then give a lot of it away to effective causes (probably a lot more than 10%). This idea tells you which jobs you should decide to work in.

GWWC pledge: Pretty much whoever you are, if you've got a decent income in a rich country, you should give 10% of it away to effective causes. This idea says nothing about which jobs you should be working in.

I think these two ideas are very different.

'Earning to give' gets a lot of criticism from people outside EA, but I don't see much criticism of the idea of donating 10% of your income. Sure, you can call the amount arbitrary and dispute the extent to which it is an obligation, but I think even major critics of EA often concede that the 10% pledge is still an admirable thing to do.

All AGI Safety questions welcome (especially basic ones) [April 2023]

tobycrisford3y3

Thank you! This is exactly what I wanted to read!

All AGI Safety questions welcome (especially basic ones) [April 2023]

tobycrisford3y1

Thanks for this reply! That makes sense. Do you know how likely people in the field think it is that AGI will come from just scaling up LLMs vs requiring some big new conceptual breakthrough? I hear people talk about this question but don't have much sense about what the consensus is among the people most concerned about AI safety (if there is a consensus).

All AGI Safety questions welcome (especially basic ones) [April 2023]

tobycrisford3y4

I've seen people already building AI 'agents' using GPT. One crucial component seems to be giving it a scratchpad to have an internal monologue with itself, rather than forcing it to immediately give you an answer.

If the path to agent-like AI ends up emerging from this kind of approach, wouldn't that make AI safety really easy? We can just read their minds and check what their intentions are?

Holden Karnofsky talks about 'digital neuroscience' being a promising approach to AI safety, where we figure out how to read the minds of AI agents. And for current GPT agents, it seems completely trivial to do that: you can literally just read their internal monologue in English and see exactly what they're planning!

I'm sure there are lots of good reasons not to get too hopeful based on this early property of AI agents, although for some of the immediate objections I can think of I can also think of responses. I'd be interested to read a discussion of what the implications of current GPT 'agents' are for AI safety prospects.

A few reasons I can think of for not being too hopeful, and my thoughts:

Maybe AGI will look more like the opaque ChatGPT mode of working, than the more transparent GPT 'agent' mode. (Maybe this is true, although ChatGPT mode seems to have some serious blindspots that come from its lack of a working memory. E.g. if i give it 2 sentences and just ask it which sentence has more words in it, it usually gets it wrong. But if I ask it to write the words in each sentence out in a numbered list first, thereby giving it permission to use the output box to do its working, then it gets it right. It makes intuitive sense to me that agent-like GPTs with a scratchpad would perform much better at general tasks and would be what superhuman AIs would look like).
Maybe future language model agents will not write their internal monologue in English, but use some more incomprehensible compressed format instead. Or they will generate so much internal monologue that it will be really hard to check it all. (Maybe. It seems pretty likely that they wouldn't use normal English. But it also feels likely that decoding this format and automatically checking for harmful intentions wouldn't be too hard i.e. easily doable with current natural language processing technology. As long as it's easier to read thoughts than to generate thoughts, it seems like we'd still have a lot of reason to be optimistic about AI safety).
Maybe the nefarious intentions of the AI will hide in the opaque neural weights of the language model, rather than in the transparent internal monologue of the agent. (This feels unlikely to me, for similar reasons to why the first bullet point feels unlikely. It feels like complex planning of the kind AI safety people worry about is going to require a scratchpad and an iterative thought process, not a single pass through a memoryless neural network. If I think about myself, a lot of the things my brain does are opaque, not just to outsiders, but to me too! I might not know why a particular thought pops into my head at a particular moment, and I certainly don't know how I resolve separate objects from the image that my eyes create. But if you ask me at a high level what I've been thinking about in the last 5 minutes, I can probably explain it pretty well. This part of my thinking is internally transparent. And I think it's these kinds of thoughts that a potential adversary might actually be interested in reading, if they could. Maybe the same will be true of AI? It seems likely to me that the interesting parts will still be internally transparent. And maybe for an AI, the internally transparent parts will also be externally transparent? Or at least, much easier to decipher than they are to create, which should be all that matters)

A final thought/concern/question: if 'digital neuroscience' did turn out to be really easy, I'd be much less concerned about the welfare of humans, and I'd start to be a lot more concerned about the welfare of the AIs themselves. It would make them very easily exploitable, and if they were sentient as well then it seems like there's a lot of scope for some pretty horrific abuses here. Is this a legitimate concern?

Sorry this is such a long comment, I almost wrote this up as a forum post. But these are very uninformed naive musings that I'm just looking for some pointers on, so when I saw this pinned post I thought I should probably put it here instead! I'd be keen to read comments from anyone who's got more informed thoughts on this!

Casting the Decisive Vote

tobycrisford3y1

I really like this argument. I think there's another way of framing it that occurred to me when reading it, that I also found insightful (though it may already be obvious):

Suppose the value of your candidate winning is X, and their probability of winning if you don't do anything is p.
If you could buy all the votes, you would pay X(1-p) to do so (value of your candidate winning minus a correction because they could have won anyway). This works out at X(1-p)/N per vote on average.
If p>1/2, then buying votes probably has diminishing returns (certainly this is implied by the unimodal assumption).
Therefore, if p>1/2, the amount you would pay for a single vote must be bounded below by X(1-p)/N.
If p<1/2, I think you can just suppose that you are in a zero-sum game with the opposition party(ies), and take their perspective instead to get the same bound reflected about p=1/2.

The lower bound this gives seems less strict (1/2 X/N in the case that p=1/2, instead of X/N), but it helps me understand intuitively why the answer has to come out this way, and why the value of contributing to voting is directly analogous to the value of contributing to, say, Parfit's water tank for the injured soldiers, even though there are no probabilities involved there.

If as a group you do something with value O(1), then the value of individual contributions should usually be O(1/N), since value (even in expectation) is additive.

Cooperative or Competitive Altruism, and Antisocial Counterfactuals

tobycrisford3y1

Point taken, although I think this is analogous to saying: Counterfactual analysis will not leave us predictably worse off if we get the probabilities of others deciding to contribute right.

Cooperative or Competitive Altruism, and Antisocial Counterfactuals

tobycrisford3y5

Thank you for this correction, I think you're right! I had misunderstood how to apply Shapley values here, and I appreciate you taking the time to work through this in detail.

If I understand correctly now, the right way to apply Shapley values to this problem (with X=8, Y=2) is not to work with N (the number of players who end up contributing, which is unknown), but instead to work with N', the number of 'live' players who could contribute (known with certainty here, not something you can select), and then:

N'=3, the number of 'live' players who are deciding whether to contribute.
With N'=3, the Shapley value of the coordination is 1/3 for each player (expected value of 1 split between 3 people), which is positive.
A positive Shapley value means that all players decide to contribute (if basing their decisions off Shapley values as advocated in this post), and you then end up with N=3.

Have I understood the Shapley value approach correctly? If so, I think my final conclusion still stands (even if for the wrong reasons) that a Shapley value analysis will lead to sub-optimal N (number of players deciding to participate). Since the optimal N here is 2 (or 1, which has same value).

As for whether the framing of the problem makes sense, with N as something we can select, the point I was making was that in a lot of real-world situations, N might well be something we can select. If a group of people have the same goals, they can coordinate to choose N, and then you're not really in a game-theory situation at all. (This wasn't a central point to my original comment but was the point I was defending in the comment you're responding to)

Even if you don't all have exactly the same goals, or if there's a lot of actors, it seems like you'll often be able to benefit by communicating and coordinating, and then you'll be able to improve over the approach of everyone deciding independently according to a Shapley value estimate: e.g. Givewell recommending a funding allocation split between their top charities.

Cooperative or Competitive Altruism, and Antisocial Counterfactuals

tobycrisford3y13

Edit: Vasco Grilo has pointed out a mistake in the final paragraph of this comment (see thread below), as I had misunderstood how to apply Shapley values, although I think the conclusion is not affected.

If the value of success is X, and the cost of each group pursuing the intervention is Y, then ideally we would want to pick N (the number of groups that will pursue the intervention) from the possible values 0,1,2 or 3, so as to maximize:

(1-(1/2)^N) X - N Y

i.e., to maximize expected value.

If all 3 groups have the same goals, they'll all agree what N is. If N is not 0 or 3, then the best thing for them to do is to get together and decide which of them will pursue the intervention, and which of them won't, in order to get the optimum N. They can base their decision of how to allocate the groups on secondary factors (or by chance if everything else really is equal). If they all have the same goals then there's no game theory here. They'll all be happy with this, and they'll all be maximizing their own individual counterfactual expected value by taking part in this coordination.

This is what I mean by coordination. The fact that their individual approaches are different is irrelevant to them benefiting from this form of coordination.

'Maximize Shapley value' will perform worse than this strategy. For example, suppose X is 8, Y is 2. The optimum value of N for expected value is then 2 (2 groups pursue intervention, 1 doesn't). But using Shapley values, I think you find that whatever N is, the Shapley value of your contribution is always >2. So whatever every other group is doing, each group should decide to take part, and we then end up at N=3, which is sub-optimal.

Cooperative or Competitive Altruism, and Antisocial Counterfactuals

tobycrisford3y11

To arrive at the 12.5% value, you were assuming that you knew with certainty that the other two teams will try to create the vaccine without you (and that they each have a 50% chance of succeeding). And I still think that under that assumption, 12.5% is the correct figure.

If I understand your reasoning correctly for why you think this is incoherent, it's because:

If the 3 teams independently arrive at the 12.5% figure, and each use that to decide whether to proceed, then you might end up in a situation where none of them fund it, despite it being clearly worth it overall.

But in making this argument, you've changed the problem. The other 2 teams are now no longer funding the vaccine with certainty, they are also making decisions based on counterfactual cost-benefit. So 12.5% is no longer the right number.

To work out what the new right number is, you have to decide how likely you think it is that the other 2 teams will try to make a vaccine, and that might be tricky. Whatever arguments you think of, you might have to factor in whether the other 2 teams will be thinking similarly. But if you really do all have the same goals, and there's only 3 of you, there's a fairly easy solution here, which is to just talk to each other! As a group you can collectively figure out what set of actions distributed among the 3 of you will maximize the global good, and then just do those. Shapley values don't have to come into it.

It gets more complicated if there's too many actors involved to all get together and figure things out like this, or if you don't all have exactly the same goals, and maybe there is scope for concepts like Shapley values be useful in those cases. And you might well be right that EA is now often in situations like these.

Maybe we don't disagree much in that case. I just wanted to push back a bit against the way you presented Shapley values here (e.g. as the "indisputably correct way to think about counterfactual value in scenarios with cooperation"). Shapley values are not always the right way to approach these problems. For example, the two thought experiments at the beginning of Parfit's paper I linked above are specific cases where Shapley values would leave you predictably worse off (and all decision theories will have some cases where they leave you predictably worse off).

tobycrisford

Posts 4

Comments65

Posts
4

Comments
65