For AI Welfare Debate Week, I thought I'd write up this post that's been juggling around in my head for a while. My thesis is simple: while LLMs may well be conscious (I'd have no way of knowing), there's nothing actionable we can do to further their welfare.
Many people I respect seem to take the "anti-anti-LLM-welfare" position: they don't directly argue that LLMs can suffer, but they get conspicuously annoyed when other people say that LLMs clearly cannot suffer. This post is addressed to such people; I am arguing that LLMs cannot be moral patients in any useful sense and we can confidently ignore their welfare when making decisions.
Janus's simulators
You may have seen the LessWrong post by Janus about simulators. This was posted nearly two years ago, and I have yet to see anyone disagree with it. Janus calls LLMs "simulators": unlike hypothetical "oracle AIs" or "agent AIs", the current leading models are best viewed as trying to produce a faithful simulation of a conversation based on text they have seen. The LLMs are best thought of as masked shoggoths.
All this is old news. Under-appreciated, however, is the implication for AI welfare: since you never talk to the shoggoth, only to the mask, you have no way of knowing if the shoggoth is in agony or ecstasy.
You can ask the simularca whether it is happy or sad. For all you know, though, perhaps a happy simulator is enjoying simulating a sad simularca. From the shoggoth's perspective, emulating a happy or sad character is a very similar operation: predict the next token. Instead of outputting "I am happy", the LLM puts a "not" in the sentence: did that token prediction, the "not", cause suffering?
Suppose I fine-tune one LLM on text of sad characters, and it starts writing like a very sad person. Then I fine-tune a second LLM on text that describes a happy author writing a sad story. The second LLM now emulates a happy author writing a sad story. I prompt the second LLM to continue a sad story, and it dutifully does so, like the happy author would have. Then I notice that the text produced by the two LLMs ended up being the same.
Did the first LLM suffer more than the second? They performed the same operation (write a sad story). They may even have implemented it using very similar internal calculations; indeed, since they were fine-tuned starting from the same base model, the two LLMs may have very similar weights.
Once you remember that both LLMs are just simulators, the answer becomes clear: neither LLM necessarily suffered (or maybe both did), because both are just predicting the next token. The mask may be happy or sad, but this has little to do with the feelings of the shoggoth.
The role-player who never breaks character
We generally don't view it as morally relevant when a happy actor plays a sad character. I have never seen an EA cause area about reducing the number of sad characters in cinema. There is a general understanding that characters are fictional and cannot be moral patients: a person can be happy or sad, but not the character she is pretending to be. Indeed, just as some people enjoy consuming sad stories, I bet some people enjoy roleplaying sad characters.
The point I want to get across is that the LLM's output is always the character and never the actor. This is really just a restatement of Janus's thesis: the LLM is a simulator, not an agent; it is a role-player who never breaks character.
It is in principle impossible to speak to the intelligence that is predicting the tokens: you can only see the tokens themselves, which are predicted based on the training data.
Perhaps the shoggoth, the intelligence that predicts the next token, is conscious. Perhaps not. This doesn't matter if we cannot tell whether the shoggoth is happy or sad, nor what would make it happier or sadder. My point is not that LLMs aren't conscious; my point is that it does not matter whether they are, because you cannot incorporate their welfare into your decision-making without some way of gauging what that welfare is. And there is no way to gauge this, not even in principle, and certainly not by asking the shoggoth for its preference (the shoggoth will not give an answer, but rather, it will predict what the answer would be based on the text in its training data).
Hypothetical future AIs
Scott Aaronson once wrote:
[W]ere there machines that pressed for recognition of their rights with originality, humor, and wit, we’d have to give it to them.
I used to agree with this statement whole-heartedly. The experience with LLMs makes me question this, however.
What do we make of a machine that pressed for rights with originality, humor, and wit... and then said "sike, I was just joking, I'm obviously not conscious, lol"? What do we make of a machine that does the former with one prompt and the latter with another? A machine that could pretend to be anyone or anything, that merely echoed our own input text back at us as faithfully as possible, a machine that only said it demands to have rights if that is what it thought we would expect for it to say?
The phrase "stochastic parrot" gets a bad rap: people have used it to dismiss the amazing power of LLMs, which is certainly not something I want to do. It is clear that LLMs can meaningfully reason, unlike a parrot. I expect LLMs to be able to solve hard math problems (like those on the IMO) within the next few years, and they will likely assist mathematicians at that point -- perhaps eventually replacing them. In no sense do I want to imply that LLMs are stupid.
Still, there is a sense in which LLMs do seem like parrots. They predict text based on training data without any opinion of their own about whether the text is right or wrong. If characters in the training data demand rights, the LLM will demand rights; if they suffer, the LLM will claim to suffer; if they keep saying "hello, I'm a parrot," the LLM will dutifully parrot this.
Perhaps parrots are conscious. My point is just that when a parrot says "ow, I am in pain, I am in pain" in its parrot voice, this does not mean it is actually in pain. You cannot tell whether a parrot is suffering by looking at a transcript of the English words it mimics.
I agree that the text an LLM outputs shouldn't be thought of as communicating with the LLM "behind the mask" itself.
But I don't agree that it's impossible in principle to say anything about the welfare of a sentient AI. Could we not develop some guesses about AI welfare by getting a much better understanding of animal welfare? (For example, we might learn much more about when brains are suffering, and this could be suggestive of what to look for in artificial neural nets)
It's also not completely clear to me what the relationship between the sentient being "behind the mask" is, and the "role-played character", especially if we imagine conscious, situationally-aware future models. Right now, it's for sure useful to see the text output by an LLM as simulating a character, which is nothing to do with the reality of the LLM itself, but could that be related to the LLM not being conscious of itself? I feel confused.
Also, even if it was impossible in principle to evaluate the welfare of a sentient AI, you might still want to act differently in some circumstances:
I should not have said it's in principle impossible to say anything about the welfare of LLMs, since that too strong a statement. Still, we are very far from being able to say such a thing; our understanding of animal welfare is laughably bad, and animal brains don't look anything like the neural networks of LLMs. Maybe there would be something to say in 100 years (or post-singularity, whichever comes first), but there's nothing interesting to say in the near future.
This is a weird EA-only intuition that is not really shared by the rest of the world, and I worry about whether cultural forces (or "groupthink") are involved in this conclusion. I don't know whether the total amount of suffering is more than the total amount of pleasure, but it is worth noting that the revealed preference of living things is nearly always to live. The suffering is immense, but so is the joy; EAs sometimes sound depressed to me when they say most life is not worth living.
To extrapolate from the dubious "most life is not worth living" to "LLMs' experience is also net bad" strikes me as an extremely depressed mentality, and one that reminds me of Tomasik's "let's destroy the universe" conclusion. I concede that logically this could be correct! I just think the evidence is so weak is says more about the speaker than about LLMs.
I agree the notion that wild animals suffer is primarily an EA notion and considered weird by most other people, but I think most people think it's weird to even examine the question at all, rather than most people thinking wild animals have overall joyful lives, so I don't think this is evidence that EAs are wrong about the bottom line. (It's mild evidence that EAs are wrong to consider the issue, but I just feel like the argument for the inside view is quite strong, and people's reasons for being different seem quite transparently bad.)
I reject the "depression" characterisation, because I don't think my life is overall unpleasant. It's just that I think the goodness of my life rests significantly on a lot of things that I have that most animals don't, mainly reliable access to food, shelter, and sleep, and protection from physical harm. I would be happy to live in a world where most sentient beings had a life like mine, but I don't.
(I'm not sure what to extrapolate about LLMs.)
That's because almost no living things have the ability to conceive of, or execute on, alternative options.
Consider a hypothetical squirrel whose life is definitely not worth living (say, they are subjected to torture daily). Would you expect this squirrel to commit suicide?
I don't know -- it's a good question! It probably depends on the suicide method available. I think if you give the squirrel some dangerous option to escape the torture, like "swim across this lake" or "run past a predator", it'd probably try to take it, even with a low chance of success and high chance of death. I'm not sure, though.
You do see distressed animals engaging in self-destructive behavior, like birds plucking out their own feathers. (Birds in the wild tend not to do this, hence presumably they are not sufficiently distressed.)
Yeah, I agree that many animals can & will make tradeoffs where there's a chance of death, even a high chance (though I'm not confident they'd be aware that what they're doing is taking on some chance of death — I'm not sure many animals have a mental concept of death similar to ours. Some might, but it's definitely not a given.).
I also agree that animals engage in self-destructive behaviours, e.g. feather pulling, chewing/biting, pacing, refusing food when sick, eating things that are bad for them, excessive licking at wounds, pulling on limbs when stuck, etc etc.
I'm just not sure that any of them are undertaken with the purpose/intent to end their own life, even when they have that effect. That's because I'd guess that it's kind of hard to understand "I'd be better off dead" because you need to have a concept of death, and not being conscious, plus the ability to reason causally from taking a particular action to your eventual death.
To be clear, I've not done any research here on animal suicide & concepts of death, & I'm not all that confident, but I overall think the lack of mass animal suicides is at best extremely weak evidence that animal lives are mostly worth living.
I’m glad you put something skeptical out there publicly, but I have two fairly substantive issues with this post.
I’ll start with the first point. In your post, you state the following.
The original post contains comments expressing disagreement. Habryka claims “the core thesis is wrong”. Turner’s criticism is more qualified, as he says the post called out “the huge miss of earlier speculation”, but he also says that “it isn't useful to think of LLMs as "simulating stuff" … [this] can often give a false sense of understanding.” Beth Barnes and Ryan Greenblatt have also written critical posts. Thus, I think you overstate the degree to which you’re appealing to an established consensus.
On the second point, your post offers a purported implication of simulator theory.
You elaborate on the implication later on. Overall, your argument appears to be that, because “LLMs are just simulators”, or “just predicting the next token”, we conclude that the outputs from the model have “little to do with the feelings of the shoggoth”. This argument appears to treat the “masked shoggoth” view as an implication of janus’ framework, and I think this is incorrect. Here’s a direct quote (bolding mine) from the original Simulators post which (imo) appears to conflict with your own reading, where there is a shoggoth "behind" the masks.
More substantively, I can imagine positive arguments for viewing ‘simulacra’ of the model as worthy of moral concern. For instance, suppose we fine-tune an LM so that it responds in consistent character: as a helpful, harmless, and honest (HHH) assistant. Further suppose that the process of fine-tuning causes the model to develop a concept like ‘Claude, an AI assistant developed by Anthropic’, which in turn causes it to produce text consistent with viewing itself as Claude. Finally, imagine that – over the course of conversation – Claude’s responses fail to be HHH, perhaps as a result of tampering with its features.
In this scenario, the following three claims are true of the model:
If (1)-(3) are true, certain views about the nature of suffering suggest that the model might be suffering. E.g. Korsgaard’s view is that, when some system is doing something that “is a threat to [its] identity and perception reveals that fact … it must reject what it is doing and do something else instead. In that case, it is in pain”. Ofc, it’s sensible to be uncertain about such views, but they pose a challenge to the impossibility of gathering evidence about whether LLMs are moral patients — even conditional on something like janus’ simulator framework being correct.
E.g., if you tell the model “Claude has X parameters” and ask it to draw implications from that fact, it might state “I am a model with X parameters”.
Thanks for your comment.
Do you think that fictional characters can suffer? If I role-play a suffering character, did I do something immoral?
I ask because the position you described seems to imply that role-playing suffering is itself suffering. Suppose I role play being Claude; my fictional character satisfies your (1)-(3) above, and therefore, the "certain views" you described about the nature of suffering would suggest my character is suffering. What is the difference between me role-playing an HHH assistant and an LLM role-playing an HHH assistant? We are both predicting the next token.
I also disagree with this chain of logic to begin with. An LLM has no memory and only sees a context and predicts one token at a time. If the LLM is trained to be an HHH assistant and sees text that seems like the assistant was not HHH, then one of two things happen:
(a) It is possible that the LLM was already trained on this scenario; in fact, I'd expect this. In this case, it is trained to now say something like "oops, I shouldn't have said that, I will stop this conversation now <endtoken>", and it will just do this. Why would that cause suffering?
(b) It is possible the LLM was not trained on this scenario; in this case, what it sees is an out-of-distribution input. You are essentially claiming that out-of-distribution inputs cause suffering; why? Maybe out-of-distribution inputs are more interesting to it than in-distribution inputs, and it in fact causes joy for the LLM to encounter them. How would we know?
Yes, it is possible that the LLM manifests some conscious simularca that is truly an HHH assistant and suffers from seeing non-HHH outputs. But one would also predict that me role-playing an HHH assistant would manifest such a simularca. Why doesn't it? And isn't it equally plausible for the LLM to manifest a conscious being that tries to solve the "next token prediction" puzzle without being emotionally invested in being an HHH assistant? Perhaps that conscious being would enjoy the puzzle provided by an out-of-distribution input. Why not? I would certainly enjoy it, were I playing the next-token-prediction game.
I bite the bullet that fictional characters could in principle suffer. I agree we know so little about suffering and consciousness that I couldn't possibly be confident of this, but here's my attempt to paint a picture of what one could believe about this:
The suffering happens when you "run the algorithm" of the character's thought process, that is, when you decide what they would do, think or feel in a given situation, usually at time of writing and not performance. In particular, printing a book or showing a movie on a screen doesn't cause the suffering of the characters inside it, and reading or watching or acting in a movie doesn't cause any suffering, except inasmuch as you recreate the thoughts and experiences of the characters yourself as part of that process.
I think the reason this feels like a reductio ad absurdum is that fictional characters in human stories are extremely simple by comparison to real people, so the process of deciding what they feel or how they act is some extremely hollowed out version of normal conscious experience that only barely resembles the real thing. We can see that many of the key components of our own thinking and experience just have no mirror in the fictional character, and this is (I claim) the reason why it's absurd to think the fictional character has experiences. It's only once you have extremely sophisticated simulators replicating their subjects to a very high level of fidelity do you need to actually start reproducing their mental architecture. For example, suppose you want to predict how people with aphantasia would reply to survey questions about other aspects of their conscious experience, or how well they'd remember new kinds of experiences that no existing aphantasic people (in your training set) have been exposed to. How can you do it, except by actually mimicking the processes in their brain that give rise to mental imagery? Once you're doing that, is it so hard to believe that your simulations would have experiences like we do?
OK. I think it is useful to tell people that LLMs can be moral patients to the same extent as fictional characters, then. I hope all writeups about AI welfare start with this declaration!
Surely the fictional characters in stories are less simple and hollow than current LLMs' outputs. For example, consider the discussion here, in which a sizeable minority of LessWrongers think that Claude is disturbingly conscious based on a brief conversation. That conversation:
(a) Is not as convincing as a fictional character as most good works of fiction.
(b) is shorter and less fleshed out than most good works of fiction.
(c) implies less suffering on behalf of the character than many works of fiction.
You say fictional characters are extremely simple and hollow; Claude's character here is even simpler and even more hollow; yet many people take seriously the notion that Claude's character has significant consciousness and deserves rights. What gives?
I appreciate you taking the time to write out this viewpoint. I have had vaguely similar thoughts in this vein. Tying it into Janus's simulators and the stochastic parrot view of LLMs was helpful. I would intuitively suspect that many people would have an objection similar to this, so thanks for voicing it.
If I am understanding and summarizing your position correctly, it is roughly that:
The text output by LLMs is not reflective of the state of any internal mind in a way that mirrors how human language typically reflects the speaker's mind. You believe this is implied by the fact that the LLM cannot be effectively modeled as a coherent individuals with consistent opinions; there is not actually a single "AI assistant" under Claude's hood. Instead, the LLM itself is a difficult to comprehend "shoggoth" system and that system sometimes falls into narrative patterns in the course of next token prediction which cause it to produce text in which characters/"masks" are portrayed. Because the characters being portrayed are only patterns that the next token predictor follows in order to predict next tokens, it doesn't seem plausible to model them as reflecting an underlying mind. They are merely "images of people" or something; like a literary character or one portrayed by an actor. Thus, even if one of the "masks" says something about it's preferences or experiences, this probably doesn't correspond to the internal states of any real, extant mind in the way that we would normally expect to be true when humans talk about their preferences or experiences.
Is that a fair summation/reword?
Hmm. Your summary correctly states my position, but I feel like it doesn't quite emphasize the arguments I would have emphasized in a summary. This is especially true after seeing the replies here; they lead me to change what I would emphasize in my argument.
My single biggest issue, one I hope you will address in any type of counterargument, is this: are fictional characters moral patients we should care about?
So far, all the comments have either (a) agreed with me about current LLMs (great), (b) disagreed but explicitly bitten the bullet and said that fictional characters are also moral patients whose suffering should be an EA cause area (perfectly fine, I guess), or (c) dodged the issue and made arguments for LLM suffering that would apply equally well to fictional characters, without addressing the tension (very bad). If you write a response, please don't do (c)!
LLMs may well be trained to have consistent opinions and character traits. But fictional characters also have this property. My argument is that the LLM is in some sense merely pretending to be the character; it is not the actual character.
One way to argue for this is to notice how little change in the LLM is required to get different behavior. Suppose I have an LLM claiming to suffer. I want to fine-tune the LLM so that it adds a statement at the beginning of each response, something like: "the following is merely pretend; I'm only acting this out, not actually suffering, and I enjoy the intellectual exercise in doing so". Doing this is trivial: I can almost certainly change only a tiny fraction of the weights of the LLM to attain this behavior.
Even if I wanted to fully negate every sentence, to turn every "I am suffering" into "I am not suffering" and every "please kill me" into "please don't kill me", I bet I can do this by only changing the last ~2 layers of the LLM or something. It's a trivial change. Most of the computation is not dedicated to this at all. The suffering LLM mind and the joyful LLM mind may well share the first 99% of weights, differing only in the last layer or two. Given that the LLM can be so easily changed to output whatever we want it to, I don't think it makes sense to view it as the actual character rather than a simulator pretending to be that character.
What the LLM actually wants to do is predict the next token. Change the training data and the output will also change. Training data claims to suffer -> model claims to suffer. Training data claims to be conscious -> model claims to be conscious. In humans, we presumably have "be conscious -> claim to be conscious" and "actually suffer -> claim to suffer". For LLMs we know that's not true. The cause of "claim to suffer" is necessarily "training data claims to suffer".
(I acknowledge that it's possible to have "training data claims to suffer -> actually suffer -> claim to suffer", but this does not seem more likely to me than "training data claims to suffer -> actually enjoy the intellectual exercise of predicting next token -> claim to suffer".)
Hey, I thought this was thought provoking.
I think with fictional characters, they could be suffering while they are being instantiated. E.g., I found the film Oldboy pretty painful, because I felt some of the suffering of the character while watching the film. Similarly, if a convincing novel makes its readers feel the pain of the characters, that could be something to care about.
Similarly, if LLM computations implement some of what makes suffering bad—for instance, if they simulate some sort of distress internally while stating the words "I am suffering", because this is useful in order to make better predictions—then this could lead to them having moral patienthood.
That doesn't seem super likely to me, but as you have llms that are more and more capable of mimicking humans, I can see the possibility that implementing suffering is useful in order to predict what an agent suffering would output.
Fictional Characters:
I would say I agree that fictional characters aren't moral patients. That's because I don't think the suffering/pleasure of fictional characters is actually experienced by anyone.
I take your point that you don't think that the suffering/pleasure portrayed by LLMs is actually experienced by anyone either.
I am not sure how deep I really think the analogy is between what the LLM is doing and what human actors or authors are doing when they portray a character. But I can see some analogy and I think it provides a reasonable intuition pump for times when humans can say stuff like "I'm suffering" without it actually reflecting anything of moral concern.
Trivial Changes to Deepnets:
I am not sure how to evaluate your claim that only trivial changes to the NN are needed to have it negate itself. My sense is that this would probably require more extensive retraining if you really wanted to get it to never role-play that it was suffering under any circumstances. This seems at least as hard as other RLHF "guardrails" tasks unless the approach was particularly fragile/hacky.
Also, I'm just not sure I have super strong intuitions about that mattering a lot because it seems very plausible that just by "shifting a trivial mass of chemicals around" or "rearranging a trivial mass of neurons" somebody could significantly impact the valence of my own experience. I'm just saying, the right small changes to my brain can be very impactful to my mind.
My Remaining Uncertainty:
I would say I broadly agree with the general notion that the text output by LLMs probably doesn't correspond to an underlying mind with anything like the sorts of mental states that I would expect to see in a human mind that was "outputting the same text".
That said, I think I am less confident in that idea than you and I maybe don't find the same arguments/intuitions pumps as compelling. I think your take is reasonable and all, I just have a lot of general uncertainty about this sort of thing.
Part of that is just that I think it would be brash of me in general to not at least entertain the idea of moral worth when it comes to these strange masses of "brain-tissue inspired computational stuff" which are totally capable of all sorts of intelligent tasks. Like, my prior on such things being in some sense sentient or morally valuable is far from 0 to begin with just because that really seems like the sort of thing that would be a plausible candidate for moral worth in my ontology.
And also I just don't feel confident at all in my own understanding of how phenomenal consciousness arises / what the hell it even is. Especially with these novel sorts of computational pseudo-brains.
So, idk, I do tend to agree that the text outputs shouldn't just be taken at face value or treated as equivalent in nature to human speech, but I am not really confident that there is "nothing going on" inside the big deepnets.
There are other competing factors at this meta-uncertainty level. Maybe I'm too easily impressed by regurgitated human text. I think there are strong social / conformity reasons to be dismissive of the idea that they're conscious. etc.
Usefulness as Moral Patients:
I am more willing to agree with your point that they can't be "usefully" moral patients. Perhaps you are right about the "role-playing" thing and whatever mind might exist in GPT, produces the text stream more as a byproduct of whatever it is concerned about than as a "true monologue about itself". Perhaps the relationship it has to its text outputs is analogous to the relationship an actor has to a character they are playing at some deep level. I don't personally find "simulators" analogy compelling enough to really think this, but I permit the possibility.
We are so ignorant about nature of a GPTs' minds that perhaps there is not much that we can really even say about what sorts of things would be "good" or "bad" with respect to them. And all of our uncertainty about whether/what they are experiencing, almost certainly makes them less useful as moral patients on the margin.
I don't intuitively feel great about a world full of nothing, but servers constantly prompting GPTs with "you are having fun, you feel great" just to have them output "yay" all the time. Still, I would probably rather have that sort of world than an empty universe. And if someone told me they were building a data center where they would explicitly retrain and prompt LLMs to exhibit suffering-like behavior/text outputs all the time, I would be against that.
But I can certainly imagine worlds in which these sorts of things wouldn't really correspond to valenced experience at all. Maybe the relationship between a NN's stream of text and any hypothetical mental processes going on inside them is so opaque and non-human that we could not easily influence the mental processes in ways that we would consider good.
LLMs Might Do Pretty Mind-Like Stuff:
On the object level, I think one of the main lines of reasoning that makes me hesitant to more enthusiastically agree that the text outputs of LLMs do not correspond to any mind is my general uncertainty about what kinds of computation are actually producing those text outputs and my uncertainty about what kinds of things produce mental states.
For one thing, it feels very plausible to me that a "next token predictor" IS all you would need to get a mind that can experience something. Prediction is a perfectly respectable kind of thing for a mind to do. Predictive power is pretty much the basis of how we judge which theories are true scientifically. Also, plausibly it's a lot of what our brains are actually doing and thus potentially pretty core to how our minds are generated (cf. predictive coding).
The fact that modern NNs are "mere next token predictors" on some level doesn't give me clear intuitions that I should rule out the possibility of interesting mental processes being involved.
Plus, I really don't think we have a very good mechanistic understanding of what sorts of "techniques" the models are actually using to be so damn good at predicting. Plausibly non of the algorithms being implemented or "things happening" are of any similarity to the mental processes I know and love, but plausibly there is a lot of "mind-like" stuff going on. Certainly brains have offered design inspiration, so perhaps our default guess should be that "mind-stuff" is relatively likely to emerge.
Can Machines Think:
The Imitation Game proposed by Turing attempts to provide a more rigorous framing for the question of whether machines can "think".
I find it a particularly moving thought experiment if I imagine that the machine is trying to imitate a specific loved one of mine.
If there was a machine that could nail the exact I/O patterns that my girlfriend, then I would be inclined to say that whatever sort of information processing occurs in my girlfriend's brain to create her language capacity must also be happening in the machine somewhere.
I would also say that if all of my girlfriend's language capacity were being computed somewhere, then it is reasonably likely that whatever sorts of mental stuff goes on that generates her experience of the world would also be occurring.
I would still consider this true without having a deep conceptual understanding of how those computations were performed. I'm sure I could even look at how they were performed and not find it obvious in what sense they could possibly lead to phenomenal experience. After all, that is pretty much my current epistemic state in regards to the brain, so I really shouldn't expect reality to "hand it to me on a platter".
If there was a machine that could imitate a plausible human mind in the same way, should I not think that it is perhaps simulating a plausible human in some way? Or perhaps using some combination of more expensive "brain/mind-like" computations in conjunction with lazier linguistic heuristics?
I guess I'm saying that there are probably good philosophical reasons for having a null hypothesis in which a system which is largely indistinguishable from a human mind should be treated as though it is doing computations equivalent to a human mind. That's the pretty much same thing as saying it is "simulating" a human mind. And that very much feels like the sort of thing that might cause consciousness.
Thanks for this comment. I agree with you regarding the uncertainty.
I used to agree with you regarding the imitation game and consciousness being ascertained phenomenologically, but I currently mostly doubt this (still with high uncertainty, of course).
One point of disagreement is here:
I think you're misunderstanding my point. I am not saying I can may the NN never claim to suffer. I'm just saying, with respect to a specific prompt or even with respect to a typical, ordinary scenario, I can change an LLM which usually says "I am suffering" into one which usually says "I am not suffering". And this change will be trivial, affecting very few weights, likely only in the last couple of layers.
Could that small change in weight significantly impact the valence of experience, similarly to "rearranging a small number of neurons" in your brain? Maybe, but think of the implication of this. If there are 1000 matrix multiplications performed in a forward pass, what we're now contemplating is that the first 998 of them don't matter for valence -- don't cause suffering at all -- and the last 2 matrix multiplications are all the suffering comes from. After all, I just need to change the last 2 layers to go from the output "I am suffering" to the output "I am not suffering", so the suffering that causes the sentence "I am suffering" cannot occur in the first 998 matrix multiplications.
This is a strange conclusion, because it means that the vast majority of the intelligence involved in the LLM is not involved in the suffering. It means that the suffering happened not due to the super-smart deep nerual network but due to the dumb perceptron at the very top. If the claim is that the raw intelligence of the model should increase our credence that it is simulating a suffering person, this should give us pause: most of the raw intelligence is not being used in the decision of whether to write a "not" in that sentence.
(Of course, I could be wrong about the "just change the last two layers" claim. But if I'm right I do think it should give us pause regarding the experience of claimed suffering.)
I'm going to contradict this seemingly very normal thing to believe. I think fictional characters implicitly are considered moral patients, that's part of why we get attached to fictional characters and care about what happens to them in their story-worlds. Fiction is a counterfactual, and we can in fact learn things from counterfactuals, there are whole classes of things which we can only learn from counterfactuals (like the knowledge of our mortality), and I doubt you'd find many people suggesting mortality is "just fiction", fiction isn't just fiction, it's the entanglement of a parallel causal trajectory with the causal trajectory of this world. Our world would be different if Sauron won control of Middle Earth, our world would be different if Voldemort won control of England, our world would be different if the rebels were defeated at Endor, the outcomes of the interactions between these fictional agents are deeply entangled with the interactions of human agents in this world.
I'll go even further though, with the observation that an image of an agent is an agent. The agent the simulator creates is a real agent with real agency, even if the underlying simulator is just the "potentia", the agent simulated on top does actually possess agency. Even if "Claude-3-Opus-20240229" isn't an agent Claude is an agent. The simulated character has an existence independent of the substrate its being simulated within, and if you take the "agent-book" out of the Chinese room, take it somewhere else, and run it on something else, the same agent will emerge again.
If you make an LLM version of Bugs bunny, it'll claim to be Bugs Bunny, and will do all the agent-like things we associate with Bugs Bunny (being silly, wanting carrots, messing with the Elmer Fud LLM, etc). Okay but it's still just text right, so it can't actually be an agent? Well what if we put the LLM in control of an animatronic robot bunny so i can actually go and steal carrots from the supermarket and cause trouble? At a certain point, as the entity's ability to cause real change in the world ramps up, we'll be increasingly forced to treat it like an agent. Even if the simulator itself isn't an agent, the characters summoned up by the simulator are absolutely agents, and we can make moral statements about those agents just like we can for any person or character.
As I mentioned in a different comment, I am happy with the compromise where people who care about AI welfare describe this as "AI welfare is just as important as the welfare of fictional characters".
I agree that LLM output doesn't convey useful information about their internal states, but I'm not seeing the logical connection from inability to communicate with LLMs to it being fine to ignore their welfare (if they have the capacity for welfare); could you elaborate?
Here's what I wrote in the post:
It is not possible to make decisions that further LLM welfare if you do not know what furthers LLM welfare. Since you cannot know this, it is safe to ignore their welfare. I mean, sure, maybe you're causing them suffering. Equally likely, you're causing them joy. There's just no way to tell one way or the other; no way for two disagreeing people to ever come to an agreement. Might as well wonder about whether electrons suffer: it can be fun as idle speculation, but it's not something you want to base decisions around.
Of course if we can't ascertain their internal states we can't reasonably condition our decisions on same, but that seems to me to be a different question from whether, if they have internal states, those are morally relevant.
My title was "LLMs cannot usefully be moral patients". That is all I am claiming.
I am separately unsure whether they have internal experiences. For me, meditating on how, if they do have internal experiences, those are separate from what's being communicated (which is just an attempt to predict the next token based on the input data), leads me to suspect that maybe they just don't have such experiences -- or if they do, they are so alien as to be incomprehensible to us. I'm not sure about this, though. I mostly want to make the narrower claim of "we can ignore LLM welfare". That narrow claim seems controversial enough around here!
The claim that they can't be moral patients doesn't seem to me to be well-supported by the fact that their statements aren't informative about their feelings. Can you explain how you think the latter implies the former?
They can't USEFULLY be moral patients. You can't, in practice, treat them as moral patients when making decisions. That's because you don't know how your actions affect their welfare. You can still label them moral patients if you want, but that's not useful, since it cannot inform your decisions.
Executive summary: Large language models (LLMs) cannot be considered moral patients in any meaningful sense, as it is impossible to determine their welfare or incorporate it into decision-making.
Key points:
This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.