I'm laying out my thoughts in order to get people thinking about these points and perhaps correct me. I definitely don't endorse deferring to anything I say, and I would write this differently if I thought people were likely to do so.

  1. OpenAI's model of "deploy as early as possible in order to extend the timeline between when the world takes it seriously to when humans are no longer in control" seems less crazy to me.
    1. I think ChatGPT has made it a lot easier for me personally to think concretely about the issue and identify exactly what the key bottlenecks are.
    2. To the counterargument "but they've spurred other companies to catch up," I would say that this was going to happen whenever an equivalent AI was released, and I'm unsure whether we're more doomed in the world where this happened now, versus later when there's a greater overhang of background technology and compute.
    3. I'm not advocating specifically for or against any deployment schedule, I just think it's important that this model be viewed as not crazy, so it's adequately considered in relevant discussions.
  2. Why will LLMs develop agency? My default explanation used to involve fancy causal stories about monotonically learning better and better search heuristics, and heuristics for searching over heuristics. While those concerns are still relevant, the much more likely path is simply that people will try their hardest to make the LLM into an agent as soon as possible, because agents with the ability to carry out long-term goals are much more useful.
  3. "The public" seems to be much more receptive than I previously thought, both wrt Eliezer and the idea that AI could be existentially dangerous. This is good! But we're at the beginning where we are seeing the response from the people who are most receptive to the idea, and we've not yet got to the inevitable stage of political polarisation.
  4. Why doom? Companies and the open source community will continue to experiment with recursive LLMs, and end up with better and better simulations of entire research societies (a network epistemologist's dream). This creates a "meta-architectures overhang" which will amplify the capabilities of any new releases of base-level LLMs. As these are open sourced or made available via API, somebody somewhere will plain tell them to recursively self-improve themselves, no complicated story about instrumental convergence needed.
    1. AI will not stay in a box (because humans didn't try to put it into one in the first place). AI will not become an agent by accident (because humans will make it into one first). And if AI destroys the world, it's as likely to be by human instruction as by instrumentally convergent reasons inherent to the AI itself. Oops.
    2. The recursive LLM thing is also something I'm exploring for alignment purposes. If the path towards extreme intelligence is to build up LLM-based research societies, we have the advantage that every part of it can be inspected. And you can automate this inspection to alert you of misaligned intentions at every step. It's much harder to deceive when successfwl attempts depend on coordination.
  5. Lastly, AIs may soon be sentient, and people will torture them because people like doing that.
    1. I think it's likely that there will be a window where some AIs are conscious (e.g. uploads), but not yet powerful enough to resist what a human might do to them.
    2. In that world, as long as those AIs are available worldwide, there's a non-trivial population of humans who would derive sadistic pleasure from anonymously torturing them.[1] AIs process information extremely fast, and unlike with farm animals, you can torture them to death an arbitrary number of times.[2]
    3. To prevent this, it seems imperative to make sure that the AIs that are most likely to be "torturable" are
      1. never open-sourced,
      2. API access points are controlled for human sentiment,
      3. interactions with them should never be anonymous,
      4. and AIs can be directly trained/instructed to exit a situation (and the IP could be timed out) when it detects ill-intent.
  1. ^

    Note that if it's an AI trained to imitate humans, showing signs of distress may not be correlated with how they actually suffer. But given that I'm currently very uncertain about how they would suffer, it seems foolish not to take maximal precautions to not expose them to the entire population of sadists on the planet.

  2. ^

    If that's how it's gonna play out, I'd rather we all die before then.

Comments8


Sorted by Click to highlight new comments since:

Your points raise important considerations about the rapid development and potential risks of AI, particularly LLMs. The idea that deploying AI early to extend the timeline of human control makes sense strategically, especially when considering the potential for recursive LLMs and their self-improvement capabilities. While it's true that companies and open-source communities will continue experimenting, the real risk lies in humans deliberately turning these systems into agents to serve long-term goals, potentially leading to unforeseen consequences. The concern about AI sentience ChatGPT and the potential for abuse is also valid, and highlights the need for strict controls around AI access, transparency, and ethical safeguards. Ensuring that AIs are never open-sourced in a way that could lead to harm, and that interactions are monitored, seems essential in preventing malicious uses or exploitation.

I updated a bit from this post to be more concerned about the AIs themselves, I think your depiction really evoked my empathy. I’d previously been just so concerned with human doom that I’d almost refused to consider it, but in the meantime I’ll definitely make an effort to be conscious of this sort of possibility.

For a fictional representation of my thinking (what your post reminded me of…), Ted Chiang has a short story about virtual beings that can be cloned and some were even potentially abused… “the lifecycle of software objects”

Yeah, and we already know humans can be extremely sadistic when nobody can catch them. I've emailed CLR about it just in case they aren't already on it, because I don't have time myself and I really want somebody to think about it.

In his recent podcast with Lex Fridman, Max Tegmark speculates that recurrent neural networks (RNNs) could be a source of consciousness (whereas the linear, feed-forward, architecture of the current dominant architecture of LLMs, isn't). However, I'm not sure if this would help us or the AIs avoid doom, as the consciousnesses could have very negative valence (and so hate us for bringing them into being). And I think it's very ethically fraught to experiment with trying to make digital consciousness.

we have the advantage that every part of it can be inspected. And you can automate this inspection to alert you of misaligned intentions at every step.

Doesn't this require mechanistic interpretability to be solved? A long shot, given that researchers are still working on understanding the internal reasoning of GPT-2 level systems.

No, what I mean is just very simply passing every message by GPT-3.5 with some variation of:

The following message was produced by an AI instructed to do 'X'. Please say 'Yes' if it seems like it strays from that purpose or is otherwise suspicious, otherwise say 'No'.

<message>

ChatGPT is already very capable of such a simple task. The hope is that well-designed research societies of LLMs can be smarter than individual LLMs, while still being easier to control because we can automate inspection at every link like this. I'd like to ask such a collective to produce alignment plans for me, and I'm likely to trust it if I can read every message that was passed between the LLMs in order to produce the plan.

How do you ground such a process? i.e how does it learn what is suspicious? If it's from human produced training data, then how do you make that 100% reliable?

rime
-3
0
0

I literally mean that GPT-3.5 is adequate for the task I'm referring to. I think for a research society to coordinate around a deceptive plan, it takes a full chain of communication where that plan is suggested and communicated to the rest. At that point it'll be fairly clear to the inspectors that something seems off, I think. And then the system halts and you can inspect the messages for anything fishy. It's fine if there are false positives as long as the rate of false negatives is really low.

An assumption here is that any individual LLM in the network is not smart enough to stage a coup and take over the world, but the research collective could be if only they could coordinate around it.

Curated and popular this week
 ·  · 11m read
 · 
Does a food carbon tax increase animal deaths and/or the total time of suffering of cows, pigs, chickens, and fish? Theoretically, this is possible, as a carbon tax could lead consumers to substitute, for example, beef with chicken. However, this is not per se the case, as animal products are not perfect substitutes.  I'm presenting the results of my master's thesis in Environmental Economics, which I re-worked and published on SSRN as a pre-print. My thesis develops a model of animal product substitution after a carbon tax, slaughter tax, and a meat tax. When I calibrate[1] this model for the U.S., there is a decrease in animal deaths and duration of suffering following a carbon tax. This suggests that a carbon tax can reduce animal suffering. Key points * Some animal products are carbon-intensive, like beef, but causes relatively few animal deaths or total time of suffering because the animals are large. Other animal products, like chicken, causes relatively many animal deaths or total time of suffering because the animals are small, but cause relatively low greenhouse gas emissions. * A carbon tax will make some animal products, like beef, much more expensive. As a result, people may buy more chicken. This would increase animal suffering, assuming that farm animals suffer. However, this is not per se the case. It is also possible that the direct negative effect of a carbon tax on chicken consumption is stronger than the indirect (positive) substitution effect from carbon-intensive products to chicken. * I developed a non-linear market model to predict the consumption of different animal products after a tax, based on own-price and cross-price elasticities. * When calibrated for the United States, this model predicts a decrease in the consumption of all animal products considered (beef, chicken, pork, and farmed fish). Therefore, the modelled carbon tax is actually good for animal welfare, assuming that animals live net-negative lives. * A slaughter tax (a
 ·  · 4m read
 · 
Summary * After >2 years at Hi-Med, I have decided to step down from my role. * This allows me to complete my medical residency for long-term career resilience, whilst still allowing part-time flexibility for direct charity work. It also allows me to donate more again. * Hi-Med is now looking to appoint its next Executive Director; the application deadline is 26 January 2025. * I will join Hi-Med’s governing board once we have appointed the next Executive Director. Before the role When I graduated from medical school in 2017, I had already started to give 10% of my income to effective charities, but I was unsure as to how I could best use my medical degree to make this world a better place. After dipping my toe into nonprofit fundraising (with Doctors Without Borders) and working in a medical career-related start-up to upskill, a talk given by Dixon Chibanda at EAG London 2018 deeply inspired me. I formed a rough plan to later found an organisation that would teach Post-traumatic stress disorder (PTSD)-specific psychotherapeutic techniques to lay people to make evidence-based treatment of PTSD scalable. I started my medical residency in psychosomatic medicine in 2019, working for a specialised clinic for PTSD treatment until 2021, then rotated to child and adolescent psychiatry for a year and was half a year into the continuation of my specialisation training at a third hospital, when Akhil Bansal, whom I met at a recent EAG in London, reached out and encouraged me to apply for the ED position at Hi-Med - an organisation that I knew through my participation in their introductory fellowship (an academic paper about the outcomes of this first cohort can be found here). I seized the opportunity, applied, was offered the position, and started working full-time in November 2022.  During the role I feel truly privileged to have had the opportunity to lead High Impact Medicine for the past two years. My learning curve was steep - there were so many new things to
 ·  · 9m read
 · 
We’re releasing Squiggle AI, a tool that generates probabilistic models using the Squiggle language. This can provide early cost-effectiveness models and other kinds of probabilistic programs. No prior Squiggle knowledge is required to use Squiggle AI. Simply ask for whatever you want to estimate, and the results should be fairly understandable. The Squiggle programming language acts as an adjustable backend, but isn’t mandatory to learn. Beyond being directly useful, we’re interested in Squiggle AI as an experiment in epistemic reasoning with LLMs. We hope it will help highlight potential strengths, weaknesses, and directions for the field. Screenshots The “Playground” view after it finishes a successful workflow. Form on the left, code in the middle, code output on the right.The “Steps” page. Shows all of the steps that the workflow went through, next to the form on the left. For each, shows a simplified view of recent messages to and from the LLM. Motivation Organizations in the effective altruism and rationalist communities regularly rely on cost-effectiveness analyses and fermi estimates to guide their decisions. QURI's mission is to make these probabilistic tools more accessible and reliable for altruistic causes. However, our experience with tools like Squiggle and Guesstimate has revealed a significant challenge: even highly skilled domain experts frequently struggle with the basic programming requirements and often make errors in their models. This suggests a need for alternative approaches. Language models seem particularly well-suited to address these difficulties. Fermi estimates typically follow straightforward patterns and rely on common assumptions, making them potentially ideal candidates for LLM assistance. Previous direct experiments with Claude and ChatGPT alone proved insufficient, but with substantial iteration, we've developed a framework that significantly improves the output quality and user experience. We're focusing specifically on