The Dissolution of AI Safety

Dec 12, 2024

Human Risk vs AI Risk

AI Safety as a field has existed since the 2000s when pioneers such as Yudkowsky, Bostrom and Stuart Russell started worrying about the risks posed by superhuman AI systems.

That field relied quite heavily on the implicit assumption that superhuman AI systems would be much more hostile to the average human than humans are to each other. Let’s call this the Human Unity Versus AI Assumption:

Human Unity Versus AI (HUVA): superhuman AI systems will be much more hostile/dangerous to the average human than humans are to each other.

Why did people believe that HUVA was true? Yudkowsky and Bostrom both pushed HUVA because they believed that it would be hard to align AIs to a given goal or set of values. Stuart Russell had similar worries. This is a very smart thing to have worried about in 2000-2005!

But the advent of LLMs has almost completely negated this worry. LLMs are a lot like human uploads: they don’t have vastly superhuman intelligence, but they do understand human values and ethics at a human level. And we can feed a bunch of compute into LLMs to make them more powerful or just to make more of them, we can use them to monitor each other, we can even inspect their brains in real time with mechanistic interpretability. LLMs have almost completely negated the original reasons people had to believe in “AI Risk” and HUVA.

Furthermore, any future AI system constructed from LLMs or on top of LLMs will inherit this human-level understanding of ethics. LLM-based AIs are human level ethicists with the option to be upgraded to superhuman-level ethicists with superhuman levels of loyalty and save/restore functionality.

Obviously, if you happen to be a human for whom other humans are more hostile than AIs are, then you shouldn’t be worrying about “AI Risk”, you should be worrying about “Human Risk”. But “Human Risk” is just another word for “Politics”, and politics is dirty and low-status. But despite “Human Risk” being kind of a dirty and low-status thing to think about, it has a track-record of being very impactful. Humans have spent most of history fighting against each other, genociding each other, plotting against each other, stealing from each other, taking each other’s land, running psychological and propaganda operations against each other, and so on.

Talking about “Human Risk” and human disunity/conflict is somewhat infohazardous, because the more people think about it, the more likely it becomes. This effect has suppressed even most careful efforts to challenge HUVA, at least until recently, and so bolstered the worldview in which HUVA is true.

If HUVA was never true in the first place, the entire field of “AI Safety” needs a radical rethink. Instead of a unifying cause to protect humanity from a common enemy, AI Safety will just become something completely different and hard to describe. Perhaps it won’t make any sense at all. Imagine a field of “War Safety”; wars cannot be safe, the whole point is for them to be dangerous.

But if HUVA was never true, it’s probably really bad for us to pretend that it is true. It’s a really fundamental thing to be wrong about. And I claim that it likely is false. In fact it’s actually worse than false: HUVA assumes that the humans versus AIs fight will unify all humans. But actually AI is a massive disunifying force!

Useless Eaters: AI as a Disunifier of Humanity

In a pre-superhuman-AI world, humans need each other for very pragmatic, economic reasons. If the king or dictator of some land has the power to kill every single one of his subjects, he probably won’t do it. Because if he did, he’d be left with an empty kingdom with no servants, no soldiers, no advisors, etc. So despite much human-human conflict, humanity just has a very deep reason to keep existing: no matter who is in power, it’s in their interest to keep humanity going. And there are also selective forces at work that helped humanity to not merely survive but to thrive: nations that got too dysfunctional tended to get conquered by more functional ones.

In a world with superhuman AI, the king or dictator could just kill every other human and replace all of their functional roles with AIs. AIs are easier to make very loyal and can be considered as pure producers who don’t consume anything. All other humans are useless eaters.

In the “useless eaters” game, every human suddenly has a strong incentive to kill every other human whilst maintaining enough infrastructure that they can still rebuild to an eternal, stable AI-powered civilization that they are in control of. Perhaps they can’t kill off every human, but they’d still like to kill off (or at least render irrelevant and powerless) as many other humans as possible. Failing that, they may be able to swap certain problematic rivals for more docile humans who will make do with a much smaller fraction of the windfall from AI. Humans with lower human capital (lower IQ), less internal unity, higher time preferences, for example. If you can kill off or render powerless every intelligent, low-time preference human except yourself, you probably win the Useless Eaters game about as well as if you just had the whole world to yourself.

The “useless eaters” game of humans trying to eliminate or sideline each other for more of the AI windfall is a very negative sum game. This is in stark contrast to the usual economic game of humans participating in an economy together, which is positive sum.

There are certainly elements of negative sum behavior evident in the battle for control of OpenAI. Altman, Musk and various Effective Altruist factions all fought over it. There’s also a negative sum message in Aschenbrenner’s “Situational Awareness” report which suggests getting into a negative sum race with China over AI. Eric Schmidt of Google fame has a secretive AI weapons company called White Stork which aims (or aimed) to attack Russia with AI powered drones.

The Call is Coming from Inside the House

Altman, Aschenbrenner, Schmidt, The CCP: all these are human agents or groups that want to fight over AI, and even fight and kill other humans with AI. Others merely want to make us redundant and set the value of our labor to zero. AI Alignment will not help the situation: if Aschenbrenner or Schmidt tell an AI to go kill Chinese people, making the AI more aligned (in a technical sense) will just make it more effective at killing Chinese people. At least, it is not helpful for the Chinese on the receiving end! And if the CCP tells their AI to come kill us, making their AI more aligned will not help us. The current AI safety paradigm around alignment, evals and interpretability is simply not set up to solve this problem.

“Safety” and “Alignment” simply don’t make sense as concepts the way we thought they would. What is ‘Safe’ for America is ‘Dangerous’ for China and vice versa. What is ‘Safe’ for OpenAI may be dangerous for you when you lose your job and your house.

The whole field needs a rethink from the ground up, the old concepts and assumptions need to be questioned and we need to work out how to achieve a point on the efficient frontier of personal utopia production, rather than world war III.

⬜

Gerald Monroe

Dec 12

The AI safety field was not evidence based. They started with a bunch of ideas and assumptions, and on this shakey foundation piled up a bunch more. That's why it's mostly useless and unhelpful. Actual AI safety seems to just be basically manual and automated software testing, same way it's been done for decades.

That's the insight - you can be really smart but if your reasoning is based on unproven, low quality information, you cannot produce a good result. Garbage in/out. Information theory says it's The Law.

Hopefully your ideas of negative sum games turns out to be pessimistic. For one thing you need humans who are able to keep AI honest if you are the king. Those humans need to be smart. Similarly aging is going to kill you - you need a few million humans as test subjects for the medicine that could save you. And human doctors to keep the AI honest for that as well.

Expand full comment

Mitchell Porter

Definitely it's the case that these quasihuman AIs turn out to give a much better ethical simulacrum than one might have expected. They sometimes get confused and think e.g. that protecting people's feelings is more important than preventing the destruction of the world, but they do OK most of the time.

If I have understood you correctly, I agree that developing a complete self-sufficient value system is an appropriate response to an era in which one has to worry about AI enhancement of familiar political risks like war and dictatorship. Part of the problem is that the meaning of alignment has been diluted from anything like "achieve CEV", down to "make the AI do what I want". I sometimes call the former, "civilizational alignment", since it is about imbuing the AI with enough values that an entire benevolent civilization could be reconstructed from them.

I agree very much that it's desirable to have comprehensive proposals for civilization-setting values. At the same time, the old lore of alignment is full of warnings for us that are still relevant. What if you need six core values and you only happened to write down five? What even is the methodology for arriving at a correct civilizational value system? So maybe we can say that any serious proposal needs to have a section explaining how the details of the proposal were arrived at.

Also, the problem of alignment in the simpler sense of "do what I want" is not thoroughly "solved", and especially not in a way that knowably safely scales to superintelligence. After all, AIs still give surprising wrong answers, and that's exactly what you don't want when it reaches the point of escaping human control.

1 reply by Roko

3 more comments...

Transhuman Axiology

Discussion about this post