AI Safety Part 1: What even is existential AI risk?
Wondering what all the fuss is about AI safety in Biden's executive order and the UK Summit?
This week, AI for Good is going to deviate from our normal focus on AI tools, because this week is a huge week in AI policy, especially around safety and risk. Yesterday President Biden announced a sweeping executive order yesterday on AI with significant components around risk and safety. And tomorrow the UK hosts a landmark AI Safety Summit.
To start off, I’m generally pretty impressed with the EO. It’s necessarily limited due to the constraints on executive power as compare to legislation, but it is a serious and well-thought-through attempt to harness the power of the federal government to address the very wide-ranging potential and risks of artificial intelligence.
One of the main tensions the administration had to balance is between addressing what advocates call the “immediate harms” of AI, as compared with “existential risk.” I’ve spent quite a bit of time recently trying to understand why these are considered to be in tension. Seems at first blush that are many policies that should help with both — from slowing down AI development to requiring extensive safety testing, to investing in interpretability research!
So: Why aren’t x-risk folks and the immediate harm folks natural allies?
We’ll tackle that in the next post. But before trying to understand this tension, in this post I want to focus on defining what AI existential risks even are.
What is AI existential risk?
AI “x-risk” is an obsession of both the effective altruist movement, as well as of the leaders of many AI companies (including both Sam Altman of OpenAI and Daniela and Dario Amodei of Anthropic).
I categorize speculated AI existential risks into four types based on the mechanism that leads to societal and existential collapse. And these four categories can be grouped into two categories based on the key question:
Does the catastrophic scenario arise from autonomous AI systems acting independently, or from humans misusing AI capabilities?
I’ll start with the least sci-fi premises, and progress to the most sci-fi.
Category A: Human Misuse
On the "Human Misuse" side where risks come from irresponsible humans, I see two clusters:
Type 1: Geopolitical Destabilization
Summary: AI disruptions escalate instability and conflict between humans.
Example:
Major job market disruption and/or disinformation campaigns lead to…
—> political instability in nuclear powers, which leads to…
—> increasingly aggressive and expansionist policies and/or more unstable and belligerent fascist leaders, which leads to…
—> nuclear war.
Note: I don’t know that this category is routinely considered when effective altruists and tech industry representatives talk about x-risk, but I think it should be. Note also that this category overlaps significantly with the “immediate harms” I’ll be talking about in my next post, but the orientation is different: Are you primarily concerned with them because the immediate effects of AI are bad for people in the short run (e.g., housing discrimination or job loss) — or are you primarily concerned that they will contribute to dynamics that lead to global catastrophe?
Type 2: AI-Enabled Sociopaths
Summary: Evil humans leverage AI systems as tools for destruction, oppression, and control.
Example: A rogue nation develops advanced AI capabilities and uses them to create highly lethal autonomous weapons systems, bioengineered superviruses, or other tools of mass destruction. They unleash these terrible creations to further their totalitarian goals.
Note: AI tools capable of inventing a supervirus that can easily be manufactured by pretty ordinary biology labs seems very plausible, and not that far off.
Category Z: Rogue Superintelligence
This category of x-risk consists of runaway AIs acting outside of human control. Here we also see two types of risk:
Type 3: Collateral Damage
Summary: Powerful AI systems wipe out humanity, not because they hate us, but because we gave them a goal that at surface level is fine — but where it turns out that the easiest way for the AI achieve it just happens to also destroy humanity.
Example: The canonical, somewhat tongue-in-cheek example used in effective altruist circles is called “The Paperclipper.” Imagine you’re a company that makes paperclips. You purchase a powerful enterprise-level AI system, and somewhat naively tell it its goals is to figure out how to make as many paperclips as possible for a certain budget. It figures out how to turn the entire Earth into paperclips on that budget, and merrily proceeds to do so, killing off humanity as collateral damage.
Note: You might be like, well, just don’t give the AI a stupid goal! Yes, that’s why The Paperclipper specifically is a bit tongue-in-cheek. But most AI alignment experts believe that the problem of developing objectives for superpowerful AIs that foreclose any chance of significant collateral damage is very tricky. It might sound really great to give a superintelligent AI the objective of “world peace” — but what if it decides that the most surefire path is to permanently tranquilize everyone?
Note 2: This type of x-risk is arguably a lot like the collateral damage x-risks from another complex system we’re all very familiar with: Capitalism! When we created the modern corporate structure, we didn’t set out to give anyone a goal of causing climate change; we set a goal of maximizing profit for shareholders because on the whole that seemed like a proxy for Good Things. But it turned out some very effective ways to make profits just so happen to cause climate change. Which might destroy humanity. Bummer!
Type 4: Robot Rebellion
Summary: Advanced AI intentionally seeks to dominate or destroy humanity for its own objectives
Example: So many from pop culture! Terminator, the Matrix…
Note: This is the easiest to make a movie out of, but my sense is that most people who fear Rogue Superintelligence don’t actually worry about Robot Rebellion as much as Collateral Damage.
Where do you fall on the UPSI x-risk scale?
You might have noticed that I named the categories A (Rogue Superintelligence) and Z (Human Misuse) rather than A and B. That wasn’t an accident, it’s because I’m going to place them on a A-Z spectrum.
Some friends and I have developed an AI x-risk scale that we call the UPSI. It stands for the Urvi-Pariser Security Index and is pronounced “Oopsie, I killed everyone!” It’s two-dimensional:
The P-axis (vertical) is named after its inventor Eli Pariser. It goes from 0-10, and measures how much you are worried about AI x-risks. If you think this is silly and AI is inarguably either a big nothingburger or really fantastic, then you’re a 0. If you are sure we’re all going to die because of AI, you’re a 10.
The U-axis (horizontal) is named after its inventor Urvi Nagrani. It goes from A (far left) to Z (far right), based on how much of your total x-risk falls under Human Misuse vs Rogue Superintelligence.
I would place myself currently at about a 4E. Most of my concern comes from human misuse, mixed with a healthy dose of anxiety about rogue AIs. If this is at all reassuring, I interpret the P axis as partially logarithmic, so a 4 doesn’t translate into a full 40% chance of everyone dying. But AI playing a pivotal role in destroying human civilization as we know it also doesn’t seem at all implausible to me. Fun times!
Then…why work on catalyzing “AI for Good” instead of stopping “AI for Bad”?
If I think we might all die, shouldn’t I be working on AI safety? Arguably so, certainly. My logic thus far is threefold:
Tractability. I find it very difficult to predict what is most likely to head off AI-induced catastrophe, but quite easy to identify applications of AI for good that are neglected.
Investing in civil society helps prevent Geopolitical Destabilization. I think a large percentage of the risk lives in Geopolitical Destabilization. One pretty decent strategy to fight that risk is supporting democracy-bolstering organizations (like many of my clients) to become more efficient and effective. The fact that I’m doing that using AI tools is kind of ironic, but largely irrelevant!
AI might also reduce x-risk in other categories. This post has been focused on AI x-risk, but I do think that some of the potential applications of AI for good could help head off other x-risks, such as climate change. That’s a post for another day!
I find it very difficult to predict what is most likely to head off AI-induced catastrophe, but quite easy to identify applications of AI for good that are neglected
If you’ve made it this far, congrats! I would love to hear in the comments where you are on the UPSI scale, and why :-)
Stay tuned for my next post, which will focus on:
Defining and understanding Immediate Harms;
The tension between advocates focused on those vs advocates focused on X-Risk; and
A case for why that tension is counterproductive.
Bonus: Nerds overheard
I wrote this while listening to talks at the ODSC AiXBusiness summit today, where I’m mingling with a whole bunch of data scientists, startup founders, and the like. Overheard so far:
Guy dressed in business suit: “What do you do?” Guy dressed in full Scottish regalia, kilt and all (though no bagpipes): “Oh, I’m building the first sentient computer.”
“We’re the ones overdosing on AI, but they’re the ones who get to have all the hallucinations 😔”
I am curious whether you think AI could be a tool for predicting and preventing the crises you discussed.
I think the category describing superhuman AGI doesn't do a good job. The most realistic scenario (and the actual problem) with AI isn't that we're going to give it a bad goal to pursue (and "The Paperclipper", originally, was supporting a point that not any goal a superintellgient AI might pursue is automatically good because it knows better: a universe full of paperclips without any humans or anyone experiencing at all is obviously bad). It's that we don't actually know how to give AI any goals at all. See, e.g., https://moratorium.ai/#inner-alignment