The Challenge of Friendly AI
Posted by Jeriaska on November 26th, 2007Eliezer Yudkowsky has two papers forthcoming in the edited volume Global Catastrophic Risks (Oxford, 2007), “Cognitive Biases Potentially Affecting Judgment of Global Risks” and “Artificial Intelligence as a Positive and Negative Factor in Global Risk.” At the 2007 Singularity Summit, he described how shaping a very powerful and general AI implies a different challenge, of greater moral and ethical depth, than programming a special-purpose domain-specific AI. The danger of trying to impose our own values, eternally unchanged, upon the future, can be seen through the thought experiment of imagining the ancient Greeks trying to do the same. Human civilizations over centuries, and individual human beings over their own lifespans, directionally change their moral values.
The following transcript of Eliezer Yudkowsky’s 2007 Singularity Summit presentation “The Challenge of Friendly AI” has not been approved by the author. An audio version of the talk is available at the Singularity Institute website.
The Challenge of Friendly AI
Before I start, a quick comment. Someone approached me today and said, “I understand you are a creationist?” And I said, “What?” And they said, “Well, you were talking yesterday about how it was impossible to evolve a butterfly.” And I said, “No, not impossible. Just amazing that it happened at all.” So I just want to disclaim that I’m not a creationist. Butterflies evolved. It’s just an extremely inefficient way to make butterflies.
This is going to be farther future, speculative type-stuff, which I make no apologies whatsoever for. Let’s say that you, a human being, are offered a million dollars to win a chess game. One way you could try to win is by playing the chess game yourself.
But, what if you have to play a thousand chess games? Playing a thousand chess games is a lot of work. Laziness is one of the great virtues of a computer programmer. If you are lazy, then you might write a computer program, a narrow AI, to play chess for you. This is a lot more work than playing a single game of chess. Laziness is usually more work than hard work. But you only have to write the program once. How can you write a program to play chess? Could you imagine every possible chess board and program in what you thought was a good move for that position? Unfortunately, there’s an exponentially vast space of possible chess positions. You need to save yourself some work here. You need to be lazy. You’ll have to solve the problem on a higher, more general level than recording moves and playing them back.
But, wait a minute. How do you, a human, know which moves to make in a position? Did your DNA pre-program you with all possible chess positions? By asking yourself - How did I decide to make that move? - you realize that you have some criteria for judging your moves. And asking - How did I judge? - uncovers opportunities to be lazy, to create a system that solves the problem on a higher level, to compress the problem using one algorithm, instead of lots of little pre-programmed moves. In chess AI, the key idea is search. The criterion of a good move is that it leads to checkmating the opponent. The good move is the one that steers the future of the chess board, steers it away from futures where the opponent checkmates you, and steers toward futures where you checkmate the opponent. By searching the game tree, we find moves that are likely to fit this criterion.
Consider Deep Blue, the chess-playing program that beat Kasparov for the world championship. If Deep Blue’s programmers had tried to program Deep Blue explicitly, tell it exactly which moves to make, then Deep Blue could never had played better chess than its programmers. So the programmers told Deep Blue what the consequences of its moves should be. And Deep Blue calculated which moves had which consequences. Deep Blue could search deeper than its programmers, so it could better foresee the consequences of a chess move, so it played better chess, bearing in mind that it had the right criterion of what it was searching for.
Building a bulldozer is more difficult, requires a higher level of technology, than shoveling dirt yourself. But a bulldozer can lift things too heavy for human muscles. By the power of laziness, you can do things that would be impossible with mere hard work.
You may recall from yesterday’s talk that one of the three major schools of Singularity thought, Vernor Vinge’s event horizon, talks about the unpredictability of a world containing minds that are smarter than you. Deep Blue’s programmers could not predict exactly which chess moves Deep Blue would make during this tournament against Kasparov. If the programmers could have predicted Deep Blue, they would have been world champions themselves. So, if you can’t predict Deep Blue’s moves, why not just use a random move generator? The unpredictability of a superior intelligence is not quite like the unpredictability of flipping a fair coin. The programmers couldn’t predict Deep Blue’s exact moves, but they could predict the consequences of Deep Blue’s moves.
If you are really lazy, you might try to write a programming AI that would write the chess program for you. Of course, as usual with laziness, this is much more difficult than writing the chess program yourself. It’s a completely different kind of AI. One kind of AI program is about chess, and the other AI program is about programs. So you have to move to a completely different domain and solve a completely different kind of problem to make this programming AI. Laziness often involves jumping up a level. It’s a different kind of problem to build a bulldozer than to shovel dirt. And modern AI technology is not very good at this. But, humans can do it. Humans can write computer programs. And therefore we know that it’s physically possible for a cognitive system to solve this problem, to exhibit this behavior of programming.
Whatever humans do is possible for a cognitive system. A truly lazy AI researcher should have a reflex, whenever they think about a problem, that says, “How am I thinking about this problem? Could I get an AI to think this way?” You may not always be able to answer the question, but you should always ask it. If you don’t ask unanswerable questions, you’ll lose track of the important research problems and lose track of what you don’t know. It’s a lot of work to figure out how to write a programming AI. It’s so difficult that modern AI science is still working on it.
Suppose you’re really, really lazy. Then you might think, “Where does AI theory come from?” I’ll symbolize AI theory using the deservedly popular textbook Artificial Intelligence: A Modern Approach. Where did the knowledge in this book come from? From humans such as Peter Norvig. Human AI researchers wrote papers, thought about algorithms, carried out computational experiments, argued heatedly about the nature of intelligence.
Perhaps you could write an AI theory AI. Modern AI science does not even begin to approach this kind of ultra high-level reasoning. It may seem so abstract and airy that you may wonder if it’s a real problem. But this reflects our own lack of knowledge. The challenge is real, we just understand it poorly. Humans do think about AI theory, and an AI that can’t think about AI theory will be below us. There will be thoughts we can think that it can’t. So, the obvious next question is, if you’re really, really, really lazy, then what kind of AI do you need to write an AI theory AI?
But here one strongly suspects than an AI that can output Artificial Intelligence: A Modern Approach will, if you run it long enough, output itself. If human AI researchers can create an AI that thinks about AI theory at least as well as human AI researchers, that AI should be able to swallow itself and become a reflective AI. And this, as you may recall from yesterday’s talk, is where the intelligence explosion comes in. This is how we have the mind that improves itself - the positive feedback cycle.
The AI that can make itself smarter in ways that improve its ability to make itself even smarter. The human brain is finitely complex. The brain only has so many neurons patterned by a much smaller amount of DNA that tells it how to learn. It only takes a bounded amount of work to be ultimately lazy. To create the AI that can do everything humans can.
How large a cheesecake you can bake depends on your intelligence. A superintelligence could build enormous cheesecakes. Cheesecakes the size of cities. And there is only a bounded amount of work need to build a self-improving AI. By golly, the future will be full of giant cheesecakes. I call this “The fallacy of the giant cheesecake.” You can’t jump from capability to actuality without considering the necessary intermediate of motive. So what will an AI’s motives be? That’s the 64 quadrillion dollar trick question.
In Hollywood movies, all the AI’s are the same type, a single tribe. Asking what AI’s will do is a trick question because it implies that AI’s form a natural class. Humans do form a natural class because we all share the same brain architecture. We all have a visual cortex, frontal cortex, limbic system, and so on. But the phrase “artificial intelligence” actually refers to a vastly larger space of possibility than when you say “human.”
When we talk about AI’s, we are really talking about minds-in-general. Imagine a map of mind design space. A tiny little circle contains all human minds. This is inside transhuman mind space, which includes all the human possibilities as a strict subset. This is inside post-human mind space, which you might say is everything a transhuman might grow up into. And then there’s all the rest of mind design space: the space of minds-in-general, including AI’s so strange they aren’t even recognizably post-human. And we need to reach into this enormously vast space of possibilities with very precise targeting and pull out a possibility which we won’t regret having made real: a Friendly AI, loosely speaking.
If you offer Gandhi a pill that makes him want to kill people, Gandhi will refuse the pill because he knows that if he takes the pill he will want to kill people, and the current Gandhi doesn’t want those people killed. That’s probably how goals are preserved in a self-improving system. And I wish I could prove this, but the current math for decision theory doesn’t work well for describing self-modifying AI’s. The current math goes into an infinite loop when you try to describe the AI modifying the part of itself that does the modifying. So this is one of the open research problems that need solving before anyone can build a Friendly AI and this is in fact what I see as my research objective. But let’s say that you solve the math problem. Now we get to the big question. What kind of Friendly AI should you make?
It’s easy enough to describe all kinds of AI’s you shouldn’t make: The Terminator, Agent Smith, HAL 9000. Now, all these examples are fictional. They never actually happened. This is an important point to bear in mind. You don’t want to fall prey to the logical fallacy of generalization from fictional evidence. If you build an AI, everything works fine, and you live happily ever after, that doesn’t make a very interesting movie. Nonetheless, the classic cautionary tales do point out some interesting problems that emphasize the importance of certain kinds of laziness.
One of the oldest cautionary tales about artificial agents comes from the Philopseudes of Lucian in 150 C.E., and you may have heard of it as “The Sorcerer’s Apprentice.” Consider a heart transplant operation. When a surgeon opens up a patient, they are not cutting open the patient because they enjoy cutting people. They are opening the patient to do a heart transplant. And they are not doing the heart transplant because they really like shuffling hearts around. They are doing it to save the patient. Saving the patient is an end in itself, at least in my book.
Now let’s say you are creating an artificial moral agent to help with the heart transplant. You specify for the goal state to be for the current heart to be outside the patient. Okay, the agent reaches in and rips the heart out. “No, no,” you say. “The goal state also requires the aorta to be intact and don’t cut through the patient’s spine either, or remove the patient’s legs to make them easier to lift.” How do you know to specify all these fine little details of the goal state? How do you know not to cut the patient’s spine? Because you have the terminal value of improving the patient’s health. And you know that cutting the spine will interfere with the patient’s health. That’s how you’re generating all these complicated instructions. So the lazy way of solving this problem is to create an artificial agent that wants the same thing you want: to preserve the patient’s health. If the agent wants the same kind of health you want, and it has a sufficiently good medical model of the patient, it will judge for itself that it should not cut the spinal chord. And, as usual here, the lazy way is more difficult.
Is it necessarily a good thing to have a powerful AI with the terminal value of keeping humans healthy? The classic story “With Folded Hands” by Jack Williamson is about AI’s that try to keep humanity happy by keeping them in nice, safe nursery playpens and lobotomize anyone who isn’t happy enough. Williamson’s AI’s had the terminal values of health and happiness. And who’s against health and happiness? But Williamson’s AI’s did not have terminal values for truth, justice, freedom, individuality, art, music, love, friendship, or any of the other things that we are happy about and stay alive for. The “Folded Hands” AI traded off security against freedom, without caring about the other half of the equation, without having a term for freedom in their utility function. Just because health is a good thing doesn’t mean that an ultra-powerful agent that only cares about health is a good thing.
It could be extremely unpleasant to be be around an ultra-powerful moral agent that shares only some, rather than all, of your terminal values. And I’ve probably got hundreds of terminal values. I can’t print out the complete list any more than I can print out the positions of all the neurons in my cerebral cortex. The lazy solution would be a meta-moral agent that looks at agents and figures out what their terminal values are. That might be an instruction you can specify more simply than just trying to describe the complete emotional make-up of humans. Or, rather than trying to paint a picture, you polish up a mirror. And a mirror is a simpler thing than a picture. Don’t paint all the tiny ethical details by hand. Make a meta-moral mirror to reflect it all. And this involves moving into a different domain than ordinary morality, I should add, to build an artificial moral agent.
In ancient Greece, slavery was common and the status of women was not much higher. If ancient Greece had possessed the power to look through time, to see our own future, if they had been allowed to peak at us and decide whether we should be allowed to come into existence, they would have vetoed our civilization out of hand for some reason or other. The decay of martial virtue: we no longer rejoice properly in slaying our enemies in hand-to-hand combat. What would the future have looked like if the ancient Greeks had had the capability to build a very powerful AI with their own moral values as fixed constants? This suggests that fixing your own moral values may be an extremely unwise strategy for building an AI. But that doesn’t mean it is wise to shrug and give up.
Our civilization is not the same as ancient Greek civilization, but we are unmistakably their heirs. It’s not that Greek morals were tossed away and new morals rolled up by dice at random. The unpredictability of a superior intelligence is not like the unpredictability of rolling dice. We got here by following the pathway from there. If we had said, “We give up, we won’t teach our children anything,” it would not have led to our future. Human beings are not perfect. But, at least for now, it is only human beings who think this. It is we who have this conception that we are flawed. You will not find that judgment written upon the stars or mountains: they are not minds, they cannot think. It is we who have a sense of a direction that we are going in, and giving up, shrugging, will not push us forward in that direction.
What we need is not to point an AI at our current values, but point an AI at the moral trajectory we would follow over time. There is not really much time, on a cosmic scale, that separates us from ancient Greece. Just a paltry two and a half thousand years. And everyone, now and then, was human. Beyond the Singularity is a much greater gap than the one that separates us from ancient Greece. What is at stake for us is also the future, and we too are not so wise. It turns out to be really hard to think of any way to build an AI that does not automatically doom the galaxy. It’s a hard problem. We cannot try to set down the right path for our children and our children’s children forever, because that is the path that leads to Plato’s utopia. Whether you ask one person, or take a vote, you are equally doomed. All of us together are not wise enough. No more than the ancient Greeks would have been.
We cannot shovel the stars with our own hands. We need an ethical bulldozer. We need the search process, not our own culture’s outputs. The AI needs to go where we’re going, not where we are. This human world, in all its beauty and all its horrible mess, is the starting point. Our wish to be better people defines a direction. Our sense of our own imperfection provides the force that pushes us forward, but if we knew where we were going, we would already be there. So if we can create an AI that mirrors our moral trajectory, is that the ultimate level of laziness?
Human beings can think about this sort of thing. Therefore, it is possible to a cognitive system. If we aren’t lazy enough, if we transfer over our own ideas about friendliness rather than our ability to argue about Friendly AI, the AI will be beneath us because it will have constants where we have thought processes. We have to make an AI that could understand or even output this talk, including this sentence. That is the last layer of laziness: the challenge of Friendly AI.

