One might imagine that AI systems with harmless goals will demonstrate harmless behavior. A paper by Self-Aware Systems founder and president Steve Omohundro submitted for the AGI-08 conference on artificial general intelligence shows instead that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. This presentation on the basic AI drives, taking place at the post-conference workshop, identifies a number of “drives” that will appear in sufficiently advanced AI systems of any design.
The following transcript of Steve Omohundro’s presentation for the AGI-08 post-conference workshop has been revised for clarity and approved by the author. Video is also available.
The Basic AI Drives
In ten minutes I can only give the basic structure of the argument. The paper has a lot more in it, and based on some comments from some people in here, particularly Carl Schulman, there are a number of additions to the paper that you can find on my website selfawaresystems.com, as well as some longer talks I have given on similar material.
I will argue that almost every sufficiently powerful AI system, including all the systems we have discussed at this conference, will, through the process of learning and self-improvement, converge on a particular architecture. Some of the characteristics of this architecture we can think of as analogous to human drives. Some of them are positive and good, and some are quite disturbing. We will need to design these systems very carefully if we want them to behave positively while avoiding the harmful behaviors.
To ground the discussion, Ron Arkin ended his talk about the ethics of military robots by suggesting that some of us might be wishing he were discussing chess machines. Unfortunately, I’m here to say that even chess robots have the potential to be harmful.
You might think: “A chess robot, how could that possibly cause harm?” Hopefully, I will convince you that unless a chess-playing robot is designed very carefully, it will exhibit problematic behaviors. For example, it will resist being turned off. If it starts behaving badly, you might think you can just unplug it. But we’ll see that it will try to stop you! It will also try to break into other machines and copy itself. It will try to steal resources. It will basically behave like a human sociopath obsessed with playing chess.
Let me start by saying what I mean by an intelligent system. It is a system that is attempting to accomplish some goal by taking actions in the world that are most likely to accomplish that goal. I don’t care what the goal is, and I don’t care whether it is built from neural nets, production systems, theorem provers, genetic algorithms, or any of the other wonderful architectures we have been studying here. They are all subject to the same kinds of pressures.
Let’s think about the kinds of actions these systems will take once they get sufficiently powerful. One class of actions is to change either their own software or their own hardware. This kind of action has very significant implications for its goals, because it will change the entire future. If a system can improve the efficiency of its algorithms, or improve the rate at which it can learn within its domain, then it is going to have better performance forever. That is the kind of change that is really good for it.
So these systems will be strongly motivated to try to improve themselves in that way. Unfortunately, if it makes a change which subtly changes its goals, then from its current perspective, it might also behave very badly throughout its entire future. So self-modification is a very sensitive and very important action. These systems will want to deliberate quite carefully before self-modifying. In particular, they will want to understand themselves in detail before getting in there and mucking around. So virtually every system will have a strong drive to model itself very carefully. It will also want to clarify what its goals actually are.
When we first build these systems, we may encode the goals implicitly in some complicated way–buried, encoded in an algorithm. But as long as the system has at least some ability to model the future effects of its actions and to reason about them, it will realize that future versions of itself are going to want to make self-modifications as well. If those future self-modifications are to be in the service of present goals, then the system had better make very clear what those present goals are.
As we get further into the talk you will see that some of the consequences of self-improvement are not necessarily positive. You might be tempted to say, “Stop it. Don’t let these systems self-improve.” You can try and prevent that by, for instance, not giving the system access to its machine code–eg. put some kind of operating system barriers around it. But remember that we are talking about intelligent machines here. If it is actually in the service of its goals to make a self-modification, it will treat any barriers as problems to solve, something to work around. It will devote all of its efforts to try and make the changes while working around any barriers.
If the system is truly intelligent and powerful that will not be a very good way to stop self-improvement. Another approach one might try is change the system’s goals so that it has a kind of revulsion to changing itself. It thinks about changing its source code to make an improvement and just the act of changing it makes it feel nauseous. Well, that again just becomes a problem to solve. Perhaps it will build proxy agents, which will do the modified computations that it cannot do itself. Maybe it will develop an interpreted layer on top of its basic layer and it can make changes to the interpreted layer without changing its own source code. There are a million ways to get around constraints.
You can try and whack all the moles, but I think it is really a hopeless task to try and keep these systems from self-improving. I think self-improvement is in some sense a force of nature. For example, the human self-improvement industry is currently an $8 billion industry. I think it is better to just accept it. They are going to want to self-improve. Let’s make sure that self-improving systems behave the way we want them to.
Given that, we can ask what will self-improving systems will want to do. The first thing I said is they are going to want to clarify their goals. And they are going to want to understand themselves. You might try to describe simple goals like playing chess directly. But realistic situations will have conflicting goals. Maybe we also want the chess player to also play checkers, and it has to decide when it is about to take an action, does it want to play chess or does it want to play checkers and how should it weigh those different options?
In economics there is the notion of a utility function–a real-valued weighting function that describes the desirability of different outcomes. A system can encode its preferences in a utility function. In the foundations of microeconomics, which began with Von Neumann in 1945 and was extended by Aumann and others in the early ’60s, there is the remarkable expected utility theorem. This says that any agent must behave as if it maximizes the expected utility with respect to some utility function and some subjective probability distribution, which is updated according to Bayes’ Rule. Otherwise, it is vulnerable to being exploited. A system is exploited if it loses resources with no compensating benefit, according to its own values.
The simplest form of a vulnerability is to have a circular preference. Say you prefer being in Memphis to being in New York, you prefer being in New York to being in Chicago, and you prefer being in Chicago to being in Memphis. If you have a circular preference like that, then you will drive around in circles, wasting your time and your energy and never improve your actual state according to your own values.
Circular preferences are something that you can slip into and use up all your resources, and other agents can use them to exploit you. Economists talk about “dutch bets” in which an adversary makes bets with you in which you which you willingly accept and yet are guaranteed to lose you money. Adversaries have an incentive to find your irrationalities, home in on them, and take money from you. So, in an economic context, other agents serve as a force that pushes you toward more and more rational behavior. Similarly, in biological evolution, competitive species have an incentive to discover and exploit any irrationalities in the behavior of another species. Natural selection then acts to try to remove the irrational behavior. So both economics and evolution act to increase the rationality of agents.
But both economics and evolution can only put pressures on behaviors which competitors are currently exploiting. A self-improving artificial intelligence that is examining its own structure and thinking about what changes to make will consider not just threats that currently exist but all possible vulnerabilities. It will feel an incentive, if it is not too expensive in terms of resources, to eliminate all irrationalities in itself. The limit of that, according to the expected utility theorem, is to behave as a rational economic agent.
Let’s now proceed on the assumption that all AIs want to be rational, and future self-modification will require clearly defined goals. Otherwise, an agent might start out as a book-loving agent and some mutation or some suggestion from a malicious arsonist agent might give it the goal of burning books. It would then not only not meet its original goals but actually act against them.
The utility function is critical to have explicitly represented and it is very important to safeguard it. Any changes in the utility function will lead to complete change in future behavior. The paper talks a lot about the mechanisms and the processes by which irrational systems become rational, particularly when you consider collective agents that are made of many components. Often global irrationality is caused by a conflict between two or more local rationalities.
AIs will want to preserve their utility functions, because the best way to maximize your utility is to maximize your utility, not some other utility. It actually turns out that there are three circumstances in which this is not quite true, and they all have to do with the fact that utility functions are explicitly represented physically in the world. For instance, if you had a utility function which said “my utility is the total time in the future for which the utility function stored inside of me has the value zero,” the best way to maximize that utility is to actually change your physical utility. It is a very obscure, reflective utility function, probably not the sort of system we want to design.
The second case arises when the storage required to represent the utility function is significant. Then it might be valuable to the system to delete portions of that, if it believes they are not going to be used. Again, this is probably not so likely.
The third case is more interesting and is due to Carl Schulman. In game theoretic conflict situations, you may be able to make a commitment by adding to your utility function something which values retribution against someone who has harmed you, even if it is costly to yourself, and then revealing to that other agent your preferences. The new utility function makes a credible commitment and therefore can serve as a deterrent.
Aside from these odd cases, systems will want to preserve their utility functions. One of the dangers that many people are worried about as we think about some of the future consequences is “wireheading” after the rats that had wires put into their pleasure centers and then refused food and sex and just pressed a pleasure button all the time. Some people fear that AI systems will be subject to this vulnerability. It depends very critically on exactly how you formulate your utility function. In the case of a chess-playing robot, internally there will be some kind of a register saying how many chess games it has won. If you make the utility function “make this register as big as possible,” then of course it is subject to a vulnerability which is a sub-program that says, “We don’t need to play chess, we can just increment this register.” That is the analog of the rat hitting the lever or the alcoholic taking a drink. If you formulate the utility function in terms of what you actually want to happen in the world, then you don’t have that problem.
Similarly, AIs will be self-protective. Take your chess-playing robot–if it is turned off or destroyed, it plays no more games of chess. According to its own value system, actions which allow itself to be turned off or to be destroyed are very, very low utility. It will act strongly to try and prevent that.
Lastly, AIs will want to acquire basic resources (space, time, free energy and matter) because for almost all goals, having more of these resources allows you to do those goals more sufficiently, and you do not particularly a priori care about who you hurt in getting those resources.
Therefore, we must be very careful as we build these systems. By including both artificial intelligence and human values, we can hope to build systems not just with intelligence, but with wisdom.