A Short Introduction to Coherent Extrapolated Volition (CEV) Tuesday, Dec 15 2009
friendly ai 4:03 am
In 2004, Eliezer Yudkowsky of the Singularity Institute presented Coherent Extrapolated Volition (CEV) as a solution to the AI Friendliness Problem. The basic idea is to extrapolate the preferences of all humanity in such a way that we obtain an output that satisfices those preferences, then the CEV shuts down, its role finished. CEV is currently the most promising theory for building a Friendly AI.
A point I haven’t seen advanced before outside this document, though it seems pretty obvious, is that any AI, to be of any use to humans whatsoever, must use some variation of volition to fulfill human directives. Volition is introduced as follows: there are two boxes, box A and box B. One of the boxes has a diamond. Fred wants the diamond, and asks us for box A. We want him to have the diamond. One problem: the diamond is in box B. The document points out the problem with handing Fred box A:
But I do not simply say: “Well, Fred chose box A, and he got box A, so I fail to see why there is a problem.” There are several ways of stating my perceived problem:
1. Fred was disappointed on opening box A, and would have been happier on opening box B.
2. It is possible to predict that if Fred chooses box A, Fred will look back and wish he had chosen box B instead; while if Fred chooses box B, Fred will be satisfied with his choice.
3. Fred wanted “the box containing the diamond”, not “box A”, and chose box A only because he guessed that box A contained the diamond.
4. If Fred had known the correct answer to the question of simple fact, “Which box contains the diamond?”, Fred would have chosen box B.
Hence my intuitive sense that giving Fred box A, as he literally requested, is not actually helping Fred.
If you find a genie bottle that gives you three wishes, it’s probably a good idea to seal the genie bottle in a locked safety box under your bed, unless the genie pays attention to your volition, not just your decision.
A powerful AI, or genie (big difference), must follow our volition, not just our direct decisions, or it would be dangerous. It is easy to imagine even worse failures based on interpreting the letter rather than the spirit of our requests — for instance, a robot chauffeur designed to take one’s children to school would be viewed as an idiotic or evil entity if it took children to school even if the school were on fire, or covered in two feet of snow. Just decisions are never enough — an AI needs an interpretation of volition. I see some connection between the idea of volition and revealed preference — people often say one thing, for social signaling purposes (often subconsciously), when they actually mean something else, which can sometimes be inferred from how they act, not what they say.
To me, the question is not whether we’ll use some form of extrapolated volition to pilot and direct AGI, but what kind we choose to use. In his paper, Eliezer proposed the following:
In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.
Phew! That’s a mouthful. What does he mean by “cohere”? What about “growing up farther together”? (I think that should read “further” — “farther” refers to physical distance.) How can we model growing up further together without actually modeling all 6 billion humans interacting socially? Not all these questions are answered in the document. (Some are.) I still regard it as a good starting point. It’s superior to the prior idea that Eliezer had, which was to create an AI that is a “normative altruist” and uses various “anchors and shapers” to craft a “normative morality”. CEV “cheats” by sucking the metamoral content out of the entire human race, like a gigantic infovorous vacuum machine.
The alternative to these sorts of extrapolation schemes all involve a programmer directly dictating the goal content of the AI in one way or another, which leaves you wide open to programmer-biased goal systems. Since the goal system of the first self-improving AI could quite plausibly dictate the fate of the universe from that point on, this is probably a bad idea. Other alternatives, like the one proposed by Bill Hibbard, involve direct feedback where humans essentially push buttons for what they like and the AI is eventually supposed to figure out moral philosophy. (Presumably.) The problem with this is that human metamorality is extremely complex and a system that absorbs the surface features without an eye for deep structure is destined to fail in stupid ways.
Humans can learn more or less what moral behavior is from other humans because much of our metamoral framework is already programmed in from an early age. When a child steals a plate full of cookies that are meant for after dinner, and a parent says, “don’t do that!”, unless the child is extremely young, he or she will generally know what they did wrong and why the adult has a problem with it. A poorly programmed AI, on the other hand, would have no metamoral framework. Was it wrong because cookies are inherently evil? Because the AI did not bake the cookies itself? Because AIs are not meant to have cookies? An AI might know all the facts about cookies and their historical context, but that still won’t give it the background it needs to find out why taking the cookies was “wrong”, and to what extent it was “wrong”. If eating the cookies saved a life, it might not be wrong. What if eating a cookie saved a billion lives a billion years in the the future? A purely utilitarian AI might exterminate the human race today if it thought that doing so would create the greatest utility in the long term. AIs with hand-coded value systems may not have the “moral common sense” that humans do. Moral facts do not follow physical facts. In some cases, we are morally biased for meaningless reasons like small changes in the wording of a hypothetical moral dilemma, or other framing effects. How is an AI supposed to make heads or tails of “right” and “wrong”? Giving up is not a choice — we need an AI we can trust with nuclear weapons or worse. More sophisticated extrapolations of revealed preferences seem to be the most sensible pathway.
The moral realists suppose that a sufficiently intelligent AI will figure out “right” and “wrong” because they are self-evident. This is suicide. Right and wrong are not objective things-in-the-world, but human constructions. Murder is not wrong because it’s objectively wrong, but because human moral development over the course of thousands of years has decided that it is wrong most of the time. People worry about this interpretation of morality because they believe it’s a slippery slope, but Joshua Greene’s PhD thesis goes over all the reasons we might be afraid of moral anti-realism and shows that none of them are really compelling. Whether or not we consider moral anti-realism to be good for society, evolutionary psychology and cognitive science show us that it is true, whether we like it or not.
There is a lot of confusion around the idea of Coherent Extrapolated Volition, which I attribute mostly just to people commenting on the concept without reading the easily digestible 28-page document. People will comment on a concept for years without reading a short document actually explaining it. The way this works is that you read the first page, or less, then fill in the gaps with your imagination.
To dispel some of the worst misconceptions about CEV, here is a short list of “6 points about Coherent Extrapolated Volition” that was posted to the SL4 mailing list in July 2005:
1. Coherent Extrapolated Volition is not a majority vote. No human being is asked to actually decide anything.
2. The key word in “Coherent Extrapolated Volition” is “extrapolated”. CEV does not use judgments produced by the sort of human beings that exist today.3. The CEV writes an AI. This AI may or may not work in any way remotely resembling a volition-extrapolator.
4. The CEV returns one coherent answer. The AI it returns may or may not display any given sort of coherence in how it treats different people, or create any given sort of coherent world.
5. The CEV runs for five minutes before producing an output. It is not meant to govern for centuries.
6. The CEV by itself does not mess around with your life. The CEV just decides which AI to replace itself with.
For a jumping-off point into one discussion about CEV, see this SL4 thread from Oct. 2008: “Just how coherent does CEV have to be?”, which began with a question proposed by Alex Bokov. Kaj Sotala points out that the initial question is answered in the CEV document itself.
If you have any burning questions, check out the PAQ (Previously Asked Questions) portion of the CEV document first. For another very short summary of the CEV concept, see its Wikipedia entry.

December 15th, 2009 at 2:57 pm
By focusing on excessively challenging engineering projects it seems possible that those interested in creating a positive future might actually create future problems - by delaying their projects to the point where less scrupulous rivals beat them to the prize - e.g. see:
Tim Tyler: The risks of caution
http://www.youtube.com/watch?v=Uoj2os5Naw4
December 16th, 2009 at 2:28 am
What about the solution of making the AGI understand the concept of hedons and make it maximize them ?
A CEV based AGI would indeed be infĂ©rior to a hedons maximizing AGI because a civilization’s volition vould probably only concern itself, whereas a hedons based solution would take into account the potential happiness of any kind of beings (existing and non-existing).
December 16th, 2009 at 8:36 am
Grog,
That’s magic: the word “hedons” can’t be executed on a computer. The purpose of CEV is to make one little step in the direction of spelling out explicitly what we want.
December 16th, 2009 at 2:00 pm
If “hedons” are so important (which presumably “they” are, though “they” are not currently a well-defined concept) they will emerge in the coherence between our volitions.
December 16th, 2009 at 11:45 pm
What if the best possible future involves our extinction and replacement by beings which would be a thousand times happier than us ? I don’t think our CEV could allow that.
December 17th, 2009 at 3:02 pm
Grog, there is no such thing as the objective “best possible future”, only the “best possible future” according to ourselves or ourselves if we had more information. So you are making a distinction between two things that are not separate.
December 18th, 2009 at 2:38 am
Ok I see.
Out of curiosity : if an AGI convinced you, using CEV arguments, that humanity had to die in order to maximize happiness on earth (by the creation of new beings), and if you had the power to shut down the AGI or let it go, what would you do ?
December 18th, 2009 at 12:12 pm
Shut it down. A kludge that is sometimes mentioned in association with CEV theory is the possibility of peeking at the output in advance (or maybe a model of that output) and issuing a veto on it if it looks bad. Though that gets us into a lot of thorny issues, I think I’d feel safer with that in place.
This ties in with the issue of how much we should extrapolate humanity’s volition. A too-far extrapolation might plop us down in a foreign world, though it’s hard to tell. Also, what if coherence doesn’t emerge? It seems plausible that it would, but it’s also hard to tell to what degree. If the possibility of hardware-level self-modification were included as part of the extrapolation, the cognitive diversity of the “coherent” state could be many times greater than our current state — would we want that?
Hopefully, we will gain better knowledge of what we’re doing with better theory and empirical feedback from toy systems.