General Summary of FAI Theory

(Original posted to the SL4 mailing list).

1). An AGI will not act like an “enslaved” human, a resentful human, an emotionally repressed human, or any other kind of human. See CFAI: Beyond Anthropomorphicism.

2). Friendliness content is the morality stuff that says “X is good, Y is bad”. Friendliness content is what we think the FAI *should* do. FAI theory describes *how* to get an FAI to do what you want it to do. See Eliezer’s Collective Extrapolated Volition.

3). FAI theory is damn hard; it is much harder than Friendliness content. So far as I know, nobody knows how to make sure that some AGI design reliably produces paperclips, which is much simpler than ensuring reliable Friendliness. Keep in mind that the Friendliness content must be maintained during recursive self-improvement, or the FAI may wind up destroying us all on programming iteration #1,576,169,123.

4). CEV is a way of deriving Friendliness content from humanity’s collective cognitive architecture. CEV is a morality-constructor, not a morality in and of itself; if you speak programming, think of CEV as a function that takes the human race as an argument and returns a morality.

5). Goal systems naturally maintain themselves (under most conditions). If the AGI has a supergoal of X, changing to a supergoal of X’ will mean that less effort is put towards accomplishing X. Because the AGI *currently* has a supergoal of X, the switch will therefore be seen as undesirable. It’s not like you have to point a gun at the AGI’s head and say, “Do X or else!”; no external coercion is necessary. See CFAI: External reference semantics.

6). An AGI has the goals we give it. It does not have human-like goals such as “reproduce”, “survive”, “be nice”, “get revenge”, “avoid external manipulation”, etc., unless we insert them or they turn out to be useful for the fulfillment of supergoals. See CFAI: Observer-biased beliefs.

7). The vast, vast majority of goal systems assign a positive utility to the destruction of the human species. The species’ actual destruction would be more complicated than this, but essentially, more energy, matter, computing power, etc. are almost always desirable, and so the AGI won’t stop consuming the planet for its own use until it runs out of matter.

8). Just because the AGI can do something doesn’t mean it will. This is what Eli calls the Giant Cheesecake Fallacy- “A superintelligent AGI could make huge cheesecakes, cheesecakes larger than any ever made before; wow, the future will be full of giant cheesecakes!” Some examples of this in action:

“The AGI, being superintelligent, has all the computational power it needs to understand natural language. Therefore, it will start analyzing natural language, instead of analyzing the nearest random quark.”

“The AGI will be powerful enough to figure out exactly what humans mean when they give an instruction. Therefore, the AGI will choose to obey the intended meanings of human instructions, rather than obey the commands of the nearest lemur.”

9). In general, it is much easier to work with simple examples than complicated examples. If you can’t do the simple stuff, you can’t do the complicated stuff. If you can’t prove that an AGI will flood the universe with paperclips and not iron crystals, you can’t prove that an AGI will be Friendly.

10). The meta-FAI rule: Any AGI which, when run, produces an FAI is Friendly. In fact, any superintelligent AGI which, when run, does anything at all besides cause large amounts of random destruction is Friendly.

Representation and Utility

The Old Testament God’s moral system is fairly well understood, at least among the intelligentsia: the most important rule is that everyone must worship Jehovah or else, and if thousands of innocent people get in the way, well, tough luck for them. Few nowadays would claim to follow such a moral system, if they understood what they were following. Even fewer would actually implement the utility function of this system and commit murder when “God” commands it. Yet a lot of people- including many mainstream Christians- are fully aware that the Old Testament was really nasty.

Most of us have a representation of the Old Testament’s utility function, although it may be imprecise or inaccurate. This does not mean that we use this utility function when making moral decisions. Children usually learn this concept around the age of three or four; it is possible for other minds to have different moralities and belief systems, and so you might wind up disagreeing with someone else on whether X is good or bad. “I think X is good”, “Person A thinks X is good” and “Person B thinks X is good” are all logically distinct statements.

This seems to get lost in translation during discussions of FAI theory. The tricky part isn’t getting the AGI to represent our moral system; it would be very easy to program a rabid-data-miner AGI to store every neuronal pulse and every scrap of human writing in some huge databank.¬† The problem is, such an AGI would have storing lots of data as a supergoal, rather than anything resembling Friendliness, and would continue growing the databank until it consumed the solar system. So much for the glorious future of all humankind.

Once the AGI has the human moral system encoded, there’s no particular reason for it to actually follow the human moral system. To most AGIs, we’re just semi-random agglomerations of CHON atoms; why should the AGI listen to us? Getting the AGI to listen to us requires solving the wishing problem, but building a simple, represent-the-human-moral-system AGI doesn’t even get that far. It just kills us all, not because it misinterpreted what we wanted, but because it never cared what we wanted, any more than a rock cares when it crushes your head.

Conservation of Expected Utility

Suppose that there are n mutually exclusive possibilities q, each with some probability p(q) (p(Q) = 1) and some utility U(q). If some new Bayesian evidence A is expected to come in, using the generalized version of Bayes’ Theorem, you can calculate the updated probabilities for the q; p’(Q) = (A o Q) / (A . Q). But you also have to take into account the possibility that the counter-evidence, ~A, will show up. The total expected probability, p’(Q), can be stated as p(Q|A)*p(A) + p(Q|~A)*p(~A)- note that this is a vector of probabilities, where each element is p(qi|A)*p(A) + p(qi|~A)*p(~A) for some given qi in Q. By Conservation of Expected Evidence, before the evidence actually shows up, p’(Q) must equal p(Q).

The utility of Q, U(Q), is found by taking the sum over p(qi)*U(qi). Computing the expected utility after A/~A doesn’t alter the utility term, but the new posterior probability is substituted for the original probability, to give EU(qi) = (p(qi|A)*p(A) + p(qi|~A)*p(~A)) * U(qi). But by CoEE, EU(qi) must equal U(qi), and so EU(Q) = U(Q). This can be readily generalized to more complicated systems, such as Bayesian networks. In nontechnical terms, every expectation of good news (information that increases the future expected utility of the universe, from the agent’s point of view) must be counterbalanced by an equal and opposite expectation of bad news. No, you cannot rationalize a reason why outcome A would be surprisingly good and outcome ~A would also be surprisingly good, that’s cheating.


Some handy numbers on how much various organizations have raised. All data on nonprofits is from, amounts are for the United States only.

World Transhumanist Organization: Less than $25,000, not required to file

Lifeboat Foundation: $8,910 (2006)

Immortality Institute: $21,940 (2006)

American Cryonics Society: $28,240 (2005)

Singularity Institute: $101,700 (2005)

Foresight Institute: $916,400 (2005)

National Space Society: $1,029,000 (2006)

Greenpeace Inc. : $15,829,000 (2006)

Amnesty International:  $43,160,000 (2006)

Self-Referential Agents

(General note: Here lowercase letters denote individual variables, while uppercase denote a set of those variables. Eg, you can define a state s, and then the set of all states S.)

Consider a standard universal Turing machine. For any state s and input i, the behavior of the machine is defined by the partial function d: S x I -> S x I x {L, R}, where x is the Cartesian product. Suppose that you had many possible functions (d0… dn) in D, and that in addition to specifying the state of the machine and the input/output data for the tape, the functions di also specified which function was to be used for the next calculation. Allowing functions to reference themselves creates an infinite recursion, so add a new state variable a to the Turing machine, which specifies which function is to be used for the current computation; assign each of the functions a tag ai in A, so that the functions are of the form di: S x I -> S x I x {L, R} x A. Note that |D| <= |A| < |(S x I x S x I x {L, R} x A)|- the number of functions which can be specified using S, I and A is always greater than the number of functions in A, so the Turing machine can never specify every function which it is capable of constructing.

Self-referential Turing machines and standard Turing machines are fully equivalent, in the one-can-emulate-the-other sense. Emulation of a TM by an SRTM is extremely simple; just set the cardinality of A and D to 1 and make the element d0 equal to the function d of the Turing machine to be emulated. Going the other way, construct a Turing machine with S’ = S x A, and specify a single function d: S’ x I -> S’ x I x {L, R}. Every mapping in the di of the SRTM can be duplicated in d by setting the input and output components s’ equal to the (s, ai)s of the old function.

To handle additional complexity, consider a probabilistic SRTM, whose functions di map each S x I state to a probability distribution over a set of results Q = (S x I x {L, R} x A). Although this does not add algorithmic power (the data generated by the probabilistic SRTM can be replicated by constructing a different deterministic SRTM for every q in Q), it helps enormously when calculating total expected utility. A probabilistic SRTM, when computed out to infinity, gives a probability distribution over a countably infinite number of states. However, constructing a deterministic SRTM for every q requires a total of |Q|n machines for n computations; this results in an uncountable infinity of machines for an infinite number of computations, which cannot be summed over.

Consider an SRTM which functions as an agent (in the AIXI sense), receiving inputs xi from the environment at one point along the tape and outputting yi to the environment at another point. A classical agent functions to maximize the expected utility over some utility function U. An SRTM, however, may seek to maximize many utility functions (u1… un), with one utility function for each ai. Now, for the Quadrillion-Dollar Question: What probability distributions over A for the partial functions D will maximize the expected utilities for their utility functions U? AIXI has already been proven to be an optimal agent; however, we are concerned with agents which optimally self-modify, even if every possible modification is sub-optimal. An absolutely optimal agent, of course, would have no need to self-modify.

This question turns out to be fairly easy to answer, but only for a drastically simplified SRTM. The first simplification is to ignore the yi, and only evaluate the U(yi), which can be assumed to be the outputs of an arbitrary algorithm E(X). The second simplification is to assume a linear or time-invariant SRTM. The behavior of the SRTM is assumed to be fixed, regardless of the time or any variable which may depend on the time, such as s or the yi. A linear SRTM will always have the same probability distribution over U(yi) and ai, although both can still vary with respect to ai.

A simplified agent SRTM can represent itself and its future behavior by computing a probability distribution over the U(yi) and ai, for every partial function in D. By Rice’s Theorem, it is impossible to prove that an arbitrary partial function has any given, nontrivial property. However, this does not mean that partial functions are forever dark and unknowable: a probability distribution over possible behaviors can be computed using standard induction and Bayesian reasoning. An element b (in the probability distribution over B, calculated for a single di) can be represented as a vector summing to one for the probability distribution over A and an arbitrary scalar for the U(yi). Please note that this is a meta-probability distribution, which must be kept distinct from the probability distribution over A. Saying that a given di has a 30% probability of always going to a1 and a 70% probability of always going to a2 is quite different from saying that it will always have a 30% probability of going to a1 and a 70% probability of going to a2 (during any given computation).

The probability distribution for the entire SRTM’s behavior can be calculated by taking the Cartesian product of the individual probability distributions over the di. This probability distribution is over (M, EU), where M is the set of all the tuples of possible vectors and EU is the set of all the tuples of U(yi). Each m can be represented as a standard Markov matrix, with each vector in the tuple as a column. Given a probability distribution over ai, the probability distribution one computation later can be calculated by taking the matrix product m * (p.d.). A probability distribution which is invariant when transformed by m- one where m* (p.d.) = (p.d.)- is an eigenvector of m for the eigenvalue 1. This eigenvector represents an equilibrium state, because it is unaffected by further computations and may continue indefinitely into the future.

It turns out that, for any initial probability distribution, the SRTM will tend towards the eigenvector’s probability distribution as more computations are done. The probability distribution can be represented as a vector, which in turn can be represented as a linear combination of all eigenvectors (for all eigenvalues). The probability distribution after one computation can then be calculated by multiplying each eigenvector’s coefficient by its corresponding eigenvalue, and summing; because a Markov matrix cannot have eigenvalues greater than one, all other coefficients will go to zero, and the eigenvector for an eigenvalue of one will equal the future probability distribution.

For every m, the eigenvectors can be calculated using standard linear algebra (for everyone who’s been hopelessly confused, this is what my Python program does). The utility of the SRTM can be computed using the classical expected utility equation; for every state of the SRTM, you take the expected utility (U(yi)) of a computation in that state, multiply by the expected number of computations, and sum. This corresponds to the dot product of the matrix m and utility vector eu. The expected utility equation is then brought into play again, by taking the E.U. of the entire probability distribution over (M, EU). For any given switch between one probability distribution over A and another, the E.U. can also be calculated, because the expected deviation from the equilibrium state forms a geometric series and so will sum to a finite number. The actual calculation is done by computing the linear combination of both probability distributions over the eigenvectors, taking the sum over each of the eigenvectors’ components out to an arbitrarily large number of computations (forming a geometric series), and then taking the dot product with the utility vector eu.

It is trivial to show that the probability distribution over A with the highest expected utility must have a probability of 1 for the ai with the highest U(yi), and a probability for zero for all other a, regardless of m. For a simple maximum utility calculation, this effectively eliminates much of the complexity in the probability distribution; the distribution over M is irrelevant, and only the distribution over EU need be considered. However, this complexity may be necessary if the probability distribution is constrained, eg., it cannot have a probability of exactly 1 due to real-world uncertainties.

Fun with Proven Invariants

Something which I thought of a while ago, and have recently remembered:

Conjecture: Any optimization process which is mathematically proven to always maintain an abstract invariant can never be perfectly Friendly.

Proof: Suppose that the invariant-maintaining optimization process is captured by a mad scientist in the dead of night. The mad scientist has rigged the entire planet with Mark 3000 Annihilators, programmed to go off the instant he gives the signal. The mad scientist then places the optimization process in a sealed box made of unobtainium, through which the mad scientist can observe the optimization process in utmost detail, but the process cannot directly alter any events outside. The mad scientist then gives the process a decision: modify verself so as to destroy the abstract invariant, or watch the world be destroyed. Obviously, the Friendly decision is self-modification; even if the optimization process will go horribly wrong and start killing everyone, it can’t (by magic assumption) kill everyone faster than the mad scientist. Yet, we have already proven mathematically that the process *cannot* self-modify in such a way as to destroy the invariant, in any situation. Hence, it must make the unFriendly decision, and so it can never be perfectly Friendly. QED.

Prediction Markets

Compared to forecasting presidential elections and bird flu outbreaks, predicting crude oil prices is easy: demand is usually fairly predictable, and supply increases/decreases can be forecast decades in advance. Yet, the historical data shows that both prediction markets and “expert analysis” are often wrong by a factor of two or more. We should have learned by now that we cannot trust greed and economic wisdom to accurately predict the future. Long Term Capital Management tried it, using a model designed by Nobel Prize-winning economists, and look what happened to them.

Utility Function Computation Program v. 2

I’ve added a new multiplication algorithm, and it should now run much faster (tenfold speedup on test data). I’ve also programmed it to output the partial derivative of the utility function, with respect to each one of the variables. Be warned that the runtime and the data output appear to grow exponentially with the number of system states; the five-state test input, included in the download, produces about 1.8 MB of output data, or around five hundred printed pages. By decree, I am not responsible if your computer gets overloaded and crashes, as has happened to me several times; use at your own risk.

 Download Program + Data Files