(General note: Here lowercase letters denote individual variables, while uppercase denote a set of those variables. Eg, you can define a state s, and then the set of all states S.)
Consider a standard universal Turing machine. For any state s and input i, the behavior of the machine is defined by the partial function d: S x I -> S x I x {L, R}, where x is the Cartesian product. Suppose that you had many possible functions (d0… dn) in D, and that in addition to specifying the state of the machine and the input/output data for the tape, the functions di also specified which function was to be used for the next calculation. Allowing functions to reference themselves creates an infinite recursion, so add a new state variable a to the Turing machine, which specifies which function is to be used for the current computation; assign each of the functions a tag ai in A, so that the functions are of the form di: S x I -> S x I x {L, R} x A. Note that |D| <= |A| < |(S x I x S x I x {L, R} x A)|- the number of functions which can be specified using S, I and A is always greater than the number of functions in A, so the Turing machine can never specify every function which it is capable of constructing.
Self-referential Turing machines and standard Turing machines are fully equivalent, in the one-can-emulate-the-other sense. Emulation of a TM by an SRTM is extremely simple; just set the cardinality of A and D to 1 and make the element d0 equal to the function d of the Turing machine to be emulated. Going the other way, construct a Turing machine with S’ = S x A, and specify a single function d: S’ x I -> S’ x I x {L, R}. Every mapping in the di of the SRTM can be duplicated in d by setting the input and output components s’ equal to the (s, ai)s of the old function.
To handle additional complexity, consider a probabilistic SRTM, whose functions di map each S x I state to a probability distribution over a set of results Q = (S x I x {L, R} x A). Although this does not add algorithmic power (the data generated by the probabilistic SRTM can be replicated by constructing a different deterministic SRTM for every q in Q), it helps enormously when calculating total expected utility. A probabilistic SRTM, when computed out to infinity, gives a probability distribution over a countably infinite number of states. However, constructing a deterministic SRTM for every q requires a total of |Q|n machines for n computations; this results in an uncountable infinity of machines for an infinite number of computations, which cannot be summed over.
Consider an SRTM which functions as an agent (in the AIXI sense), receiving inputs xi from the environment at one point along the tape and outputting yi to the environment at another point. A classical agent functions to maximize the expected utility over some utility function U. An SRTM, however, may seek to maximize many utility functions (u1… un), with one utility function for each ai. Now, for the Quadrillion-Dollar Question: What probability distributions over A for the partial functions D will maximize the expected utilities for their utility functions U? AIXI has already been proven to be an optimal agent; however, we are concerned with agents which optimally self-modify, even if every possible modification is sub-optimal. An absolutely optimal agent, of course, would have no need to self-modify.
This question turns out to be fairly easy to answer, but only for a drastically simplified SRTM. The first simplification is to ignore the yi, and only evaluate the U(yi), which can be assumed to be the outputs of an arbitrary algorithm E(X). The second simplification is to assume a linear or time-invariant SRTM. The behavior of the SRTM is assumed to be fixed, regardless of the time or any variable which may depend on the time, such as s or the yi. A linear SRTM will always have the same probability distribution over U(yi) and ai, although both can still vary with respect to ai.
A simplified agent SRTM can represent itself and its future behavior by computing a probability distribution over the U(yi) and ai, for every partial function in D. By Rice’s Theorem, it is impossible to prove that an arbitrary partial function has any given, nontrivial property. However, this does not mean that partial functions are forever dark and unknowable: a probability distribution over possible behaviors can be computed using standard induction and Bayesian reasoning. An element b (in the probability distribution over B, calculated for a single di) can be represented as a vector summing to one for the probability distribution over A and an arbitrary scalar for the U(yi). Please note that this is a meta-probability distribution, which must be kept distinct from the probability distribution over A. Saying that a given di has a 30% probability of always going to a1 and a 70% probability of always going to a2 is quite different from saying that it will always have a 30% probability of going to a1 and a 70% probability of going to a2 (during any given computation).
The probability distribution for the entire SRTM’s behavior can be calculated by taking the Cartesian product of the individual probability distributions over the di. This probability distribution is over (M, EU), where M is the set of all the tuples of possible vectors and EU is the set of all the tuples of U(yi). Each m can be represented as a standard Markov matrix, with each vector in the tuple as a column. Given a probability distribution over ai, the probability distribution one computation later can be calculated by taking the matrix product m * (p.d.). A probability distribution which is invariant when transformed by m- one where m* (p.d.) = (p.d.)- is an eigenvector of m for the eigenvalue 1. This eigenvector represents an equilibrium state, because it is unaffected by further computations and may continue indefinitely into the future.
It turns out that, for any initial probability distribution, the SRTM will tend towards the eigenvector’s probability distribution as more computations are done. The probability distribution can be represented as a vector, which in turn can be represented as a linear combination of all eigenvectors (for all eigenvalues). The probability distribution after one computation can then be calculated by multiplying each eigenvector’s coefficient by its corresponding eigenvalue, and summing; because a Markov matrix cannot have eigenvalues greater than one, all other coefficients will go to zero, and the eigenvector for an eigenvalue of one will equal the future probability distribution.
For every m, the eigenvectors can be calculated using standard linear algebra (for everyone who’s been hopelessly confused, this is what my Python program does). The utility of the SRTM can be computed using the classical expected utility equation; for every state of the SRTM, you take the expected utility (U(yi)) of a computation in that state, multiply by the expected number of computations, and sum. This corresponds to the dot product of the matrix m and utility vector eu. The expected utility equation is then brought into play again, by taking the E.U. of the entire probability distribution over (M, EU). For any given switch between one probability distribution over A and another, the E.U. can also be calculated, because the expected deviation from the equilibrium state forms a geometric series and so will sum to a finite number. The actual calculation is done by computing the linear combination of both probability distributions over the eigenvectors, taking the sum over each of the eigenvectors’ components out to an arbitrarily large number of computations (forming a geometric series), and then taking the dot product with the utility vector eu.
It is trivial to show that the probability distribution over A with the highest expected utility must have a probability of 1 for the ai with the highest U(yi), and a probability for zero for all other a, regardless of m. For a simple maximum utility calculation, this effectively eliminates much of the complexity in the probability distribution; the distribution over M is irrelevant, and only the distribution over EU need be considered. However, this complexity may be necessary if the probability distribution is constrained, eg., it cannot have a probability of exactly 1 due to real-world uncertainties.