Newcomb’s Paradox

For purposes of simplicity, I will avoid introducing Newcomb’s paradox, or any of the various philosophical issues surrounding it. I will also shamelessly avoid the issue of a perfect predictor; any perfect predictor of Turing machines in general seems to require a halting oracle, and the paradox should still work if you just use an ordinary human who has really good psychological knowledge and so can predict accurately 90% of the time.

The heart of Newcomb’s paradox is what Scott Aaronson calls first-order rationality: a case where utility is attached to beliefs directly, rather than attaching only to actions which flow from beliefs. As an extremely simple example of first-order rationality, if you have a letter in a sealed envelope which you strongly believe to be both accurate and surprising, and someone points a gun at you and tells you they’ll shoot you if you open it, you probably shouldn’t, even though you’ll predictably end up with less accurate beliefs. It seems that you can’t get a human to genuinely disbelieve in something they already know to be true without introducing other… issues, which leads to a great deal of confusion, but any human with reasonably unimpaired cognition should have little trouble avoiding a bullet in the previous scenario.

If a human is deciding whether to put $1M in the box, the obvious thing to do is to try to influence the human in some way… to shine a light in their eyes, or inject them with morphine, or weave wonderful tales about what you would do with the $1M, etc., etc. By act of magic, none of these work, and the only thing that the human considers is the predicted result of your cognitive algorithm for how many boxes to take- the only way to influence the future is through the dependence on currently held ideas. Which currently held ideas would be best?

It seems that a reasonably good solution is to evaluate the problem using a meta-algorithm; evaluate the potential cognitive algorithms available and see which one produces the best result. A mind with a one-box algorithm will predictably receive $1M, while a mind with a two-box algorithm will predictably not receive $1M. The direct consequences- the ones which depend on external actions directly- are $1K in favor of two-boxing. But the indirect consequences, which depend directly on which algorithm you use, are $1M in favor of one-boxing, far outweighing the $1K even with some uncertainty added.

A meta-algorithm should be able to beat any paradox which directly rewards arbitrary cognitive content, eg., Kavka’s toxin puzzleмебели. The other problem with the puzzle is that it would be difficult for a human (although not an AI) to implement an algorithm which actually results in them drinking the toxin, rather than a pseudo-algorithm under which they “intend” to drink it (for various definitions of intent) but actually won’t. This is easily fixable in principle, eg., by rigging a time bomb to the toxin which will explode and kill you if you don’t consume it. The ideal meta-algorithm is fully self-consistent over time, by selecting an algorithm which prefers X and then actually doing X, so it should be able to handle even a perfect predictor by avoiding deliberate deception.

First-order rationality is also applicable to game theory, eg., the Prisoner’s Dilemma, or even the True Prisoner’s Dilemma. Assuming that the two players know something about each other, selecting an algorithm which cooperates always has the direct effect of losing points, but it may also have the indirect effect of gaining points by increasing the probability that the other player will cooperate. Since the goodness comes from the indirect effects, which are still real but dependent on the other player’s algorithms, I dispute Eliezer’s assertion that one can always find a way to cooperate- if the other player is simply a rock which will fall off a shelf and land on the DEFECT button, it would be criminal stupidity to not also “defect”.

3 thoughts on “Newcomb’s Paradox

  1. Tom, there is no claim that you should cooperate with a defecting rock. You choose an action that determines the best outcome, and this determination can happen in many different forms, sometimes defying intuitive notion of actions leading to consequences. A simplest example is synchronized deterministic algorithms playing prisoner’s dilemma: since they always give the same answer, cooperation is clearly a better action, even though it doesn’t cause the other player to cooperate in a naive sense. Look at your model of the other player as a probability distribution. If that distribution says that your cooperation, which is entangled with all of your causal history, predicts other player’s cooperation, then you cooperate. There are no direct or indirect effects, there is only actual effect, and it is a whole thing.

  2. “If that distribution says that your cooperation, which is entangled with all of your causal history, predicts other player’s cooperation, then you cooperate. There are no direct or indirect effects, there is only actual effect, and it is a whole thing.”

    In this case, the direct effect of cooperating is always bad (it is always better to defect, assuming the other player’s behavior is fixed). If your cooperation influences the other player to cooperate strongly enough, this may outweigh the direct effect of you cooperating, as you gain points when the other player cooperates.

  3. This “direct” effect that can be opposite to predictable actual effect exists only in the model you use in your cognitive algorithm, and I don’t think it’s a good way to describe unnatural situations like this.

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>