For purposes of simplicity, I will avoid introducing Newcomb’s paradox, or any of the various philosophical issues surrounding it. I will also shamelessly avoid the issue of a perfect predictor; any perfect predictor of Turing machines in general seems to require a halting oracle, and the paradox should still work if you just use an ordinary human who has really good psychological knowledge and so can predict accurately 90% of the time.
The heart of Newcomb’s paradox is what Scott Aaronson calls first-order rationality: a case where utility is attached to beliefs directly, rather than attaching only to actions which flow from beliefs. As an extremely simple example of first-order rationality, if you have a letter in a sealed envelope which you strongly believe to be both accurate and surprising, and someone points a gun at you and tells you they’ll shoot you if you open it, you probably shouldn’t, even though you’ll predictably end up with less accurate beliefs. It seems that you can’t get a human to genuinely disbelieve in something they already know to be true without introducing other… issues, which leads to a great deal of confusion, but any human with reasonably unimpaired cognition should have little trouble avoiding a bullet in the previous scenario.
If a human is deciding whether to put $1M in the box, the obvious thing to do is to try to influence the human in some way… to shine a light in their eyes, or inject them with morphine, or weave wonderful tales about what you would do with the $1M, etc., etc. By act of magic, none of these work, and the only thing that the human considers is the predicted result of your cognitive algorithm for how many boxes to take- the only way to influence the future is through the dependence on currently held ideas. Which currently held ideas would be best?
It seems that a reasonably good solution is to evaluate the problem using a meta-algorithm; evaluate the potential cognitive algorithms available and see which one produces the best result. A mind with a one-box algorithm will predictably receive $1M, while a mind with a two-box algorithm will predictably not receive $1M. The direct consequences- the ones which depend on external actions directly- are $1K in favor of two-boxing. But the indirect consequences, which depend directly on which algorithm you use, are $1M in favor of one-boxing, far outweighing the $1K even with some uncertainty added.
A meta-algorithm should be able to beat any paradox which directly rewards arbitrary cognitive content, eg., Kavka’s toxin puzzleмебели. The other problem with the puzzle is that it would be difficult for a human (although not an AI) to implement an algorithm which actually results in them drinking the toxin, rather than a pseudo-algorithm under which they “intend” to drink it (for various definitions of intent) but actually won’t. This is easily fixable in principle, eg., by rigging a time bomb to the toxin which will explode and kill you if you don’t consume it. The ideal meta-algorithm is fully self-consistent over time, by selecting an algorithm which prefers X and then actually doing X, so it should be able to handle even a perfect predictor by avoiding deliberate deception.
First-order rationality is also applicable to game theory, eg., the Prisoner’s Dilemma, or even the True Prisoner’s Dilemma. Assuming that the two players know something about each other, selecting an algorithm which cooperates always has the direct effect of losing points, but it may also have the indirect effect of gaining points by increasing the probability that the other player will cooperate. Since the goodness comes from the indirect effects, which are still real but dependent on the other player’s algorithms, I dispute Eliezer’s assertion that one can always find a way to cooperate- if the other player is simply a rock which will fall off a shelf and land on the DEFECT button, it would be criminal stupidity to not also “defect”.