The Aumann Game
Posted by steven on 11 Jan 2008 at 01:36 pm | Tagged as: Rationality, Probability
Aumann’s agreement theorem says that Bayesian agents cannot “agree to disagree” — their subjective probabilities must be identical if they are common knowledge. This is true regardless of differences in private knowledge. When agents take turns stating their estimates, updating each time based on the information contained in the other’s estimate, private knowledge will “leak out” and the probabilities will converge to an equilibrium.
This theorem makes some big assumptions. One is common knowledge of honesty. Another is common priors. Another is common knowledge of Bayesianity. However, Robin Hanson has shown that uncommon priors require origin disputes, and has discussed agents who are “Bayesian wannabes” but not Bayesians.
It may be interesting to see how this process plays out with real humans in a simplified test bed. Below are 25 statements.
To play, for each statement, you have to say your honest subjective probability that it’s true. Make sure to take into account the estimates of previous commenters. You are strongly encouraged to post estimates multiple times, showing how the estimates of others have caused yours to change. We will then see whether, as the theorem suggests, everyone’s estimates converge to the same equilibrium over time, and whether that equilibrium is any good.
I’ve divided the statements into a few categories. For the “statistics” category I used NationMaster and StateMaster. For “history” I used Wikipedia. For “future”, please answer all questions conditional on no disruptive technologies like molecular nanotechnology and artificial general intelligence being invented. This makes the questions rather vague, so I’m not really happy with this category. For “counterfactual”, please answer conditional on the many worlds interpretation of quantum mechanics being true; even if it isn’t, it’s still a well-defined model, so the question is meaningful either way. For “internet”, I always included quote marks.
The answers in the “statistics”, “history”, and “internet” categories are easy to look up, but that would defeat the point. So no peeking allowed. Looking up any relevant information is peeking.
Discussion of the statements other than through stating probabilities is also against the spirit of the game. Feel free to ask for clarifications, though.
To reward honest estimates, in the end I may score people on the answers, using the rule where your number of points is the logarithm of the probability you assigned to the right answer.
(update: this was tried again less messily and with more suitable questions here, here, and here)
***
Statistics
1. Oregon has more inhabitants than Slovakia.
2. Ghana has a greater GDP (PPP) than Luxembourg.
3. In 1900, Denmark had a greater GDP per capita than Spain.
4. Ohio emits more CO2 than Poland.
5. Afghanistan has more land area than Alaska.
6. Croatia has a greater GDP per capita than Mexico.
History
7. George Orwell was born before 1900.
8. Vladimir Putin was born before 1955.
9. The tenth emperor of Rome wore a beard.
10. More than 5000 Americans died in the attack on Pearl Harbor.
Future
11. If the USA has a president in 2067, it will be a woman.
12. A 1000-qubit quantum computer will exist in 2020.
13. A nuclear (fusion or fission) weapon will be used in an attack before 2010.
14. Switzerland will join NATO before 2100.
15. Proof of life on Mars (past or present, not originating on Earth) will be found before 2050.
Counterfactual
16. In a randomly selected parallel Everett world splitting from ours on 1 Jan 1940, Hitler invades England before 1950.
17. IARSPEWSFOO 1 Jan 1940, Hitler invades the USA before 1950.
18. IARSPEWSFOO 1 Jan 1, a technological singularity happens before 1500.
19. IARSPEWSFOO 1 Jan 1, nuclear war kills at least ten million people in any five year period before 2000.
20. IARSPEWSFOO 1 Jan 1900, nuclear war kills at least ten million people in any five year period before 2000.
Internet
21. “brain” gets more google results than “heart”.
22. “Ray Kurzweil” gets more google results than “Sonic the Hedgehog”.
23. “John Paul II” gets more google results than “Ron Paul”.
24. “Iraq” gets more google results than “Italy”.
25. “death” gets more google results than “purple”.
Let me take a first stab at the ones I can participate in:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11. 30%
12. 25%
13. 5%
14. 10%
15. 20%
16. 10%
17. 5%
18. 20%
19. 50%
20. 30%
21.
22.
23.
24.
25.
@18 - what definition of technological singularity are you using?
Anything that results in the creation of much-greater-than-human intelligence, I guess.
1. 25
2. 45
3. 85
4. 15
5. 15
6. 65
7. 35
8. 35
9. 25
10. 20
11. 17
12. 19
13. 4
14. 25
15. 7
16. .01
17. .001
18. .03
19. .06
20. .03
21. 80
22. 35
23. 55
24. 65
25. 98
1. pop OR Slovakia 35
2. GDP Ghana Lux 0.01
3. 1900 GDP Denmark Spain 92
4. more carbon Ohio Poland 15
5. area Afganistan Alaska 50
6. Croatia Mexico 70
7. Orwell before1900 35
8. Putin before1955 70
9. beard emperor 15
10. 5000dead at Pearl 15
11. 2067 female president 22
12. 2020 1000-qubit 19
13. 2010 nuke 7
14. 2100 Swiss NATO 15
15. 2050 Martians 0.25
16. invades England 0.01
17. invades US 0.001
18. 1500 singularity 1.5
19. yr1 10 megadeaths 35
20. 1900 10 megadeaths 25
21. brain heart 90
22. Kurzweil SonicHedgehog 15
23. JohnPaulII RonPaul 35
24. Iraq Italy 65
25. death purple 95
life on Mars
Please make this phrase more specific, e.g., please change it to the following disjunction:
Either Earth-life started out on Mars and then migrated to Earth or Earth-life started out on Earth and Mars-life started out on Mars (and Mars-life exists or existed).
1. 25
2. 15
3. 85
4. 15
5. 20
6. 65
7. .001
8. 55
9. 20
10. 5
11. 30
12. 30
13. 2
14. 15
15. 15
16. .05
17. .001
18. .1
19. 10
20. 8
21. 80
22. 10
23. 70
24. 65
25. 95
1. 50
2. 10
3. 60
4. 30
5. 20
6. 40
7. 35
8. 75
9. 5
10. 10
11. 33
12. 40
13. 5
14. 10
15. 65
16. 5
17. .001
18 .001
19 5.001
20 5
21 40
22 3
23 50
24 60
25 90
for the nuclear weapon will be used before 2010 statement.
since it is in the future category I would take that to mean that Hiroshima and Nagasaki don’t count.
Also, the definition of “used” would be relevant.
an exploded test weapon could be counted as used.
Usually the implied definition of used would be exploded bomb with human casualties.
then there is aspect of nuclear weapon definitions.
Usually: A nuclear weapon is a type of explosive weapon that derives its destructive force from nuclear reactions of fusion or fission.
But some include dirty radiological devices.
Conceivably new devices for the generation a lot of neutrons or particle beams could be weaponized and have a nuclear component.
(not looking at other responses first)
1. 50
2. 10
3. 45
4. 70
5. 40
6. 60
7. 0.01
8. 80
9. 10
10. 20
11. 10
12. 50
13. 8
14. 5
15. 10
16. 10
17. 2
18. 0.001
19. 5
20. 2.5
21. 40
22. 15
23. 60
24. 80
25. 75
1. pop OR Slovakia 35
2. GDP Ghana Lux 0.01
3. 1900 GDP Denmark Spain 92
4. more carbon Ohio Poland 15
5. area Afganistan Alaska 40
6. Croatia Mexico 70
7. Orwell before1900 0.1
8. Putin before1955 60
9. beard emperor 15
10. 5000dead at Pearl 7
11. 2067 female president 22
12. 2020 1000-qubit 19
13. 2010 nuke 7
14. 2100 Swiss NATO 15
15. 2050 Martians 0.25
16. invades England 0.5
17. invades US 0.001
18. 1500 singularity 0.5
19. yr1 10 megadeaths 30
20. 1900 10 megadeaths 20
21. brain heart 90
22. Kurzweil SonicHedgehog 9
23. JohnPaulII RonPaul 30
24. Iraq Italy 65
25. death purple 95
I notice Mr. Hollerith’s dissent on 23. Steven, are you asking if John Paul II has received more hits than Ron Paul for the entire time that google has been online? If so, I would argue that there are many more Catholics than people interested in Ron Paul. If you are asking if John Paul II has received more hits recently, I will adjust my estimate downwards. You may also want to specify how recently. By the way, I enjoy reading your blog.
Thanks, everyone.
I meant search results (the number in the top right when you type the phrase into google). I realized just now that many people probably interpreted it as “number of people searching”. I’ve also changed the nukes and Mars questions.
OK - this looks like fun - not looking at anyone’s answers..
1 20
2 0
3 60
4 60
5 70
6 50
7 25
8 40
9 80
10 30
11 55
12 90
13 20
14 5
15 30
16 40
17 1
18 1
19 60
20 60
21 30
22 20
23 70
24 70
25 40
23. John Paul II Ron Paul 70
11. 2067 female president 20
12. 2020 1000-qubit 20
13. 2010 nuke 4
14. 2100 Swiss NATO 20
15. 2050 Martians 10
16. invades England 3
17. invades US 0.1
18. 1500 singularity 3
19. yr1 10 megadeaths 25
20. 1900 10 megadeaths 30
Note that question 2 asks about total GDP, not GDP per capita.
Also note that looking at other people’s answers is a good thing in this game.
Steven: Subjective probabilities are not the same thing as quantum measures. I will write about this on overcoming bias. Would you expect a superintelligence to have the amounts of uncertainty that you are attributing to quantum measures?
I agree that quantum uncertainty is only one of multiple sources of subjective uncertainty, and that this is still true for superintelligences (who still have to deal with incomplete knowledge, chaos, etc etc). The way I asked the question you have to take all of them into account; note that I’m not asking for your most likely estimate for the fraction of Everett worlds in which the event happens, but rather your expectation value for this fraction. I can see how you could read it differently though.
1. 25
2. 14
3. 79
4. 33
5. 20
6. 63
7. 9
8. 61
9. 16
10. 9
11. 17
12. 21
13. 4
14. 20
15. 4
16. .8
17. .06
18. 6
19. 12
20. 8
21. 70
22. 12
23. 58
24. 74
25. 98
In other words, when I put a 3% probability that in a randomly selected Everett world event X happens, that does not mean I’m claiming it happens in 3% of Everett worlds; just that if I had a machine that picked a random Everett world, I would put a 3% probability on it picking a world with event X, because 3% is my subjective expectation value for the fraction of worlds in which event X happens. Does that make sense?
Looking back I agree I asked the question in an ambiguous way, and these are bad questions to use without some way to rephrase them in a way that makes all this clear.
1. 25
2. 15
3. 85
4. 20
5. 20
6. 67
7. .001
8. 60
9. 15
10. 8
11. 20
12. 25
13. 4
14. 15
15. 10
16. 1
17. .05
18. 2
19. 10
20. 8
21. 75
22. 10
23. 68
24. 67
25. 95
Update my question 7 (Orwell) to 3%
1. 25 (OR pop > Slovakia)
2. 11 (Ghana GDP > Lux)
3. 85 (1900 DK GDP/c > ES)
4. 25 (Ohio CO2 > PL)
5. 33 (Af area > AK)
6. 65 (Cro GDP/c > MX)
7. 1 (Orwell born 500)
11. 29 (woman president 2067)
12. 23 (1000 qubits 2020)
13. 4 (nuclear attack by 2010)
14. 20 (Swiss NATO by 2100)
15. 10 (Mars life by 2050)
16. 1.5 (Hitler England by 1950)
17. .05 (Hitler US by 1950)
(*** Not including worlds where Hitler is dead/out of power but Nazi Germany invades)
18. 6 (Singularity by 1500)
19. 40 (nuke war before 2000, splitting 1)
20. 25 (same splitting 1900)
21. 65 (brain > heart)
22. 9 (Ray Kurzweil > Sonic the Hedgehog)
23. 60 (John Paul II > Ron Paul)
24. 70 (Iraq > Italy)
25. 95 (death > purple)
On second thought, #20 should have been 35.
1. 25
2. 14
3. 79
4. 33
5. 20
6. 63
7. 4
8. 61
9. 16
10. 9
11. 17
12. 21
13. 4
14. 20
15. 4
16. .8
17. .06
18. 6
19. 25
20. 15 (Note, I think that in most fairly distant possible worlds diverging in 1900 nukes were probably not developed till much later in the 20th century)
21. 70
22. 9
23. 58
24. 74
25. 97
Unless anyone thinks it’s a bad idea, I’ll reveal the answers soon.
These are my final estimates.
1. 25
2. 13
3. 85
4. 22
5. 22
6. 65
7. .001
8. 60
9. 15
10. 7
11. 22
12. 25
13. 4
14. 20
15. 10
16. 1
17. .05
18. 5
19. 22
20. 14
21. 72
22. 9
23. 68
24. 72
25. 96
These disappeared because of a rogue less-than sign:
8. 65
9. 15
10. 5
OK, here goes.
1. Oregon has more inhabitants than Slovakia.
No, it’s 3.6 vs 5.4 million.
2. Ghana has a greater GDP (PPP) than Luxembourg.
Yes, $49B vs $31B. Nominal would have been $11B vs $36B. (I’m not 100% sure I didn’t let the choice of PPP/nominal depend on the outcome.)
3. In 1900, Denmark had a greater GDP per capita than Spain.
Yes, $2.9k vs $2.0k.
4. Ohio emits more CO2 than Poland.
Turns out the StateMaster statistic is just CO2 from power generation; from some googling it looks like the total is something like 250M vs 300M metric tons. Guess I’ll score this as a “no”.
5. Afghanistan has more land area than Alaska.
No, 650k sq km vs ~1500k sq km
6. Croatia has a greater GDP per capita than Mexico.
Yes, apparently unless you use constant 2000 dollars. I’ll score this as a “yes”; sorry for the ambiguity.
7. George Orwell was born before 1900.
No, 1903
8. Vladimir Putin was born before 1955.
Yes, 1952
9. The tenth emperor of Rome wore a beard.
No
10. More than 5000 Americans died in the attack on Pearl Harbor.
No, 2388.
21. “brain” gets more google results than “heart”.
No, 250M vs 750M
22. “Ray Kurzweil” gets more google results than “Sonic the Hedgehog”.
No, 0.3M vs 6M
23. “John Paul II” gets more google results than “Ron Paul”.
Yes, 7M vs 2M
24. “Iraq” gets more google results than “Italy”.
No, 270M vs 720M
25. “death” gets more google results than “purple”.
Yes, 600M vs 250M
So that’s:
1. No
2. Yes
3. Yes
4. No (?)
5. No
6. Yes (?)
7. No
8. Yes
9. No
10. No
21. No
22. No
23. Yes
24. No
25. Yes
First link should have been http://www.nationmaster.com/graph/eco_gdp_percap-economy-gdp-per-capita
Damn! I jumped to the conclusion that question 2 (GDP Ghana Lux) concerned per capita GDP, and I did not read steven’s clarification before the answers were revealed. The 0.01 probability I assigned to question 2 drastically lowers my score. Speaking of score, I will calculate it.
Here are the scores as promised. Jaynes has advised the use of decibels rather than straight probabilities, and that is what I have done here. In particular, a person’s score is 10 * log_10 p where p is the probability the person assigned to the actual outcome — the actual facts. The best possible score is 0. The worst, negative infinity.
Fifty Fifty is a fictional contestant who answered 50% to each question without even bothering to read the question. If you have not worked with decibels before, one hint is that a multiplicative decrease of exactly one half translates into an additive decrease of almost exactly 3 decibels. So, for example Mr Fifty Fifty’s score goes down almost exactly 3 decibels every time a new question is added to the exam.
Now if I had had more ambition I would have reported a score that is independent of the number of questions on the exam. That is I would have divided the scores below by the number of questions, and under that way of reporting, again, Mr Fifty Fifty would have scored almost exactly -3 decibels — to be exact, (/ (log 0.5) (log 10)) which equals -0.30102999566398114 — and the best contestant would have scored -2 point something decibels. The Emacs Lisp code I wrote to compute the scores is at http://dl4.jottit.com/0014
Matt Duing -31 decibels
Nick Tarleton -32 decibels
Michael Vassar -33 decibels
Sam Freund -42 decibels
Fifty Fifty -45 decibels
Richard Hollerith -67 decibels
Trond Nilsen -1052 decibels
I set the zero answer that Trond gave to 1e-99 so that my code would not have to handle zero. Trond’s actual score was negative infinity. (Sorry, Trond.) There were two questions (4 and 6) that it turns out we do not confidently know the answers to. Ignoring those questions does not change the rankings significantly:
Matt Duing -28 decibels
Nick Tarleton -29 decibels
Michael Vassar -29 decibels
Sam Freund -34 decibels
Fifty Fifty -39 decibels
Richard Hollerith -64 decibels
Trond Nilsen -1045 decibels
Ignoring question two, the one that killed Trond and I, also does not change the relative rankings significantly:
Nick Tarleton -22 decibels
Matt Duing -23 decibels
Michael Vassar -24 decibels
Richard Hollerith -27 decibels
Sam Freund -32 decibels
Trond Nilsen -42 decibels
Fifty Fifty -42 decibels
Let us remember that the reason steven ran this contest was not to see who was the best guesser but rather to investigate Aumann’s agreement theorem. My report of course does not address Aumann.
I overlooked a contestant. Please insert into the first of the three lists of scores above the following:
Chris K. Haley -35 decibels
Richard, thanks for doing the calculations. When I have time I’ll compare “before” and “after” for those of you who submitted multiple lists.
Richard, thanks for the scores.
Quite welcome.
I now think that my decision to use decibels was a mistake: better to use percentages because people already have intuitions about raw exam scores that range on a scale of 0 to 100. For example people have an intuition that if you guess randomly on a test of true and false questions you will probably obtain a raw score around 50 where “raw score” means the score before any adjustment to take into account how the other students in the class did. (Do teachers still “grade on a curve”?) So, I recommend the following scoring rule. People here gave answers as percentages. Convert to fractions. Compute the probability the contestant assigned to the right answer. E.g., if the right answer was False and the guess was 60%, then the probability the contestant assigned to the right answer is 0.4. Multiply these probabilities together. To make the score independent of the number of questions in the contest, take the nth root of the product. Finally, convert that back into a percentage.
Note that the log of the nth root of the product of the probabilities is the same as the mean of the logs of the probabilities. That little observation helped me gain confidence that taking the nth root of the product was the right way to make the score independent of the number of questions. (So futzing with decibels had a benefit after all.)
[…] 06:13 pm | Tagged as: Rationality, Games I should still say something about lessons from the last Aumann Game, but in the mean time here are some new claims. Remember, the object is to maximize the sum of the […]
Hi. These are the scores you thought you would get. Richard Hollerith helped with this.
Name, Score, Expected score
Sam Freund, -41.9220198906102, -33.96893112451401
Richard Hollerith, -66.69080905047066, -26.385717743807255
Matt Duing, -31.4028751412668, -29.121340250372107
Nick Tarleton, -31.882810736933315, -30.107761006904667
Michael Vassar, -33.00420801219202, -31.07457444132983
Trond Nilsen, -1052.448909087748, -36.92831070598267
Fifty Fifty, -45.15449934959718, -45.15449934959718
Chris K. Haley, -35.32494502430252, -33.4749664347078
Everybody was overconfident.
Same above, but this time in percentage form, as suggested by Richard.
Name, Score, “Expected” Score
Sam Freund, 67.96916277, 73.13483325
Richard Hollerith, 54.10492057, 78.42546981
Matt Duing, 74.88390772, 76.47414656
Nick Tarleton, 74.55362366, 75.78250485
Michael Vassar, 73.78756316, 75.11068146
Trond Nilsen, 0.006168852, 71.16832708
Fifty Fifty, 65.97539554, 65.97539554
Chris K. Haley, 72.227107, 73.46832428
One interpretation would be that when Sam Freund says he is 73.1% sure of something, you should hear it as “67.9%”, and so on.
Hm, the above calculations must be flawed because Fifty Fifty should have a percentage score of 50.
The corrected percentage scores. The problem was that I forgot that only 15 questions were scored, not 25.
Sam Freund, 52.54360514, 59.36636517
Richard Hollerith, 35.92482046, 66.69529764
Matt Duing, 61.75149824, 63.95253023
Nick Tarleton, 61.29822916, 62.99144823
Michael Vassar, 60.25206648, 62.0634867
Trond Nilsen, 9.63106E-06, 56.72980124
Fifty Fifty, 50, 50
Chris K. Haley, 58.1433923, 59.81822984
Thanks, that’s pretty interesting. So can we now conclude things like (in this small sample) Matt Duing was slightly more knowledgeable than Chris K. Haley, but also slightly less well-calibrated?
Let me summarize briefly something I learned. We used two different ways of reporting scores in this thread: decibels and percentages. Note that if you use percentages, then “the math” requires you to multiply scores together to get composite scores. (To form a composite decibel score, you add scores together.) So for example if a teacher uses the kind of percentage scores we used in this thread, then if there are four tests in a class, to get the final grade in the class, the teacher multiplies together the four test scores. (I would suggest taking the fourth root of the product, but “the math” does not require that.) Also note that if a student scores zero percent on even a single question, then the student’s final grade in the class must be zero. Moreover, if GPAs are calculated the same way, then if a student ever scores zero on a single question, the student’s GPA goes to zero and stays there for the rest of his academic career! Obviously, if schools actually used this method of computing GPAs then students would have to be taught from kindergarten never, ever to assign a probability of zero to a choice on an exam. Using decibel scores is the same way because again, as soon as a student assigns a probability of zero to a choice that turns out to be the correct answer, the student’s GPA goes immediately to negative-infinity decibels and stays there.
Let me explain what I meant above by “the math requires”. If we do the naive thing that most of my schoolteachers did (correct answers score 100%, incorrect answers 0%) then if the student is 60% sure the answer to a question is A, the student maximizes his score (or grade) by assigning a 100% “probability” to A. In this thread, we have been using proper scoring rules, which have the property that it maximizes a student’s score, grade or GPA for the student honestly to report his probabilities. See Technical Explanation for a longer explanation of this point.
Technical Explanation has material on the expected score that Peter introduced into the conversation. The part of the material that helped me the most is as follows:
Suppose you had probabilities red:25%, blue:50%, green:25%. Let’s think in base 2 for a moment, to make things simpler. Your expected score is:
red: scores -2 bits, flashes 25% of the time
blue: scores -1 bit, flashes 50% of the time
green: scores -2 bits, flashes 25% of the time
expected score: -1.50 bits
So can we now conclude things like . . . Matt Duing was slightly more knowledgeable than Chris K. Haley, but also slightly less well-calibrated?
I just learned expected score, so if Peter says different, listen to him, but I think that is correct: Matt’s discrimination was slightly better but Chris’s calibration was slightly better.
Since I just learned the notion of the expected score, now is a good time for me to try to explain it to others. So here goes.
Suppose that you answer 90% on ten questions and that it turns out that 9 of your answers are right and one of them is wrong, which makes you perfectly calibrated. Under the (percentage, not decibel) method of scoring that I described yesterday, your score is not 90% but rather 72.24%: the tenth root of 90% * 90% * . . . * 90% * 10% == 72.24%. (My scoring method takes the nth root where n is the number of questions because that way, adding another question to the test will tend to keep the score the same if the new question is neither easier or harder than the other questions. The math requires us to multiply because that is what you do to probabilities of events to get the probability of the conjunction of the events.) A shorter way to write the same thing is, 72.24% == the tenth root of 90% ^ 9 * 10% ^ 1, which in turn is the same thing as 72.24% == 90% ^ 90% * 10% ^ 10%. In general, if p is a test taker’s answer or guess, then the test taker’s expected score == p ^ p * (1 - p) ^ (1 - p). (Under the scoring rule used by most writers on probability including Eliezer, the expected score is the log of that last quantity. It is okay for our scoring rule to be different as long as we remember to multiply scores to form a composite score rather than add scores.) Finally, here is a table of guesses and expected scores. The score is just the guess if the correct answer is T and 100% minus the guess if the correct answer is F.
Guess, Expected score
10.0%, 72.24674055842077%
20.0%, 60.62866266041592%
30.0%, 54.288145268982525%
40.0%, 51.016980025031636%
50.0%, 50.0%
60.0%, 51.016980025031636%
70.0%, 54.28814526898253%
80.0%, 60.628662660415934%
90.0%, 72.24674055842077%
[…] you wouldn’t come here. Aumania is a different game idea, one that greatly resembles the Aumann game we’ve played on this blog before, and unlike BBR I could see it being quite […]
Good post by Richard. I’m hesitant to apply the term “expected” to the percentage scores, since the word “expected” (when used in probability) usually indicates an arithmetic mean, and here we are using a geometric mean.
If you’re interested in expected scores, check out this page:
http://en.wikipedia.org/wiki/Information_entropy
I should also note that Nick Hay was the first person that I know of to suggest “expected score - score” as a measure of calibration (you want it to be zero; positive is overconfident, negative is underconfident). This was during summer 2006 at SIAI.
Thanks — this is all interesting. I guess a problem with expected score minus score as a measure of calibration (I assume this is after taking logarithms?) is that you can be overconfident at some probabilities and underconfident at others so it cancels out even though you’re not well-calibrated.
Your actual probability of getting it right is a function f of your stated probability; if f is the identity function so f(p) = p for all p then you’re perfectly calibrated; so another approach to creating a calibration measure is to get an estimate for the distance (in some metric on the space of functions) between f and the identity function.
[…] Feb 2008 at 06:09 pm | Tagged as: Rationality, Probability Many of you know the drill. Others, see previous iterations. Tell me if you know of a more inspired way to generate standard […]
[…] steven on 28 May 2008 at 03:44 pm | Tagged as: Games This is yet another Aumann Game. Here’s the original, with rules, and two other […]