
The €50’000 Prize for Compressing Human Knowledge was just announced via KurzweilAI.net. The motivation is as follows:
This compression contest is motivated by the fact that being able to compress well is closely related to acting intelligently. In order to compress data, one has to find regularities in them, which is intrinsically difficult (many researchers live from analyzing data and finding compact models). So compressors beating the current “dumb” compressors need to be smart(er). Since the prize wants to stimulate developing “universally” smart compressors, we need a “universal” corpus of data. Arguably the online lexicon Wikipedia is a good snapshot of the Human World Knowledge. So the ultimate compressor of it should “understand” all human knowledge, i.e. be really smart. enwik8 is a hopefully representative 100MB extract from Wikipedia.
This test is so much more meaningful than the Turing Test. It is quantitative, and amenable to incremental advances. It further emphasizes the relationship between general intelligence and ability to compress data. The Hutter Prize is a more concrete version of Jim Bowery’s proposed C-Prize. A more detailed rationale is on the site here.
I think the second requirement, i.e., to restore the original data as was, is somewhat contraproductive. Natural language is just one way of representing knowledge and it surely isn’t the most efficient one. Hence, what one would want is a good natural language processing algorithm, which extracts the “knowledge” out of the wiki pages (or whatever source you want to use). Represented using, say, ontologies paired with first order predicate logic gets rid of the redundancies of natural language and can thus be compressed a lot more efficiently. On the “downside”, a complete restoration of the original pages becomes impossible, it’s just like with lossy image compression. Except that the actual quality of information doesn’t suffer.
I have to agree with Herwig Moser. Good teachers aren’t interested in their students repeating their text book, word for word, only in them learning from it and being able to represent the data differntly.
However the prize founder’s may have another approach in mind.
Luke: The human-repeating-a-textbook analogy doesn’t work. Making a machine that can regurgitate information is a solved problem: the text you’re reading at this instant was prepared by machines that memorize and repeat my keyboard inputs with perfect accuracy. If a student can repeat a chapter from an economics textbook verbatim from memory, we have no way of knowing how much space it occupies in her brain, or what methods she’s using to remember it. If the student can reproduce (or at least approximate) information that cannot be memorized, then they have found a way to compress it. One can’t memorize a list of optimal moves in chess for every possible board position, which is why it’s so remarkable when people play chess well. Regurgitating text books only sounds unintelligent because we assume that the regurgitator is applying the obvious method: just recording the information without compressing it.
Suppose I were to recite to you the first 100 digits of the decimal expansion of pi. This isn’t all that impressive. It only tells you I have enough free time to memorize a string of 100 digits. But what if I can recite 10,000 digits without error? How about a million? At some point, memorization ceases to be an option. I must be generating the digits on-the-fly by some sort of algorithm. Finding those algorithms is the part that requires intelligence.
In Hutter’s contest, we know that the program is not simply “memorizing” like a student that can repeat a textbook, because, unlike a student, we know exactly how much space the generating program requires.
IMHO there are basically 3 components to a compressed corpus:
1) The presentation-invariant knowledge.
2) The calculus of presentation (vocabulary, syntax, grammar, visual markup, etc.).
3) The noise.
Taken together, 1 and 2 are the “model” part and 3 is the “noise” part of what I believe is called algorithmic statistics or minimum description length of the string. However, dividing the model into the presentation calculus (I’m making these terms up as I go along — there are probably legitimate academic terms) and presentation-invariant knowledge may make it clearer what people are actually talking about.
Clearly the presentation calculus has its own value in rendering for human interface. The presentation-invariant knowledge is what people frequently think of as “human knowledge” but of course human knowledge encompasses more than that.
I expect that the enwik8 level of the Hutter Prize will go some distance toward discovering things about the presentation calculus of an english Wiki but that presentation invariant knowledge may need more of a corpus to push compression to higher levels.
Gutzeit, I was agreeing with:
“Natural language is just one way of representing knowledge and it surely isn’t the most efficient one. Hence, what one would want is a good natural language processing algorithm, which extracts the “knowledge†out of the wiki pages (or whatever source you want to use). ”
When I said:
“Good teachers aren’t interested in their students repeating their text book, word for word, only in them learning from it and being able to represent the data differently.”
It is unreasonable to assume students have time to memorise even key parts of a text book, word-for-word. But if they can show that they know the knowledge, that will satisfy most good teachers. What I think the Hutter Prize should encourage is this latter form of information retrieval (perhaps opposed to ‘data’ retrieval), where as Herwig Moser says, the words of the text book aren’t spat out word-for-word, but the information from it is still retained, and exportable (or to continue with the human analogy: explainable). Reconstructing the noisy knowledge as an additional layer, will take some time to code, and it isn’t necessary to achieve something closer to a human’s mental compression.
Perhaps another prize that awards advances in mimicking this kind of ontology system would be best.
Another similarity that this could approximate: when people see something happen with their eyes, they don’t remember even most of the information originally present, they break it down, compressing it in several different ways before they recall it in long term memory. Actually optical illusions readily demonstrate, even our short term memory recall of imagery is incredibly lossy, try finding your blindspot. If our eyes used pixels, we could safely say, our memory recall of visual information uses a lossy compression system.
Another similarity that this could approximate: when humans read, if a word is incorrectly spelt, the very fact that we know what the word is *meant* to be shows understanding. If an advanced compression algorithm removes these spelling mistakes, it wouldn’t be a problem in most real terms. For most purposes, I don’t mind using a database or file system that just happens to automatically correct spelling errors, although if it doesn’t recognise new words, such as Skype, Google (as opposed to Googol) or Flickr (as opposed to flicker), or somebody’s password, there would be problems.
A system aimed at emulating human intelligence that has the same flaws as a human intelligence can sometimes have, is probably on the right lines, as a precursor to something better.
This will not quality for the competition but points to an answer to the problem.
Hope it helps the dialogue.
http://firstdiscipline.com/2010/09/26/juggling-with-ps-3/
I really feel similar web site enthusiasts really should consider this particular homepage as an example. Totally clean and convenient style and design, and in many cases awesome content material! You’re an expert operating in this excellent area :)