Compression is one of the most powerful concepts in computing. For the normal computer user, compression is associated with making files smaller, so you can store more of them or send them faster over the Internet, but there is much more to it.

An optimal compressor can (or could, if it existed) be used for prediction of the future of a sequence of events (weather, sports, stocks, political events, etc), for example by trying all possible continuations and examine how well it compresses given the history. Conversely, an optimal predictor that gives the correct probability of each possible next symbol can be used for optimal compression by using arithmetic coding.

Compression and prediction

Ray Solomonoff has shown [PDF] that if we let Sk be the infinite set of all programs for a machine M, such that M(Sk) gives an output with X as prefix (i.e the first bits of the output is X), then the probability of X becomes the sum of the probabilities of all of its programs, where the probability of a program is 2 ** (-|Sk|) if |Sk| is the length of the program in bits and “**” means “to the power of”. As X gets longer, the error of the predictions approach zero, if the error is calculated as the total squared probability difference.

A technicality is that only those programs count, that does not still produce X, when the last bit of the program is removed.

To give a slightly more concrete example, say that you have a sequence of events - a history - and encode those as a sequence of symbols, X. Let us further say that you have a machine, M, that can read a program S and output a sequence of symbols. If you have no further information on your sequence of events, then the best estimate for the probability of a symbol Z to occur next (i.e the best prediction) is given by the set of all programs that output your history X followed by Z. Programs which output X+Z and are short are weighted higher (the 2 ** (-|Sk|) part).

Obviously you cannot find all these programs by just trying every possible program, because 1) they are infinitely many and 2) given the halting problem you cannot in general know if a running program is in an infinite loop or if it will eventually output X.

There is an area of probability theory called Minimum description length, where the language is chosen to be so simple (not Turing complete) so that you can actually find the shortest program or “description”. Calculating probabilities this way is very similar to Bayesian probability, but more general.

Solomonoff in (almost) practice

Although the point of the theorem is not to apply it directly in practice, for short sequences X we can actually try. We can avoid problem 1 above, there are infinitely many programs, by generating random programs and see if they produce X. If they do, we count them. This way we can produce an approximation of what the true sum0 and sum1 are. If we set up our random generation such that shorter programs are more likely, then we don’t have to bother with the 2 ** (-program length) part and may just have a running count for each sum. If the sequence is too long, this method will be impractical since almost no randomly generated programs will actually output X.

Problem 2 above, when we test programs they may not halt, is harder, but Levin has proposed a way around it. If we in addition to program length use running time (number of instructions executed) as a measure of the probability of our program, we can start by generating all the programs that we intend to test and then run them all in parallel. As our execution moves forward, we will get an increasingly accurate approximation of the true sum0 and sum1, without getting stuck on infinite loops.

If we want to get even more practical, it can be shown that the shortest program that produces X will generally dominate the others and thus it will predict the most likely next symbol. That way you can just search for programs that output X and the currently shortest program will be your best guess. Since we no longer care about the relative probability of the next symbol, but only which is most likely, the search does not have to be random. Thus we can use any method we like for finding a short program. If you search for programs that produce X and find one that almost does, you can construct a “true” solution from that one, by constructing a prefix part that hard codes the places where your original program is wrong. This will produce a longer program, where the length, and thus the “score”, is the length of the prefix + the length of your faulty solution. The size of the shortest program that outputs X is called the Kolmogorov complexity of X. The size of the shortest program that outputs X, measured in program size + log(running time), is called the Levin complexity of X.

One way to find these programs is to use genetic programming, just take care that you don’t think that you can count the number of programs that produce X and get relative probabilities, because your search will now be skewed towards the solution (and thus its prediction) that you find first.

A small problem is that depending on what machine you choose, i.e which instructions your programs can use and the length of these instructions, you will get different results. The method has a built in bias, since there is no one correct Turing complete language. This difference will however be smaller as X gets longer. One way to understand that is to note that any Turing complete language can emulate any other Turing complete language and that the size of such an emulator is finite. This is called the compiler theorem.

Compression is understanding and the Hutter Prize

When we understand something, we can describe it succinctly. If I have an image of a red perfect circle, the size will be much larger if I describe the individual pixels rather than just say “a red circle of diameter d and thickness t”. When I understand what the image is depicting, I can describe it shorter. Sometimes a lossy compression of observed data will actually express the truth better than the exact data. If I take a photo of a red circle, the photo will probably not be perfect, but if I notice what the photo is showing, I can compress it as “a red circle” and some noise which I throw away, and suddenly my lossy compression is a better depiction of the truth.

This equivalence between compression and general intelligence led Marcus Hutter to announce the Hutter Prize, where money is awarded for the best compression of 100 megabyte of English Wikipedia articles. So far the compression algorithms have been impressive (compressing the text to about 15%), but not shown much intelligence or understanding of the articles. When they do start to exhibit some understanding, I think that if they are allowed to compress the data in a slightly lossy way, the first thing that will happen is that some spelling and layout mistakes will be corrected, because these will be surprising to the compressor and thus demand an unusually long representation.

Matt Mahoney has written a good rationale on the Hutter Prize here.

Compression in practice - the juicy stuff

Compression is a powerful tool to measure success and avoid overfitting in a variety of common AI problems. The methods I laid out here are interesting mostly from a theoretical perspective, because of their prohibitively long running times. In my next post, I will expand on my thoughts on how you can use these results to get actual, practical algorithms for common AI problems.

An optimal compressor can (or could, if it existed) be used for prediction of the future of a sequence of events (weather, sports, stocks, political events, etc), for example by trying all possible continuations and examine how well it compresses given the history. Conversely, an optimal predictor that gives the correct probability of each possible next symbol can be used for optimal compression by using arithmetic coding.

Compression and prediction

*This section on background theory contains possibly scary math and dense prose, but should be understandable for most programmers. Maybe re-read the sentences a couple of times.*Ray Solomonoff has shown [PDF] that if we let Sk be the infinite set of all programs for a machine M, such that M(Sk) gives an output with X as prefix (i.e the first bits of the output is X), then the probability of X becomes the sum of the probabilities of all of its programs, where the probability of a program is 2 ** (-|Sk|) if |Sk| is the length of the program in bits and “**” means “to the power of”. As X gets longer, the error of the predictions approach zero, if the error is calculated as the total squared probability difference.

A technicality is that only those programs count, that does not still produce X, when the last bit of the program is removed.

To give a slightly more concrete example, say that you have a sequence of events - a history - and encode those as a sequence of symbols, X. Let us further say that you have a machine, M, that can read a program S and output a sequence of symbols. If you have no further information on your sequence of events, then the best estimate for the probability of a symbol Z to occur next (i.e the best prediction) is given by the set of all programs that output your history X followed by Z. Programs which output X+Z and are short are weighted higher (the 2 ** (-|Sk|) part).

Even more concretely, given the binary sequence 101010101, you wonder what the probability is that the next bit will be 0 given that you know nothing else of this sequence. Sum 2 ** (-program length) for all programs that output 1010101010 vs those that output 1010101011 as their first bits (they are allowed to continue outputting stuff). If we call these sums sum0 and sum1 respectively, then the probability of 0 coming next is sum0 / (sum0 + sum1) and the probability of 1 coming next is sum1 / (sum0 + sum1).

Obviously you cannot find all these programs by just trying every possible program, because 1) they are infinitely many and 2) given the halting problem you cannot in general know if a running program is in an infinite loop or if it will eventually output X.

There is an area of probability theory called Minimum description length, where the language is chosen to be so simple (not Turing complete) so that you can actually find the shortest program or “description”. Calculating probabilities this way is very similar to Bayesian probability, but more general.

Solomonoff in (almost) practice

Although the point of the theorem is not to apply it directly in practice, for short sequences X we can actually try. We can avoid problem 1 above, there are infinitely many programs, by generating random programs and see if they produce X. If they do, we count them. This way we can produce an approximation of what the true sum0 and sum1 are. If we set up our random generation such that shorter programs are more likely, then we don’t have to bother with the 2 ** (-program length) part and may just have a running count for each sum. If the sequence is too long, this method will be impractical since almost no randomly generated programs will actually output X.

Problem 2 above, when we test programs they may not halt, is harder, but Levin has proposed a way around it. If we in addition to program length use running time (number of instructions executed) as a measure of the probability of our program, we can start by generating all the programs that we intend to test and then run them all in parallel. As our execution moves forward, we will get an increasingly accurate approximation of the true sum0 and sum1, without getting stuck on infinite loops.

If we want to get even more practical, it can be shown that the shortest program that produces X will generally dominate the others and thus it will predict the most likely next symbol. That way you can just search for programs that output X and the currently shortest program will be your best guess. Since we no longer care about the relative probability of the next symbol, but only which is most likely, the search does not have to be random. Thus we can use any method we like for finding a short program. If you search for programs that produce X and find one that almost does, you can construct a “true” solution from that one, by constructing a prefix part that hard codes the places where your original program is wrong. This will produce a longer program, where the length, and thus the “score”, is the length of the prefix + the length of your faulty solution. The size of the shortest program that outputs X is called the Kolmogorov complexity of X. The size of the shortest program that outputs X, measured in program size + log(running time), is called the Levin complexity of X.

One way to find these programs is to use genetic programming, just take care that you don’t think that you can count the number of programs that produce X and get relative probabilities, because your search will now be skewed towards the solution (and thus its prediction) that you find first.

A small problem is that depending on what machine you choose, i.e which instructions your programs can use and the length of these instructions, you will get different results. The method has a built in bias, since there is no one correct Turing complete language. This difference will however be smaller as X gets longer. One way to understand that is to note that any Turing complete language can emulate any other Turing complete language and that the size of such an emulator is finite. This is called the compiler theorem.

Compression is understanding and the Hutter Prize

When we understand something, we can describe it succinctly. If I have an image of a red perfect circle, the size will be much larger if I describe the individual pixels rather than just say “a red circle of diameter d and thickness t”. When I understand what the image is depicting, I can describe it shorter. Sometimes a lossy compression of observed data will actually express the truth better than the exact data. If I take a photo of a red circle, the photo will probably not be perfect, but if I notice what the photo is showing, I can compress it as “a red circle” and some noise which I throw away, and suddenly my lossy compression is a better depiction of the truth.

This equivalence between compression and general intelligence led Marcus Hutter to announce the Hutter Prize, where money is awarded for the best compression of 100 megabyte of English Wikipedia articles. So far the compression algorithms have been impressive (compressing the text to about 15%), but not shown much intelligence or understanding of the articles. When they do start to exhibit some understanding, I think that if they are allowed to compress the data in a slightly lossy way, the first thing that will happen is that some spelling and layout mistakes will be corrected, because these will be surprising to the compressor and thus demand an unusually long representation.

Matt Mahoney has written a good rationale on the Hutter Prize here.

Compression in practice - the juicy stuff

Compression is a powerful tool to measure success and avoid overfitting in a variety of common AI problems. The methods I laid out here are interesting mostly from a theoretical perspective, because of their prohibitively long running times. In my next post, I will expand on my thoughts on how you can use these results to get actual, practical algorithms for common AI problems.