A Numerical Method for the Evaluation of Kolmogorov Complexity, An alternative to lossless compression algorithms

A Numerical Method for the Evaluation
of Kolmogorov Complexity

Hector Zenil
Amphith´ˆtre Alan M. Turing
ea
Laboratoire d’Informatique Fondamentale de Lille
(UMR CNRS 8022)

Hector Zenil (LIFL) A Numerical Method for the Evaluation of Kolmogorov Complexity 1 / 39

Foundational Axis

As pointed out by Greg Chaitin (thesis report of H. Zenil):

The theory of algorithmic complexity is of course now widely
accepted, but was initially rejected by many because of the fact
that algorithmic complexity is on the one hand uncomputable
and on the other hand dependable on the choice of universal
Turing machine.

This last drawback is specially restrictive for real world applications
because the dependency is specially true for short strings, and a solution
to this problem is at the core of this work.


Foundational Axis (cont.)

The foundational departure point of the thesis is based in a rather but
apparent contradiction, pointed out by Greg Chaitin (same thesis report):

... the fact that algorithmic complexity is extremely, dare I say
violently, uncomputable, but nevertheless often irresistible to
apply ...


Algorithmic Complexity

Foundational Notion
A string is random if it is hard to describe.
A string is not random if it is easy to describe.

Main Idea
The theory of computation replaces descriptions with programs. It
constitutes the framework of algorithmic complexity:
description ⇐⇒ computer program


Algorithmic Complexity (cont.)

Deﬁnition
[Kolmogorov(1965), Chaitin(1966)]

K (s) = min{|p|, U(p) = s}

The algorithmic complexity K (s) of a string s is the length of the shortest
program p that produces s running on a universal Turing machine U.
The formula conveys the following idea: a string with low algorithmic
complexity is highly compressible, as the information that it contains can
be encoded in a program much shorter in length than the length of the
string itself.


Algorithmic Randomness

Example
The string 010101010101010101 has low algorithmic complexity because it
can be described as 18 times 01, and no matter how long it grows in
length, if the pattern repeats the description (k times 01) increases only
by about log (k), remaining much shorter than the length of the string.

Example
The string 010010110110001010 has high algorithmic complexity because
it doesn’t seem to allow a (much) shorter description other than the string
itself, so a shorter description may not exist.


Example of an evaluation of K
The string 01010guatda.com/cmx.p101...01 can be produced by the following program:

Program A:
1: n:= 0
2: Print n
3: n:= n+1 mod 2
4: Goto 2

The length of A (in bits) is an upper bound of K (010guatda.com/cmx.p101...01).

Connections to predictability: The program A trivially allows a shortcut to
the value of an arbitrary digit through the following function f(n):
if n = 2m then f (n) = 1, f (n) = 0 otherwise.

Predictability characterization (Shnorr) [Downey(2010)]
simple ⇐⇒ predictable
random ⇐⇒ unpredictable

Noncomputability of K

The main drawback of K is that it is not computable and thus can only be
approximated in practice.

Important
No algorithm can tell whether a program p generating s is the shortest
(due to the undecidability of the halting problem of Turing machines).

No absolute notion of randomness
It is impossible to prove that a program p generating s is the shortest
possible, also implying that if a program is about the length of the original
string one cannot tell whether there is a shorter program producing s.
Hence, there is no way to declare a string to be truly algorithmic random.


Structure vs. randomness

Formal notion of structure
One can exhibit, however, a short program generating s (much) shorter
than s itself. So even though one cannot tell whether a string is random
one can declare s not random if a program generating s is (much) shorter
than the length of s.

As a result, one can only ﬁnd upper bounds of K and s cannot be more
complex than the length of that shortest known program producing s.


Most strings have maximal algorithmic complexity

Even if one cannot tell when a string is truly random it is known most
strings cannot have much shorter generating programs by a simple
combinatoric argument:
There are exactly 2n bit strings of length n,
But there are only 20 + 21 + 22 + . . . + 2(n−1) = 2n − 1 bit strings of
fewer bits. (in fact there is one that cannot be compressed even by a
single bit)
Hence, there are considerably less short programs than long programs.

Basic notion
One can’t pair-up all n-length strings with programs of much shorter length
(there simply aren’t enough short programs to encode all longer strings).


The choice of U matters
A major criticism brought forward against K is its dependence of universal
Turing machine U. From the deﬁnition:

K (s) = min{|p|, U(p) = s}

It may turn out that:

KU1 (s) = KU2 (s) when evaluated respectively using U1 and U2 .

Basic notion
This dependency is particularly troubling for short strings, shorter than for
example the length of the universal Turing machine on which K of the
string is evaluated (typically in the order of hundreds of bits as originally
suggested by Kolmogorov himself).


The Invariance theorem
A theorem guarantees that in the long term diﬀerent algorithmic
complexity evaluations will converge to the same values as the length of
the strings grow.

Theorem
Invariance theorem If U1 and U2 are two (universal) Turing machines and
KU1 (s) and KU2 (s) the algorithmic complexity of a binary string s when
U1 or U2 are used respectively, there exists a constant c such that for all
binary string s:

|KU1 (s) − KU2 (s)| < c
(think of a compiler between 2 programming languages)

Yet, the additive constant can be arbitrarily large, making unstable (if not
impossible) to evaluate K (s) for short strings.


Theoretical holes

1 Finding a stable framework for calculating the complexity of short
strings (one wants to have short strings like 000...0 to be always
among the less algorithmic random despite any choice of machine.
2 Pathological cases: Theory says that a single bit has maximal random
complexity because the greatest possible compression is evidently the
bit itself (paradoxically it is the only ﬁnite string for which one can be
sure it cannot be compressed further), yet one would intuitively say
that a single bit is among the simplest strings.

We try to ﬁll these holes by introducing the concept of algorithmic
probability as an alternative evaluation tool for calculating K (s).


Algorithmic Probability

There is a measure that describes the expected output of a random
program running on a universal Turing machine.

Deﬁnition
[Levin(1977)]
m(s) = Σp:U(p)=s 1/2|p| i.e. the sum over all the programs for which U (a
preﬁx free universal Turing machine) with p outputs the string s and halts.

m is traditionally called Levin’s semi-measure, Solomonof-Levin’s
semi-measure or the Universal distribution [Kirchherr and Li(1997)].


The motivation for Solomonoff-Levin’s m(s)

Borel’s typewriting monkey metaphor1 is useful to explain the intuition
behind m(s):

If you were going to produce the digits of a mathematical constant like π
by throwing digits at random, you would have to produce every digit of its
infinite irrational decimal expansion.

If you place a monkey on a typewriter (with say a 50 keys typewriter), the
probability of the monkey typing an initial segment of 2400 digits of π by
chance is (1/502400 ).

1´
Emile Borel (1913) “Mćanique Statistique et Irr´versibilit´” and (1914)
e e e
“Le hasard”.

The motivation for Solomonoﬀ-Levin’s m(s) (cont.)

But if instead, the monkey is placed on a computer, the chances of
producing a program generating the digits of π are of only 1/50158
because it would take the monkey only 158 characters to produce the ﬁrst
2400 digits of π using, for example, this C language code:

int a = 10000, b, c = 8400, d, e, f[8401], g; main(){for(; b-c; )
f[b + +] = a/5; for(; d = 0, g = c ∗ 2; c- = 14, printf(“%.4d”, e + d/a),
e = d%a)for(b = c; d+ = f[b] ∗ a, f[b] = d%–g, d/ = g–, –b; d∗ = b);

Implementations in any programming language, of any of the many known
formulae of π are shorter than the expansions of π and have therefore
greater chances to be produced by chance than producing the digits of π
one by one.


More formally said

Randomly picking a binary string s of length k among all (uniformly
distributed) strings of the same length has probability 1/2k .
But the probability to ﬁnd a binary program p producing s (upon halting),
among binary programs running on a Turing machine U is at least 1/2|p|
such that U(p) = s (we know that such a program exists because U is a
universal Turing machine)
Because |p| ≤ k (e.g. the example for π described before), a string s with
a short generating program will have greater chances to have been
produced by p rather than by writing down all k bits of s one by one.
The less random a string the more likely to be produced by a short
program.


Towards a semi-measure
However, there is an infinite number of programs producing s, so the
probability of picking a program producing s among all possible programs
is ΣU(p)=s 1/2|p| , the sum of all the programs producing s running on the
universal Turing machine U.
Nevertheless, for a measure to be a probability measure, the sum of all
possible events should add up 1. So ΣU(p)=s 1/2|p| cannot be a probability
measure given that there is an infinite number of programs contributing to
the overall sum. For example, the following two programs 1 and 2 produce
the string 0.
1: Print 0
and:
1: Print 0
2: Print 1
3: Erase the previous 1
and there are (countably) infinitely many more.

Towards a semi-measure (cont.)

So for m(s) to be a probability measure, the universal Turing machine U
has to be a prefix-free Turing machine, that is a machine that does not
accept as a valid program one that has another valid program in its
beginning, e.g. program 2 starts with program 1, so if program 1 is a valid
program then program 2 cannot be a valid one.

The set of valid programs is said to form a prefix-free set, that is no
element is a prefix of any other, a property necessary to keep
0 < m(s) < 1. For more details see (Kraft’s inequality [Calude(2002)]).

However, some programs halt or some others don’t (actually, most do not
halt), so one can only run U and see what programs produce s
contributing to the sum. It is said then, that m(s) is semi-computable
from below, and therefore is considered a probability semi-measure (as
opposed to a full measure).


Some properties of m(s)

Solomonoﬀ and Levin proved that, in absence of any other information,
m(s) dominates any other semi-measure and is, therefore, optimal in this
sense (hence also its universal adjective).

On the other hand, the greatest contributor in the summation of programs
ΣU(p)=s 1/2|p| is the shortest program p, given that it is when the
denominator 2|p| reaches its smallest value and therefore 1/2|p| its greatest
value. The shortest program p producing s is nothing but K (s), the
algorithmic complexity of s.


The coding theorem

The greatest contributor in the summation of programs ΣU(p)=s 1/2|p| is
the shortest program p, given that it is when the denominator 2|p| reaches
its smallest value and therefore 1/2|p| its greatest value. The shortest
program p producing s is nothing but K (s), the algorithmic complexity of
s. The coding theorem [Levin(1977), Calude(2002)] describes this
connection between m(s) and K (s):

Theorem
K (s) = −log2 (m(s)) + c

Notice that the coding theorem reintroduces an additive constant! One
may not get rid of it, but the choices related to m(s) are much less
arbitrary than picking a universal Turing machine directly for K (s).


An additive constant in exchange for a massive
computation
The trade-off this is, however, that the calculation of m(s) requires an
extraordinary power of computation.

As pointed out by J.-P. Delahaye concerning our method (Pour La
Science, No. 405 July 2011 issue):

Comme les durés ou les longueurs tr`s petites, les faibles
e e
complexit´s sont d´licates ` ´valuer. Paradoxalement, les
e e ae
m´thodes d’´valuation demandent des calculs colossaux.
e e

The first description of our approach was published in Greg Chaitin’s
festchrift volume for his 60th. anniversary: J-P. Delahaye & H. Zenil,
“On the Kolmogorov-Chaitin complexity for short sequences,” Randomness and
Complexity: From Leibniz to Chaitin, edited by C.S. Calude, World Scientific,
2007.


Calculating an experimental m
Main idea
To evaluate K (s) one can calculate m(s). m(s) is more stable than K (s)
because one makes less arbitrary choices on a Turing machine U.

Deﬁnition
D(n) = the function that assigns to every ﬁnite binary string s the
quotient:
(# of times that a machine (n,2) produces s) / (# of machines in (n,2)).

D(n) is the probability distribution of the strings produced by all n-state
2-symbol Turing machines (denoted by (n,2)).
Examples for n = 1, n = 2 (normalized by the # of machines that
halt)
D(1) = 0 → 0.5; 1 → 0.5
D(2) = 0 → 0.328; 1 → 0.328; 00 → .0834 . . .

Calculating an experimental m (cont.)

Deﬁnition
[T. Rad´(1962)]
o
A busy beaver is a n-state, 2-color Turing machine which writes a
maximum number of 1s before halting or performs a maximum number of
steps when started on an initially blank tape before halting.

Given that the Busy Beaver function values are known for n-state 2-symbol
Turing machines for n = 2, 3, 4 we could compute D(n) for n = 2, 3, 4.
We ran all 22 039 921 152 two-way tape Turing machines starting with a
tape ﬁlled with 0s and 1s in order to calculate D(4)2

Theorem
D(n) is noncomputable (by reduction to Rado’s Busy Beaver problem).

2
A 9-day calculation on a single 2.26 Core Duo Intel CPU.

Complexity Tables

Table: The 22 bit-strings in D(2) from 6 088 (2,2)-Turing machines that halt.
[Delahaye and Zenil(2011)]
0 → .328 010 → .00065
1 → .328 101 → .00065
00 → .0834 111 → .00065
01 → .0834 0000 → .00032
10 → .0834 0010 → .00032
11 → .0834 0100 → .00032
001 → .00098 0110 → .00032
011 → .00098 1001 → .00032
100 → .00098 1011 → .00032
110 → .00098 1101 → .00032
000 → .00065 1111 → .00032

Solving degenerate cases
“0” is the simplest string (together with “1”) according to D.

Partial D(4) (top strings)


From a Prior to an Empirical Distribution
We see algorithmic complexity emerging:
1 The classification goes according to our intuition of what complexity
should be.
2 Strings are almost always classified by length except in cases in which
intuition justifies they should not. For ex. even though 0101010 is of
length 7, it came better ranked than some strings of length shorter
than 7. One sees emerging the low random complexity of 010101...
as a simple string.

From m to D
Unlike m, D is an empirical distribution and no longer a prior. D
experimentally confirms the intuition behind Solomonoff and Levin’s
measure.

Full tables are available online: www.algorithmicnature.org

Miscellaneous facts from D(3) and D(4)
There are 5 970 768 960 machines that halt among the 22 039 921 152
in (4,2). That is a fraction of 0.27 halt.
Among the most random looking group strings from D(4) there are :
0, 00, 000..., 01, 010, 0101, etc.
Among the most random looking strings one can ﬁnd:
1101010101010101, 1101010100010101, 1010101010101011 and
1010100010101011, each with frequency of 5.4447×10−10 .
As in D(3), where we reported that one string group (0101010 and its
reversion) climbed positions, in D(4) 399 strings climbed to the top
and were not sorted among their length groups.
In D(4) string length was no longer a classiﬁcation determinant. For
example, between positions 780 and 790, string lengths are: 11, 10,
10, 11, 9, 10, 9, 9, 9, 10 and 9 bits.
D(4) preserves the string order of D(3) except in 17 places out of 128
strings in D(3) ordered from highest to lowest string frequency.

Connecting D back to m

To get m we replaced a uniform distribution of bits composing strings to a
uniform distribution bits composing programs. Imagine that your
(Turing-complete) programming language allows a monkey to produce
rules of Turing machines at random, every time that the monkey types a
valid program it is executed.

At the limit, the monkey (which is just a random source of programs) will
end up covering a sample of the space of all possible Turing machine rules.


Connecting D back to m

On the other hand, D(n) for a fixed n is the result of running all n-state
2-symbol Turing machines according to an enumeration.

An enumeration is just a thorough sample of the space of all n-state
2-symbol Turing machines each with fixed probability
1/(# of Turing machines in (n,2)) (by definition of enumeration).

D(n) is therefore, a legitimate programmer monkey experiment. The
additional advantage of performing a thorough sample of Turing machines
by following an enumeration is that the order in which the machines are
traversed in the enumeration is irrelevant as long as one covers all the
elements of a (n,2) space.


Connecting D back to m (cont.)

One may ask why shorter programs are favored.

The answer, in analogy to the monkey experiment, is based on the uniform
random distribution of keystrokes: programs cannot be that long without
eventually containing the ending program keystroke. One can still think
that one can impose a diﬀerent distribution of the program instructions,
for ex. changing the keyboard distribution repeating certain keys.

Choices other than the uniform are more arbitrary than just assuming no
additional information, and therefore a uniform distribution (a keyboard
with two or more letter “a”’s rather than the usual one seems more
arbitrary than having a key per letter).



Every D(n) is a sample of D(n + 1) because (n + 1, 2) contains all
machines in (n, 2). We have empirically tested that strings sorted by
frequency in D(4) preserve the order of D(3) which preserves the order of
D(2), meaning that longer programs do not produce completely different
classifications. One can think of the sequence D(1), D(2), D(3), D(4), . . .
as samples which values are approximations to m.

One may also ask, how can we know whether a monkey provided with a
different programming language would produce a completely different D,
and therefore yet another experimental version of m. That may be the
case, but we have also shown that reasonable programming languages
(e.g. based on cellular automata and Post tag systems) produce
reasonable (correlated) distributions.


m(s) provides a formalization for Occam’s razor

The immediate consequence of algorithmic probability is simple but
powerful (and surprising):

Basic notion
Type-writing monkeys (Borel)
garbage in → garbage out
Programmer monkeys: (Bennett, Chaitin)
garbage in → structure out


What m(s) may tell us about the physical world?

Basic notion
m(s) tells that it is unlikely that a Rube Goldberg machine produces a
string if the string can be produced by a much simpler process.

Physical hypothesis
m(s) would tell that, if processes in the world are computer-like, it is
unlikely that structures are the result of the computation of a Rube
Goldberg machine. Instead, they would rather be the result of the shortest
programs producing that structures and patterns would follow the
distribution suggested by m(s).


On the algorithmic nature of the world

Could it be that m(s) tells us how structure in the world has come to be
and how is it distributed all around? Could m(s) reveal the machinery
behind?
What happens in the world is often the result of an ongoing (mechanical)
process (e.g. the Sun rising due to the mechanical celestial dynamics of
the solar system).
Can m(s) tell something about the distribution of patterns in the world?
We decided to see so we got some empirical datasets from the physical
world and made a comparison against data produced by pure computation
that by deﬁnition should follow m(s).

The results were published in H. Zenil & J-P. Delahaye, “On the
Algorithmic Nature of the World”, in G. Dodig-Crnkovic and M. Burgin (eds),
Information and Computation, World Scientiﬁc, 2010.


On the algorithmic nature of the world


Conclusions

Our method aimed to show that reasonable choices of formalisms for
evaluating the complexity of short strings through m(s) give consistent
measures of algorithmic complexity.

[Greg Chaitin (w.r.t our method)] ...the dreaded theoretical hole
in the foundations of algorithmic complexity turns out, in
practice, not to be as serious as was previously assumed.

Our method also seems notable in that it is an experimental approach that
comes into the rescue of the apparent holes left by the theory.


Bibliography
C.S. Calude, Information and Randomness: An Algorithmic
Perspective (Texts in Theoretical Computer Science. An EATCS
Series), Springer, 2nd. edition, 2002.
G. J. Chaitin. On the length of programs for computing ﬁnite binary
sequences. Journal of the ACM, 13(4):547–569, 1966.
G. Chaitin, Meta Math!, Pantheon, 2005.
R.G. Downey and D. Hirschfeldt, Algorithmic Randomness and
Complexity, Springer Verlag, to appear, 2010.
J.P. Delahaye and H. Zenil, On the Kolmogorov-Chaitin complexity for
short sequences, in Cristian Calude (eds) Complexity and Randomness:
From Leibniz to Chaitin. World Scientiﬁc, 2007.
J.P. Delahaye and H. Zenil, Numerical Evaluation of Algorithmic
Complexity for Short Strings: A Glance into the Innermost Structure
of Randomness, arXiv:1101.4795v4 [cs.IT].

C.S. Calude, M.A. Stay, Most Programs Stop Quickly or Never Halt,
2007.
W. Kirchherr and M. Li, The miraculous universal distribution,
Mathematical Intelligencer , 1997.
A. N. Kolmogorov. Three approaches to the quantitative deﬁnition of
information. Problems of Information and Transmission, 1(1):1–7,
1965.
P. Martin-L¨f. The deﬁnition of random sequences. Information and
o
Control, 9:602–619, 1966.
L. Levin, On a concrete method of Assigning Complexity Measures,
Doklady Akademii nauk SSSR, vol.18(3), pp. 727-731, 1977.
L. Levin., Universal Search Problems., 9(3):265-266, 1973.
(submitted: 1972, reported in talks: 1971). English translation in:
B.A.Trakhtenbrot. A Survey of Russian Approaches to Perebor
(Brute-force Search) Algorithms. Annals of the History of Computing
6(4):384-400, 1984.


M. Li, P. Vitńyi, An Introduction to Kolmogorov Complexity and Its
a
Applications,, Springer, 3rd. Revised edition, 2008.
S. Lloyd, Programming the Universe: A Quantum Computer Scientist
Takes On the Cosmos, Knopf Publishing Group, 2006.
T. Rad´, On non-computable functions, Bell System Technical
o
Journal, Vol. 41, No. 3, 1962.
R. J. Solomonoff. A formal theory of inductive inference: Parts 1 and
2. Information and Control, 7:1–22 and 224–254, 1964.
H. Zenil and J.P. Delahaye, On the Algorithmic Nature of the World,
in G. Dodig-Crnkovic and M. Burgin (eds), Information and
Computation, World Scientific, 2010.
S. Wolfram, A New Kind of Science, Wolfram Media, 2002.


A Numerical Method for the Evaluation of Kolmogorov Complexity, An alternative to lossless compression algorithms

More Related Content

Viewers also liked (12)

Similar to A Numerical Method for the Evaluation of Kolmogorov Complexity, An alternative to lossless compression algorithms (20)

More from Hector Zenil (20)

Recently uploaded (20)

A Numerical Method for the Evaluation of Kolmogorov Complexity, An alternative to lossless compression algorithms