Big Programming, Small Programming

Sep 3rd, 2013 | Comments

Programming is complicated. Different programs have different abstraction levels, domains, platforms, longevity, team sizes, etc ad infinitum. There is something fundamentally different between the detailed instructions that goes into, say, computing a checksum and the abstractions when defining the flow of data in any medium-sized system.

I think that the divide between coding the details and describing the flow of a program is so large that a programming language could benefit immensely from keeping them conceptually separate. This belief has led me to design a new programming language - Glow - that has this separation at its core.

A Better Soylent - Designing a Simple, “Optimal” Nutrition Shake

Apr 9th, 2013 | Comments

This post was inspired by Rob’s Soylent experiment. He recently got a lot of press for constructing an almost completely artificial liquid diet containing everything a human needs and then consuming this goo exclusively for a month.

Yes, of course it has been done before, but I thought the idea was interesting and wanted to try it myself. I am very interested in health and longevity, so I thought I’d incorporate what I consider the most important and well-founded dietary findings in my recipe. The final recipe (let’s call it “Goop”, like the food in The Matrix) will be quite cheap compared to regular food and include only natural vegetarian ingredients and a multi-vitamin (mostly for the vitamin D).

Nutrition is complicated. The most difficult part of any new dietary idea is that humans are extremely long lived, so it takes forever to verify the long-term effects. Also, most of us are not keen to be put in cages and fed on a rigorous schedule, so you never know quite what people really eat. You can probably tell a lot from mouse studies and you can monitor known health markers in humans, but there are no guarantees. I have not even tried this diet myself yet (though I will in just a few days). If you decide to give it a try, you do it completely at your own risk.

Even if you are not interested in my final recipe for a complete nutrition shake (or shake-like substance), you might find the general discussion on what to put in your body interesting.

A Better Algorithm for Backups and Rolling Logs

Feb 22nd, 2013 | Comments

In your /var/log/ you will most probably have logs that have grown too large and rolled over. Per default your system logger gzips and stores a few of the older ones and finally when they get too numerous, it just deletes them. Same thing for log handlers in most languages, for example Python’s RotatingFileHandler. Backups are usually also handled the same way, when you don’t want to store every backup.

That is almost never what I want. If I store only N backups, I don’t want them to be only the very latest. If they are close in time they will be more similar and in some sense contain less information than backups that are spread out over time. Of course the newer ones are probably on average more interesting to me, so I don’t want them evenly spread over time. The following simple algorithm is my suggestion for a better way to handle rotating logs and backups.

Pre-programming Mental Silence Meditation With Entrainment

Jan 24th, 2013 | Comments

I am not a dirty hippie. No, honestly, I’m not. I am, in fact, a grumpy, skeptical, philosophically materialist atheist. Yet, on and off for the last 15 years I have practiced meditation.

It feels good, helps me relax and it makes my mind feel clear, perceptive and sharp afterwards. It was only recently that I started doing it in conjunction with programming. The match was so good, that I felt the proselytic urge to share what I had found. My example is programming, but it can obviously be any intensive, possibly creative, mental task, such as writing, studying or playing Yahtzee.

Compile Time Loops in C++11 With Trampolines and Exponential Recursion

Nov 22nd, 2012 | Comments

Nowadays with constexpr, we can make pure functions that are calculated compile time. The restriction on these functions is that they may only consist of one statement (though additional lines with typedefs, static_assert, etc are OK), and this statement may only call other constexpr functions.

If-statements are easy - just use the ternary ?: operator or, if you want to show off, template specialization. But the only way to loop with these restrictions is by using recursion. Furthermore, neither clang 3.1 nor GCC 4.7 (nor the current build of GCC 4.8) support tail call elimination in these constexprs, so normal linear loops will still eat stack space if we loop for a while. Also, the standard recommends that the default maximum recursion depth should be 512, which means that if we want to do something silly/fun/interesting compile time, we have to mess with proprietary compiler switches to get the program to compile. No fun.

In this post, I show one way of working around those limitations.

C++11 and Boost - Succinct Like Python

Oct 31st, 2012 | Comments

C++11 is the new standard of C++ that was released last year. Yes, I know that is now 2012, but compilers are just now starting to catch up and implement everything, though AFAIK there is not yet a fully compliant compiler.

With a combination of C++11 and the Boost library, I think that it is possible to write code in a style that is almost as painless as in a modern dynamic language like Python. I also think that is not so well known how much C++ has changed for the better, outside the C++-community. Hence this post.

As an example, I have taken the first interesting excercise from Dive into Python, fileinfo.py, and converted it to C++, trying to remain as faithful as possible to the original code.

No, Really. Use Zsh.

Sep 28th, 2012 | Comments

Zsh is the new hotness. Well newer and hotter than Bash anyway, since the first version of Bash was released in June 1989, while the young and peppy Zsh was released in December 1990. In large parts thanks to the configuration “skin” oh-my-zsh, Zsh has gained a lot of popularity during the last year or so. I have used it for a few months myself and could not be happier, unless it produced chocolate ice cream ← note to shell developers.

This is a guide on why you need it and how you install, configure and use it. Sometimes just with links to the relevant sites.

Octopress and Github as a Blogging Platform

Sep 27th, 2012 | Comments

I have switched from Blogspot to Octopress. Any self respecting coder should realize that 1) blogging is text and 2) text should be in revision control. Also 3) blogging is public text, so the revision control can be on a public server, like Github.

Guerilla - My Attempt to Build a Strong AI

Nov 10th, 2011

Having given the topic of artificial general intelligence (or AGI) a fair amount of thought over the last decade, a plausible path forward, based on something like an encyclopedia for algorithms, has slowly congealed in my head. I need to write it down before I start coding. Preferably in the form of a blog post, to get feedback. This is that post.

Background

When I write about strong AI or AGI, I mean an algorithm with general problem solving skills. Not necessarily a mind inhabiting a robot, running around talking to people and passing the Turing test (though eventually a successful AGI could be taken in that direction), but rather something that can be applied to a wide variety of problems. For example: playing chess and poker, picking stocks, solving puzzles, proving math theorems, analyzing and writing computer code, speech and image recognition, improve itself by learning and self modification, etc.

There is an alluring approach to AGI, where you begin with a “simple” seed program, which will learn and self improve and eventually evolve to human intelligence and beyond. I think that in practice, the problem is that it has taken lots of people lots of time to invent all those algorithms that can be useful in general problem solving. Humans are pretty good at general problem solving - certainly much better than the best software/hardware combination we have today. Constructing algorithms to solve specific problems is in itself a kind of problem solving, and we humans have certainly invented many different algorithms for a wide variety of purposes. One might suspect that the computational depth of inventing/discovering algorithms is very large. So unless the seed program is actually very advanced, more like a full grown Sequoia than a seed, it might take too much time for it to invent all those algorithms and heuristics that we, as a civilization, already have.

Topics that are potentially useful to understand include:

Statistics (Bayes rule, distributions, Markov chains, running an experiment, etc).
Algorithms for optimizing parameters (genetic algorithms, simulated annealing, steepest descent, linear programming, random testing and purely analytical methods). In some situations it could take days or more to try a single parameter configuration. In other situations evaluating the fitness of a parameter configuration is just a couple of CPU instructions. Approaching these different tasks require a variety of methods
Logic
Basic mathematics (calculus, algebra, geometry, etc)
Code analysis (lambda calculus, etc)
Formal proof methods (knowledge of the methods listed here: http://en.wikipedia.org/wiki/Mathematical_proof) and formal reasoning
Tree and graph searching (depth-first, breadth-first, A*, beam, minimax, alpha-beta, Dijkstra)
Bayesian belief networks
Pattern recognition
Compression
Monte Carlo method
Clustering and classification
Fourier transforms, wavelets
Function approximation (analytical or with neural networks or genetic programming)
Inverting functions (in other words, given a program function and its output tell me what the input was - this turn out to be a very general way of posing questions)

…and of course many more.

I consider statistics and parameter optimization to be the most important areas for intelligence, since you need them to learn. Pattern recognition (perhaps implemented with statistics and optimization) and various forms of tree searching are also vital.

An encyclopedia of algorithms

My approach is based on implementing an encyclopedia of useful algorithms that:

Know to which tasks they can be applied
Can give a rough, initially often ridiculously rough, estimate of what the probability is that they solve the task after a certain time or, in the case of an open-ended task such as optimization, can give a rough estimate of how well the task is solved after a certain time.
Can continuously update the estimate as the task is solved

It is important to stress that it is not enough to just implement a library of algorithms that can work on the same datastructures. The important thing is that you need metadata, describing when an algorithm can be used and the algorithmic complexity in time and memory. With time, you want to automatically build up more knowledge of the algorithms, gradually improving the time and success estimations as well as improving your knowledge of which algorithms are suitable in which situations.

The algorithms should be broken up into as many natural subtasks as possible, so that when new algorithms are added to the system, they can try to solve these subtasks as well, thus creating new hybrid algorithms.

A task is basically a function call together with its arguments. An algorithm that can solve a task implements the corresponding function and a time/success estimator. Similarly to function overloading in C++, the function header might state specializations, additional properties, of the function arguments that must be true for the algorithm to be a contender to solve it. It is important that the Scheduler (see below) immediately knows which algorithms are suitable for a certain task, so the mentioned “additional argument properties” must be immediately available. If a certain property requires work to find out - “is the list in the first argument sorted?” - and an algorithm still needs it, a new algorithm can be constructed that first checks if the list is sorted and then either fails or asks the task again with the new property set. This new algorithm would have higher estimates of running time and lower estimates of success than the original algorithm.

The Scheduler

When a task is added to the task pool it always has a price attached to it. The Scheduler runs those algorithms that currently promise best expected price per time unit. Algorithms that need subtasks solved has to assign a price to those too, before adding them to the task pool. That price should reasonably reflect how much of the overall time the subtask is expected to take. If it turns out that a subtask consistently take a smaller or larger fraction of the estimated total time, there should be algorithms that modify the price for these subtasks and correspondingly the total time estimate (also, see Self Improvement below).

Open-ended tasks where something should be optimized cannot have just one value attached to them. Instead they need to have a function from achieved performance to price, or at least a rough mapping from some performance values to price. This mapping stops the system from optimizing for too long on a relatively unimportant subsubsubtask somewhere.

Algorithms that can either fail or succeed on a task need a similar mapping, where they give probability of success as a function of time.

One can also imagine that the algorithms could give a confidence interval or standard deviation on their estimates to tell the Scheduler how sure they are of their estimates, but I am not quite sure how this should be used, so for now they won’t.

For my first try, the Scheduler will use a simple heuristic. The algorithm that claims to have the best price / time ratio for any task currently in the task pool will get to run it. For one thread this will be optimal in some sense. It gets more complex when you have many algorithms running in parallel on multiple cores or even clusters. For example, you want to slightly punish two algorithms trying to complete the same task in parallel, since the first one to succeed will always make the other algorithms work moot. On the other hand sometimes it makes sense to attack an important problem from several angles, so you don’t want to forbid it entirely either.

In a later design, the Scheduler should be able to use the task pool to think about how it should Schedule. Obviously this must not end in an infinite Scheduling loop or general inefficiency, since normally the Scheduler must work very quickly.

The Scheduler’s work and indeed that of the whole system will not be especially interesting when there are only a few algorithms implemented. The first interesting moment will be when new hybrid algorithms emerge, where subtasks are sometimes handled by unanticipated algorithms. I am not sure how many algorithms needs to be implemented for the system to show interesting emergent behaviour. Probably more than ten, but less than a hundred, depending, of course, on which algorithms and what you count as an individual algorithm.

Self improvement

From the above, you can see that the system will not be self improving at first. However, by adding self improvement tasks, it will start doing things like improving the time estimates of the algorithms, learn to what degree one algorithm’s failure to solve a task should also reflect on the estimates of other algorithms, learn which situations are suitable for which algorithms; for example which algorithms perform well on the subtasks posted from a certain algorithm. It can also have an algorithm that constructs new lower-priced training tasks from real tasks, for example generalizations or specializations of a problem, just out of “curiosity”.

Producing new/improved code and algorithms, either for self improvement or as the solution for a puzzle or some other task, is among the most advanced tasks the system can try. It will not be able to do much of interest in this area until it is really strong, but it could start out by trying simple modifications of existing algorithms or trying them on similar tasks, a bit like in genetic programming.

The system is also inherently self improving from a sort of network effect, since for each algorithm added, the existing algorithms get potentially better.

What now?

When I have implemented the base system, I will start by applying the AGI to function inversion. Trivial stuff at first, of course, but I hope to eventually make it solve real puzzles like Towers of Hanoi by a combination of searching and deduction. Also, it would be fun to try some games and an NP-complete problem like 3-SAT.

It would be beautiful if the algorithms were written in the same simplified, purely functional (thus easier to analyze), LISP that I plan to write the problem definitions in. Alas, good AI needs to be fast and a 100x slower system just because the algorithms run in my own immature poorly interpreted language instead of C is not so fun. However, a good JIT compiler is a very good test for an AGI. You continuously have to weigh optimization time against running what you have. If the AGI in some distant future JITs its own code, effectively running and optimizing itself, I will consider the entire project a grand success :).

I forget why I called the project Guerilla. It was probably terribly clever. Nevertheless, here is the link to the Github repository: https://github.com/gurgeh/Guerilla. It does not contain much yet.

Comments

sm4096

Ai building efforts start at definitions: Ai that can
specify goals and weight them , acquire combine breakdown and refine strategy.

A strategy specifies goals, their desirability and at what likelihoods to take what actions on what (set of) conditions.

Devising strategies can be broken down into:
creating and assessing conditions for actions,
weight of goals, estimates of cost for actions,
estimates of effectiveness of actions, finding related strategies,
taking strategies apart,
combining strategies,
covering contingencies,
evaluating strategies

Jiri Jelinek

I would be interested to see the input from which this AI (when implemented) would be able to learn how to play the 5-in-a-row game.

Jiri Jelinek

This comment has been removed by the author.

Daivd

@acetoline No, the project is not abandoned, but thanks for asking :). I tend to post infrequent, overambitiously long posts, so a few weeks silence is normal.

The reason the github activity is low is more silly. I am currently in something between the design and implementation stage, writing Python code with a few pseudocode elements and a lot of prose. For some reason, I have not considered this semi-code "commit-worthy".

I promise a github update this week.

acetoline

Hi, I noticed there hasn't been any activity on your blog or github lately. I hope you haven't abandoned the project.

Jiri Jelinek

Doesn't sound like a well scalable solution. Don't get overexcited/misled after some early luck in well defined toy worlds. With teaching by manual algorithm entry by techies, you aren't gonna get very far.

Daivd

Hopefully, yes, it should be able to solve general problems using more specialized algorithms working together. It will not, however, take a set of specialized algorithms (let's say playing chess, checkers, poker and backgammon) and produce a general game playing algorithm. That is not how it achieves generality.

It is geared towards very technical users. It takes input tasks as snippets of code and gives a set of inputs that makes the function output true. This is called function inversion and is a fairly simple way of describing puzzles and technical problems.

If it turns out to be a useful system for solving these types of tasks (a big IF - no one has really been able to achieve that). It would be a very good base on which to build something that can communicate with non-technical users and interact with our fuzzy world. That is not it's primary purpose, though.

Jiri Jelinek

Can this 'AGI' generate general algorithms from a set of relevant non-general algorithms? Will non-technical users be able to teach this AI by describing specific (/non-general) scenarios?

Daivd

@Jiri Swedes, Norwegians, Danes and many Finns can read Swedish. That makes up a good 0.3% of the earth population :).

Actually, I will remove that. That source code is not for human consumption yet. It is just test cases for analyzing source code, written in an odd Lisp dialect. No actual code relating to implementing either any of the algorithms I write about or the Scheduler.

Mentifex

One way to build a strong AI is outlined in the http://mind.sourceforge.net/aisteps.html and develops into a simple but gradually expandable AI Mind.

Jiri Jelinek

Don't use Swedish in the source, man! 'Nobody' can read that ;-)

Compression, Prediction and Artificial Intelligence

Dec 28th, 2010

Compression is one of the most powerful concepts in computing. For the normal computer user, compression is associated with making files smaller, so you can store more of them or send them faster over the Internet, but there is much more to it.

An optimal compressor can (or could, if it existed) be used for prediction of the future of a sequence of events (weather, sports, stocks, political events, etc), for example by trying all possible continuations and examine how well it compresses given the history. Conversely, an optimal predictor that gives the correct probability of each possible next symbol can be used for optimal compression by using arithmetic coding.

Compression and prediction

This section on background theory contains possibly scary math and dense prose, but should be understandable for most programmers. Maybe re-read the sentences a couple of times.

Ray Solomonoff has shown [PDF] that if we let Sk be the infinite set of all programs for a machine M, such that M(Sk) gives an output with X as prefix (i.e the first bits of the output is X), then the probability of X becomes the sum of the probabilities of all of its programs, where the probability of a program is 2 ** (-|Sk|) if |Sk| is the length of the program in bits and “**” means “to the power of”. As X gets longer, the error of the predictions approach zero, if the error is calculated as the total squared probability difference.

A technicality is that only those programs count, that does not still produce X, when the last bit of the program is removed.

To give a slightly more concrete example, say that you have a sequence of events - a history - and encode those as a sequence of symbols, X. Let us further say that you have a machine, M, that can read a program S and output a sequence of symbols. If you have no further information on your sequence of events, then the best estimate for the probability of a symbol Z to occur next (i.e the best prediction) is given by the set of all programs that output your history X followed by Z. Programs which output X+Z and are short are weighted higher (the 2 ** (-|Sk|) part).

Even more concretely, given the binary sequence 101010101, you wonder what the probability is that the next bit will be 0 given that you know nothing else of this sequence. Sum 2 ** (-program length) for all programs that output 1010101010 vs those that output 1010101011 as their first bits (they are allowed to continue outputting stuff). If we call these sums sum0 and sum1 respectively, then the probability of 0 coming next is sum0 / (sum0 + sum1) and the probability of 1 coming next is sum1 / (sum0 + sum1).

Obviously you cannot find all these programs by just trying every possible program, because 1) they are infinitely many and 2) given the halting problem you cannot in general know if a running program is in an infinite loop or if it will eventually output X.

There is an area of probability theory called Minimum description length, where the language is chosen to be so simple (not Turing complete) so that you can actually find the shortest program or “description”. Calculating probabilities this way is very similar to Bayesian probability, but more general.

Solomonoff in (almost) practice

Although the point of the theorem is not to apply it directly in practice, for short sequences X we can actually try. We can avoid problem 1 above, there are infinitely many programs, by generating random programs and see if they produce X. If they do, we count them. This way we can produce an approximation of what the true sum0 and sum1 are. If we set up our random generation such that shorter programs are more likely, then we don’t have to bother with the 2 ** (-program length) part and may just have a running count for each sum. If the sequence is too long, this method will be impractical since almost no randomly generated programs will actually output X.

Problem 2 above, when we test programs they may not halt, is harder, but Levin has proposed a way around it. If we in addition to program length use running time (number of instructions executed) as a measure of the probability of our program, we can start by generating all the programs that we intend to test and then run them all in parallel. As our execution moves forward, we will get an increasingly accurate approximation of the true sum0 and sum1, without getting stuck on infinite loops.

If we want to get even more practical, it can be shown that the shortest program that produces X will generally dominate the others and thus it will predict the most likely next symbol. That way you can just search for programs that output X and the currently shortest program will be your best guess. Since we no longer care about the relative probability of the next symbol, but only which is most likely, the search does not have to be random. Thus we can use any method we like for finding a short program. If you search for programs that produce X and find one that almost does, you can construct a “true” solution from that one, by constructing a prefix part that hard codes the places where your original program is wrong. This will produce a longer program, where the length, and thus the “score”, is the length of the prefix + the length of your faulty solution. The size of the shortest program that outputs X is called the Kolmogorov complexity of X. The size of the shortest program that outputs X, measured in program size + log(running time), is called the Levin complexity of X.

One way to find these programs is to use genetic programming, just take care that you don’t think that you can count the number of programs that produce X and get relative probabilities, because your search will now be skewed towards the solution (and thus its prediction) that you find first.

A small problem is that depending on what machine you choose, i.e which instructions your programs can use and the length of these instructions, you will get different results. The method has a built in bias, since there is no one correct Turing complete language. This difference will however be smaller as X gets longer. One way to understand that is to note that any Turing complete language can emulate any other Turing complete language and that the size of such an emulator is finite. This is called the compiler theorem.

Compression is understanding and the Hutter Prize

When we understand something, we can describe it succinctly. If I have an image of a red perfect circle, the size will be much larger if I describe the individual pixels rather than just say “a red circle of diameter d and thickness t”. When I understand what the image is depicting, I can describe it shorter. Sometimes a lossy compression of observed data will actually express the truth better than the exact data. If I take a photo of a red circle, the photo will probably not be perfect, but if I notice what the photo is showing, I can compress it as “a red circle” and some noise which I throw away, and suddenly my lossy compression is a better depiction of the truth.

This equivalence between compression and general intelligence led Marcus Hutter to announce the Hutter Prize, where money is awarded for the best compression of 100 megabyte of English Wikipedia articles. So far the compression algorithms have been impressive (compressing the text to about 15%), but not shown much intelligence or understanding of the articles. When they do start to exhibit some understanding, I think that if they are allowed to compress the data in a slightly lossy way, the first thing that will happen is that some spelling and layout mistakes will be corrected, because these will be surprising to the compressor and thus demand an unusually long representation.

Matt Mahoney has written a good rationale on the Hutter Prize here.

Compression in practice - the juicy stuff

Compression is a powerful tool to measure success and avoid overfitting in a variety of common AI problems. The methods I laid out here are interesting mostly from a theoretical perspective, because of their prohibitively long running times. In my next post, I will expand on my thoughts on how you can use these results to get actual, practical algorithms for common AI problems.

← Older Blog Archives

IFHO

In Fendrich's Humble Opinion

Big Programming, Small Programming

A Better Soylent - Designing a Simple, “Optimal” Nutrition Shake

A Better Algorithm for Backups and Rolling Logs

Pre-programming Mental Silence Meditation With Entrainment

Compile Time Loops in C++11 With Trampolines and Exponential Recursion

C++11 and Boost - Succinct Like Python

No, Really. Use Zsh.

Octopress and Github as a Blogging Platform

Guerilla - My Attempt to Build a Strong AI

Comments

Compression, Prediction and Artificial Intelligence