“Automating Newton”: Symbolic regression

The Heliocentric Revolution is arguably one of the greatest developments in the history of science. It exemplifies the three key elements which lead to scientific discoveries: data acquisition (Brahe), data analysis (Kepler), and derivation from first-principles (Newton).

The plummeting cost of sensors, computing resources, and data storage has provided immense amount of data. This has paved the way to “automating Kepler”. Indeed, big data analytics are the same brute force approach used by Kepler with the only difference of being much faster. Machine learning provides the currently tools of choice for “automating Newton”. This powerful approach permits one to extrapolate patterns and provide insights from seemingly random data set.

Symbolic regression is a machine learning technique from the field of evolutionary computation which is especially interesting in that it has the capability to extract mathematical model from a given dataset. While in conventional regression techniques, one optimizes parameters of a model provided as a starting point to the algorithm, in symbolic regression no such a-priori assumption is required. Both the functional form of the model and model’s parameters are optimized at the same time. The basic idea is quite simple. Firstly, one identifies a mathematical expression space containing candidate function building blocks, e.g. mathematical operators, analytic functions, and so forth. Afterwards, symbolic regression searches through the space spanned by these building blocks to find the most appropriate fitting function. The capability of distilling free-form natural laws from experimental data is very well described in this paper.

As amazing these techniques are, one should be well aware of their potential limits. These are discussed in detail in the paper pointed out in the recent post by Shiwani. The formidable story of Deep Thought from “The Hitchhiker’s Guide to the Galaxy” offers an alternate, more hilarious, illustration …