Easy regular expressions

In the bad old days of computer programming you had no choice but to write programs in the native language of whatever machine you had handy. Writing software this way was problematic: early programmers spent most of their creative energies worrying about the physical implementation of the machine and its limits, instead of addressing the problem they were trying to solve. This  code was hard to read and understand; code that describes in great detail how to, for example, add two numbers, might make it hard to understand what we are doing with these numbers. The reader was forced to tease from idiom and convention what it is you acutally wanted done when you asked the computer to do what it did.   The programmer couldn’t get started without intimate knowledge of the machine’s physical implementation.

Lastly, code written in the machine’s native language was frightfully difficult to run elsewhere; in the 1940’s and 1950’s computer engineers were preoccupited with getting the computer to work at all that they didn’t give much thought to interoperability.

In 1953 the development of Fortran changed this; we suddenly had a well specified language, performant language to decribe what you wanted done, instead of how to do it. In the decades that followed programming languages have evolved and grown to the rich collection of ecosystems we enjoy today.

Except for regular expressions: for some reason when we sit down to write programs to match text, the facilities offered by most programming languages are throw backs to the 40’s. Instead of writing our regular expression in a high level language that affords us compostability and compartmentalization, we instead write sequences of characters to feed into a regular expression engine’s byte code interpreter. We’re basically writing machine code for a software implemented machine, and not all environments support exactly the same instruction sequence.

The irony here is a regular expression implements a simpiler class of machine: regular expressions are imoplementable in a “finite state autonoma”; the “machine” you describe in code has a limited number of states. In contrast, conventional programs implement a more powerful class of computation device: the turing machine. The difference lies in the ability to request more memory; a conventional computer program can request arbitrarily large amounts of memory upto the physical capacity of the computer. A regular expression can’t, its memory capacity is strictly defined at startup and never changes.

You’d think that because regular expressions implement a simpler class of computational device, we’d have a rich and easy to use language to describe them. Turns out the opposite is true: because we can almost, sort of, barely, implement regular expressions that are good enough in assembly, we never quite reached that point where we collectively decided we needed something better.

Well, I disagree. As regular expressions are commonly written today, many things are difficult. We can’t easily reuse components between regular expressions. We can’t unit-test subsections of a regular expression. And writing regular expressions at the “machine code” level makes higher level, more abstract, optimizations really hard.

While working for Peloton I created easyre to make regular expression construction, manipulation, composure and testing a whole lot easier.  At Peloton we found it useful, and I hope you do too.