How I implemented a parser combinator library in Lisp

Parsers make sense of raw input by turning it into the data structures of your domain, but parsing input can be hard. Fortunately, parser combinators make it easier. After enjoying the parser combinators in Haskell, I decided to implement my own parser combinator library in Lisp. Here’s how I did it.

OfferZen_How-I-Implemented-a-Parser-Combinator-Library-in-Lisp_inner-article

What are parser combinators and how do they help

Parsers are helpful for various tasks such as:

Making sense of user input
Deserialising data
Implementing domain-specific languages

When writing a parser from scratch, you need to read input and buffer it, handle parse errors and backtrack. This is low-level and error-prone work, requiring a lot of boilerplate code. It distracts you from the structure of the data you are parsing and can be a challenge to get right. Using a parser combinator library is much easier because it abstracts the process of reading data from an input source and focuses on how combinations of tokens in a stream are interpreted.

Parser combinator libraries do this by providing a collection of built-in parsers and combinators, where:

The parsers are small generic components for matching input.
The combinators are functions for combining parsers to make more complicated parsers suited to a particular use case.

You can use the built-in parsers and combinators to implement your own parser and then use the library to run your parser on the input.

Why I implemented my own parser combinator in Lisp

I wrote a combinator parser library in Lisp because I wanted to understand how they work. I’ve written parsers from scratch before, and the process can be tedious. I like how low-level details are hidden in parser combinator libraries, but because they are overall high-level and abstract, the internal workings of these libraries are not obvious.

Since writing my own library, I’ve gained a better appreciation for what goes into a combinator parser library. Also, because a combinator parser library is a domain-specific language designed for creating parsers, I got practice with Lisp’s macro system, which is useful for implementing domain-specific languages.

While there were already combinator parser libraries written in Lisp, I added some features to my library that I didn’t find in other Lisp combinator parser libraries. For example, I can map and flatmap to create new parsers from existing parsers. I think this makes my library particularly convenient and modular.

Examples of parsers and combinators from my library

The built-in parsers and combinators in my Lisp library are inspired by the Parsec library from Haskell. An example of a built-in parser is many. The many function uses a predicate to create a parser for a sequence of zero or more characters which match the predicate. This parser matches zero or more numeric characters:

(many #’digit-char-p)

My library also contains combinators for making new parsers by combining existing ones. An example of this is repeated, which creates a parser that applies another parser zero or more times to the input. In other words, if you had a parser that could handle a single row in a CSV file, then this parser would be able to handle all of the rows.

(repeatedly csv-row)

In addition to parsers and combinators, there are two important control structures in my library called orp and sequential.

The orp control structure allows branching with backtracking.
The sequential control structure lets you apply parsers in order and bind the results of incremental parsing.

Application of my library: Declarative JSON parsing

As an application of my library, I decided to write a parser for JSON. Because the details of processing a stream of characters are abstracted, defining the parser is high-level and declarative. For example, this is how I define the parser for a JSON value:

(defun json ()

  (sequential (_ (ignore-whitespace))

          	(result (orp (json-string)

                  	 (json-number)

                  	 (json-object)

                  	 (json-array)

                  	 (json-boolean)

                  	 (json-null)))

          	result))

This can be read as:

First, ignore any leading whitespace
Then try each of the possible cases for a JSON value and return the first match as the result

I defined separate parsers for each possible case of a JSON value in the JSON spec. Here is what the parser for a JSON object looks like:

(defun json-object ()

  (sequential (_ (ignore-whitespace))

                     (_ (char1 #\{))

                     (_ (ignore-whitespace))

                     (keys-and-values (sep-by (json-key-value) (char1 #\,)))

                     (_ (ignore-whitespace))

          	 (_ (char1 #\}))

          	 (alist-to-hash-table keys-and-values)))

It ignores leading whitespace, reads an opening curly bracket, ignores more whitespace, parses keys and values in the body of the JSON object, ignores any more whitespace and then parses a closing curly bracket. The keys and values are converted into a Lisp hash table.

The combinators and control structures give me an expressive parsing language. This lets me define how a parser should match characters but doesn’t specify how it is executed. To execute a parser on input and get the results, I needed to add a runtime that could process a stream of characters.

Parsing the stream and tweaking performance

The stream processor reads and buffers an input stream and applies a parser to the characters in the stream. When a branch in an orp expression fails, the parser backtracks and tries the next branch.

First implementation using a lazy list of characters

My first implementation of the stream processor used a lazy list. In this approach:

When the parser tries to consume input, it checks if there are any characters available in the list.
If there aren’t, it reads from the stream and adds a character to the end of the list.
When the parser tries a branch in an orp it remembers where it is in the list.
If it fails, then it backtracks by returning to that point in the list and trying again with another branch.

The lazy list was easy to implement, but it used a lot of memory because each character needs a pointer to the next element in the list. For ASCII characters in UTF8 encoding, a pointer is eight times bigger than a single character.

This memory usage also resulted in a lot of time garbage collecting which slowed down the parser.

Switching to a list of buffers

To improve the speed of the parser, I switched from a simple linked list to a list of character buffers. In this approach, there are lots of characters in each buffer, so the overhead of pointers is small. A contiguous buffer is also cache-friendly.

Modifying orp and sequential

Another thing that hurt performance in my first implementation was constructing parsers, because functions that create component parsers were being repeatedly invoked in the sequential and orp blocks. This took time and also resulted in memory overhead because the parser functions create closures which are allocated on the heap at runtime.

To solve this problem, I modified orp and sequential to cache parsers after the first time they were constructed. Lisp’s macro system allowed me to make these modifications without affecting how I used orp and sequential.

Final result: benchmarking my JSON parser

After these improvements, I decided to benchmark my JSON parser against the JSON parser that comes built into Ruby. I ran both parsers on a large JSON file and found that my parser ran about three times slower than the Ruby parser. I’m happy with this result because the Ruby parser, which is implemented in the C language, is handwritten, carefully optimised and very fast, while I wrote a straightforward declarative parser in about a hundred lines of Lisp.

To improve performance, I could provide the option to modify the backtracking strategy. In some cases, it doesn’t make sense to backtrack because only one case in a branch can match. I could modify orp to disable backtracking in these situations. This could also help with memory usage.

A deeper understanding of parser combinators with help from Lisp

Implementing a parser combinator library gave me a deeper understanding of how they work and even more appreciation for how they make parsing easy and declarative. Improving this understanding has also made it clearer how to use them and what I can do with them, making me better at applying them.

In terms of Lisp, it let me easily add control structures like orp and sequential and enabled me to optimise the implementation without compromising the ergonomics of using my library. Optimising the library gave me practice using Lisp’s profiling tools to investigate how the runtime was allocating memory and see where I could improve performance.

Henry is a Scala engineer at GrapheneDB where he uses functional programming to manage graph databases in the cloud. He loves functional programming because it makes code robust and easy to reason about.