Lua Pattern Matching for Go

Introduction

In October 2009, Google released a new programming language called Go. This post isn't meant to be an introduction to the language, since those can be found on all corners of the internet. Suffice it to say that it's a statically compiled systems programming language with primitive support for message passing concurrency, which is what drew me to it for my research. Over the course of the past few months, I've implemented a number of programs in Go and I've found it to be a really interesting language. I definitely enjoy being challenged to think of problems differently, and I've found that the solutions are very elegant and simple to understand.

What's missing

If you work with Go for any extended period of time you may find that there is a lack of text processing, specifically full regular expression support. Since I very rarely use regular expressions, I never though this would be a very large issue. Then it came time to implement a data scraper using Go, using the concurrency to fetch webpages in bulk and process them as they arrive. The parser/scraper that I already have is written in Lua and it's fantastic for this sort of work, since the pattern matching language is fast, small and powerful. Unfortunately doing anything multi-threaded in Lua is a bit of an exercise is masochism. It took me less than 15 minutes to get the general structure of my new Go scraper up and running, fetching webpages and processing them as they arrive, and joining at the end to ensure all requests finish. But when it came time to do the actual text processing, I hit a wall.

I've been destroyed by Lua pattern matching, especially the support for non-greedy matches. Unlike regular expressions, where you're normally limited to + and * for repetition, in Lua you can use - to specify 0 or more repetitions non-greedily. This makes things like parsing HTML and Javascript very very easy.

Read the go-nuts mailing list and search for non-greedy matching. You'll see time and time again that people suggest the 'real' solution to the problem is to write an HTML parser and then use that. Unfortunately MUCH of the code that I need to parse in this situation is actually a mix between Javascript and HTML and a parser just wouldn't cut it for me. I've got a quick dirty easily maintained parser already written and I really want to be able to use the bulk of that logic in the Go version of the same.

Initial implementation of go-luapatterns

Over the course of two days without reliable internet access I started converting the Lua pattern matching code (under 800 lines of C code) to Go. Unfortunately the pattern matching package makes heavy use of pointer arithmetic, something we simply don't have access to in Go. Since I didn't quite understand how the code was working, I wanted to keep as close to a line-by-line port as possible, so I took the same approach as the Kahlua project and implemented a string pointer type that I could them use. Even with this I had to make some changes, since pointers in C aren't mutable when passed to different functions, whereas my string pointer type I implemented was.

This first version didn't take me terribly long, and much of that time was spent debugging issues relating to my translation and machinery around the library. I was able to use the test suite from the Lua distribution to test the compliance of my implementation and it seemed to work quite well. There are a lot of differences between Lua pattern matching in Lua and in Go, such as the returns from string.Find, but the pattern language itself worked. Unfortunately it was slow, as in about 57 times slower than an equivalent Lua program. Given that Lua is interpreted and Go is compiled to native code, this seemed to be too big of a difference. I knew that at some point performance was going to bite me, especially when using the non-greedy features that caused me to write the library in the first place, so I tried to identify the issues I was experiencing.

Every time I needed to move to the next character in a string, I needed to invoke a method on my string pointer type, which incremeneted an index variable and then returned the new index. Every operation I wanted to do required either a field index or a method call, and these were happening very frequently. After the initial release of the package, someone asked why I didn't use byte slices, since they're native and seem to be the right choice. Obviously that would have been great, but at the time my understanding of the package wasn't high enough to do a higher-level translation like that.

An idiomatic version of go-luapatterns

Over the past four days I have been able to successfully rewrite the pattern matching package to use byte slices natively. In a few places this is a bit awkward, as we need to step backwards through a string, but in each case I was able to get the right behavior without needing to resort to any hackery. The hardest part of the port was figuring out precisely what the pointer arithmetic was doing throughout the library. Very frequently two char pointers were compared to ensure one didn't run into the other. In clearer terms, what the code was really trying to do is ensure that one string was longer than the empty string. The particular details of each of these comparisons was a major source of problems, but I believe I've gotten most of them right.

The result is a very simple pattern matching for Go that uses the Lua pattern matching language to support text processing. The interface is fairly simple:

  • func Match(s, p string) (bool, []string)
  • func MatchBytes(s, p []byte) (bool, [][]byte)
  • func Find(s, p string, plain bool) (bool, int, int, []string)
  • func FindBytes(s, p []byte, plain bool) (bool, int, int, [][]byte)
  • func Gmatch(s, p string) chan []string
  • func GmatchBytes(s, p []byte) chan [][]byte
  • func Replace(s, p, repl string, max int) (string, int)
  • func ReplaceBytes(s, p, repl []byte, max int) ([]byte, int)

The match functions simply take a string and a pattern and return whether or not the string matches the pattern, and a list of captures. If no captures are specified in the pattern, the entire portion of the source string that matches the pattern is returned as a single capture. The find functions do the same thing, only they return the numeric indices in the source string where the match is found. These returns are such that if you take a slice from the start index to the end index of the source array, this will be precisely the text that was matched. The gmatch function returns a channel that can be used as an iterator over the matches of p in s. For example you can use the following to print each word in a source string:

   for word := range pattern.Gmatch("This is a string", "%w+") {
        fmt.Printf("%s", word)
    }

The replace functions are implementations of gsub in this new style, allowing you to make replacements in a source string according to some pattern, and allowing for referencing captures in the replacement string.

Comparisons

I've done some basic performance comparisons and in what I believe to be the worst case for the built-in regexp package, and the luapatterns package is roughly 15 times faster than the regexp package. It is still about 3 times slower than the Lua interpreter for the same task, but I don't have any good leads on where I might increase performance. I'm certainloy open to suggestions if anyone has them.

The source for go-luapatterns can be found on Github

 

Projects

About

James N. Whitehead II is an author, computer scientist who is currently studying for his DPhil/PhD at the Oxford University Computing Laboratory. This blog is a collection of his thoughts, projects, snippets, photos and any other bits that come along.