background

Regular Expressions For Non-Text

In the last few days, I’ve returned to a long running annoyance of mine. And it’s this: why does there seem to be a complete absence of libraries that provide the power of regular expressions for non-text? The way I see it, regular expressions have the following key properties:

  1. The input data is a linear, 1 dimensional list
  2. A part of the list that matches some pattern needs to be found
  3. The pattern itself is input data, and does not need to be hard coded into the program
  4. The pattern can be complex, with embedded structure (sub-matches), alternatives, capturing and backreferences

The most common problem in this area is finding substrings, which is why there are lots of libraries for text matching. And there is some value in specialising for this case, because specialising can improve performance. But there seems to be a real dearth of any general purpose, non-specialised libraries for this.

Of course, my desire for this comes partly from the specific problems which I encountered while writing my sound change applier in Python. The solution I ended up adopting was to translate the phones into a textual representation and use traditional regular expressions, but this solution is far from ideal. The resulting, automatically generated regular expressions are a nightmare to read, and because the phone objects have more complex internal structures than characters it is difficult to achieve the level of flexibility I would like with this approach.

So I decided to write my own set of functions in Haskell that matches patterns in a standard format against a list of arbitrary data. The first question is then, what data is specific and needs to be provided by the user?

  1. a type that represents an individual ‘character’ in the string or list
  2. a type that contains the data required to match against an individual list element
  3. a function that checks if a ‘character’ meets a set of match criteria

For example, in the case of text the individual character type would be Char, the match type might be a set or list of valid Chars, and the function would check if the character exists in the set.

Now what needs to be common?

  1. the format of the ‘regular expression’. I’ve declared a Pattern type which forms a tree containing repetitions nodes, capture nodes, matching nodes and so on. The type is parametrised so that it can contain matching criteria of arbitrary (but consistent) type.
  2. the functions which walk the tree and match the pattern, binding captures in the process. These functions sit inside a type class.

The end result of this is that you can match arbitrary data by creating two datatypes and writing one function inside a type class instance defining what counts as a match. The only downside is a parse tree for the pattern is obviously not a format that the user wants to enter directly, so an additional parsing/translation layer is necessary to translate from something like ”.*s” to a pattern that my mini-library understands.

I’m sure my implementation is probably slow compared to a more sophisticated traditional regex engine. It is a very naive backtracker, which has associated costs, and its lack of specialisation probably also counts against it when it comes to performance. But the sequences I’m trying to match are reasonably short, and it has the advantage that matching of non-textual data looks clean and is easy to read.

I should probably also say how clean and simple the solution looks in Haskell. The combination of pattern matching, a high level functional style with maps, filters and folds, and lazy lists is quite powerful in this specific problem domain. The program really amounts to just specifying how to build a list of all possible solutions then just popping the first one off the top.

I guess the final question is: have I missed a better way? I have investigated the Haskell regex libraries, and while they do allow new types to be matched again, they still display a strong bias towards text processing. Even the naming betrays this, since they sit in Text.Regex. The other alternative would be something like Parsec, but again, I don’t believe this is quite suitable. One key requirement for a regular expression engine is that it can match patterns that aren’t known until runtime, but Parsec is more suited towards matching grammars which are known at compile time. Which isn’t to say that Parsec is useless, since I’ve found it to be pretty amazing in its own domain. It’s just that it is more comparable tools like lexx and yacc than it is to a traditional regular expression.

Posted by chrisdb on 2010-07-30

Comments

Honestly that was really exciting!!! Those resources really helped me… So once again thank you so much…

Posted by Custom Essay Writing Service on 2011-09-28 07:27:19


pozycjonowanie

Posted by BicSairlfoolf on 2012-02-29 07:47:03


liceum wieczorowe warszawa program do fakturowania Ulotki Warszawa

Posted by creawayVecy on 2012-03-07 12:03:34


pozycjonowanie

Posted by SandatRooess on 2012-03-21 16:05:47


pozycjonowanie

Posted by Stremegborn on 2012-03-23 04:11:18


Generic Fluoxetine should help from trouble

Posted by buy zyban on 2012-05-05 17:16:01