Tim Hatch

Weblog | Photos | Projects | Panoramas | About

Linting regular expressions in Pygments 09 Mar, 2012

Regular expressions are easy to get wrong.

And I help maintain a Python library that's full of them — Pygments — which uses them to tokenize many languages’ source code (along with a small state machine, more on that later).

Because they’re so easy to get wrong, I’m a bit surprised nobody wrote a linter before. Thus, I present regexlint.

When run against the Pygments tree, it will warn on suspicious constructs, such as [xx] (duplicate character in class) or (x|y||z) (duplicated alternation pipe), along with a lot of higher-level issues.

How many bugs has it found? Here’s the diffstat:

 pygments/lexers/agile.py        |   26 ++++++++++-----------
 pygments/lexers/compiled.py     |   12 +++++-----
 pygments/lexers/dotnet.py       |    8 +++----
 pygments/lexers/functional.py   |    9 +++++---
 pygments/lexers/hdl.py          |   13 +++++------
 pygments/lexers/jvm.py          |   16 ++++++-------
 pygments/lexers/math.py         |    8 +++----
 pygments/lexers/other.py        |   40 ++++++++++++++++----------------
 pygments/lexers/parsers.py      |    2 +-
 pygments/lexers/shell.py        |    6 ++---
 pygments/lexers/sql.py          |    8 +++----
 pygments/lexers/templates.py    |   33 ++++++++++++++-------------
 pygments/lexers/text.py         |   41 +++++++++++++++++----------------
 pygments/lexers/web.py          |   48 +++++++++++++++++++--------------------
 tests/examplefiles/antlr_throws |    1 +
 tests/examplefiles/function.mu  |    1 +