Linting regular expressions in Pygments 09 Mar, 2012
Regular expressions are easy to get wrong.
And I help maintain a Python library that's full of them — Pygments — which uses them to tokenize many languages’ source code (along with a small state machine, more on that later).
Because they’re so easy to get wrong, I’m a bit surprised nobody wrote a linter before. Thus, I present regexlint.
When run against the Pygments tree, it will warn on suspicious constructs, such as [xx]
(duplicate character in class) or (x|y||z)
(duplicated alternation pipe), along with a lot of higher-level issues.
How many bugs has it found? Here’s the diffstat:
pygments/lexers/agile.py | 26 ++++++++++----------- pygments/lexers/compiled.py | 12 +++++----- pygments/lexers/dotnet.py | 8 +++---- pygments/lexers/functional.py | 9 +++++--- pygments/lexers/hdl.py | 13 +++++------ pygments/lexers/jvm.py | 16 ++++++------- pygments/lexers/math.py | 8 +++---- pygments/lexers/other.py | 40 ++++++++++++++++---------------- pygments/lexers/parsers.py | 2 +- pygments/lexers/shell.py | 6 ++--- pygments/lexers/sql.py | 8 +++---- pygments/lexers/templates.py | 33 ++++++++++++++------------- pygments/lexers/text.py | 41 +++++++++++++++++---------------- pygments/lexers/web.py | 48 +++++++++++++++++++-------------------- tests/examplefiles/antlr_throws | 1 + tests/examplefiles/function.mu | 1 +