Go ahead, browse the archives.
Regular expressions are easy to get wrong.
And I help maintain a Python library that's full of them — Pygments — which uses them to tokenize many languages’ source code (along with a small state machine, more on that later).
Because they’re so easy to get wrong, I’m a bit surprised nobody wrote a linter before. Thus, I present
regexlint.
When run against the Pygments tree, it will warn on suspicious constructs, such as [xx] (duplicate character in class) or (x|y||z) (duplicated alternation pipe), along with a lot of higher-level issues.
How many bugs has it found? Here’s the diffstat:
pygments/lexers/agile.py | 26 ++++++++++-----------
pygments/lexers/compiled.py | 12 +++++-----
pygments/lexers/dotnet.py | 8 +++----
pygments/lexers/functional.py | 9 +++++---
pygments/lexers/hdl.py | 13 +++++------
pygments/lexers/jvm.py | 16 ++++++-------
pygments/lexers/math.py | 8 +++----
pygments/lexers/other.py | 40 ++++++++++++++++----------------
pygments/lexers/parsers.py | 2 +-
pygments/lexers/shell.py | 6 ++---
pygments/lexers/sql.py | 8 +++----
pygments/lexers/templates.py | 33 ++++++++++++++-------------
pygments/lexers/text.py | 41 +++++++++++++++++----------------
pygments/lexers/web.py | 48 +++++++++++++++++++--------------------
tests/examplefiles/antlr_throws | 1 +
tests/examplefiles/function.mu | 1 +
My best friend Loriana decided to join me in California in 2009, and
we’re excited to be married this October. I’ve put her through
two moves in as many years, but we’re settling in to the bay area now.
If you’re nearby, my email and phone numbers are on the About page.
She doesn’t have a website right now, but we’re working on that.

I ran into a funny issue in the Chrome dev builds this week. There’s a bug that has (as of this writing) 12 duplicates filed against it about how the xkcd images don't render right. Dupes keep getting reported because it’s closed, but not available in Dev channel yet and the default search filter is for open issues only.
The code for lcs_sse is now available, taking two uint32t[4] pointers. Make sure you use the correct -march=core2, -march=native, or -march=opteron -msse3 to get the fastest byteswapping.
Available at lcs_sse.c (tested on Core 2 OS X with GCC 4.2.1, Core 2 Arch Linux with GCC 4.5 and ICC 11.1, and an older distro using sse2 fallback on an Opteron 250. No runtime detection occurs, it's entirely compile-time.)
This post is a followup to md5game in 2009, and makes more
sense if you read that first.
After doing a bit more research using Intel manuals and some excellent
resources by Agner Fog. Because
in 2009 we didn’t get to do any intense benchmarking (only overall speed),
the first thing was to figure out whether the hotspots were where we thought
them were. I’d assumed that the best spot for optimization was in the lcs
code. This was only partially true. The entire program is basically three
components, and the actual time spent on each is about:
- 450 clock cycles computing an md5
- 400 clock cycles in
lcs_32_rough (28000 false positives per 8M)
- 4000 clock cycles if
lcs_32_rough matched
So there were actually four places I could optimize (those three, plus
decrease the false positives). Given that, I picked an old idea I had for
lcs_32_rough, that had never worked competitively in gcc as my first
assembly project, using the following rough algorithm:
Treat the resulting md5 string as binary, avoiding the hex conversion, so it
cleanly fits in an xmm register — or, on 64-bit platforms that don’t have
sse2, as two 64-bit registers. Take the initial position of both, and check
for matching bytes (a hand-wave at the time, but it turns out there are
instructions for doing just that — PCMPEQB for example, or other tricks you
can read about in Hacker’s Delight). Then (another hand-wave) check to see
if three bytes consecutuvely match. So you end up with something like
top:
xmm0 = abcdefedcba
xmm1 = ffCDEFfffff
t = 00111100000
if count(t) > 8 return true;
; shift one over and check again, there are 13 positions shifting either
; right or left
xmm0 <<= 4
j top
Overall I spent about 5 nights reading up on sse instructions, two nights
coding up the initial comparison routine, and another two (sleepless) nights
trying to find the secret to doing the consecutive-match part of it. Turns
out it's really easy, something akin to a & (a << 1) & (a << 2) in
c. End result, about 3350K/s, a good chunk of which is due to reducing the
false positives from 28000 per group of 8M to about 175 per 8M).
Having finished this, my advice to anyone attempting to do this is:
- Read about GCC's extended ASM. Then read it again. Make a bookmark, it's
very arcane syntax that you'll need to get around in order to make sure
that your code doesn't get optimized away ("r"(ptr) might be what
you need to load your string into a register, but "r"(*ptr) is
necessary so GCC will actually put the data there!)
- Download all of Agner Fog's pdfs, particularly instruction_tables.pdf,
print them out and write the column headers in yourself (your scrolling
hand will thank you)
- Benchmark, benchmark, benchmark. Use RDTSC on recent chips, or raw
gettimeofday(2) if you have to. But above all, know where your hotspots are,
and don't waste time on the parts the compiler is already making fast for
you (for example, I had a complex shift operation that I rewrote in
assembly, only to find it slower than the original -- turns out that GCC
knew about LEA which lets you do simple multiplications by constants
quickly, and I didn't).
- Share your results quickly with others (it forces you to write good code),
try your code on 32-bit platforms (it forces you to write good code, that
doesn't waste registers), and be sure to build in sanity checks that you
can enable (it forces you to write correct code). In my case, sanity
checks are a
PEDANTIC define that causes it to run both the rough and full
lcs, and complain if/when they differ in a way that matters.
- When you're using a network protocol, keep in mind both the latency to
make the connection, and the extra overhead for making a database
connection. Although shared hosting kept up with the database quite well
(2.4GB and counting, including indices), it couldn't keep up with hundreds
of incoming http requests. Just sending 10 results at once (reducing it
to tens) helped immensely.
So far, I've spent many hours on this impossible project, and I'm no closer
to finding the answer. I do have the 17.3 ratio, which is very interesting
(and to my knowledge, not noticed by anyone else yet). I also have a single length
14 match, one of two known in the universe. Other than that, I just have a marginally higher electric bill to
show for it. So make sure you're learning as you go along.
P.S. Assuming the 17.3 ratio keeps going for higher match lengths, and
assuming my 1 length 8 match per 8M, which was just off the top of my head,
that means a total of:
>>> 16**32 / 8000000 # length 8
42535295865117307932921825928971L
>>> 16**32 / 8000000 / (17.3**3) # length 11
8.2150677345087227e+27
>>> 16**32 / 8000000 / (17.3**24) # length 32
82.346090058467112
That gives an individual candidate a chance of about 2.4e-37 of being a full
32-character match. With 16 cores running the latest code, that’s still
2.47e21 years searching for each one.

The graph above is the number of matches that my code found for an obscure
programming project called md5game, per day since I started writing it.
Let me start off with a little background. I’m a computer programmer by
trade, and keep my more esoteric skills sharp by doing programming
challenges (like Project Euler).
They let me play with some features that I wouldn’t use otherwise at $JOB,
say red-black trees or finding the millionth prime number.
About a year ago, May 2009, someone posed a question on reddit, asking if
there’s essentially a fixed point in md5, where md5(x) == x, so
you could run a string through the algorithm and get back the same string.
Since rough estimates confirm that as taking well beyond the age of the
universe to determine using current computers, a
lesser challenge was posted — find the longest common
substring with more specific declaration of the rules (lowercase, input
is a string, output considered hex).
This problem is interesting to me for a few reasons. One, I recruited some
collaborators, and I like writing code when there’s at least one user
besides me.
Two, it’s an infinite problem, which means that it’s easy to distribute
without complicated workunit tracking (like SETI@Home for example).
And three, I keep finding ways to make it faster. This post is part 1,
covering optimiations that Eli and I
found in 2009.
The first version I wrote in python, and went something like this:
while 1:
src = ''.join([random.choice(hex) for i in range(32)])
dst = md5.new(src).hexdigest()
if lcs_32(src, dst) > 8:
print src, dst
This could evaluate about 2.5K candidates per second, and didn’t really give
a lot of results. I found out as the project progressed, that there’s only
a match of length 8 or greater about once per 8M candidates. I wasn’t the
only one writing code like this — slow, and written in an interpreted
language, and clearly I could do better.
So off to C. The first version that worked well was mostly contributed by
Eli, and ran at 247K/s. That’s almost exactly two orders of magnitude and I
was happy. We ran that for a while starting on May 7, 2009.
May 9, I wrote my own leaderboard to keep track of all the results coming
in, and also started ignoring all matches of less than 8. Soon after it
became clear that there’s about a 17:1 ratio between length 8 and 9. It
turns out that the ratio is roughly constant around 17.5:1 for 8, 9, 10,
and 11.
A few patches poured in — I included md5.c from BSD instead of linking with
openssl which gave a boost, and Eli made several incremental improvements to
the table-driven dynamic programming lcs function I’d originally written,
which took the speed up to 336K/s on May 12. The single biggest change from
this point on would be the move to using a first rough guess at whether the
lcs would be long enough to care about. This lcs_32_rough took the speed up
to 1090K/s.
Further tweaks to lcs, and making everything global and static along with
compiling using -fwhole-program and using a 64-bit OS got us up to
1965K/s on May 25, 2009. The source itself didn’t get any further changes
in 2009, because shortly after that my apartment was burgled, so I no longer
had a G4 machine to test big-endian code on (writing endian-portable code is
tough without a big-endian machine around). The longest match I found was a length 14, which ties the top on the official leaderboard.
After this, I wrote a script to exhaustively try compiler options, which
were pretty similar to the ones we’d picked out by hand: inlining everything
is fast, because of fewer jumps, unrolling loops also removes jumps, and no
amount of compiler options will help gcc auto-vectorize code with strange
loop counts and dependencies between iterations. That’s basically where it
stayed in terms of performance, about 2800K/s on a Core 2 Q6600 (65nm) per
core which used mostly integer operations.
The next post is about picking it up again in August 2010, and applying new
knowledge in x86 SIMD operations to speed up lcs_32_rough in
assembly.
Note to self: the correct linux kernel module for my TX3 IDE controller is not sata_promise even though it says TX2/TX4 in the name and the devices show up as /dev/sd*... it's pata_pdc2027x now.
My printer at home is a Brother MFC-7440N, which has difficult-to-use Linux support. I finally got everything working under Linux, and figured it’s worth a quick walkthrough. I use Arch Linux for which there were no proper packages available.
- Grab brother-mfc7440n and brscan3, and build/install them. You will need various dependencies, all of which should be in the main repos.
- If you’re using it networked, great! The commands you need to run to set up the printer (in cups) and the scanner (in sane) will be displayed on install.
- Install scanadf, which is part of the
sane-frontends package, along with imagemagick.
- Make a handy little script like this:
#!/bin/bash
T=`mktemp -d`
scanadf --resolution 200 $T/page%d.pnm
convert $T/page*.pnm ~/Documents/`date '+%Y-%m-%d %H:%I:%S'`.pdf
rm -r $T
I spent half of yesterday tracking down a particularly pesky bug that’s existed since before I started working here. There was a big workaround that I added a year ago that fixed the symptom because we didn’t see root issue, and that workaround broke yesterday. In the end it was a one-line fix, and the bug was caused by f calling g(a1, a2, a3, a4) which then did a4.append(...) and f had a local variable that referenced a4[-1]. It felt a little bit like the proverbial piano tuner (f is 200 lines and g is 170).
Fix totals section (and others) on multi-customer receipts
CHANGELOG | 2 +-
src/parser.py | 24 ++----------------------
src/rparse.py | 1 +
That sort of diffstat makes me happy.
As promised yesterday, I’m experimenting with a new kind of ball-chain pulley. It’s based around the center hub of thermal receipt paper rolls — a throwaway part that I have an excess of at work (and assume is standard world-wide). It has 12 sections, made out of a thin plastic, and and 4.4mm chain fits well if you make notches for the balls to sink into (see photos). I also got some smaller chain to fit, but it makes much weaker contact points on the end grain.
The plastic comes in black, white, and neutral but I don’t yet know what it’s made out of. It’s 79.5mm long, with an outer diameter of 21.6mm and inner diameter of 11.75mm. It’s easily cut with hacksaw and shaved with a hobby knife, and drills somewhat self-center (they avoid the section barriers).

I’ve tried deforming (making it oblong) by hand, and it springs back; if you put a chair leg on it, it can get permanently out-of-round though. To test out whether the pitch worked out right, I made this jig with hand-drilled holes. The chain slips off occasionally so I’d need to make a hole-drilling jig to get them perfectly aligned along the length before an actual test. I’m still looking for a good way to mount a 5mm-or-so motor shaft on it as well. The wing-nut with some thread locker works well enough on threaded rod.
