Tim Hatch

Weblog | Photos | Projects | Panoramas | About

Go ahead, browse the archives.

Linting regular expressions in Pygments 09 Mar, 2012

Regular expressions are easy to get wrong.

And I help maintain a Python library that's full of them — Pygments — which uses them to tokenize many languages’ source code (along with a small state machine, more on that later).

Because they’re so easy to get wrong, I’m a bit surprised nobody wrote a linter before. Thus, I present regexlint.

When run against the Pygments tree, it will warn on suspicious constructs, such as [xx] (duplicate character in class) or (x|y||z) (duplicated alternation pipe), along with a lot of higher-level issues.

How many bugs has it found? Here’s the diffstat:

 pygments/lexers/agile.py        |   26 ++++++++++-----------
 pygments/lexers/compiled.py     |   12 +++++-----
 pygments/lexers/dotnet.py       |    8 +++----
 pygments/lexers/functional.py   |    9 +++++---
 pygments/lexers/hdl.py          |   13 +++++------
 pygments/lexers/jvm.py          |   16 ++++++-------
 pygments/lexers/math.py         |    8 +++----
 pygments/lexers/other.py        |   40 ++++++++++++++++----------------
 pygments/lexers/parsers.py      |    2 +-
 pygments/lexers/shell.py        |    6 ++---
 pygments/lexers/sql.py          |    8 +++----
 pygments/lexers/templates.py    |   33 ++++++++++++++-------------
 pygments/lexers/text.py         |   41 +++++++++++++++++----------------
 pygments/lexers/web.py          |   48 +++++++++++++++++++--------------------
 tests/examplefiles/antlr_throws |    1 +
 tests/examplefiles/function.mu  |    1 +

Belated announcements 23 May, 2011

My best friend Loriana decided to join me in California in 2009, and we’re excited to be married this October. I’ve put her through two moves in as many years, but we’re settling in to the bay area now. If you’re nearby, my email and phone numbers are on the About page.

She doesn’t have a website right now, but we’re working on that.

Chromium Bug 12 Nov, 2010

I ran into a funny issue in the Chrome dev builds this week. There’s a bug that has (as of this writing) 12 duplicates filed against it about how the xkcd images don't render right. Dupes keep getting reported because it’s closed, but not available in Dev channel yet and the default search filter is for open issues only.

lcs_32 in sse3 22 Sep, 2010

The code for lcs_sse is now available, taking two uint32t[4] pointers. Make sure you use the correct -march=core2, -march=native, or -march=opteron -msse3 to get the fastest byteswapping.

Available at lcs_sse.c (tested on Core 2 OS X with GCC 4.2.1, Core 2 Arch Linux with GCC 4.5 and ICC 11.1, and an older distro using sse2 fallback on an Opteron 250. No runtime detection occurs, it's entirely compile-time.)

MD5game in 2010 22 Sep, 2010

This post is a followup to md5game in 2009, and makes more sense if you read that first.

After doing a bit more research using Intel manuals and some excellent resources by Agner Fog. Because in 2009 we didn’t get to do any intense benchmarking (only overall speed), the first thing was to figure out whether the hotspots were where we thought them were. I’d assumed that the best spot for optimization was in the lcs code. This was only partially true. The entire program is basically three components, and the actual time spent on each is about:

  • 450 clock cycles computing an md5
  • 400 clock cycles in lcs_32_rough (28000 false positives per 8M)
  • 4000 clock cycles if lcs_32_rough matched

So there were actually four places I could optimize (those three, plus decrease the false positives). Given that, I picked an old idea I had for lcs_32_rough, that had never worked competitively in gcc as my first assembly project, using the following rough algorithm:

Treat the resulting md5 string as binary, avoiding the hex conversion, so it cleanly fits in an xmm register — or, on 64-bit platforms that don’t have sse2, as two 64-bit registers. Take the initial position of both, and check for matching bytes (a hand-wave at the time, but it turns out there are instructions for doing just that — PCMPEQB for example, or other tricks you can read about in Hacker’s Delight). Then (another hand-wave) check to see if three bytes consecutuvely match. So you end up with something like
top:
xmm0 = abcdefedcba
xmm1 = ffCDEFfffff
t    = 00111100000
if count(t) > 8 return true;

; shift one over and check again, there are 13 positions shifting either
; right or left
xmm0 <<= 4
j top

Overall I spent about 5 nights reading up on sse instructions, two nights coding up the initial comparison routine, and another two (sleepless) nights trying to find the secret to doing the consecutive-match part of it. Turns out it's really easy, something akin to a & (a << 1) & (a << 2) in c. End result, about 3350K/s, a good chunk of which is due to reducing the false positives from 28000 per group of 8M to about 175 per 8M).

Having finished this, my advice to anyone attempting to do this is:

  • Read about GCC's extended ASM. Then read it again. Make a bookmark, it's very arcane syntax that you'll need to get around in order to make sure that your code doesn't get optimized away ("r"(ptr) might be what you need to load your string into a register, but "r"(*ptr) is necessary so GCC will actually put the data there!)
  • Download all of Agner Fog's pdfs, particularly instruction_tables.pdf, print them out and write the column headers in yourself (your scrolling hand will thank you)
  • Benchmark, benchmark, benchmark. Use RDTSC on recent chips, or raw gettimeofday(2) if you have to. But above all, know where your hotspots are, and don't waste time on the parts the compiler is already making fast for you (for example, I had a complex shift operation that I rewrote in assembly, only to find it slower than the original -- turns out that GCC knew about LEA which lets you do simple multiplications by constants quickly, and I didn't).
  • Share your results quickly with others (it forces you to write good code), try your code on 32-bit platforms (it forces you to write good code, that doesn't waste registers), and be sure to build in sanity checks that you can enable (it forces you to write correct code). In my case, sanity checks are a PEDANTIC define that causes it to run both the rough and full lcs, and complain if/when they differ in a way that matters.
  • When you're using a network protocol, keep in mind both the latency to make the connection, and the extra overhead for making a database connection. Although shared hosting kept up with the database quite well (2.4GB and counting, including indices), it couldn't keep up with hundreds of incoming http requests. Just sending 10 results at once (reducing it to tens) helped immensely.

So far, I've spent many hours on this impossible project, and I'm no closer to finding the answer. I do have the 17.3 ratio, which is very interesting (and to my knowledge, not noticed by anyone else yet). I also have a single length 14 match, one of two known in the universe. Other than that, I just have a marginally higher electric bill to show for it. So make sure you're learning as you go along.

P.S. Assuming the 17.3 ratio keeps going for higher match lengths, and assuming my 1 length 8 match per 8M, which was just off the top of my head, that means a total of:

>>> 16**32 / 8000000              # length 8
42535295865117307932921825928971L
>>> 16**32 / 8000000 / (17.3**3)  # length 11
8.2150677345087227e+27
>>> 16**32 / 8000000 / (17.3**24) # length 32
82.346090058467112

That gives an individual candidate a chance of about 2.4e-37 of being a full 32-character match. With 16 cores running the latest code, that’s still 2.47e21 years searching for each one.

MD5game in 2009 21 Sep, 2010

Graph of results

The graph above is the number of matches that my code found for an obscure programming project called md5game, per day since I started writing it.

Let me start off with a little background. I’m a computer programmer by trade, and keep my more esoteric skills sharp by doing programming challenges (like Project Euler). They let me play with some features that I wouldn’t use otherwise at $JOB, say red-black trees or finding the millionth prime number.

About a year ago, May 2009, someone posed a question on reddit, asking if there’s essentially a fixed point in md5, where md5(x) == x, so you could run a string through the algorithm and get back the same string. Since rough estimates confirm that as taking well beyond the age of the universe to determine using current computers, a lesser challenge was posted — find the longest common substring with more specific declaration of the rules (lowercase, input is a string, output considered hex).

This problem is interesting to me for a few reasons. One, I recruited some collaborators, and I like writing code when there’s at least one user besides me. Two, it’s an infinite problem, which means that it’s easy to distribute without complicated workunit tracking (like SETI@Home for example). And three, I keep finding ways to make it faster. This post is part 1, covering optimiations that Eli and I found in 2009.

The first version I wrote in python, and went something like this:

while 1:
    src = ''.join([random.choice(hex) for i in range(32)])
    dst = md5.new(src).hexdigest()
    if lcs_32(src, dst) > 8:
        print src, dst

This could evaluate about 2.5K candidates per second, and didn’t really give a lot of results. I found out as the project progressed, that there’s only a match of length 8 or greater about once per 8M candidates. I wasn’t the only one writing code like this — slow, and written in an interpreted language, and clearly I could do better.

So off to C. The first version that worked well was mostly contributed by Eli, and ran at 247K/s. That’s almost exactly two orders of magnitude and I was happy. We ran that for a while starting on May 7, 2009.

May 9, I wrote my own leaderboard to keep track of all the results coming in, and also started ignoring all matches of less than 8. Soon after it became clear that there’s about a 17:1 ratio between length 8 and 9. It turns out that the ratio is roughly constant around 17.5:1 for 8, 9, 10, and 11.

A few patches poured in — I included md5.c from BSD instead of linking with openssl which gave a boost, and Eli made several incremental improvements to the table-driven dynamic programming lcs function I’d originally written, which took the speed up to 336K/s on May 12. The single biggest change from this point on would be the move to using a first rough guess at whether the lcs would be long enough to care about. This lcs_32_rough took the speed up to 1090K/s.

Further tweaks to lcs, and making everything global and static along with compiling using -fwhole-program and using a 64-bit OS got us up to 1965K/s on May 25, 2009. The source itself didn’t get any further changes in 2009, because shortly after that my apartment was burgled, so I no longer had a G4 machine to test big-endian code on (writing endian-portable code is tough without a big-endian machine around). The longest match I found was a length 14, which ties the top on the official leaderboard.

After this, I wrote a script to exhaustively try compiler options, which were pretty similar to the ones we’d picked out by hand: inlining everything is fast, because of fewer jumps, unrolling loops also removes jumps, and no amount of compiler options will help gcc auto-vectorize code with strange loop counts and dependencies between iterations. That’s basically where it stayed in terms of performance, about 2800K/s on a Core 2 Q6600 (65nm) per core which used mostly integer operations.

The next post is about picking it up again in August 2010, and applying new knowledge in x86 SIMD operations to speed up lcs_32_rough in assembly.

Promise TX2/3 IDE Controller 21 Sep, 2010

Note to self: the correct linux kernel module for my TX3 IDE controller is not sata_promise even though it says TX2/TX4 in the name and the devices show up as /dev/sd*... it's pata_pdc2027x now.

Brother MFC-7440N on Arch Linux 16 Aug, 2010

My printer at home is a Brother MFC-7440N, which has difficult-to-use Linux support. I finally got everything working under Linux, and figured it’s worth a quick walkthrough. I use Arch Linux for which there were no proper packages available.

  1. Grab brother-mfc7440n and brscan3, and build/install them. You will need various dependencies, all of which should be in the main repos.
  2. If you’re using it networked, great! The commands you need to run to set up the printer (in cups) and the scanner (in sane) will be displayed on install.
  3. Install scanadf, which is part of the sane-frontends package, along with imagemagick.
  4. Make a handy little script like this:
  5. #!/bin/bash
    T=`mktemp -d`
    scanadf --resolution 200 $T/page%d.pnm
    convert $T/page*.pnm ~/Documents/`date '+%Y-%m-%d %H:%I:%S'`.pdf
    rm -r $T

Squashing a bug 17 Jul, 2010

I spent half of yesterday tracking down a particularly pesky bug that’s existed since before I started working here. There was a big workaround that I added a year ago that fixed the symptom because we didn’t see root issue, and that workaround broke yesterday. In the end it was a one-line fix, and the bug was caused by f calling g(a1, a2, a3, a4) which then did a4.append(...) and f had a local variable that referenced a4[-1]. It felt a little bit like the proverbial piano tuner (f is 200 lines and g is 170).

Fix totals section (and others) on multi-customer receipts

 CHANGELOG         |    2 +-
 src/parser.py     |   24 ++----------------------
 src/rparse.py     |    1 +

That sort of diffstat makes me happy.

A potential ball chain pulley 17 Jul, 2010

As promised yesterday, I’m experimenting with a new kind of ball-chain pulley. It’s based around the center hub of thermal receipt paper rolls — a throwaway part that I have an excess of at work (and assume is standard world-wide). It has 12 sections, made out of a thin plastic, and and 4.4mm chain fits well if you make notches for the balls to sink into (see photos). I also got some smaller chain to fit, but it makes much weaker contact points on the end grain.

The plastic comes in black, white, and neutral but I don’t yet know what it’s made out of. It’s 79.5mm long, with an outer diameter of 21.6mm and inner diameter of 11.75mm. It’s easily cut with hacksaw and shaved with a hobby knife, and drills somewhat self-center (they avoid the section barriers).

End view of the spool

I’ve tried deforming (making it oblong) by hand, and it springs back; if you put a chair leg on it, it can get permanently out-of-round though. To test out whether the pitch worked out right, I made this jig with hand-drilled holes. The chain slips off occasionally so I’d need to make a hole-drilling jig to get them perfectly aligned along the length before an actual test. I’m still looking for a good way to mount a 5mm-or-so motor shaft on it as well. The wing-nut with some thread locker works well enough on threaded rod.

Chain wrapped around drilled holes