Tim Hatch

Weblog | Photos | Projects | Panoramas | About

Go ahead, browse the archives.

Belated announcements 23 May, 2011

My best friend Loriana decided to join me in California in 2009, and we’re excited to be married this October. I’ve put her through two moves in as many years, but we’re settling in to the bay area now. If you’re nearby, my email and phone numbers are on the About page.

She doesn’t have a website right now, but we’re working on that.

Chromium Bug 12 Nov, 2010

I ran into a funny issue in the Chrome dev builds this week. There’s a bug that has (as of this writing) 12 duplicates filed against it about how the xkcd images don't render right. Dupes keep getting reported because it’s closed, but not available in Dev channel yet and the default search filter is for open issues only.

lcs_32 in sse3 22 Sep, 2010

The code for lcs_sse is now available, taking two uint32t[4] pointers. Make sure you use the correct -march=core2, -march=native, or -march=opteron -msse3 to get the fastest byteswapping.

Available at lcs_sse.c (tested on Core 2 OS X with GCC 4.2.1, Core 2 Arch Linux with GCC 4.5 and ICC 11.1, and an older distro using sse2 fallback on an Opteron 250. No runtime detection occurs, it's entirely compile-time.)

MD5game in 2010 22 Sep, 2010

This post is a followup to md5game in 2009, and makes more sense if you read that first.

After doing a bit more research using Intel manuals and some excellent resources by Agner Fog. Because in 2009 we didn’t get to do any intense benchmarking (only overall speed), the first thing was to figure out whether the hotspots were where we thought them were. I’d assumed that the best spot for optimization was in the lcs code. This was only partially true. The entire program is basically three components, and the actual time spent on each is about:

  • 450 clock cycles computing an md5
  • 400 clock cycles in lcs_32_rough (28000 false positives per 8M)
  • 4000 clock cycles if lcs_32_rough matched

So there were actually four places I could optimize (those three, plus decrease the false positives). Given that, I picked an old idea I had for lcs_32_rough, that had never worked competitively in gcc as my first assembly project, using the following rough algorithm:

Treat the resulting md5 string as binary, avoiding the hex conversion, so it cleanly fits in an xmm register — or, on 64-bit platforms that don’t have sse2, as two 64-bit registers. Take the initial position of both, and check for matching bytes (a hand-wave at the time, but it turns out there are instructions for doing just that — PCMPEQB for example, or other tricks you can read about in Hacker’s Delight). Then (another hand-wave) check to see if three bytes consecutuvely match. So you end up with something like
top:
xmm0 = abcdefedcba
xmm1 = ffCDEFfffff
t    = 00111100000
if count(t) > 8 return true;

; shift one over and check again, there are 13 positions shifting either
; right or left
xmm0 <<= 4
j top

Overall I spent about 5 nights reading up on sse instructions, two nights coding up the initial comparison routine, and another two (sleepless) nights trying to find the secret to doing the consecutive-match part of it. Turns out it's really easy, something akin to a & (a << 1) & (a << 2) in c. End result, about 3350K/s, a good chunk of which is due to reducing the false positives from 28000 per group of 8M to about 175 per 8M).

Having finished this, my advice to anyone attempting to do this is:

  • Read about GCC's extended ASM. Then read it again. Make a bookmark, it's very arcane syntax that you'll need to get around in order to make sure that your code doesn't get optimized away ("r"(ptr) might be what you need to load your string into a register, but "r"(*ptr) is necessary so GCC will actually put the data there!)
  • Download all of Agner Fog's pdfs, particularly instruction_tables.pdf, print them out and write the column headers in yourself (your scrolling hand will thank you)
  • Benchmark, benchmark, benchmark. Use RDTSC on recent chips, or raw gettimeofday(2) if you have to. But above all, know where your hotspots are, and don't waste time on the parts the compiler is already making fast for you (for example, I had a complex shift operation that I rewrote in assembly, only to find it slower than the original -- turns out that GCC knew about LEA which lets you do simple multiplications by constants quickly, and I didn't).
  • Share your results quickly with others (it forces you to write good code), try your code on 32-bit platforms (it forces you to write good code, that doesn't waste registers), and be sure to build in sanity checks that you can enable (it forces you to write correct code). In my case, sanity checks are a PEDANTIC define that causes it to run both the rough and full lcs, and complain if/when they differ in a way that matters.
  • When you're using a network protocol, keep in mind both the latency to make the connection, and the extra overhead for making a database connection. Although shared hosting kept up with the database quite well (2.4GB and counting, including indices), it couldn't keep up with hundreds of incoming http requests. Just sending 10 results at once (reducing it to tens) helped immensely.

So far, I've spent many hours on this impossible project, and I'm no closer to finding the answer. I do have the 17.3 ratio, which is very interesting (and to my knowledge, not noticed by anyone else yet). I also have a single length 14 match, one of two known in the universe. Other than that, I just have a marginally higher electric bill to show for it. So make sure you're learning as you go along.

P.S. Assuming the 17.3 ratio keeps going for higher match lengths, and assuming my 1 length 8 match per 8M, which was just off the top of my head, that means a total of:

>>> 16**32 / 8000000              # length 8
42535295865117307932921825928971L
>>> 16**32 / 8000000 / (17.3**3)  # length 11
8.2150677345087227e+27
>>> 16**32 / 8000000 / (17.3**24) # length 32
82.346090058467112

That gives an individual candidate a chance of about 2.4e-37 of being a full 32-character match. With 16 cores running the latest code, that’s still 2.47e21 years searching for each one.

MD5game in 2009 21 Sep, 2010

Graph of results

The graph above is the number of matches that my code found for an obscure programming project called md5game, per day since I started writing it.

Let me start off with a little background. I’m a computer programmer by trade, and keep my more esoteric skills sharp by doing programming challenges (like Project Euler). They let me play with some features that I wouldn’t use otherwise at $JOB, say red-black trees or finding the millionth prime number.

About a year ago, May 2009, someone posed a question on reddit, asking if there’s essentially a fixed point in md5, where md5(x) == x, so you could run a string through the algorithm and get back the same string. Since rough estimates confirm that as taking well beyond the age of the universe to determine using current computers, a lesser challenge was posted — find the longest common substring with more specific declaration of the rules (lowercase, input is a string, output considered hex).

This problem is interesting to me for a few reasons. One, I recruited some collaborators, and I like writing code when there’s at least one user besides me. Two, it’s an infinite problem, which means that it’s easy to distribute without complicated workunit tracking (like SETI@Home for example). And three, I keep finding ways to make it faster. This post is part 1, covering optimiations that Eli and I found in 2009.

The first version I wrote in python, and went something like this:

while 1:
    src = ''.join([random.choice(hex) for i in range(32)])
    dst = md5.new(src).hexdigest()
    if lcs_32(src, dst) > 8:
        print src, dst

This could evaluate about 2.5K candidates per second, and didn’t really give a lot of results. I found out as the project progressed, that there’s only a match of length 8 or greater about once per 8M candidates. I wasn’t the only one writing code like this — slow, and written in an interpreted language, and clearly I could do better.

So off to C. The first version that worked well was mostly contributed by Eli, and ran at 247K/s. That’s almost exactly two orders of magnitude and I was happy. We ran that for a while starting on May 7, 2009.

May 9, I wrote my own leaderboard to keep track of all the results coming in, and also started ignoring all matches of less than 8. Soon after it became clear that there’s about a 17:1 ratio between length 8 and 9. It turns out that the ratio is roughly constant around 17.5:1 for 8, 9, 10, and 11.

A few patches poured in — I included md5.c from BSD instead of linking with openssl which gave a boost, and Eli made several incremental improvements to the table-driven dynamic programming lcs function I’d originally written, which took the speed up to 336K/s on May 12. The single biggest change from this point on would be the move to using a first rough guess at whether the lcs would be long enough to care about. This lcs_32_rough took the speed up to 1090K/s.

Further tweaks to lcs, and making everything global and static along with compiling using -fwhole-program and using a 64-bit OS got us up to 1965K/s on May 25, 2009. The source itself didn’t get any further changes in 2009, because shortly after that my apartment was burgled, so I no longer had a G4 machine to test big-endian code on (writing endian-portable code is tough without a big-endian machine around). The longest match I found was a length 14, which ties the top on the official leaderboard.

After this, I wrote a script to exhaustively try compiler options, which were pretty similar to the ones we’d picked out by hand: inlining everything is fast, because of fewer jumps, unrolling loops also removes jumps, and no amount of compiler options will help gcc auto-vectorize code with strange loop counts and dependencies between iterations. That’s basically where it stayed in terms of performance, about 2800K/s on a Core 2 Q6600 (65nm) per core which used mostly integer operations.

The next post is about picking it up again in August 2010, and applying new knowledge in x86 SIMD operations to speed up lcs_32_rough in assembly.

Promise TX2/3 IDE Controller 21 Sep, 2010

Note to self: the correct linux kernel module for my TX3 IDE controller is not sata_promise even though it says TX2/TX4 in the name and the devices show up as /dev/sd*... it's pata_pdc2027x now.

Brother MFC-7440N on Arch Linux 16 Aug, 2010

My printer at home is a Brother MFC-7440N, which has difficult-to-use Linux support. I finally got everything working under Linux, and figured it’s worth a quick walkthrough. I use Arch Linux for which there were no proper packages available.

  1. Grab brother-mfc7440n and brscan3, and build/install them. You will need various dependencies, all of which should be in the main repos.
  2. If you’re using it networked, great! The commands you need to run to set up the printer (in cups) and the scanner (in sane) will be displayed on install.
  3. Install scanadf, which is part of the sane-frontends package, along with imagemagick.
  4. Make a handy little script like this:
  5. #!/bin/bash
    T=`mktemp -d`
    scanadf --resolution 200 $T/page%d.pnm
    convert $T/page*.pnm ~/Documents/`date '+%Y-%m-%d %H:%I:%S'`.pdf
    rm -r $T

Squashing a bug 17 Jul, 2010

I spent half of yesterday tracking down a particularly pesky bug that’s existed since before I started working here. There was a big workaround that I added a year ago that fixed the symptom because we didn’t see root issue, and that workaround broke yesterday. In the end it was a one-line fix, and the bug was caused by f calling g(a1, a2, a3, a4) which then did a4.append(...) and f had a local variable that referenced a4[-1]. It felt a little bit like the proverbial piano tuner (f is 200 lines and g is 170).

Fix totals section (and others) on multi-customer receipts

 CHANGELOG         |    2 +-
 src/parser.py     |   24 ++----------------------
 src/rparse.py     |    1 +

That sort of diffstat makes me happy.

A potential ball chain pulley 17 Jul, 2010

As promised yesterday, I’m experimenting with a new kind of ball-chain pulley. It’s based around the center hub of thermal receipt paper rolls — a throwaway part that I have an excess of at work (and assume is standard world-wide). It has 12 sections, made out of a thin plastic, and and 4.4mm chain fits well if you make notches for the balls to sink into (see photos). I also got some smaller chain to fit, but it makes much weaker contact points on the end grain.

The plastic comes in black, white, and neutral but I don’t yet know what it’s made out of. It’s 79.5mm long, with an outer diameter of 21.6mm and inner diameter of 11.75mm. It’s easily cut with hacksaw and shaved with a hobby knife, and drills somewhat self-center (they avoid the section barriers).

End view of the spool

I’ve tried deforming (making it oblong) by hand, and it springs back; if you put a chair leg on it, it can get permanently out-of-round though. To test out whether the pitch worked out right, I made this jig with hand-drilled holes. The chain slips off occasionally so I’d need to make a hole-drilling jig to get them perfectly aligned along the length before an actual test. I’m still looking for a good way to mount a 5mm-or-so motor shaft on it as well. The wing-nut with some thread locker works well enough on threaded rod.

Chain wrapped around drilled holes

Cartesian Robot Progress 16 Jul, 2010

I’ve started once again to work on my goal of building a RepRap which started almost three years ago when I was still in Texas. When I was there, I built half a McWire bot with Jesse’s help — the frame, and X/Y axes were finished when I moved and Palmer contributed parts from a BfB extruder.

McWire with Z axis built

Last weekend, I finished out the Z axis, with some minor changes. I went with locally-available U-channel aluminum which was thinner (1/16” vs 1/8”) than the rails ordered from McMaster. During cutting, I also managed to introduce some scuffing on the running surface which I mostly sanded out. I hope these won’t affect the friction of the PFTE bearings (but if it does, I found the 1/8” U-channel in the meantime).

The hardware mounting the Z-axis is a mix of imperial and metric, with lots of washers adapting for different size parts — mainly because I ran out of metric parts from my initial order, but also because it’s difficult to find 5/16” machine screws with a bevel head suitable for countersinking. Ace Hardware doesn’t carry any 5/16” with a bevel here, Home Depot has them but only in specific sizes, OSH only carries them >2” long, but I managed to find the best selection at the tiny San Andres Hardware on Micheltorena (they’re slotted, not phillips like the rest, but oh well).

I have a completed set of early RepRap electronics so I’ve run some quick tests and all the stages move as they’re supposed to. Excessive torque on the Z-axis will knock the nut out of the pipe clamp, which makes me wonder if it will also move over time and require re-homing even with lower torque. Next on the list is to finish some extender cables so I can put all the stepper controller boards in one place. I’m using 18-ga sprinkler wire for the stationary X and Z axes, but some super-flexible 16-ga speaker wire that Berto had leftover from a project for the Y axis (which is the only motor that moves during operation).

I’m using the stock leadscrews for now (1/4 × 20 stainless allthread) but intend to move to an alternate drive system after I get the electronics hooked up and a couple of test millings. The primary ones I’m considering are:

  • ball-chain, which I’ll cover in an upcoming post
  • wooden auger, although I’ll need to unbox my drill press
  • larger (3/8” or so) allthread, since it’s coarser (16 tpi vs 20)