Tim Hatch

Weblog | Photos | Projects | Panoramas | About

lcs_32 in sse3 22 Sep, 2010

The code for lcs_sse is now available, taking two uint32_t[4] pointers. Make sure you use the correct -march=core2, -march=native, or -march=opteron -msse3 to get the fastest byteswapping.

Available at lcs_sse.c (tested on Core 2 OS X with GCC 4.2.1, Core 2 Arch Linux with GCC 4.5 and ICC 11.1, and an older distro using sse2 fallback on an Opteron 250. No runtime detection occurs, it's entirely compile-time.)

MD5game in 2010 22 Sep, 2010

This post is a followup to md5game in 2009, and makes more sense if you read that first.

After doing a bit more research using Intel manuals and some excellent resources by Agner Fog. Because in 2009 we didn’t get to do any intense benchmarking (only overall speed), the first thing was to figure out whether the hotspots were where we thought them were. I’d assumed that the best spot for optimization was in the lcs code. This was only partially true. The entire program is basically three components, and the actual time spent on each is about:

  • 450 clock cycles computing an md5
  • 400 clock cycles in lcs_32_rough (28000 false positives per 8M)
  • 4000 clock cycles if lcs_32_rough matched

So there were actually four places I could optimize (those three, plus decrease the false positives). Given that, I picked an old idea I had for lcs_32_rough, that had never worked competitively in gcc as my first assembly project, using the following rough algorithm:

Treat the resulting md5 string as binary, avoiding the hex conversion, so it cleanly fits in an xmm register — or, on 64-bit platforms that don’t have sse2, as two 64-bit registers. Take the initial position of both, and check for matching bytes (a hand-wave at the time, but it turns out there are instructions for doing just that — PCMPEQB for example, or other tricks you can read about in Hacker’s Delight). Then (another hand-wave) check to see if three bytes consecutuvely match. So you end up with something like
xmm0 = abcdefedcba
xmm1 = ffCDEFfffff
t    = 00111100000
if count(t) > 8 return true;

; shift one over and check again, there are 13 positions shifting either
; right or left
xmm0 <<= 4
j top

Overall I spent about 5 nights reading up on sse instructions, two nights coding up the initial comparison routine, and another two (sleepless) nights trying to find the secret to doing the consecutive-match part of it. Turns out it's really easy, something akin to a & (a << 1) & (a << 2) in c. End result, about 3350K/s, a good chunk of which is due to reducing the false positives from 28000 per group of 8M to about 175 per 8M).

Having finished this, my advice to anyone attempting to do this is:

  • Read about GCC's extended ASM. Then read it again. Make a bookmark, it's very arcane syntax that you'll need to get around in order to make sure that your code doesn't get optimized away ("r"(ptr) might be what you need to load your string into a register, but "r"(*ptr) is necessary so GCC will actually put the data there!)
  • Download all of Agner Fog's pdfs, particularly instruction_tables.pdf, print them out and write the column headers in yourself (your scrolling hand will thank you)
  • Benchmark, benchmark, benchmark. Use RDTSC on recent chips, or raw gettimeofday(2) if you have to. But above all, know where your hotspots are, and don't waste time on the parts the compiler is already making fast for you (for example, I had a complex shift operation that I rewrote in assembly, only to find it slower than the original -- turns out that GCC knew about LEA which lets you do simple multiplications by constants quickly, and I didn't).
  • Share your results quickly with others (it forces you to write good code), try your code on 32-bit platforms (it forces you to write good code, that doesn't waste registers), and be sure to build in sanity checks that you can enable (it forces you to write correct code). In my case, sanity checks are a PEDANTIC define that causes it to run both the rough and full lcs, and complain if/when they differ in a way that matters.
  • When you're using a network protocol, keep in mind both the latency to make the connection, and the extra overhead for making a database connection. Although shared hosting kept up with the database quite well (2.4GB and counting, including indices), it couldn't keep up with hundreds of incoming http requests. Just sending 10 results at once (reducing it to tens) helped immensely.

So far, I've spent many hours on this impossible project, and I'm no closer to finding the answer. I do have the 17.3 ratio, which is very interesting (and to my knowledge, not noticed by anyone else yet). I also have a single length 14 match, one of two known in the universe. Other than that, I just have a marginally higher electric bill to show for it. So make sure you're learning as you go along.

P.S. Assuming the 17.3 ratio keeps going for higher match lengths, and assuming my 1 length 8 match per 8M, which was just off the top of my head, that means a total of:

>>> 16**32 / 8000000              # length 8
>>> 16**32 / 8000000 / (17.3**3)  # length 11
>>> 16**32 / 8000000 / (17.3**24) # length 32

That gives an individual candidate a chance of about 2.4e-37 of being a full 32-character match. With 16 cores running the latest code, that’s still 2.47e21 years searching for each one.

MD5game in 2009 21 Sep, 2010

Graph of results

The graph above is the number of matches that my code found for an obscure programming project called md5game, per day since I started writing it.

Let me start off with a little background. I’m a computer programmer by trade, and keep my more esoteric skills sharp by doing programming challenges (like Project Euler). They let me play with some features that I wouldn’t use otherwise at $JOB, say red-black trees or finding the millionth prime number.

About a year ago, May 2009, someone posed a question on reddit, asking if there’s essentially a fixed point in md5, where md5(x) == x, so you could run a string through the algorithm and get back the same string. Since rough estimates confirm that as taking well beyond the age of the universe to determine using current computers, a lesser challenge was posted — find the longest common substring with more specific declaration of the rules (lowercase, input is a string, output considered hex).

This problem is interesting to me for a few reasons. One, I recruited some collaborators, and I like writing code when there’s at least one user besides me. Two, it’s an infinite problem, which means that it’s easy to distribute without complicated workunit tracking (like SETI@Home for example). And three, I keep finding ways to make it faster. This post is part 1, covering optimiations that Eli and I found in 2009.

The first version I wrote in python, and went something like this:

while 1:
    src = ''.join([random.choice(hex) for i in range(32)])
    dst = md5.new(src).hexdigest()
    if lcs_32(src, dst) > 8:
        print src, dst

This could evaluate about 2.5K candidates per second, and didn’t really give a lot of results. I found out as the project progressed, that there’s only a match of length 8 or greater about once per 8M candidates. I wasn’t the only one writing code like this — slow, and written in an interpreted language, and clearly I could do better.

So off to C. The first version that worked well was mostly contributed by Eli, and ran at 247K/s. That’s almost exactly two orders of magnitude and I was happy. We ran that for a while starting on May 7, 2009.

May 9, I wrote my own leaderboard to keep track of all the results coming in, and also started ignoring all matches of less than 8. Soon after it became clear that there’s about a 17:1 ratio between length 8 and 9. It turns out that the ratio is roughly constant around 17.5:1 for 8, 9, 10, and 11.

A few patches poured in — I included md5.c from BSD instead of linking with openssl which gave a boost, and Eli made several incremental improvements to the table-driven dynamic programming lcs function I’d originally written, which took the speed up to 336K/s on May 12. The single biggest change from this point on would be the move to using a first rough guess at whether the lcs would be long enough to care about. This lcs_32_rough took the speed up to 1090K/s.

Further tweaks to lcs, and making everything global and static along with compiling using -fwhole-program and using a 64-bit OS got us up to 1965K/s on May 25, 2009. The source itself didn’t get any further changes in 2009, because shortly after that my apartment was burgled, so I no longer had a G4 machine to test big-endian code on (writing endian-portable code is tough without a big-endian machine around). The longest match I found was a length 14, which ties the top on the official leaderboard.

After this, I wrote a script to exhaustively try compiler options, which were pretty similar to the ones we’d picked out by hand: inlining everything is fast, because of fewer jumps, unrolling loops also removes jumps, and no amount of compiler options will help gcc auto-vectorize code with strange loop counts and dependencies between iterations. That’s basically where it stayed in terms of performance, about 2800K/s on a Core 2 Q6600 (65nm) per core which used mostly integer operations.

The next post is about picking it up again in August 2010, and applying new knowledge in x86 SIMD operations to speed up lcs_32_rough in assembly.

Promise TX2/3 IDE Controller 21 Sep, 2010

Note to self: the correct linux kernel module for my TX3 IDE controller is not sata_promise even though it says TX2/TX4 in the name and the devices show up as /dev/sd*... it's pata_pdc2027x now.