ShaderOp.com

ShaderOp.com

Hi there. I have moved to a new website at Mhmmd.org. I'm no longer updating this one, but I'm keeping it around so the Internet wouldn't break. See you there.

Bad Assumptions: Hashing Algorithms

I’m doing some prep work for a little personal project I’m working on. For reasons that will hopefully become clear in future posts, I want to do the following:

Given the fully-qualified name of a type (whether it’s a class, struct, or function), generate a  unique identifier for it.

The identifier has to have the following properties:

  1. It should fit into 32 bits.
  2. It should be unique only within the scope of a given program.
  3. Speed isn’t a big factor since the results can be cached, but the faster it can be generated the better.

Since the scope of uniqueness I care for is limited, I theorized that the first four bytes of a SHA-1 hash of the fully qualified name of the type should be unique enough in a given program space. It’s an assumption that sounds reasonable, but I had to test it before basing any code on it.

And while I was at it, I decided to test MD5 as well because it’s supposedly faster. I also threw in CRC32 just for laughs, because I reasoned while it would be the fastest of the three it would fail miserably at generating unique ID’s with a large enough set of types. Or so I thought.

Generating SHA-1 and MD5 using .NET is trivial thanks to the classes in the System.Security.Cryptography namespace, but I had to look elsewhere for CRC32. I finally found an excellent and robust implementation on David Anson’s blog.

To test all three I wrote a little C# console application that went through all the types in the System assembly, computed all three hashes, and took the first four bytes of the hash and stuffed them into a UInt32. The program ends by printing out the total number of types found and the total number of unique IDs generated.

Here’s the first run with just the types in the System assembly:

Number of types         = 2779
Number of sha1 hashes   = 2779
Number of md5 hashes    = 2779
Number of crc32 hashes  = 2779

That looks reasonable enough. CRC32 was doing pretty well. But I was sure it won’t take long to break it.

Same test again, but this time with System, System.Xml, and System.Xml.Linq

Number of types         = 3740
Number of sha1 hashes   = 3740
Number of md5 hashes    = 3740
Number of crc32 hashes  = 3740

I kept adding assemblies and rerunning the test, and still all hashing algorithms managed to produce unique ID’s. The final test I ran was this:

Number of types         = 10183
Number of sha1 hashes   = 10183
Number of md5 hashes    = 10183
Number of crc32 hashes  = 10183

10,000 types and still CRC32 was holding its ground. At that point I began to suspect that something was wrong with my code. More drastic testing was required.

So I downloaded Moby Dick, modified the program to run through the file and store every line in a list while removing duplicates and trimming whitespaces, and then ran the hashing algorithms on each line in that list:

Number of unique lines  = 18847
Number of sha1 hashes   = 18847
Number of md5 hashes    = 18847
Number of crc32 hashes  = 18847

And again with War and Peace:

Number of unique lines  = 50605
Number of sha1 hashes   = 50604
Number of md5 hashes    = 50605
Number of crc32 hashes  = 50605

Finally one algorithm produced a duplicate, but it’s SHA-1. CRC32 is still tugging along happily.

I wasn’t going to give up until the others broke, so I tried again with The Bible:

Number of unique lines  = 98377
Number of sha1 hashes   = 98377
Number of md5 hashes    = 98376
Number of crc32 hashes  = 98377

The word of God did manage to shake MD5, but not the humble CRC32. Underestimating the meek is never a good idea.

All three books combined together now in one giant 10-megabyte file:

Number of unique lines  = 167309
Number of sha1 hashes   = 167307
Number of md5 hashes    = 167305
Number of crc32 hashes  = 167307

Finally CRC32 cracks, but not before finishing head to head with SHA-1.

Conclusion

I don’t know much about statistics and cryptography, so I don’t know what conclusions someone with better knowledge of the subject would draw from all of this. But in my case I’m now more than comfortable to assume the CRC32 will cover my needs quite adequately. I will probably still add some debug-only checks to my code to ensure that no duplicate ID’s are generated in my program, but otherwise, and contrary to my initial assumption, CRC32 are unique enough.

2 Comments

# By sha1 hash on August 24, 2010, 10:53 pm

Now that’s quite an interesting experiment. Would be interesting to see how it trended as you expanded identifier size beyond 32 bits.

# By Mohammad on August 24, 2010, 11:03 pm

Thank you!

I’m sure (although I might be proved wrong again) that SHA-1 at least would perform a lot better for a 64-bit identifier. But since I’m mainly targeting Visual C++ 2010, where INT is still 32-bit even for 64-bit builds, it’s not a pressing need for me at the moment.

But It’s definitely something I might want to look into in the future.

Powered by: