I’m doing some prep work for a little personal project I’m working on. For reasons that will hopefully become clear in future posts, I want to do the following:
Given the fully-qualified name of a type (whether it’s a class, struct, or function), generate a unique identifier for it.
The identifier has to have the following properties:
- It should fit into 32 bits.
- It should be unique only within the scope of a given program.
- Speed isn’t a big factor since the results can be cached, but the faster it can be generated the better.
Since the scope of uniqueness I care for is limited, I theorized that the first four bytes of a SHA-1 hash of the fully qualified name of the type should be unique enough in a given program space. It’s an assumption that sounds reasonable, but I had to test it before basing any code on it.
And while I was at it, I decided to test MD5 as well because it’s supposedly faster. I also threw in CRC32 just for laughs, because I reasoned while it would be the fastest of the three it would fail miserably at generating unique ID’s with a large enough set of types. Or so I thought.
Generating SHA-1 and MD5 using .NET is trivial thanks to the classes in the System.Security.Cryptography namespace, but I had to look elsewhere for CRC32. I finally found an excellent and robust implementation on David Anson’s blog.
To test all three I wrote a little C# console application that went through all the types in the System assembly, computed all three hashes, and took the first four bytes of the hash and stuffed them into a UInt32. The program ends by printing out the total number of types found and the total number of unique IDs generated.
Here’s the first run with just the types in the System assembly:
Number of types = 2779 Number of sha1 hashes = 2779 Number of md5 hashes = 2779 Number of crc32 hashes = 2779
That looks reasonable enough. CRC32 was doing pretty well. But I was sure it won’t take long to break it.
Same test again, but this time with System, System.Xml, and System.Xml.Linq
Number of types = 3740 Number of sha1 hashes = 3740 Number of md5 hashes = 3740 Number of crc32 hashes = 3740
I kept adding assemblies and rerunning the test, and still all hashing algorithms managed to produce unique ID’s. The final test I ran was this:
Number of types = 10183 Number of sha1 hashes = 10183 Number of md5 hashes = 10183 Number of crc32 hashes = 10183
10,000 types and still CRC32 was holding its ground. At that point I began to suspect that something was wrong with my code. More drastic testing was required.
So I downloaded Moby Dick, modified the program to run through the file and store every line in a list while removing duplicates and trimming whitespaces, and then ran the hashing algorithms on each line in that list:
Number of unique lines = 18847 Number of sha1 hashes = 18847 Number of md5 hashes = 18847 Number of crc32 hashes = 18847
And again with War and Peace:
Number of unique lines = 50605 Number of sha1 hashes = 50604 Number of md5 hashes = 50605 Number of crc32 hashes = 50605
Finally one algorithm produced a duplicate, but it’s SHA-1. CRC32 is still tugging along happily.
I wasn’t going to give up until the others broke, so I tried again with The Bible:
Number of unique lines = 98377 Number of sha1 hashes = 98377 Number of md5 hashes = 98376 Number of crc32 hashes = 98377
The word of God did manage to shake MD5, but not the humble CRC32. Underestimating the meek is never a good idea.
All three books combined together now in one giant 10-megabyte file:
Number of unique lines = 167309 Number of sha1 hashes = 167307 Number of md5 hashes = 167305 Number of crc32 hashes = 167307
Finally CRC32 cracks, but not before finishing head to head with SHA-1.
I don’t know much about statistics and cryptography, so I don’t know what conclusions someone with better knowledge of the subject would draw from all of this. But in my case I’m now more than comfortable to assume the CRC32 will cover my needs quite adequately. I will probably still add some debug-only checks to my code to ensure that no duplicate ID’s are generated in my program, but otherwise, and contrary to my initial assumption, CRC32 are unique enough.