This topic is one that I enjoy quite a bit, but unfortunately there is a lot of problems out there.
I'm sure everyone has heard of hashing before, but many are not quite sure what it is, how it works, why it's used, or the various uses for it, and why are there so many hashing algorithms.
The hashing idea started when programmers determined a need to various situations to generate a "key" or a "unique identifier" for chunks of data. Say you had 20 strings, each one being 300 characters long, to find out if a new string is in the list (and if so, which one) you'd need to go through 20 entries, and depending on how similar they are, you'd have to do up to 300 comparisons on each one. Needless to say, finding such a needle in a haystack is slow. It also means that to compare, one needs an exact copy stored in the database. So for example, if I was passing you a file, and we wanted to know if it transfered correctly, it would be annoying (to say the least) for me to send you a second copy for comparison. Furthermore, there is no guarantee that any transfers would work right.
For all these reasons above (and lesser reasons as well), hashing was invented. The idea in hashing is that we apply a truncating transform to the data in question. Imagine we have a number 0 to 99999999999999999999, we could have a truncation where we add up all the digits over and over till we get a single digit. So if our number was 123456 we could add that up to 21 and that to 3. Although converting any arbitrary number size to a single digit isn't very unique, as we'll only have 10 possibilities, this follows the basic idea of hashing. A good hashing algorithm will have drastically different results for similar numbers, and wouldn't have a situation where 123456 and 654321 and 123465 all converted to the same hash. What we need to realize though is three key points in hashing.
1) There is truncation. One can't determine the original number from a single hash alone.
2) There is a limited size to the hash itself, and we can't uniquely identify all data with a smaller key. We only have as much uniqueness as the amount of possibilities of the hash itself. Meaning for our hash above, we'd get unique hashes if the input was a single digit, once we're past that point, we have "collisions". If the hash is 4 bytes for example, the uniqueness is only guaranteed on inputs block of up to 4 bytes.
3) Hashing algorithms are completely arbitrary mathematical transforms. Normally, when you read a good programming novel (source code), you like to see some sense in the code, like having objects that model real world objects, or control flow that matches what the user sees. When you look at hashing code, you normally just see a bunch of shifts and rotates, xor'ing data with all sorts of seemingly random values, and producing results that look like A4B31AAD49879FE which doesn't mean anything to anybody. While good hashing algorithms try to focus on transformations that make the hash/key end up looking drastically different for similar inputs, the code to do so really is just gibberish from any standpoint. One shouldn't ask, but why is he adding up all the digits? Why not flip every other bit here? Why not rotate every 3rd bit 4 over?
Now that we have a basic understanding what hashing is, let's look at various hash uses more in depth.
One of the most common uses in programming is the Hash Table. Say you have a programming language with exactly 64 keywords. You want to write a compiler for this language, as you parse the input source code, you'd have to compare every word to one of your keywords. You can store your keywords in a table in lexicographical order, and binary search each word to see if the word is in there, but this generally means you'll bump into on average 5 keywords which you'll have to do string comparisons on before you find for certain if the word you have in question is a keyword or not, and if so, what to do with it. While 5 isn't much, string comparisons are slow as they involve many comparisons internally, and the number is multiplied by the amount of words in the source code. Now while our example has 64, you may find other examples where the amount is 1024 bumping up the amount of failure checks to 10, put together, this ads up to a slowness in your program where there doesn't have to be.
The trick in this case would be to devise a hashing algorithm for your table that produces 64 unique possibilities, with each key mapping to exactly one keyword. Then instead of binary searching with many string comparisons, one can hash the word in question, jump directly into the array and then do the last comparison to see if it was a keyword or not. If the hash algorithm was 3 operations, and the average string comparison was 10 operations, you just cut down keyword check overhead from 50 operations to 3. Of course some of you are thinking that finding a hash algorithm to map so perfectly for this particular situation is probably non existent. However with hashing, we have many tricks up our sleeve to approach this ideal situation. If say our hashing algorithm produced 80 possible key results but otherwise gave each keyword a unique key, we'd still be able to reach our goal, as we'll just have 16 holes in the array that if jumped into, we know immediately that the word in question was not a keyword. There's also techniques available where you could have more than one key occupying a slot in the array, as it's really an array of arrays, and hashing turns into a kind of more optimal binary search, where instead of cutting the array in half, you eliminate 90% or more of the array each time. One could also setup a situation where we instead store both the hash and the strings in an array, and then sort the arrays by the hash, and then one only need to binary search for the hash, thus turning lengthy string comparisons into single integer comparisons (provided that the hash result is an integer of a fixed width).
Using the idea of hashing to save against lengthy string comparisons also works well with all kinds of data structures. You could for example change the key in a binary tree from being a string into a hash of the string, to speed up tree traversal. Hash Tables are also very useful if you need to keep track of data that is processed on the fly, using an array of array method touched on above.
That is your basic data structure use of hashes, now I'd like to focus more on the "signing" aspect of it. Today, there is a multitude of "cryptographic hash functions" which are designed mathematically that it should be very time consuming for a program to just determine how to generate a series of data to match the hash, forcing one to resort to brute force. Once brute force is required, and the possible data set size is well over the trillions, it wouldn't be feasible (using conventional methods at least) to determine a matching set of data for the hash, creating the illusion of uniqueness of the key, making it a way of signing a set of data so one can validate its integrity after transfer. That's why you quite often see websites listing the "MD5" or "SHA-1" of a file you want to download.
This idea can also be expanded to databases, and provide a way to identify data. Say you want to setup a used DVD buy and sell store. When someone comes into your store with a DVD in some case, you want to make sure it's a real DVD and not some copy (or if it is a copy, at least true to the original), that it's not scratched to the extent that some of it can't be read, and you need to know which edition and for which region the movie is in. How does one setup his used DVD store in a way that he can verify a DVD is good before he buys it, and know which edition it is to label it properly?
To have an archive of each DVD somewhere isn't practical, and if it has to be compared over a network, the transfer time would just be too long. If one were to setup a network where some hashes of the data alongside some data on the DVD such as track name, language lists, etc..., a used DVD store could make use of this database to identify and validate DVDs. The people running these stores can all use a program which they pass new DVDs they buy through. The program would collect 2+ cryptographically secure hashes from the full data of the DVD, and rip all other info that can be obtained, using a program such as MPlayer. Then the user can enter in additional info such as what name is on the box, and which edition the box says it is. This data is then placed on a table on a network (website). If one wants to identify a DVD, after hashes are built off of it, these hashes are then scanned for in the table. The primary key of the table should be the hash which has the smallest length, and sorted by it, so the correct section of the table can be binary searched for very quickly. Then the other stats such as extra hashes and track titles can be compared against to make sure this is the right DVD, and not one which happened to match a single hash.
Once such a setup is in place, someone running a DVD store just has to stick a DVD someone is trying to sell them in a machine, and in a few minutes, they can know if it's real, can be read fully, and which edition it is. If it's not in the database, they don't have to make a mistaken purchase, or spend time having to watch the whole thing to make sure it's complete and doesn't stop in the middle anywhere. And with the information pulled from the web, they know if it has to be labeled wide screen edition or deluxe edition or whatever (which isn't always written on the DVD itself, just the box which may not be part of the sale, or may even be the wrong box with the intention to fool the buyer).
Some key points to remember listed above is multiple hashes, sorted by the smallest (even perhaps weakest) hash, and have auxiliary data to verify against, and use cryptographic hashes. If hashes were chosen which could be replicated quickly, people who want to beat the system could make a DVD which just has some matching track lables or whatever data, along with random data to match the hashes in question. Using 2+ hashes also makes it much harder to generate false matching data, as methods to crack hashes are very different from each other, and increase the computational complexity exponentially.
There are various hashes today with varying levels of security, complexity, and length which are good for many of these purposes mentioned above and others. One popular family is the CRC-16, CRC-32, CRC-64, with the numbers referring to how many bits are in hash. CRC is quick and easy to compute (as well as to crack), but since it fits into native integer sizes on CPUs, it's very good for short tables you keep in RAM that you want to deal with quickly and repeatedly. However it's not good for verifying data, as how easy it is to crack, not being cryptographic at all. Focusing on cryptographic hashes, next up which is nice and popular MD-5. MD-5 was an old favorite which has been considered broken for a while now, yet is still in wide use for signing data. MD-5 is 128 bit, and is a decent choice to use along side other keys for extra protection, but shouldn't be used by itself. Another very popular choice today is the SHA series, which offers truncated and non truncated flavors. Personally, I find truncating a hash to be pointless, so we'll just focus on SHA-1 (160 bit), SHA-2 32 bit words (256 bit), and SHA-2 64 bit words (512 bit). SHA-1 was considered a step up on MD-5 and 'the thing to use', but it isn't that much more secure, and while not busted wide open as of the time of this writing, it has been weakened significantly and shouldn't be used by itself either. SHA-2 is significantly more secure, as well as giving long bit sizes to provide more uniqueness of keys. If speed is a consideration, using the 32-bit or 64-bit variants might factor into which CPUs your hashing use is focused on, although of course either can be computed with any kind of CPU, it's just a matter of speed. SHA-512 is regarded as one of the strongest hashes today, but it's hardly the only game out there. There's also RIPEMD (160 bit) as an alternative to SHA-1 being the same size, and built along the same lines as MD-5, which also has its problems, but is a favorite in certain circles. Looking for more unusual but strong hashes, there is Tiger which is 192 bit, and quite popular in certain P2P applications. Tiger being less popular, it hasn't seen the scrutiny some of the other hashes have, so it might not be as good as the SHA series, or it might be better, hard to say, but to my knowledge, doesn't have any significant attacks on it. Tiger2 is also in the works which is supposed to tighten any issues discovered in the first edition, and is supposed to be out in the not so distant future. Lastly, there is Whirlpool which like the 64 bit SHA-2 is 512 bits, and uses heavy transformations making it an excellent choice for hashing, and may be the best popular one currently available, although like Tiger it hasn't seen as much scrutiny as the SHAs. Although Whirlpool is currently at v3, which strengthened any possible issues pointed out since its invention.
Once you decide which hashing algorithm to use, you may wonder where exactly should you get the code from? Since hashing code isn't really supposed to make sense to someone looking at it, clarity isn't what one should be looking for when trying to find a good library. One should focus more on correctness, completeness, portability, usability, and speed.
If you're looking for BSD licensed version of MD-5, RIPEMD, SHA-1 and 2, you can grab good accurate source right out of OpenBSD's CVS. Unfortunately though, the source has some BSDisms in it which will have to be removed if you want to use it elsewhere. It also doesn't make use of the C99 types which can be used to get integers of exact sizes, which is very important when you want portable accurate code with hashing.
There's Botan which has good accurate hashing as well, of every hash we discussed here, and even some assembly optimized versions, all under the BSD license, but it's tangled up with other code making it a bit annoying if you just want to grab some source for a single hash algorithm to use. The portable code is also in C++, which isn't good for C only applications.
Mhash is another library one can use under LGPL which covers all the hashes. I notice though that its output of Tiger is backwards, which also makes me worry about endian issues in general with it. It also suffers of other issues mentioned above.
Crypto++ is a popular C++ library for hashing (and encryption) placed in the public domain. This library is rather excellent, although annoying to use if you only want to grab one source file for a single hash. It also has some assembly optimized versions available, but doesn't use the function pointer methods explained in my previous blog post, and instead checks each time whether your CPU has the instructions needed to compute the hash before computing it. Also in a test I ran, the SSE2 optimized version of SHA-512 crashed under Mac OS X (Intel of course), which may be due to unaligned memory accesses.
If you just want SHA hashing, Brian Gladman has a popular library to keep you covered. It's written in C and dual licensed as BSD and GPL. But it doesn't have any of the type safety of C99 mentioned above.
If you're looking for Whirlpool, be careful, if you look on Wikipedia, it links to source which seems to work at first, but results drastically changed based on what size chunks the data is passed in. I noticed a buffer size of 1-8191 bytes produces different results than hashing the data in 8192 byte chunks.
LibTomCrypt, another implementation under C and for the public domain seems very nice on the surface. But on further testing, I found that if the length of data passed to it for Tiger or Whirlpool hashing is not a multiple of 64, it gives improper results. Problems could be in other situations which I did not test. I really liked this though, as the code was pretty clean, and many implementations were rather fast.
OpenSSL which is very popular and in C with an Apache style license is a good choice if you want the popular hashes, as it doesn't cover Tiger or Whirlpool (but has MD-5, RIPE, and SHA). It also has assembly implementations in certain cases, but the assembly code is really a nightmare, and it's hard to integrate just part of OpenSSL into a project.
If you're looking for another implementation of SHA in C under BSD, and written rather well (but perhaps not as fast as Brian's, look no further than Aaron Gifford's implementation. If I'm not mistaken, it's also his SHA-2 in OpenBSD (but messied up a bit).
A good source which covers our favorite hashes properly is the md5deep project, only thing missing is RIPEMD code. I rather like this one and think it's a good place to start.
Of course, there's also official implementations, for Tiger, Whirlpool, and RIPEMD. Although the official Tiger code seems to be one pass only, no simple hash_init(), hash_update(), hash_finalize() methods. And the official Whirlpool and RIPEMD implementations are rather slow. Some of the other implementations of Whirlpool pointed out above were 25% faster.
It'd also be nice to have some kind of consistency when it comes to which functions to use. Before choosing, make sure you test well on various machines of different endian, bit width, and on various data sizes and chunk sizes. You may be surprised how many libraries just aren't written right and produce wrong data.
I'm personally working on C99 implementations of these, which I'm testing on Win32, Linux x86-32, Linux x86-64, Mac OS X PPC, Mac OS X Intel, in a large variety of situations to certify correctness and portability. Of course I also plan on having a consistent interface for all of these, and each being contained to a single .h/.c pair so it can be easily dropped in anywhere. Right now I'm currently trying to optimize my Tiger implementation more, and adding more assembly variations using the function pointer method documented previously to give the best speed, while providing portability and usability. Let me know if you're interested in using it when I'm done.
While it is ridiculous for yet another person to make yet another hash library, it is rather appalling how many out there are broken or have very bad bugs in them, or act very differently on another system. Don't make the mistake of using a library which hasn't been tested heavily in a variety of situations.