Today we have another short post, focused on two different approaches used for similarity analysis (aka Fuzzy hashing). While we know that MD5, SHA1, and other cryptographic hashes are great at validation and integrity checks, any modification to the source file completely changes the output. And that is good - these hashes are supposed to help us identify distinct files. For this reason, we have to reach to other tools to assist in similarity analysis; one of the most popular utilities is ssdeep, developed by Jessie Kornblum (project GitHub here).

The ssdeep utility leverages the spamsum algorithm, developed by Dr. Andrew Tridgell to help identify spam messages in email. This algorithm, found at samba.org, implements what is known as context triggered piecewise hashing ("CTPH") as further discussed in this DFRWS paper. Before we dive into why CTPH is useful in this approach, let's evaluate a few other strategies for similarity analysis.

As you may already know, similarity analysis does not bode well when reviewing compressed or encrypted data. This is because the process for compression or encryption can result in notably different outputs with slightly modified inputs (a topic for another post). This means that docx files, for example, do not play well with the strategies we  will discuss; instead we would want to expose and generate a signature of the content (ie the raw text or other media) of the document instead of the binary file itself (once again, a topic for another post).

Similarity Analysis Strategies

It may go without saying that our most accurate approach to similarity matching would be to compare the byte content of two files, side by side, and look for differences. While we may be able to accomplish this using a mix of xxd and diff, this only really works at a small scale. Once we move from comparing two small files to comparing many small files, or a few medium-sized files, we need a more efficient approach. This is where signature generation comes into play.

To generate a signature, we must have a few things figured out:

  • What alphabet we want to use for our signature
  • How we want to segment the file into "summarizable" blocks
  • The technique for converting our "block summary" into a character from our alphabet
While the alphabet is an optional component, it allows us, humans, to better review and understand the data. We can always store it as integers and save a tiny bit of computational resources.

In this blog post, we will use base64 as our alphabet - as does spamsum and ssdeep. For the second and third items above, let's discuss a few techniques for slicing up our file and generating our hash value. For this example (and to keep things simple) let's use the alphabet (a-z) followed by the numbers 0 and 1 as the content of our file.

Our first approach is to slice the file into equal sized blocks. In the below example is our file content - the letters a-z followed by the numbers 0 and 1. For this example, we have decided to split our file into 4-byte blocks. The first row is each byte of the file, with it's ASCII value on the second row.

You can find the ASCII and Base64 reference tables here

We then summarize each of these 4-byte blocks by summing the ASCII value of the 4 characters, as shown on the third row of the table. We then convert this summarization of our file content into our base64 representation by taking 394 modulo 64 (394 % 64) which is 10, or K in the base64 alphabet.

Initial example of summarizing blocks of a file

The letter K becomes our summarization of the first block, a for the second, and continues until we have our complete file signature of Kaq6KaU.

In the next image below, a slightly modified version of our original file. As seen below, someone replaced jklmn with hello. We can now run our hashing algorithm against this file to get a sense of how much has changed between the two versions.

Example of file modification with fixed-block sized

Using the same technique, we calculate the new hash value of Kai6KaU. If we wanted to compare the similarity of our two files, we should be able to use our signatures to facilitate our comparison - right? So, in this case, we have one letter difference between our signatures, meaning our two file streams are largely similar!

As you may have spotted there is an issue here; we have found a hash collision when using our algorithm. In the prior example, the fourth block of each file is different; the first is mnop and the second is loop. Since we are summing our file content to determine our signature value, we are bound to get an unhealthy amount of collisions. These collisions may cause us to think files are more similar when they are not, and unfortunately are a product of summarizing file content without the use of a cryptographic hash algorithm. For this reason, we have to find a better balance between summarizing file content and encountering hash collisions.

Our next example demonstrates what happens when an insertion occurs. As you can see in the below image, the letter h was inserted after mn, adding 1 byte to the file and causing the entire content to shift right by one. In this instance, our last 'block' will just contain the number 1, though some implementations may handle this differently.

Example of insertion of file content with fixed-block sizes 

Using our same formula, we calculate a hash of KaqyGUbx. This hash is largely different than Kaq6KaU. In fact, once we reach the block containing the change, the hash is completely different even though we have similar content in the latter half of the file.

This is one of the main reasons that fixed blocks are not the best approach when it comes to similarity analysis. Any shift in content moves data across the boundaries and will cause us to calculate completely different hashes for similar content. To address that, we need to be able to set these boundaries in another way.

Context triggered piecewise hashing

As you probably guessed, this is where CTPH comes into play. Essentially we are aiming to calculate reset points with this technique. Reset points, in this case, are similar to the 4-byte boundaries we used in the prior example, where we use these boundaries to determine the amount of a file we want to summarize. The notable exception is that we instead pick the boundaries based on file content versus fixed windows. What this means is we use a rolling hash, as employed by ssdeep and spamsum, to calculate values throughout the file and look for a specific value. When this specific value is found, a boundary line is drawn and the content since the prior boundary is summarized. In the example below, we are using a simplified calculation to determine if we have reached a reset point.

While both spamsum and ssdeep calculate the reset point number for each file, for our example, we will use 7. As an additional note, this technique is meant for files with more than 28 bytes, so our hashes here will be really short and, therefore, less useful outside of illustrative purposes.

Before jumping into the example, let's talk through what a rolling hash is. Once again, we will use the same a-z01 file content we used in the prior examples. We then use what is known as a rolling hash to calculate our value for each byte of the file. A rolling hash works by calculating a hash value for all of the characters within a certain window of the file. In our case, we will have a window size of three. The window movement in our file would look like this across the first four iterations:

  1. ['a', '', ''] = [97, 0, 0]
  2. ['a', 'b', ''] = [97, 98, 0]
  3. ['a', 'b', 'c'] = [97, 98, 99]
  4. ['b', 'c', 'd'] = [98, 99, 100]

As you can see, this rolling window would continue through the file, adding a new byte each iteration and removing the oldest byte, in a FIFO style. To generate a hash of this window, we would then perform some series of further calculation against the values in the window.

The ssdeep and spamsum algorithms handle this rolling window calculation differently, though the product of the calculation is used in the same manner. We have simplified the calculation to make this process easier to discuss.

For this example, as you likely guessed, we will sum the ASCII values to keep things simple. This sum is shown in the first row of the below example. To keep the numbers smaller though, we will then take our summed ASCII values (S) modulo 8 (S % 8) and use this integer to look for our boundaries in the file content. This number is found in the second row of the below image. If S % 8 == 7 we have reached a reset point and can create a summarization of the prior block.

Since our reset point is 7, as previously selected, we will define a chunk of a file any time our rolling hash calculation returns a 7. This is represented visually in the below image with horizontal lines showing the blocks we've set within the file.

For each block, we will calculate our signature in the same way as before: summing up the ASCII integer values of the content within the entire block (as shown in the fourth row) and applying modulo 64 to get the character for the signature (as seen in the last row).

Please remember that the only relationship between rows 2 and 3 in this example is that row 2 tells us when to set the reset point and calculate the number shown in Row 3. These two hashes are algorithmically independent from one another by design. Row 3 is still the summation of the ASCII values for a + b + c + d + e + f and not the summation of our rolling hash output.

This produces the signature VUUD. While much shorter, we now have context triggered hashes.

For our final example, let's revisit what happens when we perform the same insertion of the letter h. Using our rolling hash to calculate our context-based blocks (as shown in the first row), we can calculate the summarization of blocks using the same algorithm and generate the signature VUH1D.

As you can see, this technique is more resilient to insertions and allows us to more accurately compare differences in files than using the fixed blocks. In this case, our signatures are showing that the two files are more different than they are, though this technique is more accurate than our fixed block calculation as it understands that the tail of our file is the same between our two versions.

Obviously, this technique requires files larger than 28 bytes in order to produce accurate results, though hopefully this simplification can help explain how these fuzzy hashes are formed.  

Additional resources

For more details about how ssdeep and spamsum work, please check out the below resources:

While we primarily focused on ssdeep/spamsum in this post, please check out sdhash and the research that Vassil Roussev has conducted in the field of similarity matching.