Registry hives are one of the most artifact-dense files found on the Windows operating system. The key-value pair storage of system and software configurations, device usage, and so much more makes this artifact attractive to us in DFIR. Whether we use the key and value names to gather informaiton about connected devices or the application usage, we can leverage the last written timestamp to append these events to our timeline of activity.

This timestamp is referenced with last connected times for USB devices and networks, when documents were opened in MRU, and more. As I am sure others in the field have noticed, some hives on some systems some of the time appear to have unreliable last written timestamps associated with a key. This usually stands out to us pretty clearly when the timestamp is substantially different than the anticipated activity timeframe, though requires additional scrutiny to identify our confidence level in the time value.

Investigating registry "stomping"

I have run some testing in the past to attempt to identify what process(s) cause the timestamp "stomping" to occur without much luck. (Though "stomping" isn't quite the word I'm looking for, we'll use it to talk about this - open to suggestions for a more fitting term.) A few things I noticed are:

  • The stomping is generally present in multiple locations within a hive
  • The last written is usually stomped for one or more subkeys of the initial parent key
  • Software installation (including Windows updates) seems to play a role
  • This information appears in the registry after reboot (likely caused by shutting down the machine before grabbing hives during my testing)

Note: I am planning on documenting the testing and results in a future post, though welcome any insights from the community in what you've seen in casework or thoughts on what items should be added to the testing.

To help us identify this stomping activity, I thought it may be useful to rip out all of these timestamps, with a bit of context, and start with 'frequency analysis' to get a sense of what timestamps we should be wary of. There are a few caveats with this technique, but we will address those in a bit.

As some of you may know, one of my forensic tools of choice is Python - so naturally, I built a script. Though the code is ~100 lines with nice formatting and documentation, I figured I'd include it inline below to step through my thought process some. I will upload the code file soon and provide a link, though didn't want to delay the blog post.

Registry timestamp extraction

The goals in design are fairly simplistic:

  1. Open a registry hive;
  2. Iterate through all the keys in the hive;
  3. Grab the last written time and key path (including root);
  4. Write this information to a file where we can easily interact with the data;

So let's dive in! I am using the python-registry==1.0.4 library in Python 3.6.5 (though there are other libraries that will work for our purpose):

import json
import os
import sys
from Registry import Registry

# Path to your registry hive
# You can use `sys.argv` or `argparse` to get these details as well
hive_path = "" 
output_file = ""
open_output = open(output_file, 'w')

# Quickly test that it is a registry hive using the `regf` magic
with open(hive_path, 'rb') as temp_open:
    if temp_open.read(4) != b'regf':
        print("The provided file is not a registry hive.")
        sys.exit()

# Open up our hive and get the `root()` key object
hive_root = Registry.Registry(hive_path).root()

Now that we have the root key from our open registry hive, let's build a small function to iterate over the keys.

def gather_details(hive_key, file_out):
    """Read key date information and write as JSON to the output file
    before recursively calling on the next key.
    
    We will write 1 JSON object per line for easy parsing.
    """
    # Not the fastest method, but simple enough for our purposes
    file_out.write(
        json.dumps({'last_written': hive_key.timestamp().isoformat(),
                    'key': hive_key.path()})+"\n")
    file_out.flush()
    
    if hive_key.subkeys_number():
        for subkey in hive_key.subkeys():
            gather_details(subkey, file_out)
            
# Now let's call our function to generate the output file
gather_details(hive_root, open_output)
open_output.close()

And that's it! Add in your file paths above (or implement sys or argparse) and you should be able to generate the JSON-lines output.

Making sense of the data

We now have more data...

via GIPHY

So the next step is making it useful. Considering you're still reading at this point, you're probably also thinking of bringing it into CSV/Excel and making a pivot table. I agree - it is a great way to quickly see what values stand out, though I also agree that it leaves more to be desired. If we start with the pivot table today, I'll promise to do something more interesting with it and make a post about that. Deal?

We are now going to flip our JSON lines into CSV. There are a bunch of tools to handle this, though my favorite is jq. This tool is so much more powerful than this conversion, but today we will run the below:

jq '. | map(.) | @csv' output.json > output.csv

See the jq manual for details on what the tool can do and what the above actually means.

With our CSV, let's make a quick pivot table and sort on the count of keys per last_written time. As seen below, we can quickly see what timestamps require some extra scrutiny in our registry hive:

reg_times_pivot-1

Only the top 20 values shown here. Cutoff picked by size of my Excel window at the time.

Not very exciting as a pivot table and I can already see how, with a larger dataset than shown here, this output format does not present the data in a manner that will allow me to efficiently find the information I need to provide context to my other timestamps. With this, I will work on finding ways to present this information in a more examiner-friendly format. Feel free to send along any ideas!

Shortcomings

As promised, let's talk through a few shortcomings with this type of analysis:

  • 'Frequency analysis' is just that - as shown in the prior screenshot, we have a large number of hits for a single date. This timestamp could be system installation, last major OS update, etc. Just because a timestamp is frequent it doesn't mean it isn't less valuable or inaccurate. This is why context is important
  • Speaking of context - our CSV pivot table lacks this in a major way. When we see ~8k of the total ~10k keys have the same timestamp, we think that the whole hive has been "stomped". This sets off an unnecessary alarm bell in us and we need to work to identify what context can efficiently distinguish between normal and notable. (More on this in another post where we can play with this data further and group activity by subkey, etc.)
  • More on context - Let's say we do identify that one of these timestamps has stomped on an artifact we are concerned about. Since we (presumably) have the hard drive of the machine, let's pivot out to the rest of the system and see what activity exists at that point in time. While I keep blaming Windows updates (since it is easy to do), let's also consider user actions and activities in our pivot out to the disk.
  • Excel may not be the best option here, as we will want to use very precise timestamps to identify this behavior. Rounding to the nearest second (or any rounding) can cause false positives.

Future work

This is a topic I want to spend many more hours on, working to identify some of the causes of this activity, meaningful dataset reduction, and improve the usefulness of the output in our investigations. I will continue to share via this blog and GitHub and am open to thoughts/suggestions/questions/collaboration with the community.

Posted: 2018-06-16