Part 2

To follow up on the prior post discussing bash tips & tricks here are a few more commands that are useful in text, specifically log, file analysis.

Intro to using sed

In this example, we will use sed to help normalize the data we are processing. This utility is a stream editor, useful in manipulating data. In our scenario below, we will use it to find and replace text strings.

In our example below, we have a squid proxy access log and will work to extract a unique list of domains from the log file. Using a similar process as last time, we will iteratively build out our command to further refine our results. To start, let's take a look at the format of the log:

$ head -n 3 access.log
10.53.56.81 - - [26/Nov/2017:14:20:26 +0000] "CONNECT aus5.mozilla.org:443 HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0" TAG_NONE:HIER_DIRECT
10.53.56.81 - - [26/Nov/2017:14:20:26 +0000] "GET https://aus5.mozilla.org/update/6/Firefox/47.0.2/20161031133903/WINNT_x86-msvc-x64/en-US/release/Windows_NT%2010.0.0.0%20(x64)/SSE3/default/default/update.xml HTTP/1.1" 200 946 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0" TCP_MISS:HIER_DIRECT
10.53.56.81 - - [26/Nov/2017:14:20:27 +0000] "GET http://download.mozilla.org/?product=firefox-56.0-complete-bz2&os=win&lang=en-US HTTP/1.1" 302 585 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0" TCP_MISS:HIER_DIRECT

As seen above, we have the URL in the 7th column. Using awk, as we did in the prior post, we can isolate this field as shown below by selecting the 7th space delimited column:

$ head -n 3 access.log | awk '{ print $7 }'
aus5.mozilla.org:443
https://aus5.mozilla.org/update/6/Firefox/47.0.2/20161031133903/WINNT_x86-msvc-x64/en-US/release/Windows_NT%2010.0.0.0%20(x64)/SSE3/default/default/update.xml
http://download.mozilla.org/?product=firefox-56.0-complete-bz2&os=win&lang=en-US

These first three entries are all the same domain, though we need to clean up our URLs to extract just the domains. To start, let's remove the query string data. Query strings contain parameters passed between pages on a web site and follow the ? character found in the URL. To remove this, let's use awk again, setting the delimiter to ? and displaying the field prior to the character:

$ head -n 3 access.log | awk '{ print $7 }' | awk -F '?' '{ print $1 }'                                                                                     
aus5.mozilla.org:443
https://aus5.mozilla.org/update/6/Firefox/47.0.2/20161031133903/WINNT_x86-msvc-x64/en-US/release/Windows_NT%2010.0.0.0%20(x64)/SSE3/default/default/update.xml
http://download.mozilla.org/

Now that we've addressed query strings, let's remove the protocol statement. This includes http:// and https://, though can be other protocols such as ftp://. Since in this log we only see the HTTP and HTTPS protocols, we will manually specify those to be removed; while there are other methods to do this, let's use sed:

$ head -n 3 access.log | awk '{ print $7 }' | awk -F '?' '{ print $1 }' | sed 's|http\(s\)*://||'
aus5.mozilla.org:443
aus5.mozilla.org/update/6/Firefox/47.0.2/20161031133903/WINNT_x86-msvc-x64/en-US/release/Windows_NT%2010.0.0.0%20(x64)/SSE3/default/default/update.xml
download.mozilla.org/

In the above, we have specified a sed substitution expression. The initial s in the sed command executes the substitute function. The syntax following this function is described below (and further in the man page):

  • | - This character is the delimiter between the substitute function sections. It is generally a / character, but since we are including / as a character in our pattern, we have to use another character. It is common to instead use the | as an alternative.
  • The first section is the pattern sed will look for. In our case, we are specifying that we want to look for anything that matches the regex http\(s\)*:// which will match both the http:// and https:// protocol specifiers.
  • The second section is empty here because we are using sed to remove the string. In other situations, we could put a replacement string between the last two pipe characters.

Our URLs still have page URL paths and port numbers, which we can remove using awk. We will use two awk statements, with different delimiters to remove these values:

$ head -n 3 access.log | awk '{ print $7 }' | awk -F '?' '{ print $1 }' | sed 's|http\(s\)*://||' | awk -F '/' '{ print $1 }' | awk -F ':' '{ print $1}'
aus5.mozilla.org
aus5.mozilla.org
download.mozilla.org

Great! We are down to subdomains, which we can generate stats on using the usort alias we created in the prior post). To demonstrate this further, lets replace the head -n 3 statement with cat to review the full file.

$ cat access.log | awk '{ print $7 }' | awk -F '?' '{ print $1 }' | sed 's|http\(s\)*://||' | awk -F '/' '{ print $1 }' | awk -F ':' '{ print $1}' | usort
   1 36cc206a.akstat.io
   1 a.scorecardresearch.com
   1 a.visualrevenue.com
   1 a1.vdna-assets.com
   1 ads.lfstmedia.com
 [...]
 146 cdn.images.express.co.uk
 149 px.moatads.com
 186 pagead2.googlesyndication.com
 205 dt.adsafeprotected.com
 380 www.google.com

Now that we have this list, we can further process this data in a number of ways:

  • remove subdomains to get a better count of the number of domains
  • feeding the domains into an API to gather intelligence (whois, ASN, GeoIP, reputation, etc.)
  • searching for activity specific to one or more of the domains (ie What searches were made with google.com)
  • if we add back in the port numbers and/or protocol specifiers we can look into the frequency of those data points in our access log