I’ve found myself having to build a few somewhat complex (in my opinion) regular expressions over the last few days in order to index certain fields for Splunk. A good friend of mine pointed me in the direction of a regular expression testing tool a while ago and it has proved to be extremely useful. The tool, RegExr, gives a good overview of examples, special characters, and even community submitted regular expressions for you to use. Most importantly it lets you test your regular expression on a sample of user submitted text.
This is a great tool for Splunk. All you have to do is copy an event that you want to capture a custom field in, paste it in the tool, then work with the regular expression until it captures that data you need. One example of a regular expression that I built is this monster:
(http|https)://(([A-Za-z0-9\.\-]*)?\.)?(?<domain_name>[A-Za-z0-9\-]{3}[A-Za-z0-9\-]*\.([A-Za-z]{2}\.[A-Za-z]{2}|[A-Za-z]{2,3})|[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})
I’ll explain what this one does in a second, but you can probably guess just by looking at it. With Splunk I am indexing all data that goes through the HTTP proxy on the firewall. Each event that Splunk indexes from the proxy includes the address processessed, right down to the file name. However, I’m more interested in pooling all of the events by domain name to get total amount of requests, time spent, etc by domain name. So, I needed a regular expression to extract the domain name; enter the mess of characters from above.
The regular expression above will break down a URL and capture its domain name. For example the regex will capture example.com from the following url:
http://www.example.com/some_folder/filename.php?id=34&name=something
This is actually simple enough to capture but became more complex when you considered the following:
- Subdomains (whatever.example.com)
- Odd domain extensions (ab.ca, co.uk)
- IP Address domains (64.75.34.12)
Here is a break down of the regular expression mentioned above.
Capture group 1:
(http|https)://
Pretty straight forward, the url must begin with http or https. This is a HTTP/S proxy so I know that it will begin with either of these two values. If you were to use this for an FTP proxy you could easily put in ftp. The http/s is always followed by ://.
Capture Group 2:
([A-Za-z0-9\.\-]*\.)?
This part of the expression is to capture the subdomain. It looks for any number containing characters in the range A-Z, a-z, 0-9, ., and – followed by a . (dot). The question mark at the end of this part of the expression means that it is optional, that is, not all URL’s have subdomains. Now that we have the subdomain, the next thing to process will be the domain name.
Capture Group 3 (this one’s a doosey):
([A-Za-z0-9\-]{3}[A-Za-z0-9\-]*\.([A-Za-z]{2}\.[A-Za-z]{2}|[A-Za-z]{2s,4})|[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})
This one is so long because it looks for a domain name, or an IP address. Here is the part that captures a domain name:
([A-Za-z0-9\-]{2}[A-Za-z0-9\-]*\.[A-Za-z]{2}\.[A-Za-z]{2}|[A-Za-z]{2,6})
A domain name must be at lease 2 characters, the first part takes care of that: [A-Za-z0-9\-]{2}.
The expression that follows that captures the remaining characters in the domain up until the “.” before the domain extension: [A-Za-z0-9\-]*\.
The final part captures the domain extension, which can sometimes be a provice/state followed by a country code (ab.ca or fl.us which is ([A-Za-z]{2}\.[A-Za-z]{2}) or (|) an extension from two characters (ie: .ca) to six characters (ie: .museum) which is represented by [A-Za-z]{2,6}.
I also wanted to capture the IP address if the http request used an IP instead of a host name. That’s what the last part captures:
[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}
This one is fairly easy as well. An Ipv4 address is just a series of 1 to 3 digits, a dot, 1 to 3 digits, a dot, 1 to 3 digits, a dot, then 1 to 3 digits. That is what the above captures.
To reiterate, the full regular expression is:
(http|https)://(([A-Za-z0-9\.\-]*)?\.)?(?<domain_name>[A-Za-z0-9\-]{3}[A-Za-z0-9\-]*\.([A-Za-z]{2}\.[A-Za-z]{2}|[A-Za-z]{2,3})|[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})
You may be wondering what the ?
I feel that learning and becoming comfortable with regular expressions is extremely important if you do any kind of programming. At first they may seem a little daunting, but once you get the patterns down they are actually quite simple to write. They allow you to parse almost any kind of data and nearly every programming language has some kind of implementation for them. Using a tool like RegExr is a great way to learn to write regular expressions and also test them out once you get the hang of it. You can also find a large library of regular expressions at regexlib.com.
Regular Expression Links:
- RegExr – Regular Expression Testing Tool
- RegExLib.com – A collection of regular expressions
- An Introduction to Regular Expressions
Have you tried the interactive field extractor in Splunk. You give it examples of values you want to extract and it generates the regex. It works for a lot of common regex cases and you can provide counter examples when it gets things wrong. In your example, which is somewhat difficult, it didn’t go a great job — (?i)/www\.(?[^\.]*)(?=\.)
But that same tool allows you to edit the regex right there and fix it up, so you can see how it extracted in your data on hundreds of examples, or tens of thousands if you hit the ‘test’ button.
Hi David,
I did use the field extractor in Splunk 3, but haven’t tried it in Splunk 4 yet. In Splunk 3 it was easier for me to just write the regex because there were instances where I couldn’t get it to pick up a specific value in my Windows 2008 events. I’ll have to try it out and version 4, it will be nice if I can manually modify and test the regex it generates. I agree that the regex in my example is somewhat difficult. I need it to capture IP addresses and ignore prefixes (like www and subdomain names) but still capture domain suffixes (like .com and .ab.ca) so that is where most of the complexity comes in. I’m positive there are other, more simplified, regexes that will do the same job. As the old saying goes… there’s more than one way to skin a cat.
Thanks for the tip! I’m going to go check out the field extractor.
Thank you so much for this post, i digg the way you explain stuff, before regex was a mystery to me! Looking forward to see more tutorials from you!