SlideShare a Scribd company logo
Beyond “Regular” Regular
Expressions
Cary Petterborg | Splunk Architect | LDS Church
August 8, 2017
Boilerplate
During the course of this presentation, we may make forward-looking statements regarding future events or
the expected performance of the company. We caution you that such statements reflect our current
expectations and estimates based on factors currently known to us and that actual events or results could
differ materially. For important factors that may cause actual results to differ from those contained in our
forward-looking statements, please review our filings with the SEC.
The forward-looking statements made in this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, this presentation may not contain current or accurate
information. We do not assume any obligation to update any forward looking statements we may make. In
addition, any information about our roadmap outlines our general product direction and is subject to change
at any time without notice. It is for informational purposes only and shall not be incorporated into any contract
or other commitment. Splunk undertakes no obligation either to develop the features or functionality
described or to include any such feature or functionality in a future release.
Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in
the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2017 Splunk Inc. All rights reserved.
Forward-Looking Statements
My Disclaimer
During the course of this presentation, I may make references to my employer, The
Church of Jesus Christ of Latter-day Saints. This should not be taken as an
endorsement of Splunk or Splunk products by the LDS Church.
About Me
▶ Splunk user and administrator for 5.5 years
▶ Monitoring Engineer for 10 years
▶ Web developer for 23 years
▶ Software engineer for 37 years
▶ Many languages from assembly to Ruby
▶ Application development including Flight Sim, DB systems, and Web
▶ Works for the LDS Church in Salt Lake City
▶ Speaker at .conf 201[567]
▶ SplunkTrust Member 2018
Who is Cary Petterborg
Purpose
▶ Help you control your data instead of letting your data control you
▶ Regular expressions give you that control
My purpose here today is to…
Where’s Waldo?
Can you easily pick out all the female Waldo’s?
Picture courtesy of Albanpix.com
▶ Find the distinctions within similar data
▶ Isolate the value properties
from the noise
▶ Find the needles in the haystacks
▶ Break data into usable, constituent parts
Regular Expressions help you…
▶ Using Regular Expressions since the mid 80’s
▶ Started using regex with lex/yacc/sed/grep for software development
▶ Realized the power of regex quickly
▶ Taught classes on regex
▶ Love working with regex stuff in Splunk and other utilities
▶ Regex is an important skill, and I want to share my knowledge
▶ Have Rex – Will Conquer
Why Do I Like Regular Expressions?
One day, you too…
One day, you too…
Splunk and Regular
Expressions
▶ Field extractions
▶ The rex and regex search commands
▶ In props.conf, transforms.conf and other .conf files
▶ Data feeds (probably external to Splunk itself)
▶ Note: Splunk regular expressions are PCRE (Perl Compatible Regular
Expressions) and use the PCRE C library.
Where do you use regular expression in Splunk?
Splunk Field
Extraction Tool
▶ GUI tool in the web UI of Splunk
▶ Simple to use, and you can visually see the results of the regex on the events
▶ Pretty good for a start, but not always good for a final result
▶ Not able to optimize or do anything complex
▶ Makes mistakes if you don’t have regular data
Field Extraction Tool
Example of why you
might want to use
your own regex for
field extraction
Or, how to be smarter than the Field Extraction Tool
Multi-format Security Data in…
This is what the FET gives you:
Simple FET extraction of Port
Intelligent extraction of Port
▶ (?P<name>…) === (?<name>...)
▶ The P is optional (came from Python), but it is usually considered more correct
▶ Splunk FET will use (?P<name>...), so why not make things similar?
BUT
▶ Do it the way you feel most comfortable
Notes on Named Capture Groups
Goal: user from all entries using one regex
There is no user automatically extracted
FET Failed
Gets wrong values from some events
FET Failed After More Lines Added
Trying to add an additional line and extracting user doesn’t work
Let’s look at the generated REGEX
Not pretty, not easy to change/fix, not efficient. So, let’s fix it.
My tool of choice: regex101.com
Add the data and a regex
Refine the regex – better matches, but not all
Refine the regex again – almost there
And FINALLY – we got them all
Four different formats – all four user field types found!
Back to the FET – use our regex
Paste this regex into the field that the ugly, single-format regex was in:
…and the results are MUCH better
How’d He Do That?
Tricks for getting it right
▶ (for invalid user (?P<user>S+))|(for (?P<user>S+))
▶ Capture group names must be unique:
One named capture group with a single name
More than one instance of the same name will fail
▶ Start with one format
▶ Try to find similarities and differences between the formats
▶ Add a new format to your data and check your updated regex
▶ Keep a copy of the last one that worked!!
▶ Add additional formats and check ALL matches for ALL examples
How do you eat an elephant?
Bite by bite is better than trying to stuff the whole elephant in your mouth at once
▶ This can be more difficult with simpler regexes
▶ Can be easier for more complex regexes
▶ Combine two of the regexes that are similar
▶ Try to keep the things that are the same in both, making the changes to the
original only where there is a difference
▶ Remember – ONE instance of a name per regex
Alternate: Do one for each format
Then you can try combining
((for ((invalid user )|(user ))?)|(sudo: ))(?P<user>S+) (from|by)?
▶ Using parentheses for clarity is helpful:
▶ They make it possible to see the separate parts and their relationships with each
other
▶ Don’t overdo the parentheses
Use Parentheses!
Make it easier to come back to later
▶ Sometimes a field regex must be able to match data that hasn’t been seen in the
data yet, so in this case be as general as possible
▶ In the previous example, the S is best because s will be the delimiter (a space
in this case) because you want to catch any potential case that you don’t see in
the data, yet.
S+
▶ If you have a delimiter that you can count on, use something like this to match
the field value (in this case be specific about what it is NOT):
[^,]+
Use the Best Character Class
Use the right tool for the job
Just because your sample data doesn’t have a particular character in it, that
doesn’t mean it never will. Examples:
▶ Usernames – alphanumerics + what?
• Dash – Underscore - Other characters - Are you sure?
▶ Filenames – could you have a space?
• C:Program FilesMy Application
Examples:
▶ Data: ”contents of quoted string”
• Use: ”(?P<contents>[^”]+)”
▶ Data: User:carypetterborg Dept:ICS
• Use: User:(?P<user>S+)s
▶ Data: Salt Lake City, Utah 84117-6403
• Use: ^(?P<city>[^,]+),s+(?P<state>.+)s+(?P<zip>[-d]*)$
If you have a definite delimiter, take advantage
REX and REGEX
Commands
The most common use for regular expressions is in SPL
with rex and regex
When you:
▶don’t always want to extract the data
▶want to extract data from a field that is already extracted
▶don’t have access to field extractions (permissions, etc.)
▶require doing multiple, disparate regular expressions
▶are in a hurry or you are doing a proof-of-concept
When to Use REX
REX Example
index=voice sourcetype=voice* Description | rex "Description=(?P<description>[^^]+)"
| rex field=description "From (?P<start>.+) to (?P<end>.+?):s”
Aug 2 20:32:28 l13772 CPCMl13772: %local7-2-ALARM: 16$Description= Number of
AuthenticationFailed events exceeds configured threshold during configured interval of time 1
within 3 minutes on cluster StandAloneCluster. There are 2 AuthenticationFailed events (up to
30) received during the monitoring interval From Wed Aug 03 10:25:00 PHT 2016 to Wed Aug
03 10:28:00 PHT 2016: TimeStamp : 8/3/16 10:26 AM LoginFrom : 172.12.34.40 Interface :
VMREST UserID : JacobMD AppID : Cisco Tomcat ClusterID : NodeID : APPHMANAOVM001
TimeStamp : Wed Aug 03 10:26:13 PHT 2016
TimeStam::Status=2,cleared^Severity=minor^Acknowledged=no^CUSTOMER=Cisco Prime
Collaboration^Private IP Address=172.16.17.24^Default Alarm
Name=AuthenticationFailed^Managed Object=10.160.17.24^Managed Object Type=Unity
Connection^MODE=2;Alarm ID=343815480^Component=10.160.17.24x00000
▶index=voice sourcetype=voice* Description
| rex "Description=(?P<description>[^^]+)"
| rex field=description "From (?P<start>.+) to (?P<end>.+?):s”
▶SYNTAX:
| rex [field=fieldname] “regex”
▶Also available:
| rex mode=sed
REX Commands
It takes two:
First rex – get the description
Second rex – get the Start and End
▶To filter out events/data that you don’t want included in the pipeline
▶This is like search on steroids, but doesn’t replace search
▶Only used as a filter
When to use REGEX
Regex example
Only get events with internal addresses
▶Search:
sourcetype=linux_secure | regex "10.d+.d+.d+”
▶Only internal (10.*) IP addresses make it through the regex filter
▶Search produces events, regex then limits those results passed on through the
pipeline by a fancy regular expression
▶Yes, there are other ways to do this, but this is a regex example
Breakdown
▶Use rex to extract fields
▶Use regex to limit results
▶Yes, you can use them in the same search:
sourcetype=linux_secure | rex "from (?P<src_ip>d+.d+.d+.d+)" | regex
src_ip="(?<!10).d+.d+.d+"
Rex vs Regex
Index Time Regular
Expression Usage
▶ You can’t index Social Security Numbers
▶ How do you distinguish a Social Security Number from other numbers?
▶ Obfuscate ONLY SSNs, but leave other things alone.
The Problem
SSN Phone #
▶ 123-45-6789
▶ d{3}-d{2}-d{4}
▶ 800-123-4567
▶ d{3}-d{3}-d{4}
SSN vs Phone #
Regex distinctions
▶ You could use something simple like:
d+-d+-d+
▶ But it will mistake a phone number for a SSN:
Be as specific in your matches as possible
▶ New regex:
d+-dd-d+
▶ Gets rid of Phone #’s, but what about other data?
A Better Match
▶ This is exactly what a properly formatted SSN looks like:
d{3}-d{2}-d{4}
▶ This defines a SSN, but it matches other things, too:
We’re now so close
▶ Best definition:
(?<!d)(?P<ssn>d{3}-d{2}-d{4})(?!d)
▶ The SSN match can be found anywhere in the event, and only the SSN:
Let’s get REAL specific
So let’s make it work in transforms.conf
▶ Grab the beginning and ending text of the event:
(.*)(?<!d)(d{3}-d{2}-d{4})(?!d)(.*)
▶ This regex can’t be done with SEDCMD in props.conf alone
• The regex uses regex features not found in SED format
▶ Using a simple custom sourcetype, but it can be made a general transform
▶ Must capture all parts of the event.
▶ Will obfuscate only one SSN per event.
Index-time conversion
Conditions and Limitations
[testssn]
TRANSFORM-ssn = nossn
DATETIME_CONFIG = CURRENT
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = false
disabled = false
The props.conf
[nossn]
REGEX=(?m)(.*)(?<!d)(d{3}-d{2}-d{4})(?!d)(.*)
FORMAT = $1###-##-####$3
DEST_KEY = _raw
The transforms.conf
▶ (?m) – perform the regex on multi-line events
▶ (?<!d) - Negative Lookbehind – not preceeded by a digit – no net character
▶ (?!d) – Negative Lookahead – not followed by a digit – no net character
New regex features shown
▶ FORMAT = $1###-##-###$3
▶ $1 and $3 are capture group matches – from (.*) at beginning and end
▶ $2 is not used in the FORMAT, but it’s the capture group for the SSN – from:
(d{3}-d{2}-d{4})
How the FORMAT works
After bringing in the data:
It even will obfuscate the data import preview:
Greedy vs. Lazy
Matches
Greedy vs Lazy
▶ Greedy – Grab as much as you can
▶ Lazy – Grab as little as you can
▶ The lazy match will continue only as far as it needs to, no further
• <.+?> will match <12345>, while
• <.+> will match both <12345> and <12345><67890>
SYNTAX: place a ? After a * or +
The lazy match only goes to the first instance of a match following a multiple match
What is the difference?
Subtle difference, but big effect
Greedy Example
Lazy Example
Second Look - Greedy
Second Look - Lazy
▶ Greedy may cross long segments
▶ Lazy may stop prematurely
▶ Try it on various data sets to make sure it will do what you want
Choose Wisely
Embedding
▶ Problem:
• Extract two different fields from the exact same piece of data
• Only want to use one regex – for efficiency if nothing else
• Need both the Domain only and the whole URL from an access log
Multiple field extractions from one piece of data
▶ Data:
• 1501408932.060 16922 108.65.113.83 TCP_REFRESH_HIT/200 474 GET
http://damtare.by.ru/id.txt myuan@buttercupgames.com DIRECT/damtare.by.ru text/html
DEFAULT_CASE-DefaultGroup-Demo_Clients-NONE-NONE-DefaultRouting <IW_scty,-6.9,0,-,-
,-,-,0,-,-,-,-,-,-,-,IW_scty,-> - -
▶ Desired Fields:
• Domain: damtare.by.ru
• URL: http://damtare.by.ru/id.txt
Source and Results
Regex
▶ Slashes need escaping in regex101, but not in Splunk:
(?P<URL>http://(?P<domain>[^/]+)S+)
vs
(?P<URL>http://(?P<domain>[^/]+)S+)
Regex101.com vs Splunk
(?P<URL>http://(?P<domain>[^/]+)S+)
^ ^ ^ ^
| |
What? … Oh, now I see.
Results in Splunk
You can’t do this in the FET without doing your own regex!
Other Notes
▶ Some complex field extractions can be costly
▶ Some complex regular expressions can be costly
▶ Use the Job Inspector to see if there is a difference in doing on complex field
extraction vs man simple field extractions ( | rex | vs | rex | rex | rex )
▶ Sometimes the readability is more important than the performance
Performance Considerations
▶ The complex field extractions (for example, one that extracts 6 fields at once)
may be easier to maintain than multiple simple extractions (where you would
have 6 different fields extracted by 6 different regexes)
▶ Your own field extractions will probably be easier to maintain than those created
by the Field Extraction Tool – just write your own regexes better than the FET
▶ Regexes can save you a lot of headaches compared to using non-regex
field extractions (one user was trying to extract data using 200 non-regex evals
compared to 6 regexes that accomplished the same thing!)
Maintenance Considerations
Tools
Regex101 Web Page
http://regex101.com
Regexr Web Page
http://regexr.com - Doesn’t do PCRE!!
▶https://www.loggly.com/blog/five-invaluable-
techniques-to-improve-regex-performance/
Improve Regex Performance
▶ Regex Golf
• https://alf.nu/RegexGolf
• https://www.oreilly.com/learning/regex-golf-with-peter-norvig
▶ Regex Crosswords
• https://regexcrossword.com
• https://mariolurig.com/crossword/
Just for fun:
Try your regex prowess
Learn from others – ask questions – get answers
▶ http://answers.splunk.com/
Splunk Documentation
▶ https://docs.splunk.com/Documentation/Splunk/6.4.3/Knowledge/AboutSplunkreg
ularexpressions
▶ Splunk regex Slack channel
Splunk Answers and Docs
▶ Lisa Guinn
• Inspiration and guidance in preparing
• Data set to use in examples
Acknowledgements
beyond-regular-regular-expressions-v20.pdf
Questions?

More Related Content

PDF
The Power of SPL
PDF
Power of SPL
PDF
Splunk workshop-2017-Power-of-SPL
PDF
PGQL: A Language for Graphs
PDF
A Whirlwind Tour of Spatial Joins
PDF
Splunk conf2014 - Onboarding Data Into Splunk
PDF
Splunk Data Onboarding Overview - Splunk Data Collection Architecture
PDF
Power of SPL Workshop
The Power of SPL
Power of SPL
Splunk workshop-2017-Power-of-SPL
PGQL: A Language for Graphs
A Whirlwind Tour of Spatial Joins
Splunk conf2014 - Onboarding Data Into Splunk
Splunk Data Onboarding Overview - Splunk Data Collection Architecture
Power of SPL Workshop

Similar to beyond-regular-regular-expressions-v20.pdf (20)

PPTX
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
PDF
ODP
Perl Teach-In (part 1)
PDF
Frappe Open Day - October & November 2018
PPTX
SplunkLive! Munich 2018: Data Onboarding Overview
PPTX
Demo day
PPTX
Tdd is Dead, Long Live TDD
PDF
Machine Data 101
PDF
[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary
PPT
PPTX
Pig latin
DOCX
DataBase Management System Lab File
PDF
PSUG 1 - 2024-01-22 - Onboarding Best Practices
PPT
닷넷 개발자를 위한 패턴이야기
PDF
Don't you (forget about me) - PHP Meetup Lisboa 2023
PDF
Fuzzing - A Tale of Two Cultures
PPS
Rpg Pointers And User Space
PDF
Enhance system transparency and truthfulness with request tracing
PDF
Spark Meetup
PPT
Eff Plsql
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
Perl Teach-In (part 1)
Frappe Open Day - October & November 2018
SplunkLive! Munich 2018: Data Onboarding Overview
Demo day
Tdd is Dead, Long Live TDD
Machine Data 101
[E-Dev-Day 2014][5/16] C++ and JavaScript bindings for EFL and Elementary
Pig latin
DataBase Management System Lab File
PSUG 1 - 2024-01-22 - Onboarding Best Practices
닷넷 개발자를 위한 패턴이야기
Don't you (forget about me) - PHP Meetup Lisboa 2023
Fuzzing - A Tale of Two Cultures
Rpg Pointers And User Space
Enhance system transparency and truthfulness with request tracing
Spark Meetup
Eff Plsql
Ad

Recently uploaded (20)

PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
top salesforce developer skills in 2025.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
history of c programming in notes for students .pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
AI in Product Development-omnex systems
PPTX
Introduction to Artificial Intelligence
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
System and Network Administraation Chapter 3
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PTS Company Brochure 2025 (1).pdf.......
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
top salesforce developer skills in 2025.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
history of c programming in notes for students .pptx
Design an Analysis of Algorithms I-SECS-1021-03
Softaken Excel to vCard Converter Software.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
AI in Product Development-omnex systems
Introduction to Artificial Intelligence
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Wondershare Filmora 15 Crack With Activation Key [2025
wealthsignaloriginal-com-DS-text-... (1).pdf
System and Network Administraation Chapter 3
Ad

beyond-regular-regular-expressions-v20.pdf

  • 1. Beyond “Regular” Regular Expressions Cary Petterborg | Splunk Architect | LDS Church August 8, 2017
  • 3. During the course of this presentation, we may make forward-looking statements regarding future events or the expected performance of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-looking statements, please review our filings with the SEC. The forward-looking statements made in this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, this presentation may not contain current or accurate information. We do not assume any obligation to update any forward looking statements we may make. In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionality described or to include any such feature or functionality in a future release. Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2017 Splunk Inc. All rights reserved. Forward-Looking Statements
  • 4. My Disclaimer During the course of this presentation, I may make references to my employer, The Church of Jesus Christ of Latter-day Saints. This should not be taken as an endorsement of Splunk or Splunk products by the LDS Church.
  • 6. ▶ Splunk user and administrator for 5.5 years ▶ Monitoring Engineer for 10 years ▶ Web developer for 23 years ▶ Software engineer for 37 years ▶ Many languages from assembly to Ruby ▶ Application development including Flight Sim, DB systems, and Web ▶ Works for the LDS Church in Salt Lake City ▶ Speaker at .conf 201[567] ▶ SplunkTrust Member 2018 Who is Cary Petterborg
  • 8. ▶ Help you control your data instead of letting your data control you ▶ Regular expressions give you that control My purpose here today is to…
  • 9. Where’s Waldo? Can you easily pick out all the female Waldo’s? Picture courtesy of Albanpix.com
  • 10. ▶ Find the distinctions within similar data ▶ Isolate the value properties from the noise ▶ Find the needles in the haystacks ▶ Break data into usable, constituent parts Regular Expressions help you…
  • 11. ▶ Using Regular Expressions since the mid 80’s ▶ Started using regex with lex/yacc/sed/grep for software development ▶ Realized the power of regex quickly ▶ Taught classes on regex ▶ Love working with regex stuff in Splunk and other utilities ▶ Regex is an important skill, and I want to share my knowledge ▶ Have Rex – Will Conquer Why Do I Like Regular Expressions?
  • 12. One day, you too…
  • 13. One day, you too…
  • 15. ▶ Field extractions ▶ The rex and regex search commands ▶ In props.conf, transforms.conf and other .conf files ▶ Data feeds (probably external to Splunk itself) ▶ Note: Splunk regular expressions are PCRE (Perl Compatible Regular Expressions) and use the PCRE C library. Where do you use regular expression in Splunk?
  • 17. ▶ GUI tool in the web UI of Splunk ▶ Simple to use, and you can visually see the results of the regex on the events ▶ Pretty good for a start, but not always good for a final result ▶ Not able to optimize or do anything complex ▶ Makes mistakes if you don’t have regular data Field Extraction Tool
  • 18. Example of why you might want to use your own regex for field extraction Or, how to be smarter than the Field Extraction Tool
  • 20. This is what the FET gives you: Simple FET extraction of Port
  • 22. ▶ (?P<name>…) === (?<name>...) ▶ The P is optional (came from Python), but it is usually considered more correct ▶ Splunk FET will use (?P<name>...), so why not make things similar? BUT ▶ Do it the way you feel most comfortable Notes on Named Capture Groups
  • 23. Goal: user from all entries using one regex
  • 24. There is no user automatically extracted
  • 25. FET Failed Gets wrong values from some events
  • 26. FET Failed After More Lines Added Trying to add an additional line and extracting user doesn’t work
  • 27. Let’s look at the generated REGEX Not pretty, not easy to change/fix, not efficient. So, let’s fix it.
  • 28. My tool of choice: regex101.com
  • 29. Add the data and a regex
  • 30. Refine the regex – better matches, but not all
  • 31. Refine the regex again – almost there
  • 32. And FINALLY – we got them all Four different formats – all four user field types found!
  • 33. Back to the FET – use our regex Paste this regex into the field that the ugly, single-format regex was in:
  • 34. …and the results are MUCH better
  • 35. How’d He Do That? Tricks for getting it right
  • 36. ▶ (for invalid user (?P<user>S+))|(for (?P<user>S+)) ▶ Capture group names must be unique: One named capture group with a single name More than one instance of the same name will fail
  • 37. ▶ Start with one format ▶ Try to find similarities and differences between the formats ▶ Add a new format to your data and check your updated regex ▶ Keep a copy of the last one that worked!! ▶ Add additional formats and check ALL matches for ALL examples How do you eat an elephant? Bite by bite is better than trying to stuff the whole elephant in your mouth at once
  • 38. ▶ This can be more difficult with simpler regexes ▶ Can be easier for more complex regexes ▶ Combine two of the regexes that are similar ▶ Try to keep the things that are the same in both, making the changes to the original only where there is a difference ▶ Remember – ONE instance of a name per regex Alternate: Do one for each format Then you can try combining
  • 39. ((for ((invalid user )|(user ))?)|(sudo: ))(?P<user>S+) (from|by)? ▶ Using parentheses for clarity is helpful: ▶ They make it possible to see the separate parts and their relationships with each other ▶ Don’t overdo the parentheses Use Parentheses! Make it easier to come back to later
  • 40. ▶ Sometimes a field regex must be able to match data that hasn’t been seen in the data yet, so in this case be as general as possible ▶ In the previous example, the S is best because s will be the delimiter (a space in this case) because you want to catch any potential case that you don’t see in the data, yet. S+ ▶ If you have a delimiter that you can count on, use something like this to match the field value (in this case be specific about what it is NOT): [^,]+ Use the Best Character Class Use the right tool for the job
  • 41. Just because your sample data doesn’t have a particular character in it, that doesn’t mean it never will. Examples: ▶ Usernames – alphanumerics + what? • Dash – Underscore - Other characters - Are you sure? ▶ Filenames – could you have a space? • C:Program FilesMy Application
  • 42. Examples: ▶ Data: ”contents of quoted string” • Use: ”(?P<contents>[^”]+)” ▶ Data: User:carypetterborg Dept:ICS • Use: User:(?P<user>S+)s ▶ Data: Salt Lake City, Utah 84117-6403 • Use: ^(?P<city>[^,]+),s+(?P<state>.+)s+(?P<zip>[-d]*)$ If you have a definite delimiter, take advantage
  • 43. REX and REGEX Commands The most common use for regular expressions is in SPL with rex and regex
  • 44. When you: ▶don’t always want to extract the data ▶want to extract data from a field that is already extracted ▶don’t have access to field extractions (permissions, etc.) ▶require doing multiple, disparate regular expressions ▶are in a hurry or you are doing a proof-of-concept When to Use REX
  • 45. REX Example index=voice sourcetype=voice* Description | rex "Description=(?P<description>[^^]+)" | rex field=description "From (?P<start>.+) to (?P<end>.+?):s” Aug 2 20:32:28 l13772 CPCMl13772: %local7-2-ALARM: 16$Description= Number of AuthenticationFailed events exceeds configured threshold during configured interval of time 1 within 3 minutes on cluster StandAloneCluster. There are 2 AuthenticationFailed events (up to 30) received during the monitoring interval From Wed Aug 03 10:25:00 PHT 2016 to Wed Aug 03 10:28:00 PHT 2016: TimeStamp : 8/3/16 10:26 AM LoginFrom : 172.12.34.40 Interface : VMREST UserID : JacobMD AppID : Cisco Tomcat ClusterID : NodeID : APPHMANAOVM001 TimeStamp : Wed Aug 03 10:26:13 PHT 2016 TimeStam::Status=2,cleared^Severity=minor^Acknowledged=no^CUSTOMER=Cisco Prime Collaboration^Private IP Address=172.16.17.24^Default Alarm Name=AuthenticationFailed^Managed Object=10.160.17.24^Managed Object Type=Unity Connection^MODE=2;Alarm ID=343815480^Component=10.160.17.24x00000
  • 46. ▶index=voice sourcetype=voice* Description | rex "Description=(?P<description>[^^]+)" | rex field=description "From (?P<start>.+) to (?P<end>.+?):s” ▶SYNTAX: | rex [field=fieldname] “regex” ▶Also available: | rex mode=sed REX Commands It takes two:
  • 47. First rex – get the description
  • 48. Second rex – get the Start and End
  • 49. ▶To filter out events/data that you don’t want included in the pipeline ▶This is like search on steroids, but doesn’t replace search ▶Only used as a filter When to use REGEX
  • 50. Regex example Only get events with internal addresses
  • 51. ▶Search: sourcetype=linux_secure | regex "10.d+.d+.d+” ▶Only internal (10.*) IP addresses make it through the regex filter ▶Search produces events, regex then limits those results passed on through the pipeline by a fancy regular expression ▶Yes, there are other ways to do this, but this is a regex example Breakdown
  • 52. ▶Use rex to extract fields ▶Use regex to limit results ▶Yes, you can use them in the same search: sourcetype=linux_secure | rex "from (?P<src_ip>d+.d+.d+.d+)" | regex src_ip="(?<!10).d+.d+.d+" Rex vs Regex
  • 54. ▶ You can’t index Social Security Numbers ▶ How do you distinguish a Social Security Number from other numbers? ▶ Obfuscate ONLY SSNs, but leave other things alone. The Problem
  • 55. SSN Phone # ▶ 123-45-6789 ▶ d{3}-d{2}-d{4} ▶ 800-123-4567 ▶ d{3}-d{3}-d{4} SSN vs Phone # Regex distinctions
  • 56. ▶ You could use something simple like: d+-d+-d+ ▶ But it will mistake a phone number for a SSN: Be as specific in your matches as possible
  • 57. ▶ New regex: d+-dd-d+ ▶ Gets rid of Phone #’s, but what about other data? A Better Match
  • 58. ▶ This is exactly what a properly formatted SSN looks like: d{3}-d{2}-d{4} ▶ This defines a SSN, but it matches other things, too: We’re now so close
  • 59. ▶ Best definition: (?<!d)(?P<ssn>d{3}-d{2}-d{4})(?!d) ▶ The SSN match can be found anywhere in the event, and only the SSN: Let’s get REAL specific
  • 60. So let’s make it work in transforms.conf ▶ Grab the beginning and ending text of the event: (.*)(?<!d)(d{3}-d{2}-d{4})(?!d)(.*)
  • 61. ▶ This regex can’t be done with SEDCMD in props.conf alone • The regex uses regex features not found in SED format ▶ Using a simple custom sourcetype, but it can be made a general transform ▶ Must capture all parts of the event. ▶ Will obfuscate only one SSN per event. Index-time conversion Conditions and Limitations
  • 62. [testssn] TRANSFORM-ssn = nossn DATETIME_CONFIG = CURRENT NO_BINARY_CHECK = true SHOULD_LINEMERGE = false disabled = false The props.conf
  • 64. ▶ (?m) – perform the regex on multi-line events ▶ (?<!d) - Negative Lookbehind – not preceeded by a digit – no net character ▶ (?!d) – Negative Lookahead – not followed by a digit – no net character New regex features shown
  • 65. ▶ FORMAT = $1###-##-###$3 ▶ $1 and $3 are capture group matches – from (.*) at beginning and end ▶ $2 is not used in the FORMAT, but it’s the capture group for the SSN – from: (d{3}-d{2}-d{4}) How the FORMAT works
  • 66. After bringing in the data:
  • 67. It even will obfuscate the data import preview:
  • 70. ▶ Greedy – Grab as much as you can ▶ Lazy – Grab as little as you can ▶ The lazy match will continue only as far as it needs to, no further • <.+?> will match <12345>, while • <.+> will match both <12345> and <12345><67890> SYNTAX: place a ? After a * or + The lazy match only goes to the first instance of a match following a multiple match What is the difference? Subtle difference, but big effect
  • 73. Second Look - Greedy
  • 75. ▶ Greedy may cross long segments ▶ Lazy may stop prematurely ▶ Try it on various data sets to make sure it will do what you want Choose Wisely
  • 77. ▶ Problem: • Extract two different fields from the exact same piece of data • Only want to use one regex – for efficiency if nothing else • Need both the Domain only and the whole URL from an access log Multiple field extractions from one piece of data
  • 78. ▶ Data: • 1501408932.060 16922 108.65.113.83 TCP_REFRESH_HIT/200 474 GET http://damtare.by.ru/id.txt myuan@buttercupgames.com DIRECT/damtare.by.ru text/html DEFAULT_CASE-DefaultGroup-Demo_Clients-NONE-NONE-DefaultRouting <IW_scty,-6.9,0,-,- ,-,-,0,-,-,-,-,-,-,-,IW_scty,-> - - ▶ Desired Fields: • Domain: damtare.by.ru • URL: http://damtare.by.ru/id.txt Source and Results
  • 79. Regex
  • 80. ▶ Slashes need escaping in regex101, but not in Splunk: (?P<URL>http://(?P<domain>[^/]+)S+) vs (?P<URL>http://(?P<domain>[^/]+)S+) Regex101.com vs Splunk
  • 81. (?P<URL>http://(?P<domain>[^/]+)S+) ^ ^ ^ ^ | | What? … Oh, now I see.
  • 82. Results in Splunk You can’t do this in the FET without doing your own regex!
  • 84. ▶ Some complex field extractions can be costly ▶ Some complex regular expressions can be costly ▶ Use the Job Inspector to see if there is a difference in doing on complex field extraction vs man simple field extractions ( | rex | vs | rex | rex | rex ) ▶ Sometimes the readability is more important than the performance Performance Considerations
  • 85. ▶ The complex field extractions (for example, one that extracts 6 fields at once) may be easier to maintain than multiple simple extractions (where you would have 6 different fields extracted by 6 different regexes) ▶ Your own field extractions will probably be easier to maintain than those created by the Field Extraction Tool – just write your own regexes better than the FET ▶ Regexes can save you a lot of headaches compared to using non-regex field extractions (one user was trying to extract data using 200 non-regex evals compared to 6 regexes that accomplished the same thing!) Maintenance Considerations
  • 86. Tools
  • 90. ▶ Regex Golf • https://alf.nu/RegexGolf • https://www.oreilly.com/learning/regex-golf-with-peter-norvig ▶ Regex Crosswords • https://regexcrossword.com • https://mariolurig.com/crossword/ Just for fun: Try your regex prowess
  • 91. Learn from others – ask questions – get answers ▶ http://answers.splunk.com/ Splunk Documentation ▶ https://docs.splunk.com/Documentation/Splunk/6.4.3/Knowledge/AboutSplunkreg ularexpressions ▶ Splunk regex Slack channel Splunk Answers and Docs
  • 92. ▶ Lisa Guinn • Inspiration and guidance in preparing • Data set to use in examples Acknowledgements