SlideShare a Scribd company logo
Getting Started with Machine Learning
for Incident Detection
August 2016 | Target. Hunt. Disrupt.
Chris McCubbin, Director of Data Science, Sqrrl
David J. Bianco, Security Technologist, Sqrrl
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
2	
  
A	
  story	
  we	
  all	
  know:	
  Regular	
  expressions	
  
  “Good	
  theory	
  leads	
  to	
  good	
  programs”	
  
  Who	
  here	
  has	
  implemented	
  and	
  optimized	
  a	
  Nondeterministic	
  
Finite	
  Automata	
  compiler?	
  
  You	
  probably	
  use	
  one	
  every	
  day	
  
  Regex:	
  Grep,	
  perl	
  
  You	
  don’t	
  care	
  how	
  it	
  works	
  inside	
  
  But	
  you	
  might	
  need	
  to	
  know	
  some	
  quirks	
  
  Regex	
  can’t	
  count	
  (google	
  up	
  “regex	
  HTML”	
  on	
  stackoverflow)	
  
Grep	
  has	
  no	
  ‘bad	
  cases’	
  
  Perl	
  is	
  more	
  powerful	
  (lazy,	
  backreferences)	
  
  But	
  it	
  is	
  helpful	
  to	
  know	
  what	
  it’s	
  good	
  for,	
  how	
  to	
  use	
  it,	
  etc.	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
3	
  
Agenda	
  
  What	
  is	
  Machine	
  Learning	
  (ML)	
  good	
  at?	
  
How	
  does	
  ML	
  work?	
  What	
  are	
  the	
  quirks	
  of	
  useful	
  Machine	
  Learning	
  techniques?	
  
  Can	
  I	
  use	
  Machine	
  Learning	
  easily?	
  
  How	
  can	
  you	
  customize	
  &	
  improve	
  our	
  examples?	
  
	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
4	
  
When’s	
  the	
  last	
  time	
  you	
  heard…?	
  
“It’s	
  a	
  Best	
  Practice	
  to	
  review	
  your	
  logs	
  every	
  
day.”	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
5	
  
Machine-­‐Assisted	
  Analysis	
  
Practical	
  Cyborgism	
  for	
  Security	
  Operations	
  
●  Bad	
  at	
  context	
  and	
  
understanding	
  
●  Good	
  at	
  repetition	
  
and	
  drudgery	
  
●  Algorithms	
  work	
  
cheap!	
  
●  Contextual	
  analysis	
  
experts	
  who	
  love	
  
patterns	
  
●  Possess	
  curiosity	
  &	
  
intuition	
  
●  Business	
  knowledge	
  
●  Good	
  results	
  from	
  
massive	
  amounts	
  of	
  
data	
  
●  Agile	
  investigations	
  
●  Quickly	
  turn	
  
questions	
  into	
  insight	
  
COMPUTERS	
   EMPOWERED	
  
ANALYSTS	
  
PEOPLE	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
6	
  
Problem	
  Statement:	
  HTTP	
  Proxy	
  Logs	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
7	
  
Our	
  solution:	
  Clearcut!	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
8	
  
Two	
  different	
  types	
  of	
  machine	
  learning	
  
  Supervised	
  
  Have	
  labeled	
  training	
  data?	
  
  Classification	
  algorithms	
  
  Random	
  Forests	
  
  Unsupervised	
  
  No	
  labeled	
  training	
  data	
  
  Assume	
  attacks	
  are	
  rare	
  
  Outlier	
  Detection	
  
  Isolation	
  Forests	
  
  Clustering	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
9	
  
Supervised:	
  Binary	
  Classification	
  
Given	
  a	
  population	
  of	
  two	
  types	
  of	
  “things”,	
  can	
  I	
  find	
  a	
  
function	
  that	
  separates	
  them	
  into	
  two	
  classes?	
  
	
  
Maybe	
  it’s	
  a	
  line,	
  maybe	
  it’s	
  not.	
  
	
  
Nothing’s	
  perfect,	
  but	
  how	
  close	
  can	
  we	
  get?	
  
	
  
If	
  we	
  derive	
  a	
  function	
  that	
  does	
  reasonably	
  well	
  at	
  
separating	
  the	
  two	
  classes,	
  that’s	
  our	
  binary	
  classifier!	
  
	
  
Fortunately,	
  Python	
  has	
  pantsloads	
  of	
  libraries	
  that	
  can	
  
do	
  this	
  for	
  us.	
  	
  The	
  machine	
  can	
  learn	
  the	
  function	
  
given	
  enough	
  samples	
  of	
  each	
  class.	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
10	
  
Classification	
  With	
  Random	
  Forests	
  
1.  Identify	
  positive	
  and	
  negative	
  sample	
  datasets	
  
2.  Clean	
  &	
  normalize	
  the	
  data	
  
3.  Partition	
  the	
  data	
  into	
  training	
  &	
  testing	
  datasets	
  
4.  Select	
  &	
  compute	
  some	
  interesting	
  features	
  
5.  Train	
  a	
  model	
  
6.  Test	
  the	
  model	
  
7.  Evaluate	
  the	
  results	
  
8.  .	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
11	
  
Generating	
  synthetic	
  abnormal	
  data	
  	
  
	
  Perhaps	
  we	
  don’t	
  have	
  any	
  malware	
  data,	
  but	
  we	
  
have	
  normal	
  data.	
  
	
  
	
  If	
  we	
  could	
  make	
  some	
  synthetic	
  abnormal	
  data,	
  
we	
  could	
  still	
  use	
  the	
  same	
  methods	
  
	
  
	
  One-­‐class	
  classification	
  
	
  
	
  How	
  should	
  we	
  create	
  the	
  data?	
  
	
  
	
  One	
  option:	
  ‘Noise-­‐contrastive	
  estimation’:	
  
Generate	
  noise	
  data	
  that	
  looks	
  real-­‐ish,	
  but	
  has	
  no	
  
real	
  structure	
  and	
  contrast	
  that	
  to	
  the	
  normal	
  data	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
12	
  
Decision	
  Trees	
  
Greedily	
  grow	
  tree	
  by	
  choosing	
  feature	
  that	
  
explains	
  the	
  class	
  the	
  most	
  
	
  
Split	
  the	
  training	
  set	
  into	
  two	
  sets,	
  repeat	
  
	
  
Form	
  a	
  classifier	
  by	
  “walking	
  down	
  the	
  tree”	
  
	
  
Issue:	
  overfitting	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
13	
  
Random	
  Forests	
  
Sample	
  training	
  set	
  with	
  replacement	
  
	
  
Fit	
  a	
  decision	
  tree	
  to	
  the	
  sample	
  
	
  
Repeat	
  n	
  times	
  
	
  
Form	
  a	
  classifier	
  by	
  averaging	
  the	
  n	
  decision	
  trees	
  
	
  
http://www.rhaensch.de/vrf.html	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
14	
  
Unsupervised:	
  Outlier	
  Detection	
  
	
  Given	
  a	
  population	
  of	
  “things”,	
  can	
  I	
  find	
  a	
  
function	
  that	
  tells	
  me	
  which	
  ones	
  look	
  
weird?	
  
	
  
	
  Can	
  also	
  pretend	
  to	
  be	
  a	
  classifier	
  	
  
(class	
  0	
  =	
  normal,	
  class	
  1	
  =	
  	
  weird)	
  
	
  
	
  Loads	
  of	
  ways	
  to	
  accomplish	
  this:	
  distance	
  
to	
  your	
  neighbors,	
  angle-­‐based	
  methods,	
  
isolation-­‐based	
  methods	
  
	
  
	
  
	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
15	
  
Isolation	
  Forests	
  [Liu,	
  Ting,	
  Zhao]	
  
http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf	
  
Pick	
  a	
  dimension	
  at	
  random.	
  Pick	
  a	
  value	
  at	
  random.	
  
	
  
Make	
  a	
  tree	
  by	
  splitting	
  the	
  set	
  into	
  two	
  sets,	
  repeat.	
  
Stop	
  when	
  the	
  set	
  is	
  a	
  single	
  point.	
  
	
  
Do	
  this	
  for	
  many	
  trees.	
  
	
  
Form	
  an	
  outlier	
  detector	
  by	
  the	
  average	
  depth	
  that	
  a	
  
point	
  is	
  isolated	
  in	
  each	
  tree	
  (deeper	
  is	
  more	
  inlier-­‐y)	
  
	
  
Issue:	
  enumerated	
  types	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
16	
  
A	
  quick	
  note	
  about	
  parameters	
  
Choosing	
  parameters	
  can	
  be	
  important	
  
	
  
Can	
  use	
  expert	
  knowledge	
  or	
  ad-­‐hoc	
  methods	
  
	
  
Dimitar	
  Karev	
  (MIT	
  RSI	
  Intern)	
  tested	
  a	
  range	
  of	
  
parameters	
  for	
  Clearcut	
  iforests	
  using	
  
exhaustive	
  search	
  (for	
  forest	
  params)	
  and	
  a	
  
genetic	
  algorithm	
  (for	
  features)	
  
	
  
Result	
  was	
  a	
  huge	
  improvement	
  in	
  F1	
  (see	
  ROC	
  
curves)	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
17	
  
Classification	
  With	
  Isolation	
  Forests	
  
1.  Identify	
  positive	
  and	
  negative	
  sample	
  datasets	
  
2.  Clean	
  &	
  normalize	
  the	
  data	
  
3.  Partition	
  the	
  data	
  into	
  training	
  &	
  testing	
  datasets	
  
4.  Select	
  &	
  compute	
  some	
  interesting	
  features	
  
5.  Train	
  a	
  model	
  
6.  Test	
  the	
  model	
  
7.  Evaluate	
  the	
  results	
  
8.  	
  	
  
9.  Notice	
  similarities	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
18	
  
The	
  beauty	
  of	
  scikit-­‐learn	
  &	
  python	
  
Gists	
  to	
  perform	
  many	
  types	
  are	
  learning	
  are	
  simple	
  and	
  consistent	
  	
  
  Take	
  same	
  data	
  as	
  input	
  (supervised	
  requires	
  an	
  extra	
  column)	
  
  Signatures	
  of	
  methods	
  are	
  the	
  same	
  
  Example:	
  RF’s	
  vs	
  iForests	
  
  Changed	
  a	
  few	
  lines	
  of	
  code	
  for	
  training	
  
  Classes	
  are	
  a	
  bit	
  different	
  (0/1	
  vs	
  1/-­‐1)	
  
  Can	
  re-­‐use	
  the	
  analysis	
  script	
  with	
  nearly	
  no	
  change	
  
	
  
#RF	
  
clf	
  =	
  RandomForestClassifier(n_jobs=4,	
  
	
  n_estimators=opts.numtrees,	
  oob_score=True)	
  
y,	
  _	
  =	
  pd.factorize(train['class'])	
  
	
  
clf.fit(train.drop('class',	
  axis=1),	
  y)	
  
test['prediction']	
  =	
  clf.predict(testnoclass)	
  
#iF	
  
clf	
  =	
  IsolationForest(n_estimators=opts.numtrees)	
  
	
  
	
  
	
  
clf.fit(train.drop('class',	
  axis=1))	
  
test['prediction']	
  =	
  clf.predict(testnoclass)	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
19	
  
Identifying	
  Training	
  &	
  Test	
  Data	
  
Malicious	
  
Data	
  
All	
  
Labeled	
  
Data	
  
Training	
  
Data	
  
Test	
  
Data	
  
Label	
  =	
  normal	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
20	
  
Feature	
  extraction	
  
Many	
  classifiers	
  want	
  to	
  work	
  with	
  numeric	
  features.	
  
We	
  use	
  a	
  ‘flow	
  enhancing’	
  step	
  to	
  add	
  some	
  
convenience	
  columns	
  to	
  the	
  data	
  
	
  
Some	
  columns	
  are	
  already	
  numeric	
  
	
  
Some	
  columns	
  have	
  easy-­‐to-­‐extract	
  numeric	
  info:	
  
number	
  of	
  dots	
  in	
  URL,	
  entropy	
  in	
  TLD	
  
	
  
Categorical	
  columns	
  can	
  be	
  converted	
  to	
  “Bag	
  of	
  
words”	
  (BOW):	
  N	
  binary	
  features,	
  one	
  for	
  each	
  category	
  
	
  
Text-­‐y	
  columns	
  can	
  be	
  converted	
  to	
  BOW	
  or	
  Bag-­‐of-­‐
Ngrams	
  (BON)	
  
	
  
Use	
  TF-­‐IDF	
  to	
  determine	
  which	
  features	
  to	
  keep	
  
The quick brown fox….
The q ck br
The q daofj wrgwg ck br wrgwr gwrgg
1 0 0 1 0 0
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
21	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
22	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
Read	
  the	
  Bro	
  data	
  files	
  into	
  a	
  Pandas	
  data	
  
frame.	
  	
  	
  
	
  
Each	
  row	
  is	
  labeled	
  either	
  ‘benign’	
  or	
  
‘malicious’.	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
23	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
Random	
  Forest	
  requires	
  numeric	
  data,	
  so	
  we	
  
have	
  to	
  convert	
  strings.	
  
	
  
Primarily	
  two	
  methods:	
  
●  Bag	
  of	
  Words	
  (method,	
  status	
  code)	
  
●  Bag	
  of	
  N-­‐Grams	
  (domain,	
  user	
  agent)	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
24	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
Split	
  all	
  the	
  labeled	
  data	
  into	
  ‘training’	
  (80%)	
  
and	
  ‘test’	
  (20%)	
  datasets.	
  
	
  
Now	
  feed	
  all	
  the	
  training	
  data	
  through	
  the	
  
Random	
  Forest	
  to	
  produce	
  a	
  trained	
  model.	
  
	
  
At	
  this	
  point,	
  we	
  do	
  nothing	
  with	
  the	
  test	
  
data.	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
25	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
Now	
  we	
  run	
  the	
  ‘test’	
  data	
  through	
  the	
  
trained	
  model.	
  It’s	
  still	
  labeled,	
  so	
  we	
  know	
  
what	
  the	
  answer	
  should	
  be.	
  
	
  
We	
  compare	
  the	
  expected	
  results	
  with	
  the	
  
actual	
  prediction	
  and	
  create	
  a	
  little	
  table.	
  
	
  
We	
  don’t	
  expect	
  perfect	
  results,	
  but	
  we’d	
  like	
  
to	
  see	
  most	
  of	
  the	
  data	
  in	
  the	
  0/0	
  and	
  
1/1	
  rows.	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
26	
  
Training,	
  Testing	
  &	
  Evaluating	
  a	
  Model	
  
% ./train_flows_rf.py -o data/http-malware.log http-training.log
Reading normal training data
Reading malicious training data
Building Vectorizers
Training
Predicting (class 0 is normal, class 1 is malicious)
class prediction
0 0 12428
1 15
1 0 19
1 9563
dtype: int64
F1 = 0.998225469729
It’s	
  hard	
  to	
  compare	
  two	
  tables	
  to	
  see	
  how	
  
different	
  models	
  compare	
  (due	
  to	
  different	
  
datasets	
  or	
  feature	
  choices).	
  	
  	
  
	
  
The	
  F1	
  value	
  is	
  a	
  useful	
  single-­‐number	
  
measure	
  for	
  comparison,	
  combining	
  TP	
  &	
  FP	
  
rates.	
  
	
  
Anything	
  over	
  about	
  0.9	
  is	
  considered	
  good,	
  
but	
  beware	
  very	
  high	
  values	
  (“overfitting”)!	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
27	
  
Bonus:	
  Most	
  Influential	
  Features	
  with	
  ‘-­‐v’	
  
Feature ranking:
1. feature user_agent.mac os (0.047058)
2. feature user_agent. os x 1 (0.044084)
3. feature user_agent.; intel (0.042387)
4. feature user_agent.ac os x (0.037192)
5. feature user_agent.os x 10 (0.031616)
[...]
46. feature userAgentEntropy (0.009144)
47. feature subdomainEntropy (0.007699)
48. feature browser_string.browser (0.007263)
49. feature response_body_len (0.006410)
50. feature request_body_len (0.005506)
51. feature domainNameDots (0.005054)
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
28	
  
Analyzing	
  Log	
  Files	
  
Percentage	
  of	
  original	
  
file	
  left	
  to	
  review.	
  % ./analyze_flows.py http-production-2016-05-02.log
Loading HTTP data
Loading trained model
Calculating features
Analyzing
detected 298 anomalies out of 180520 total rows (0.17%)
-----------------------------------------
line 2393
Co7qtw35sGLX6RiG79,80,HEAD,download.virtualbox.org,/virtualbox/5.0.20/
Oracle_VM_VirtualBox_Extension_Pack-5.0.20.vbox-extpack,-,Mozilla/5.0 (AgnosticOS;
Blend) IPRT/64.42,0,0,200,80,Unknown Browser,,,download,virtualbox
-----------------------------------------
line 2394
ChpL1u2Ia64utWrd9j,80,GET,download.virtualbox.org,/virtualbox/5.0.20/
Oracle_VM_VirtualBox_Extension_Pack-5.0.20.vbox-extpack,-,Mozilla/5.0 (AgnosticOS;
Blend) IPRT/64.42,0,16421439,200,80,Unknown Browser,,,download,virtualbox
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
29	
  
Bonus:	
  Classifier	
  Explanations	
  with	
  ‘-­‐v’	
  
line 431
C9WQArVvgv1BjvJG7,80,GET,apt.spideroak.com,/spideroak_one_rpm/stable/
repodata/repomd.xml,-,PackageKit-hawkey,0,2969,200,80,Unknown
Browser,,,apt,spideroak
Top feature contributions to class 1:
userAgentLength 0.0831734141875
response_body_len 0.0719766424091
domainNameLength 0.056790435921
user_agent.mac os 0.0272829846513
user_agent. os x 1 0.0252803447682
user_agent.os x 10 0.0251306287983
user_agent.ac os x 0.0244848247673
user_agent.; intel 0.0241743906069
user_agent. intel 0.0236921809876
tld.apple 0.020090459858
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
30	
  
Ideas	
  for	
  improvement	
  
More	
  diverse	
  malware	
  samples	
  
	
  
Better	
  filtering	
  for	
  connectivity	
  checks	
  in	
  
the	
  malware	
  data	
  
	
  
Incrementally	
  retraining	
  the	
  forest	
  	
  
(‘warm	
  start’)	
  
	
  
Log	
  type	
  “plugins”	
  
	
  
K-­‐class	
  classifier	
  
	
  
	
  
	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
31	
  
Adapting	
  to	
  other	
  log	
  sources	
  
Change	
  log	
  input:	
  clearcut_utils.load_brofile	
  
Import	
  your	
  data	
  into	
  a	
  pandas	
  data	
  frame	
  
	
  
Change	
  flow	
  enhancer:	
  flowenhancer.enhance_flow	
  
Add	
  any	
  columns	
  that	
  might	
  make	
  featurizing	
  
easier	
  
	
  
Change	
  feature	
  generator:	
  
featurizer.build_vectorizers	
  	
  
Make	
  any	
  BOW	
  and	
  BON	
  vectorizers	
  that	
  you	
  want	
  
Use	
  featurizers	
  to	
  make	
  BOW/BON	
  features	
  
Add	
  any	
  other	
  features	
  you	
  think	
  might	
  be	
  
important	
  
http://www.orwellloghomes.com/greybg.jpg	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
32	
  
Takeaways	
  
  Pandas	
  and	
  scikit-­‐learn	
  are	
  highly	
  active	
  python	
  
projects	
  that	
  are	
  bringing	
  data	
  science	
  and	
  machine	
  
learning	
  tools	
  to	
  the	
  masses	
  
  Security	
  technologists	
  can	
  (should?)	
  leverage	
  these	
  
tools	
  as	
  black	
  or	
  grey	
  boxes	
  
  Today,	
  implementing	
  ‘standard’	
  ML	
  algorithms	
  is	
  not	
  
the	
  long	
  pole	
  in	
  the	
  tent	
  
  Snag	
  Clearcut	
  for	
  an	
  example	
  
	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
33	
  
The	
  Sqrrl	
  Threat	
  Hunting	
  Platform	
  
SECURITY	
  DATA	
  
NETWORK	
  DATA	
  
ENDPOINT/IDENTITY	
  
DATA	
  
Firewall	
  
/	
  IDS	
  
Threat	
  
Intel	
  
Processes	
  
HR	
  
Bro	
  
SIEM	
  
Alerts	
  
Netflow	
  Proxy	
  
Authentication	
  
How	
  To	
  Learn	
  More?	
  
	
  
Go	
  to	
  sqrrl.com	
  to…	
  
  Download	
  Sqrrl’s	
  Threat	
  
Hunting	
  eBook	
  
  Download	
  the	
  Sqrrl	
  White	
  
Paper	
  on	
  Threat	
  Hunting	
  
Platforms	
  
  Request	
  a	
  Sqrrl	
  Test	
  Drive	
  
VM	
  
  Download	
  Sqrrl’s	
  Product	
  
Paper	
  
  Reach	
  out	
  to	
  us	
  at	
  
info@sqrrl.com	
  
©	
  2016	
  Sqrrl	
  Data,	
  Inc.	
  All	
  rights	
  reserved.	
  
	
  
34	
  
More	
  Info	
  
Chris	
  McCubbin	
  
Director	
  of	
  Data	
  Science	
  
@_SecretStache_	
  
chris@sqrrl.com	
  
David	
  J.	
  Bianco	
  
Security	
  Technologist	
  
@DavidJBianco	
  
dbianco@sqrrl.com	
  
Clearcut	
  
Machine	
  Learning	
  for	
  Log	
  Review	
  
	
  
https://github.com/DavidJBianco/Clearcut	
  
(iforest	
  branch	
  for	
  iforests)	
  

More Related Content

PDF
Sqrrl March Webinar: How to Build a Big App
PDF
April 2015 Webinar: Cyber Hunting with Sqrrl
PDF
Threat Hunting Platforms (Collaboration with SANS Institute)
PPTX
Modernizing Your SOC: A CISO-led Training
PPTX
SQRRL threat hunting platform
PDF
Sqrrl June Webinar: An Accumulo Love Story
PDF
Sqrrl 2.0 Launch Webinar
PDF
Transitioning Government Technology
Sqrrl March Webinar: How to Build a Big App
April 2015 Webinar: Cyber Hunting with Sqrrl
Threat Hunting Platforms (Collaboration with SANS Institute)
Modernizing Your SOC: A CISO-led Training
SQRRL threat hunting platform
Sqrrl June Webinar: An Accumulo Love Story
Sqrrl 2.0 Launch Webinar
Transitioning Government Technology

What's hot (20)

PDF
Leveraging Threat Intelligence to Guide Your Hunts
PDF
Reducing Mean Time to Know
PDF
The Art and Science of Alert Triage
PPTX
Threat Hunting for Command and Control Activity
PPTX
Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together
PPTX
Sqrrl and IBM: Threat Hunting for QRadar Users
PDF
October 2014 Webinar: Cybersecurity Threat Detection
PDF
Sqrrl May Webinar: Data-Centric Security
PDF
Sqrrl Overview for Stac Research
PPTX
Grace Hopper Open Source Day Findings | Thorn & Cloudera Cares
PDF
Sqrrl Enterprise: Big Data Security Analytics Use Case
PDF
Jisheng Wang at AI Frontiers: Deep Learning in Security
PDF
Sqrrl February Webinar: Breaking Down Data Silos
PDF
Security Insights at Scale
PPTX
Building a Successful Threat Hunting Program
PDF
Fighting cybersecurity threats with Apache Spot
PPTX
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
PDF
Visualization in the Age of Big Data
PPTX
Cyber Threat Hunting with Phirelight
PDF
Big Data Analytics to Enhance Security
Leveraging Threat Intelligence to Guide Your Hunts
Reducing Mean Time to Know
The Art and Science of Alert Triage
Threat Hunting for Command and Control Activity
Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together
Sqrrl and IBM: Threat Hunting for QRadar Users
October 2014 Webinar: Cybersecurity Threat Detection
Sqrrl May Webinar: Data-Centric Security
Sqrrl Overview for Stac Research
Grace Hopper Open Source Day Findings | Thorn & Cloudera Cares
Sqrrl Enterprise: Big Data Security Analytics Use Case
Jisheng Wang at AI Frontiers: Deep Learning in Security
Sqrrl February Webinar: Breaking Down Data Silos
Security Insights at Scale
Building a Successful Threat Hunting Program
Fighting cybersecurity threats with Apache Spot
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
Visualization in the Age of Big Data
Cyber Threat Hunting with Phirelight
Big Data Analytics to Enhance Security
Ad

Similar to Machine Learning for Incident Detection: Getting Started (20)

PDF
Практическое применение машинного обучения в ИБ
PDF
Introduction To Machine Learning With Python A Guide For Data Scientists 1st ...
PDF
Introduction to Machine Learning with Python ( PDFDrive.com ).pdf
PDF
Hands_On_Machine_Learning_with_Scikit_Le.pdf
PDF
Python Machine Learning Sebastian Raschka Vahid Mirjalili
PDF
Machine Learning-Based Phishing Detection
PDF
Introduction to Machine Learning with SciKit-Learn
PDF
Identifying and classifying unknown Network Disruption
PPTX
IMPLEMENTATION OF MACHINE LEARNING IN E-COMMERCE & BEYOND
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
PDF
Choosing a Machine Learning technique to solve your need
PDF
Getting Started with Machine Learning
PDF
Scikit-Learn: Machine Learning in Python
PDF
Simple big data, in Python
PDF
Introduction to conventional machine learning techniques
PDF
Hands-on - Machine Learning using scikitLearn
PPTX
Introduction to Machine Learning with Python and scikit-learn
PDF
Training and deploying an image classification model
PDF
MLHEP Lectures - day 1, basic track
PDF
Python Advanced Predictive Analytics Kumar Ashish
Практическое применение машинного обучения в ИБ
Introduction To Machine Learning With Python A Guide For Data Scientists 1st ...
Introduction to Machine Learning with Python ( PDFDrive.com ).pdf
Hands_On_Machine_Learning_with_Scikit_Le.pdf
Python Machine Learning Sebastian Raschka Vahid Mirjalili
Machine Learning-Based Phishing Detection
Introduction to Machine Learning with SciKit-Learn
Identifying and classifying unknown Network Disruption
IMPLEMENTATION OF MACHINE LEARNING IN E-COMMERCE & BEYOND
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
Choosing a Machine Learning technique to solve your need
Getting Started with Machine Learning
Scikit-Learn: Machine Learning in Python
Simple big data, in Python
Introduction to conventional machine learning techniques
Hands-on - Machine Learning using scikitLearn
Introduction to Machine Learning with Python and scikit-learn
Training and deploying an image classification model
MLHEP Lectures - day 1, basic track
Python Advanced Predictive Analytics Kumar Ashish
Ad

More from Sqrrl (13)

PDF
How to Hunt for Lateral Movement on Your Network
PDF
Building a Next-Generation Security Operations Center (SOC)
PDF
User and Entity Behavior Analytics using the Sqrrl Behavior Graph
PPTX
Leveraging DNS to Surface Attacker Activity
PDF
The Linked Data Advantage
PDF
Sqrrl Enterprise: Integrate, Explore, Analyze
PDF
Sqrrl Datasheet: Cyber Hunting
PDF
Benchmarking The Apache Accumulo Distributed Key–Value Store
PDF
Scalable Graph Clustering with Pregel
PDF
What's Next for Google's BigTable
PDF
Performance Models for Apache Accumulo
PDF
Sqrrl November Webinar: Encryption and Security in Accumulo
PDF
Sqrrl October Webinar: Data Modeling and Indexing
How to Hunt for Lateral Movement on Your Network
Building a Next-Generation Security Operations Center (SOC)
User and Entity Behavior Analytics using the Sqrrl Behavior Graph
Leveraging DNS to Surface Attacker Activity
The Linked Data Advantage
Sqrrl Enterprise: Integrate, Explore, Analyze
Sqrrl Datasheet: Cyber Hunting
Benchmarking The Apache Accumulo Distributed Key–Value Store
Scalable Graph Clustering with Pregel
What's Next for Google's BigTable
Performance Models for Apache Accumulo
Sqrrl November Webinar: Encryption and Security in Accumulo
Sqrrl October Webinar: Data Modeling and Indexing

Recently uploaded (20)

PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Transform Your Business with a Software ERP System
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Introduction to Artificial Intelligence
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
ai tools demonstartion for schools and inter college
PDF
top salesforce developer skills in 2025.pdf
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
L1 - Introduction to python Backend.pptx
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PTS Company Brochure 2025 (1).pdf.......
Transform Your Business with a Software ERP System
Design an Analysis of Algorithms I-SECS-1021-03
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
wealthsignaloriginal-com-DS-text-... (1).pdf
Operating system designcfffgfgggggggvggggggggg
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Design an Analysis of Algorithms II-SECS-1021-03
Understanding Forklifts - TECH EHS Solution
Introduction to Artificial Intelligence
Odoo Companies in India – Driving Business Transformation.pdf
Reimagine Home Health with the Power of Agentic AI​
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
ai tools demonstartion for schools and inter college
top salesforce developer skills in 2025.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
L1 - Introduction to python Backend.pptx

Machine Learning for Incident Detection: Getting Started

  • 1. Getting Started with Machine Learning for Incident Detection August 2016 | Target. Hunt. Disrupt. Chris McCubbin, Director of Data Science, Sqrrl David J. Bianco, Security Technologist, Sqrrl
  • 2. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     2   A  story  we  all  know:  Regular  expressions     “Good  theory  leads  to  good  programs”     Who  here  has  implemented  and  optimized  a  Nondeterministic   Finite  Automata  compiler?     You  probably  use  one  every  day     Regex:  Grep,  perl     You  don’t  care  how  it  works  inside     But  you  might  need  to  know  some  quirks     Regex  can’t  count  (google  up  “regex  HTML”  on  stackoverflow)   Grep  has  no  ‘bad  cases’     Perl  is  more  powerful  (lazy,  backreferences)     But  it  is  helpful  to  know  what  it’s  good  for,  how  to  use  it,  etc.  
  • 3. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     3   Agenda     What  is  Machine  Learning  (ML)  good  at?   How  does  ML  work?  What  are  the  quirks  of  useful  Machine  Learning  techniques?     Can  I  use  Machine  Learning  easily?     How  can  you  customize  &  improve  our  examples?    
  • 4. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     4   When’s  the  last  time  you  heard…?   “It’s  a  Best  Practice  to  review  your  logs  every   day.”  
  • 5. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     5   Machine-­‐Assisted  Analysis   Practical  Cyborgism  for  Security  Operations   ●  Bad  at  context  and   understanding   ●  Good  at  repetition   and  drudgery   ●  Algorithms  work   cheap!   ●  Contextual  analysis   experts  who  love   patterns   ●  Possess  curiosity  &   intuition   ●  Business  knowledge   ●  Good  results  from   massive  amounts  of   data   ●  Agile  investigations   ●  Quickly  turn   questions  into  insight   COMPUTERS   EMPOWERED   ANALYSTS   PEOPLE  
  • 6. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     6   Problem  Statement:  HTTP  Proxy  Logs  
  • 7. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     7   Our  solution:  Clearcut!  
  • 8. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     8   Two  different  types  of  machine  learning     Supervised     Have  labeled  training  data?     Classification  algorithms     Random  Forests     Unsupervised     No  labeled  training  data     Assume  attacks  are  rare     Outlier  Detection     Isolation  Forests     Clustering  
  • 9. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     9   Supervised:  Binary  Classification   Given  a  population  of  two  types  of  “things”,  can  I  find  a   function  that  separates  them  into  two  classes?     Maybe  it’s  a  line,  maybe  it’s  not.     Nothing’s  perfect,  but  how  close  can  we  get?     If  we  derive  a  function  that  does  reasonably  well  at   separating  the  two  classes,  that’s  our  binary  classifier!     Fortunately,  Python  has  pantsloads  of  libraries  that  can   do  this  for  us.    The  machine  can  learn  the  function   given  enough  samples  of  each  class.  
  • 10. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     10   Classification  With  Random  Forests   1.  Identify  positive  and  negative  sample  datasets   2.  Clean  &  normalize  the  data   3.  Partition  the  data  into  training  &  testing  datasets   4.  Select  &  compute  some  interesting  features   5.  Train  a  model   6.  Test  the  model   7.  Evaluate  the  results   8.  .  
  • 11. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     11   Generating  synthetic  abnormal  data      Perhaps  we  don’t  have  any  malware  data,  but  we   have  normal  data.      If  we  could  make  some  synthetic  abnormal  data,   we  could  still  use  the  same  methods      One-­‐class  classification      How  should  we  create  the  data?      One  option:  ‘Noise-­‐contrastive  estimation’:   Generate  noise  data  that  looks  real-­‐ish,  but  has  no   real  structure  and  contrast  that  to  the  normal  data  
  • 12. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     12   Decision  Trees   Greedily  grow  tree  by  choosing  feature  that   explains  the  class  the  most     Split  the  training  set  into  two  sets,  repeat     Form  a  classifier  by  “walking  down  the  tree”     Issue:  overfitting  
  • 13. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     13   Random  Forests   Sample  training  set  with  replacement     Fit  a  decision  tree  to  the  sample     Repeat  n  times     Form  a  classifier  by  averaging  the  n  decision  trees     http://www.rhaensch.de/vrf.html  
  • 14. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     14   Unsupervised:  Outlier  Detection    Given  a  population  of  “things”,  can  I  find  a   function  that  tells  me  which  ones  look   weird?      Can  also  pretend  to  be  a  classifier     (class  0  =  normal,  class  1  =    weird)      Loads  of  ways  to  accomplish  this:  distance   to  your  neighbors,  angle-­‐based  methods,   isolation-­‐based  methods        
  • 15. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     15   Isolation  Forests  [Liu,  Ting,  Zhao]   http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf   Pick  a  dimension  at  random.  Pick  a  value  at  random.     Make  a  tree  by  splitting  the  set  into  two  sets,  repeat.   Stop  when  the  set  is  a  single  point.     Do  this  for  many  trees.     Form  an  outlier  detector  by  the  average  depth  that  a   point  is  isolated  in  each  tree  (deeper  is  more  inlier-­‐y)     Issue:  enumerated  types  
  • 16. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     16   A  quick  note  about  parameters   Choosing  parameters  can  be  important     Can  use  expert  knowledge  or  ad-­‐hoc  methods     Dimitar  Karev  (MIT  RSI  Intern)  tested  a  range  of   parameters  for  Clearcut  iforests  using   exhaustive  search  (for  forest  params)  and  a   genetic  algorithm  (for  features)     Result  was  a  huge  improvement  in  F1  (see  ROC   curves)  
  • 17. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     17   Classification  With  Isolation  Forests   1.  Identify  positive  and  negative  sample  datasets   2.  Clean  &  normalize  the  data   3.  Partition  the  data  into  training  &  testing  datasets   4.  Select  &  compute  some  interesting  features   5.  Train  a  model   6.  Test  the  model   7.  Evaluate  the  results   8.      9.  Notice  similarities  
  • 18. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     18   The  beauty  of  scikit-­‐learn  &  python   Gists  to  perform  many  types  are  learning  are  simple  and  consistent       Take  same  data  as  input  (supervised  requires  an  extra  column)     Signatures  of  methods  are  the  same     Example:  RF’s  vs  iForests     Changed  a  few  lines  of  code  for  training     Classes  are  a  bit  different  (0/1  vs  1/-­‐1)     Can  re-­‐use  the  analysis  script  with  nearly  no  change     #RF   clf  =  RandomForestClassifier(n_jobs=4,    n_estimators=opts.numtrees,  oob_score=True)   y,  _  =  pd.factorize(train['class'])     clf.fit(train.drop('class',  axis=1),  y)   test['prediction']  =  clf.predict(testnoclass)   #iF   clf  =  IsolationForest(n_estimators=opts.numtrees)         clf.fit(train.drop('class',  axis=1))   test['prediction']  =  clf.predict(testnoclass)  
  • 19. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     19   Identifying  Training  &  Test  Data   Malicious   Data   All   Labeled   Data   Training   Data   Test   Data   Label  =  normal  
  • 20. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     20   Feature  extraction   Many  classifiers  want  to  work  with  numeric  features.   We  use  a  ‘flow  enhancing’  step  to  add  some   convenience  columns  to  the  data     Some  columns  are  already  numeric     Some  columns  have  easy-­‐to-­‐extract  numeric  info:   number  of  dots  in  URL,  entropy  in  TLD     Categorical  columns  can  be  converted  to  “Bag  of   words”  (BOW):  N  binary  features,  one  for  each  category     Text-­‐y  columns  can  be  converted  to  BOW  or  Bag-­‐of-­‐ Ngrams  (BON)     Use  TF-­‐IDF  to  determine  which  features  to  keep   The quick brown fox…. The q ck br The q daofj wrgwg ck br wrgwr gwrgg 1 0 0 1 0 0
  • 21. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     21   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729
  • 22. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     22   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729 Read  the  Bro  data  files  into  a  Pandas  data   frame.         Each  row  is  labeled  either  ‘benign’  or   ‘malicious’.  
  • 23. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     23   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729 Random  Forest  requires  numeric  data,  so  we   have  to  convert  strings.     Primarily  two  methods:   ●  Bag  of  Words  (method,  status  code)   ●  Bag  of  N-­‐Grams  (domain,  user  agent)  
  • 24. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     24   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729 Split  all  the  labeled  data  into  ‘training’  (80%)   and  ‘test’  (20%)  datasets.     Now  feed  all  the  training  data  through  the   Random  Forest  to  produce  a  trained  model.     At  this  point,  we  do  nothing  with  the  test   data.  
  • 25. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     25   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729 Now  we  run  the  ‘test’  data  through  the   trained  model.  It’s  still  labeled,  so  we  know   what  the  answer  should  be.     We  compare  the  expected  results  with  the   actual  prediction  and  create  a  little  table.     We  don’t  expect  perfect  results,  but  we’d  like   to  see  most  of  the  data  in  the  0/0  and   1/1  rows.  
  • 26. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     26   Training,  Testing  &  Evaluating  a  Model   % ./train_flows_rf.py -o data/http-malware.log http-training.log Reading normal training data Reading malicious training data Building Vectorizers Training Predicting (class 0 is normal, class 1 is malicious) class prediction 0 0 12428 1 15 1 0 19 1 9563 dtype: int64 F1 = 0.998225469729 It’s  hard  to  compare  two  tables  to  see  how   different  models  compare  (due  to  different   datasets  or  feature  choices).         The  F1  value  is  a  useful  single-­‐number   measure  for  comparison,  combining  TP  &  FP   rates.     Anything  over  about  0.9  is  considered  good,   but  beware  very  high  values  (“overfitting”)!  
  • 27. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     27   Bonus:  Most  Influential  Features  with  ‘-­‐v’   Feature ranking: 1. feature user_agent.mac os (0.047058) 2. feature user_agent. os x 1 (0.044084) 3. feature user_agent.; intel (0.042387) 4. feature user_agent.ac os x (0.037192) 5. feature user_agent.os x 10 (0.031616) [...] 46. feature userAgentEntropy (0.009144) 47. feature subdomainEntropy (0.007699) 48. feature browser_string.browser (0.007263) 49. feature response_body_len (0.006410) 50. feature request_body_len (0.005506) 51. feature domainNameDots (0.005054)
  • 28. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     28   Analyzing  Log  Files   Percentage  of  original   file  left  to  review.  % ./analyze_flows.py http-production-2016-05-02.log Loading HTTP data Loading trained model Calculating features Analyzing detected 298 anomalies out of 180520 total rows (0.17%) ----------------------------------------- line 2393 Co7qtw35sGLX6RiG79,80,HEAD,download.virtualbox.org,/virtualbox/5.0.20/ Oracle_VM_VirtualBox_Extension_Pack-5.0.20.vbox-extpack,-,Mozilla/5.0 (AgnosticOS; Blend) IPRT/64.42,0,0,200,80,Unknown Browser,,,download,virtualbox ----------------------------------------- line 2394 ChpL1u2Ia64utWrd9j,80,GET,download.virtualbox.org,/virtualbox/5.0.20/ Oracle_VM_VirtualBox_Extension_Pack-5.0.20.vbox-extpack,-,Mozilla/5.0 (AgnosticOS; Blend) IPRT/64.42,0,16421439,200,80,Unknown Browser,,,download,virtualbox
  • 29. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     29   Bonus:  Classifier  Explanations  with  ‘-­‐v’   line 431 C9WQArVvgv1BjvJG7,80,GET,apt.spideroak.com,/spideroak_one_rpm/stable/ repodata/repomd.xml,-,PackageKit-hawkey,0,2969,200,80,Unknown Browser,,,apt,spideroak Top feature contributions to class 1: userAgentLength 0.0831734141875 response_body_len 0.0719766424091 domainNameLength 0.056790435921 user_agent.mac os 0.0272829846513 user_agent. os x 1 0.0252803447682 user_agent.os x 10 0.0251306287983 user_agent.ac os x 0.0244848247673 user_agent.; intel 0.0241743906069 user_agent. intel 0.0236921809876 tld.apple 0.020090459858
  • 30. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     30   Ideas  for  improvement   More  diverse  malware  samples     Better  filtering  for  connectivity  checks  in   the  malware  data     Incrementally  retraining  the  forest     (‘warm  start’)     Log  type  “plugins”     K-­‐class  classifier        
  • 31. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     31   Adapting  to  other  log  sources   Change  log  input:  clearcut_utils.load_brofile   Import  your  data  into  a  pandas  data  frame     Change  flow  enhancer:  flowenhancer.enhance_flow   Add  any  columns  that  might  make  featurizing   easier     Change  feature  generator:   featurizer.build_vectorizers     Make  any  BOW  and  BON  vectorizers  that  you  want   Use  featurizers  to  make  BOW/BON  features   Add  any  other  features  you  think  might  be   important   http://www.orwellloghomes.com/greybg.jpg  
  • 32. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     32   Takeaways     Pandas  and  scikit-­‐learn  are  highly  active  python   projects  that  are  bringing  data  science  and  machine   learning  tools  to  the  masses     Security  technologists  can  (should?)  leverage  these   tools  as  black  or  grey  boxes     Today,  implementing  ‘standard’  ML  algorithms  is  not   the  long  pole  in  the  tent     Snag  Clearcut  for  an  example    
  • 33. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     33   The  Sqrrl  Threat  Hunting  Platform   SECURITY  DATA   NETWORK  DATA   ENDPOINT/IDENTITY   DATA   Firewall   /  IDS   Threat   Intel   Processes   HR   Bro   SIEM   Alerts   Netflow  Proxy   Authentication   How  To  Learn  More?     Go  to  sqrrl.com  to…     Download  Sqrrl’s  Threat   Hunting  eBook     Download  the  Sqrrl  White   Paper  on  Threat  Hunting   Platforms     Request  a  Sqrrl  Test  Drive   VM     Download  Sqrrl’s  Product   Paper     Reach  out  to  us  at   info@sqrrl.com  
  • 34. ©  2016  Sqrrl  Data,  Inc.  All  rights  reserved.     34   More  Info   Chris  McCubbin   Director  of  Data  Science   @_SecretStache_   chris@sqrrl.com   David  J.  Bianco   Security  Technologist   @DavidJBianco   dbianco@sqrrl.com   Clearcut   Machine  Learning  for  Log  Review     https://github.com/DavidJBianco/Clearcut   (iforest  branch  for  iforests)