SlideShare a Scribd company logo
Applying	
  Machine	
  Learning	
  to	
  Network	
  
Security	
  Monitoring
	
  
Alexandre	
  Pinto
	
  
Chief	
  Data	
  Scien4st	
  |	
  MLSec	
  Project	
  
	
  

	
  

@alexcpsec
@MLSecProject!
WARNING!
	
  
•  This	
  is	
  a	
  talk	
  about	
  BUILDING	
  not	
  breaking	
  
–  NO	
  systems	
  were	
  harmed	
  on	
  the	
  development	
  of	
  this	
  talk.	
  
–  This	
  is	
  NOT	
  about	
  1337	
  Android	
  Malware	
  
•  Only	
  thing	
  we	
  are	
  likely	
  to	
  break	
  here	
  is	
  the	
  4me	
  limit	
  on	
  the	
  
talk	
  
	
  
•  This	
  talk	
  includes	
  more	
  MATH	
  than	
  the	
  daily	
  recommended	
  
intake	
  by	
  the	
  FDA.	
  
•  All	
  stunts	
  described	
  in	
  this	
  talk	
  were	
  performed	
  by	
  trained	
  
professionals.!
Who's	
  Alex?
	
  
•  13	
  years	
  in	
  Informa4on	
  Security,	
  done	
  a	
  liRle	
  bit	
  of	
  everything.	
  
•  Past	
  7	
  or	
  so	
  years	
  leading	
  security	
  consultancy	
  and	
  monitoring	
  
teams	
  in	
  Brazil,	
  London	
  and	
  the	
  US.	
  
–  If	
  there	
  is	
  any	
  way	
  a	
  SIEM	
  can	
  hurt	
  you,	
  it	
  did	
  to	
  me.	
  

•  Researching	
  machine	
  learning	
  and	
  data	
  science	
  in	
  general	
  for	
  
the	
  past	
  year	
  or	
  so	
  and	
  presen4ng	
  about	
  the	
  intersec4on	
  of	
  it	
  
and	
  Infosec	
  throughout	
  the	
  year.	
  
•  Created	
  MLSec	
  Project	
  in	
  July	
  2013	
  to	
  give	
  structure	
  to	
  the	
  
research	
  being	
  done.	
  
Agenda
	
  
•  Defini4ons	
  
•  Big	
  Data	
  
•  Data	
  Science	
  
•  Machine	
  Learning	
  

• 
• 
• 
• 
• 

Y	
  U	
  DO	
  DIS?	
  
Network	
  Security	
  Monitoring	
  
PoC	
  ||	
  GTFO	
  
Feature	
  Intui4on	
  
How	
  to	
  get	
  started?	
  
 

Big	
  Data	
  +	
  Machine	
  Learning	
  +	
  Data	
  Science
 

Big	
  Data	
  +	
  Machine	
  Learning	
  +	
  Data	
  Science
Big	
  Data
	
  
(Security)	
  Data	
  ScienEst
	
  

•  “Data	
  Scien4st	
  (n.):	
  Person	
  who	
  is	
  beRer	
  at	
  sta4s4cs	
  than	
  any	
  so`ware	
  
engineer	
  and	
  beRer	
  at	
  so`ware	
  engineering	
  than	
  any	
  sta4s4cian.”	
	
  -­‐-­‐	
  Josh	
  Willis,	
  Cloudera	
  

Data	
  Science	
  Venn	
  Diagram	
  by	
  Drew	
  Conway!
Enter	
  Machine	
  Learning
	
  
•  “Machine	
  learning	
  systems	
  automa4cally	
  learn	
  programs	
  
from	
  data”	
  (*)	
  
•  You	
  don’t	
  really	
  code	
  the	
  program,	
  but	
  it	
  is	
  inferred	
  
from	
  data.	
  
•  Intui4on	
  of	
  trying	
  to	
  mimic	
  the	
  way	
  the	
  brain	
  learns:	
  	
  
that's	
  where	
  terms	
  like	
  ar#ficial	
  intelligence	
  come	
  from.
!

(*)	
  CACM	
  55(10)	
  -­‐	
  A	
  Few	
  Useful	
  Things	
  to	
  Know	
  about	
  Machine	
  Learning	
  (Domingos	
  2012)	
  
Kinds	
  of	
  Machine	
  Learning
	
  
•  Supervised	
  Learning:	
  

–  Classifica4on	
  (NN,	
  SVM,	
  Naïve	
  
Bayes)	
  
–  Regression	
  (linear,	
  logis4c)!

•  Unsupervised	
  Learning	
  :	
  
–  Clustering	
  (k-­‐means)	
  
–  Decomposi4on	
  (PCA,	
  SVD)	
  

Source	
  –	
  scikit-­‐learn.github.io/scikit-­‐learn-­‐tutorial/general_concepts.html	
  
ClassificaEon	
  Example
	
  

VS!
Regression	
  Example
	
  
ConsideraEons	
  on	
  Data	
  Gathering
	
  
•  Models	
  will	
  (generally)	
  get	
  beRer	
  with	
  more	
  data	
  
–  But	
  we	
  always	
  have	
  to	
  consider	
  bias	
  and	
  variance	
  as	
  we	
  select	
  our	
  data	
  
points	
  
–  Also	
  adversaries	
  –	
  we	
  may	
  be	
  force	
  fed	
  “bad	
  data”,	
  find	
  signal	
  in	
  weird	
  
noise	
  or	
  design	
  bad	
  (or	
  exploitable)	
  features	
  

•  “I’ve	
  got	
  99	
  problems,	
  but	
  data	
  ain’t	
  one”!

Domingos,	
  2012	
  

Abu-­‐Mostafa,	
  Caltech,	
  2012	
  
ApplicaEons	
  of	
  Machine	
  Learning
	
  
•  Sales

!

•  Trading	
  

•  Image	
  and	
  
Voice	
  
Recogni4on	
  
Y	
  U	
  DO	
  DIS?
	
  
•  Common	
  reac4ons	
  from	
  Security	
  Professionals:	
  
•  “Eh,	
  cool…”	
  *blank	
  stare*	
  *walks	
  away*	
  
•  “Are	
  you	
  high,	
  bro?”	
  
•  “Why	
  aren’t	
  you	
  doing	
  some	
  cool	
  research	
  like	
  Android	
  
Malware?”	
  
Math	
  is	
  HARD
	
  
Security	
  ApplicaEons	
  of	
  ML
	
  
•  Fraud	
  detec4on	
  systems:	
  
–  Is	
  what	
  he	
  just	
  did	
  consistent	
  with	
  past	
  
behavior?	
  

•  Network	
  anomaly	
  detec4on	
  (?):	
  
–  More	
  like	
  bad	
  sta4s4cal	
  analysis	
  
–  Did	
  not	
  advance	
  a	
  lot,	
  IMO	
  

•  Predic4ng	
  likelihood	
  of	
  aRack	
  actors	
  
–  Create	
  different	
  predic4ve	
  models	
  and	
  
chain	
  them	
  to	
  gain	
  more	
  confidence	
  in	
  each	
  
step.!

•  SPAM	
  filters	
  
ConsideraEons	
  on	
  Data	
  Gathering
	
  
•  Adversaries	
  -­‐	
  Exploi4ng	
  the	
  learning	
  process	
  
•  Understand	
  the	
  model,	
  understand	
  the	
  machine,	
  and	
  
you	
  can	
  circumvent	
  it	
  
•  Something	
  InfoSec	
  community	
  knows	
  very	
  well	
  
•  Any	
  predic4ve	
  model	
  on	
  InfoSec	
  will	
  be	
  pushed	
  to	
  the	
  
limit	
  
•  Again,	
  think	
  back	
  on	
  the	
  	
  
way	
  SPAM	
  engines	
  evolved.!
Network	
  Security	
  Monitoring
	
  
CorrelaEon	
  Rules:	
  A	
  Primer
	
  
•  Rules	
  in	
  a	
  SIEM	
  solu4on	
  invariably	
  are:	
  

–  “Something”	
  has	
  happened	
  “x”	
  4mes;	
  
–  “Something”	
  has	
  happened	
  and	
  other	
  “something2”	
  has	
  
happened,	
  with	
  some	
  rela4onship	
  (4me,	
  same	
  fields,	
  etc)	
  
between	
  them.	
  

•  Configuring	
  SIEM	
  =	
  iterate	
  on	
  combina4ons	
  un4l:	
  
–  Customer	
  or	
  management	
  is	
  foole..	
  I	
  mean	
  sa4sfied;	
  	
  
–  Consul4ng	
  money	
  runs	
  out	
  

•  Behavioral	
  rules	
  (anomaly	
  detec4on)	
  helps	
  a	
  bit	
  with	
  
the	
  “x”s,	
  but	
  s4ll,	
  very	
  laborious	
  and	
  4me	
  
consuming.!
Kinds	
  of	
  Network	
  Security	
  Monitoring
	
  
•  Alert-­‐based:	
  

–  “Tradi4onal”	
  log	
  management	
  
–  SIEM	
  
–  Using	
  “Threat	
  Intelligence”	
  (i.e	
  
blacklists)	
  for	
  about	
  a	
  year	
  or	
  
so	
  
–  Lack	
  of	
  context	
  
–  Low	
  effec4veness	
  
–  You	
  get	
  the	
  results	
  handed	
  
over	
  to	
  you	
  

•  Explora4on-­‐based:	
  
–  Network	
  Forensics	
  tools	
  (2/3	
  
years	
  ago)	
  
–  Elas4c	
  Search	
  based	
  LM	
  
systems	
  
–  High	
  effec4veness	
  
–  Lots	
  of	
  people	
  necessary	
  
–  Lots	
  of	
  HIGHLY	
  trained	
  people	
  

•  Big	
  Data	
  Security	
  Analy4cs	
  (BDSA):	
  

–  Run	
  explora4on-­‐based	
  monitoring	
  on	
  Hadoop	
  
–  More	
  like	
  Big	
  Data	
  Security	
  Monitoring	
  (BDSM)	
  
Alert-­‐based	
  +	
  ExploraEon-­‐based
	
  
A	
  wild	
  army	
  of	
  robots	
  appears
	
  
Using	
  robots	
  to	
  catch	
  bad	
  guys
	
  
PoC	
  ||	
  GTFO
	
  
•  We	
  developed	
  a	
  set	
  of	
  algorithms	
  to	
  detect	
  malicious	
  
behavior	
  from	
  log	
  entries	
  of	
  firewall	
  blocks	
  
•  Over	
  6	
  months	
  of	
  data	
  from	
  SANS	
  DShield	
  (thanks,	
  guys!)	
  
	
  
•  A`er	
  a	
  lot	
  of	
  sta4s4cal-­‐based	
  math	
  (true	
  posi4ve	
  ra4o,	
  
true	
  nega4ve	
  ra4o,	
  odds	
  likelihood),	
  it	
  could	
  pinpoint	
  
actors	
  that	
  would	
  be	
  13x-­‐18x	
  more	
  likely	
  to	
  aRack	
  you.	
  
•  Today	
  more	
  like	
  30x	
  on	
  the	
  SANS	
  data,	
  and	
  finding	
  
around	
  80%	
  of	
  “badness”	
  in	
  par4cipant	
  deployments.!
Feature	
  IntuiEon:	
  IP	
  Proximity
	
  
•  Assump4ons	
  to	
  aggregate	
  the	
  data	
  	
  
•  Correla4on	
  /	
  proximity	
  /	
  similarity	
  BY	
  BEHAVIOR	
  
•  “Bad	
  Neighborhoods”	
  concept:	
  	
  
–  Spamhaus	
  x	
  CyberBunker	
  
–  Google	
  Report	
  (June	
  2013)	
  
–  Moura	
  2013	
  

•  Group	
  by	
  Geoloca4on	
  
•  Group	
  by	
  Netblock	
  (/16,	
  /24)	
  
•  Group	
  by	
  ASN	
  	
  
–  (thanks,	
  Team	
  Cymru)!
0	
  

10	
  

MULTICAST	
  AND	
  FRIENDS	
  

You	
  are	
  
here!

CN,	
  
BR,	
  
TH	
  

Map	
  of	
  the	
  
Internet	
  
•  (Hilbert	
  Curve)	
  
•  Block	
  port	
  22	
  	
  
•  2013-­‐07-­‐20	
  

CN	
  
127	
  

RU	
  
Feature	
  IntuiEon:	
  Temporal	
  Decay
	
  
•  Even	
  bad	
  neighborhoods	
  renovate:	
  
–  ARackers	
  may	
  change	
  ISPs/proxies	
  
–  Botnets	
  may	
  be	
  shut	
  down	
  /	
  relocate	
  
–  A	
  liRle	
  paranoia	
  is	
  Ok,	
  but	
  not	
  EVERYONE	
  is	
  out	
  to	
  get	
  you	
  (at	
  least	
  
not	
  all	
  at	
  once)!

•  As	
  days	
  pass,	
  let's	
  forget,	
  bit	
  by	
  bit,	
  
who	
  aRacked	
  
•  Last	
  4me	
  I	
  saw	
  this	
  actor,	
  and	
  how	
  
o`en	
  did	
  I	
  see	
  them!
MLSec	
  Project	
  
•  Behavior:	
  block	
  
on	
  port	
  22	
  
•  Trial	
  inference	
  on	
  
100k	
  IP	
  addresses	
  
per	
  Class	
  A	
  
subnet	
  
•  Logarithm	
  	
  scale:	
  
brightest	
  4les	
  are	
  
10	
  to	
  1000	
  4mes	
  
more	
  likely	
  to	
  
aRack.	
  
Feature	
  IntuiEon:	
  DNS	
  features
	
  
•  Who	
  resolves	
  to	
  this	
  IP	
  address?	
  
•  Number	
  of	
  domains	
  that	
  resolve	
  to	
  the	
  IP	
  address	
  
•  Distribu4on	
  of	
  their	
  life4me	
  
•  Entropy,	
  size,	
  ccTLDs	
  
•  Registrar	
  informa4on	
  
•  Reverse	
  DNS	
  informa4on…	
  
•  History	
  of	
  DNS	
  registra4on…	
  
•  (Thanks,	
  DNSDB!)	
  
Training	
  the	
  Model
	
  
•  YAY!	
  We	
  have	
  a	
  bunch	
  of	
  numbers	
  per	
  IP	
  address/domain!	
  
•  How	
  do	
  you	
  define	
  what	
  is	
  malicious	
  or	
  not?	
  
•  “Advanced	
  exper4se	
  in	
  both	
  informa4on	
  security	
  and	
  data	
  
science	
  will	
  be	
  a	
  necessary	
  ingredient	
  in	
  enabling	
  accurate	
  
discrimina4on	
  between	
  malicious	
  and	
  benign	
  ac4vity.	
  “	
  
	
  
	
  
	
  
	
  -­‐	
  Anton	
  Chuvakin,	
  Gartner	
  
•  Kinda	
  easy	
  for	
  security	
  tools	
  (if	
  you	
  trust	
  them)	
  
•  Web	
  applica4on	
  logs	
  need	
  deeper	
  sta4s4cal	
  analysis	
  
•  Not	
  normal	
  /	
  standard	
  devia4on	
  thing	
  
	
  
!
How	
  do	
  I	
  get	
  started	
  on	
  this?
	
  
•  Programming	
  is	
  a	
  must	
  (Python	
  /	
  R)	
  
•  Sta4s4cal	
  knowledge	
  keeps	
  you	
  from	
  making	
  dumb	
  
mistakes	
  
•  Specific	
  machine	
  learning	
  courses	
  and	
  books:	
  
–  Coursera	
  (ML/	
  Data	
  Analysis	
  /	
  Data	
  Science)	
  

•  Prac4ce,	
  Prac4ce,	
  Prac4ce:	
  
–  Explore	
  your	
  data!	
  –	
  (Security	
  Onion)	
  
–  Kaggle	
  
–  KDD,	
  VAST,	
  VizSec!
MLSec	
  Project
	
  
•  Sign	
  up,	
  send	
  logs,	
  receive	
  reports	
  generated	
  by	
  machine	
  
learning	
  models!	
  
•  Working	
  with	
  several	
  companies	
  on	
  trying	
  out	
  these	
  models	
  on	
  
their	
  environment	
  with	
  their	
  data	
  
•  We	
  are	
  hiring	
  (KINDA)	
  
•  Visit	
  h]ps://www.mlsecproject.org	
  ,	
  message	
  @MLSecProject	
  
or	
  just	
  e-­‐mail	
  me.!
MLSec	
  Project	
  -­‐	
  Current	
  Research
	
  
•  Inbound	
  aRacks	
  on	
  exposed	
  services	
  (DEFCON/BH	
  2013):	
  
–  Informa4on	
  from	
  inbound	
  connec4ons	
  on	
  firewalls,	
  IPS,	
  WAFs	
  
–  Feature	
  extrac4on	
  and	
  supervised	
  learning	
  
	
  	
  
•  Malware	
  Distribu4on	
  and	
  Botnets:	
  
–  Informa4on	
  from	
  outbound	
  connec4ons	
  on	
  firewalls,	
  DNS	
  and	
  
Web	
  Proxy	
  
–  Ini4al	
  labeling	
  provided	
  by	
  intelligence	
  feeds	
  and	
  AV/an4-­‐malware	
  
–  Semi-­‐supervised	
  learning	
  involved	
  
•  Kill-­‐chain	
  Ensemble	
  Models:	
  
–  Increased	
  precision	
  by	
  composing	
  different	
  behaviors	
  
–  Web	
  server	
  path	
  -­‐>	
  go	
  through	
  Firewall,	
  then	
  IPS,	
  then	
  WAF	
  
–  Early	
  confirma4on	
  of	
  aRack	
  failure	
  or	
  success	
  
Thanks!
	
  
•  Q&A?	
  
•  Feedback?	
  

Alexandre	
  Pinto	
  
	
  

@alexcpsec
	
  
@MLSecProject
	
  
hRps://www.mlsecproject.org/
	
  

"	
  Essen4ally,	
  all	
  models	
  are	
  wrong,	
  but	
  some	
  are	
  useful."	
  	
  
	
  
	
  
	
  
	
  
	
  	
  	
  
	
  	
  	
  -­‐	
  George	
  E.	
  P.	
  Box	
  	
  

More Related Content

PDF
BSidesLV 2013 - Using Machine Learning to Support Information Security
PDF
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
PDF
SANS CTI Summit 2016 - Data-Driven Threat Intelligence: Sharing
PPTX
Determining the Fit and Impact of CTI Indicators on Your Monitoring Pipeline ...
PDF
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
PDF
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
PDF
Biting into the Jawbreaker: Pushing the Boundaries of Threat Hunting Automation
PDF
Data-Driven Threat Intelligence: Metrics on Indicator Dissemination and Sharing
BSidesLV 2013 - Using Machine Learning to Support Information Security
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
SANS CTI Summit 2016 - Data-Driven Threat Intelligence: Sharing
Determining the Fit and Impact of CTI Indicators on Your Monitoring Pipeline ...
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
Biting into the Jawbreaker: Pushing the Boundaries of Threat Hunting Automation
Data-Driven Threat Intelligence: Metrics on Indicator Dissemination and Sharing

What's hot (20)

PDF
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
PPTX
Towards a Threat Hunting Automation Maturity Model
PDF
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
PPTX
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
PPTX
2016 FS-ISAC Annual Summit (Miami) - Developing Effective Encryption Strategies
PPTX
Billions & Billions of Logs
PDF
Luncheon 2016-07-16 - Topic 2 - Advanced Threat Hunting by Justin Falck
PPTX
Penetration Testing
PPTX
Abstract Tools for Effective Threat Hunting
PDF
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
PDF
Avoiding the Pitfalls of Hunting - BSides Charm 2016
PDF
IT Operation Analytic for security- MiSSconf(sp1)
PDF
Creating Your Own Threat Intel Through Hunting & Visualization
PPTX
Machine Learning in Information Security by Mohammed Zuber
PDF
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
PDF
Full-System Emulation Achieving Successful Automated Dynamic Analysis of Evas...
PPTX
Databases, Web Services and Tools For Systems Immunology
PPTX
Remote forensics fsec2016 delija draft
PDF
H@dfex 2015 malware analysis
PPTX
SplunkLive! Customer Presentation – Virtustream
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
Towards a Threat Hunting Automation Maturity Model
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
2016 FS-ISAC Annual Summit (Miami) - Developing Effective Encryption Strategies
Billions & Billions of Logs
Luncheon 2016-07-16 - Topic 2 - Advanced Threat Hunting by Justin Falck
Penetration Testing
Abstract Tools for Effective Threat Hunting
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
Avoiding the Pitfalls of Hunting - BSides Charm 2016
IT Operation Analytic for security- MiSSconf(sp1)
Creating Your Own Threat Intel Through Hunting & Visualization
Machine Learning in Information Security by Mohammed Zuber
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
Full-System Emulation Achieving Successful Automated Dynamic Analysis of Evas...
Databases, Web Services and Tools For Systems Immunology
Remote forensics fsec2016 delija draft
H@dfex 2015 malware analysis
SplunkLive! Customer Presentation – Virtustream
Ad

Similar to Applying Machine Learning to Network Security Monitoring - BayThreat 2013 (20)

PDF
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
PPTX
Needles, Haystacks and Algorithms: Using Machine Learning to detect complex t...
PPTX
Delivering Security Insights with Data Analytics and Visualization
PDF
AI & Machine Learning - Etienne Greeff - SecureData
PPTX
Pushing Machine Learning Down the Security Stack to Make It More Effective fo...
PDF
AI & ML in Cyber Security - Why Algorithms Are Dangerous
PDF
BSides Lisbon - Data science, machine learning and cybersecurity
PPTX
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
PPTX
Machine learning in computer security
PDF
Navy security contest-bigdataforsecurity
PPTX
Intrusion Detection with Neural Networks
PPTX
BsidesLVPresso2016_JZeditsv6
PPTX
Role of data mining in cyber security
PPTX
Rise of the machines -- Owasp israel -- June 2014 meetup
PPTX
Application of Machine Learning in Cybersecurity
PDF
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
PDF
CIS AIML Beginners Series Part 1
PPTX
rsec2a-2016-jheaton-morning
PDF
Comparative Study on Machine Learning Algorithms for Network Intrusion Detect...
PDF
I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACH...
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Needles, Haystacks and Algorithms: Using Machine Learning to detect complex t...
Delivering Security Insights with Data Analytics and Visualization
AI & Machine Learning - Etienne Greeff - SecureData
Pushing Machine Learning Down the Security Stack to Make It More Effective fo...
AI & ML in Cyber Security - Why Algorithms Are Dangerous
BSides Lisbon - Data science, machine learning and cybersecurity
AI & ML in Cyber Security - Welcome Back to 1999 - Security Hasn't Changed
Machine learning in computer security
Navy security contest-bigdataforsecurity
Intrusion Detection with Neural Networks
BsidesLVPresso2016_JZeditsv6
Role of data mining in cyber security
Rise of the machines -- Owasp israel -- June 2014 meetup
Application of Machine Learning in Cybersecurity
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
CIS AIML Beginners Series Part 1
rsec2a-2016-jheaton-morning
Comparative Study on Machine Learning Algorithms for Network Intrusion Detect...
I FOR ONE WELCOME OUR NEW CYBER OVERLORDS! AN INTRODUCTION TO THE USE OF MACH...
Ad

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
KodekX | Application Modernization Development
PDF
Encapsulation theory and applications.pdf
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Spectroscopy.pptx food analysis technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PDF
Empathic Computing: Creating Shared Understanding
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
cuic standard and advanced reporting.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
KodekX | Application Modernization Development
Encapsulation theory and applications.pdf
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectral efficient network and resource selection model in 5G networks
Spectroscopy.pptx food analysis technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
Empathic Computing: Creating Shared Understanding
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
cuic standard and advanced reporting.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Applying Machine Learning to Network Security Monitoring - BayThreat 2013

  • 1. Applying  Machine  Learning  to  Network   Security  Monitoring   Alexandre  Pinto   Chief  Data  Scien4st  |  MLSec  Project       @alexcpsec @MLSecProject!
  • 2. WARNING!   •  This  is  a  talk  about  BUILDING  not  breaking   –  NO  systems  were  harmed  on  the  development  of  this  talk.   –  This  is  NOT  about  1337  Android  Malware   •  Only  thing  we  are  likely  to  break  here  is  the  4me  limit  on  the   talk     •  This  talk  includes  more  MATH  than  the  daily  recommended   intake  by  the  FDA.   •  All  stunts  described  in  this  talk  were  performed  by  trained   professionals.!
  • 3. Who's  Alex?   •  13  years  in  Informa4on  Security,  done  a  liRle  bit  of  everything.   •  Past  7  or  so  years  leading  security  consultancy  and  monitoring   teams  in  Brazil,  London  and  the  US.   –  If  there  is  any  way  a  SIEM  can  hurt  you,  it  did  to  me.   •  Researching  machine  learning  and  data  science  in  general  for   the  past  year  or  so  and  presen4ng  about  the  intersec4on  of  it   and  Infosec  throughout  the  year.   •  Created  MLSec  Project  in  July  2013  to  give  structure  to  the   research  being  done.  
  • 4. Agenda   •  Defini4ons   •  Big  Data   •  Data  Science   •  Machine  Learning   •  •  •  •  •  Y  U  DO  DIS?   Network  Security  Monitoring   PoC  ||  GTFO   Feature  Intui4on   How  to  get  started?  
  • 5.   Big  Data  +  Machine  Learning  +  Data  Science
  • 6.   Big  Data  +  Machine  Learning  +  Data  Science
  • 8. (Security)  Data  ScienEst   •  “Data  Scien4st  (n.):  Person  who  is  beRer  at  sta4s4cs  than  any  so`ware   engineer  and  beRer  at  so`ware  engineering  than  any  sta4s4cian.”  -­‐-­‐  Josh  Willis,  Cloudera   Data  Science  Venn  Diagram  by  Drew  Conway!
  • 9. Enter  Machine  Learning   •  “Machine  learning  systems  automa4cally  learn  programs   from  data”  (*)   •  You  don’t  really  code  the  program,  but  it  is  inferred   from  data.   •  Intui4on  of  trying  to  mimic  the  way  the  brain  learns:     that's  where  terms  like  ar#ficial  intelligence  come  from. ! (*)  CACM  55(10)  -­‐  A  Few  Useful  Things  to  Know  about  Machine  Learning  (Domingos  2012)  
  • 10. Kinds  of  Machine  Learning   •  Supervised  Learning:   –  Classifica4on  (NN,  SVM,  Naïve   Bayes)   –  Regression  (linear,  logis4c)! •  Unsupervised  Learning  :   –  Clustering  (k-­‐means)   –  Decomposi4on  (PCA,  SVD)   Source  –  scikit-­‐learn.github.io/scikit-­‐learn-­‐tutorial/general_concepts.html  
  • 13. ConsideraEons  on  Data  Gathering   •  Models  will  (generally)  get  beRer  with  more  data   –  But  we  always  have  to  consider  bias  and  variance  as  we  select  our  data   points   –  Also  adversaries  –  we  may  be  force  fed  “bad  data”,  find  signal  in  weird   noise  or  design  bad  (or  exploitable)  features   •  “I’ve  got  99  problems,  but  data  ain’t  one”! Domingos,  2012   Abu-­‐Mostafa,  Caltech,  2012  
  • 14. ApplicaEons  of  Machine  Learning   •  Sales ! •  Trading   •  Image  and   Voice   Recogni4on  
  • 15. Y  U  DO  DIS?   •  Common  reac4ons  from  Security  Professionals:   •  “Eh,  cool…”  *blank  stare*  *walks  away*   •  “Are  you  high,  bro?”   •  “Why  aren’t  you  doing  some  cool  research  like  Android   Malware?”  
  • 17. Security  ApplicaEons  of  ML   •  Fraud  detec4on  systems:   –  Is  what  he  just  did  consistent  with  past   behavior?   •  Network  anomaly  detec4on  (?):   –  More  like  bad  sta4s4cal  analysis   –  Did  not  advance  a  lot,  IMO   •  Predic4ng  likelihood  of  aRack  actors   –  Create  different  predic4ve  models  and   chain  them  to  gain  more  confidence  in  each   step.! •  SPAM  filters  
  • 18. ConsideraEons  on  Data  Gathering   •  Adversaries  -­‐  Exploi4ng  the  learning  process   •  Understand  the  model,  understand  the  machine,  and   you  can  circumvent  it   •  Something  InfoSec  community  knows  very  well   •  Any  predic4ve  model  on  InfoSec  will  be  pushed  to  the   limit   •  Again,  think  back  on  the     way  SPAM  engines  evolved.!
  • 20. CorrelaEon  Rules:  A  Primer   •  Rules  in  a  SIEM  solu4on  invariably  are:   –  “Something”  has  happened  “x”  4mes;   –  “Something”  has  happened  and  other  “something2”  has   happened,  with  some  rela4onship  (4me,  same  fields,  etc)   between  them.   •  Configuring  SIEM  =  iterate  on  combina4ons  un4l:   –  Customer  or  management  is  foole..  I  mean  sa4sfied;     –  Consul4ng  money  runs  out   •  Behavioral  rules  (anomaly  detec4on)  helps  a  bit  with   the  “x”s,  but  s4ll,  very  laborious  and  4me   consuming.!
  • 21. Kinds  of  Network  Security  Monitoring   •  Alert-­‐based:   –  “Tradi4onal”  log  management   –  SIEM   –  Using  “Threat  Intelligence”  (i.e   blacklists)  for  about  a  year  or   so   –  Lack  of  context   –  Low  effec4veness   –  You  get  the  results  handed   over  to  you   •  Explora4on-­‐based:   –  Network  Forensics  tools  (2/3   years  ago)   –  Elas4c  Search  based  LM   systems   –  High  effec4veness   –  Lots  of  people  necessary   –  Lots  of  HIGHLY  trained  people   •  Big  Data  Security  Analy4cs  (BDSA):   –  Run  explora4on-­‐based  monitoring  on  Hadoop   –  More  like  Big  Data  Security  Monitoring  (BDSM)  
  • 23. A  wild  army  of  robots  appears  
  • 24. Using  robots  to  catch  bad  guys  
  • 25. PoC  ||  GTFO   •  We  developed  a  set  of  algorithms  to  detect  malicious   behavior  from  log  entries  of  firewall  blocks   •  Over  6  months  of  data  from  SANS  DShield  (thanks,  guys!)     •  A`er  a  lot  of  sta4s4cal-­‐based  math  (true  posi4ve  ra4o,   true  nega4ve  ra4o,  odds  likelihood),  it  could  pinpoint   actors  that  would  be  13x-­‐18x  more  likely  to  aRack  you.   •  Today  more  like  30x  on  the  SANS  data,  and  finding   around  80%  of  “badness”  in  par4cipant  deployments.!
  • 26. Feature  IntuiEon:  IP  Proximity   •  Assump4ons  to  aggregate  the  data     •  Correla4on  /  proximity  /  similarity  BY  BEHAVIOR   •  “Bad  Neighborhoods”  concept:     –  Spamhaus  x  CyberBunker   –  Google  Report  (June  2013)   –  Moura  2013   •  Group  by  Geoloca4on   •  Group  by  Netblock  (/16,  /24)   •  Group  by  ASN     –  (thanks,  Team  Cymru)!
  • 27. 0   10   MULTICAST  AND  FRIENDS   You  are   here! CN,   BR,   TH   Map  of  the   Internet   •  (Hilbert  Curve)   •  Block  port  22     •  2013-­‐07-­‐20   CN   127   RU  
  • 28. Feature  IntuiEon:  Temporal  Decay   •  Even  bad  neighborhoods  renovate:   –  ARackers  may  change  ISPs/proxies   –  Botnets  may  be  shut  down  /  relocate   –  A  liRle  paranoia  is  Ok,  but  not  EVERYONE  is  out  to  get  you  (at  least   not  all  at  once)! •  As  days  pass,  let's  forget,  bit  by  bit,   who  aRacked   •  Last  4me  I  saw  this  actor,  and  how   o`en  did  I  see  them!
  • 29. MLSec  Project   •  Behavior:  block   on  port  22   •  Trial  inference  on   100k  IP  addresses   per  Class  A   subnet   •  Logarithm    scale:   brightest  4les  are   10  to  1000  4mes   more  likely  to   aRack.  
  • 30. Feature  IntuiEon:  DNS  features   •  Who  resolves  to  this  IP  address?   •  Number  of  domains  that  resolve  to  the  IP  address   •  Distribu4on  of  their  life4me   •  Entropy,  size,  ccTLDs   •  Registrar  informa4on   •  Reverse  DNS  informa4on…   •  History  of  DNS  registra4on…   •  (Thanks,  DNSDB!)  
  • 31. Training  the  Model   •  YAY!  We  have  a  bunch  of  numbers  per  IP  address/domain!   •  How  do  you  define  what  is  malicious  or  not?   •  “Advanced  exper4se  in  both  informa4on  security  and  data   science  will  be  a  necessary  ingredient  in  enabling  accurate   discrimina4on  between  malicious  and  benign  ac4vity.  “          -­‐  Anton  Chuvakin,  Gartner   •  Kinda  easy  for  security  tools  (if  you  trust  them)   •  Web  applica4on  logs  need  deeper  sta4s4cal  analysis   •  Not  normal  /  standard  devia4on  thing     !
  • 32. How  do  I  get  started  on  this?   •  Programming  is  a  must  (Python  /  R)   •  Sta4s4cal  knowledge  keeps  you  from  making  dumb   mistakes   •  Specific  machine  learning  courses  and  books:   –  Coursera  (ML/  Data  Analysis  /  Data  Science)   •  Prac4ce,  Prac4ce,  Prac4ce:   –  Explore  your  data!  –  (Security  Onion)   –  Kaggle   –  KDD,  VAST,  VizSec!
  • 33. MLSec  Project   •  Sign  up,  send  logs,  receive  reports  generated  by  machine   learning  models!   •  Working  with  several  companies  on  trying  out  these  models  on   their  environment  with  their  data   •  We  are  hiring  (KINDA)   •  Visit  h]ps://www.mlsecproject.org  ,  message  @MLSecProject   or  just  e-­‐mail  me.!
  • 34. MLSec  Project  -­‐  Current  Research   •  Inbound  aRacks  on  exposed  services  (DEFCON/BH  2013):   –  Informa4on  from  inbound  connec4ons  on  firewalls,  IPS,  WAFs   –  Feature  extrac4on  and  supervised  learning       •  Malware  Distribu4on  and  Botnets:   –  Informa4on  from  outbound  connec4ons  on  firewalls,  DNS  and   Web  Proxy   –  Ini4al  labeling  provided  by  intelligence  feeds  and  AV/an4-­‐malware   –  Semi-­‐supervised  learning  involved   •  Kill-­‐chain  Ensemble  Models:   –  Increased  precision  by  composing  different  behaviors   –  Web  server  path  -­‐>  go  through  Firewall,  then  IPS,  then  WAF   –  Early  confirma4on  of  aRack  failure  or  success  
  • 35. Thanks!   •  Q&A?   •  Feedback?   Alexandre  Pinto     @alexcpsec   @MLSecProject   hRps://www.mlsecproject.org/   "  Essen4ally,  all  models  are  wrong,  but  some  are  useful."                        -­‐  George  E.  P.  Box