SlideShare a Scribd company logo
Secure	
  Because	
  Math:	
  A	
  Deep-­‐Dive	
  on	
  
Machine	
  Learning-­‐Based	
  Monitoring	
  	
  
(#SecureBecauseMath)	
  
Alex	
  Pinto	
  
Chief	
  Data	
  Scien2st	
  |	
  MLSec	
  Project	
  	
  
@alexcpsec	
  
@MLSecProject!
Alex	
  Pinto	
  
•  Chief	
  Data	
  Scien2st	
  at	
  MLSec	
  Project	
  
•  Machine	
  Learning	
  Researcher	
  and	
  Trainer	
  
•  Network	
  security	
  and	
  incident	
  response	
  aficionado	
  	
  
•  Tortured	
  by	
  SIEMs	
  as	
  a	
  child	
  
•  Hacker	
  Spirit	
  Animal™:	
  CAFFEINATED	
  CAPYBARA!
whoami	
  
(hPps://secure.flickr.com/photos/kobashi_san/)	
  
 
•  Security	
  Singularity	
  
•  Some	
  History	
  
•  TLA	
  
•  ML	
  Marke2ng	
  PaPerns	
  
•  Anomaly	
  Detec2on	
  
•  Classifica2on	
  
•  Buyer’s	
  Guide	
  
•  MLSec	
  Project	
  
Agenda	
  
Security	
  Singularity	
  Approaches	
  
(Side	
  Note)	
  
First	
  hit	
  on	
  Google	
  images	
  for	
  “Network	
  Security	
  Solved”	
  is	
  a	
  
picture	
  of	
  Jack	
  Daniel!
Security	
  Singularity	
  Approaches	
  
•  “Machine	
  learning	
  /	
  math	
  /	
  algorithms…	
  these	
  terms	
  are	
  
used	
  interchangeably	
  quite	
  frequently.”	
  
•  “Is	
  behavioral	
  baselining	
  and	
  anomaly	
  detec2on	
  part	
  of	
  
this?”	
  
•  “What	
  about	
  Big	
  Data	
  Security	
  Analy2cs?”	
  
	
  
(hPp://bigdatapix.tumblr.com/)	
  
Are	
  we	
  even	
  trying?	
  
•  “Hyper-­‐dimensional	
  security	
  
analy2cs”	
  
•  “3rd	
  genera2on	
  Ar2ficial	
  
Intelligence”	
  
•  “Secure	
  because	
  Math”	
  
	
  
•  Lack	
  of	
  ability	
  to	
  differen2ate	
  
hurts	
  buyers,	
  investors.	
  
•  Are	
  we	
  even	
  funding	
  the	
  right	
  
things?	
  
Is	
  this	
  a	
  communicaCon	
  issue?	
  
Guess	
  the	
  Year!	
  
•  “(…)	
  behavior	
  analysis	
  system	
  that	
  enhances	
  your	
  
network	
  intelligence	
  and	
  security	
  by	
  audi2ng	
  network	
  
flow	
  data	
  from	
  exis2ng	
  infrastructure	
  devices”	
  
•  "Mathema2cal	
  models	
  (…)	
  that	
  determine	
  baseline	
  
behavior	
  across	
  users	
  and	
  machines,	
  detec2ng	
  (...)	
  
anomalous	
  and	
  risky	
  ac2vi2es	
  (...)”	
  
•  ”(…)	
  maintains	
  historical	
  profiles	
  of	
  usage	
  per	
  user	
  and	
  
raises	
  an	
  alarm	
  when	
  observed	
  ac2vity	
  departs	
  from	
  
established	
  paPerns	
  of	
  usage	
  for	
  an	
  individual.”	
  	
  
A	
  liGle	
  history	
  
•  Dorothy	
  E.	
  Denning	
  (professor	
  at	
  the	
  
Department	
  of	
  Defense	
  Analysis	
  at	
  the	
  
Naval	
  Postgraduate	
  School)	
  
•  1986	
  (SRI)	
  -­‐	
  First	
  research	
  that	
  led	
  
to	
  IDS	
  
•  Intrusion	
  Detec2on	
  Expert	
  System	
  
(IDES)	
  
•  Already	
  had	
  sta2s2cal	
  anomaly	
  
detec2on	
  built-­‐in	
  
•  1993:	
  Her	
  colleagues	
  release	
  the	
  Next	
  
Genera2on	
  (!)	
  IDES	
  
Three	
  LeGer	
  Acronyms	
  -­‐	
  KDD	
  
•  Ajer	
  the	
  release	
  of	
  Bro	
  (1998)	
  and	
  Snort	
  (1999),	
  DARPA	
  
thought	
  we	
  were	
  covered	
  for	
  this	
  signature	
  thing	
  
•  DARPA	
  released	
  datasets	
  for	
  user	
  anomaly	
  detec2on	
  in	
  
1998	
  and	
  1999	
  
•  And	
  then	
  came	
  the	
  KDD-­‐99	
  dataset	
  –	
  over	
  6200	
  cita2ons	
  
on	
  Google	
  Scholar	
  
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)
Three	
  LeGer	
  Acronyms	
  
Three	
  LeGer	
  Acronyms	
  -­‐	
  KDD	
  
Trolling,	
  maybe?	
  
Not	
  here	
  to	
  bash	
  academia	
  
A	
  Probable	
  Outcome	
  
GRAD	
  
SCHOOL	
  
FRESHMAN	
  
ZOMG	
  
RESULTS	
  !!
11!1!	
  
ZOMG!	
  
RESULTS???	
  
MATH,	
  STAHP!	
  
MATH	
  IS	
  
HARD,	
  LET’S	
  
GO	
  SHOPPING	
  
ML	
  MarkeCng	
  PaGerns	
  
•  The	
  “Has-­‐beens”	
  	
  
•  Name	
  is	
  a	
  bit	
  harsh,	
  but	
  hey,	
  you	
  hardly	
  use	
  ML	
  
anymore,	
  let	
  us	
  try	
  it	
  
•  The	
  “Machine	
  Learning	
  ¯ˉ_(ツ)_/¯ˉ”	
  
•  Hey,	
  that	
  sounds	
  cool,	
  let’s	
  put	
  that	
  in	
  our	
  brochure	
  
•  The	
  “Sweet	
  Spot”	
  
•  People	
  that	
  actually	
  are	
  trying	
  to	
  do	
  something	
  
•  Anomaly	
  Detec2on	
  vs.	
  Classifica2on	
  
Anomaly	
  DetecCon	
  
Anomaly	
  DetecCon	
  
•  Works	
  wonders	
  for	
  well	
  
defined	
  “industrial-­‐like”	
  
processes.	
  
•  Looking	
  at	
  single,	
  
consistently	
  measured	
  
variables	
  
•  Historical	
  usage	
  in	
  financial	
  
fraud	
  preven2on.	
  
Anomaly	
  DetecCon	
  
Anomaly	
  DetecCon	
  
•  What	
  fits	
  this	
  mold?	
  
•  Network/Neqlow	
  behavior	
  analysis	
  	
  
•  User	
  behavior	
  analysis	
  
•  What	
  are	
  the	
  challenges?	
  
•  Curse	
  of	
  Dimensionality	
  
•  Lack	
  of	
  ground	
  truth	
  and	
  normality	
  poisoning	
  
•  Hanlon’s	
  Razor	
  
AD:	
  Curse	
  of	
  Dimensionality	
  
•  We	
  need	
  “distances”	
  to	
  measure	
  
the	
  features/variables	
  
•  Usually	
  ManhaPan	
  or	
  Euclidian	
  
•  For	
  high-­‐dimensional	
  data,	
  the	
  
distribu2on	
  of	
  distances	
  between	
  
all	
  pairwise	
  points	
  in	
  the	
  space	
  
becomes	
  concentrated	
  around	
  an	
  
average	
  distance.	
  
AD:	
  Curse	
  of	
  Dimensionality	
  
•  The	
  volume	
  of	
  the	
  high	
  
dimensional	
  sphere	
  
becomes	
  negligible	
  in	
  
rela2on	
  to	
  the	
  volume	
  of	
  
the	
  high	
  dimensional	
  cube.	
  
•  The	
  prac2cal	
  result	
  is	
  that	
  
everything	
  just	
  seems	
  too	
  
far	
  away,	
  and	
  at	
  similar	
  
distances.	
  
(hPp://www.datasciencecentral.com/m/blogpost?
id=6448529%3ABlogPost%3A175670)	
  
A	
  PracCcal	
  example	
  
•  NetFlow	
  data,	
  company	
  with	
  n	
  internal	
  nodes.	
  
•  2(nˆ2	
  -­‐	
  n)	
  communica2on	
  direc2ons	
  
•  2*2*2*65535(nˆ2	
  -­‐	
  n)	
  measures	
  of	
  network	
  ac2vity	
  
•  1000	
  nodes	
  -­‐>	
  Half	
  a	
  trillion	
  possible	
  dimensions	
  
Breaking	
  the	
  Curse	
  
•  Different	
  /	
  crea2ve	
  
distance	
  metrics	
  
•  Organizing	
  the	
  space	
  into	
  
sub-­‐manifolds	
  where	
  
Euclidean	
  distances	
  make	
  
more	
  sense.	
  
•  Aggressive	
  feature	
  
removal	
  
•  A	
  few	
  interes2ng	
  results	
  
available	
  
Breaking	
  the	
  Curse	
  
AD:	
  Normality-­‐poisoning	
  aGacks	
  
•  Ground	
  Truth	
  (labels)	
  >>	
  Features	
  >>	
  Algorithms	
  
•  There	
  is	
  no	
  (or	
  next	
  to	
  none)	
  Ground	
  Truth	
  in	
  AD	
  
•  What	
  is	
  “normal”	
  in	
  your	
  environment?	
  
•  Problem	
  asymmetry	
  
•  Solu2ons	
  are	
  biased	
  to	
  the	
  prevalent	
  class	
  
•  Very	
  hard	
  to	
  fine-­‐tune,	
  becomes	
  prone	
  to	
  a	
  lot	
  of	
  false	
  
nega2ves	
  or	
  false	
  posi2ves	
  
AD:	
  Normality-­‐poisoning	
  aGacks	
  
AD:	
  Hanlon’s	
  Razor	
  
Never attribute to malice
that which is adequately
explained by stupidity.
AD:	
  Hanlon’s	
  Razor	
  
vs!
Evil	
  Hacker!
Hipster	
  Developer	
  	
  
(a.k.a.	
  MaP	
  Johansen)!
What	
  about	
  User	
  Behavior?	
  
•  Surprise,	
  it	
  kinda	
  works!	
  (as	
  supervised,	
  that	
  is)	
  
•  As	
  specific	
  implementa2ons	
  for	
  specific	
  solu2ons	
  
•  Good	
  stuff	
  from	
  Square,	
  AirBnB	
  
•  Well	
  defined	
  scope	
  and	
  labeling.	
  
•  Can	
  it	
  be	
  general	
  enough?	
  
•  File	
  exfiltra2on	
  example	
  (roles/info	
  classifica2on	
  
are	
  mandatory?)	
  
•  Can	
  I	
  “average	
  out”	
  user	
  behaviors	
  in	
  different	
  
applica2ons?	
  
ClassificaCon!	
  
VS!
•  Lots	
  of	
  available	
  academic	
  research	
  around	
  this	
  
•  Classifica2on	
  and	
  clustering	
  of	
  malware	
  samples	
  
•  More	
  success	
  into	
  classifying	
  ar2facts	
  you	
  already	
  know	
  to	
  
be	
  malware	
  then	
  to	
  actually	
  detect	
  it.	
  (Lineage)	
  
•  State	
  of	
  the	
  art?	
  My	
  guess	
  is	
  AV	
  companies!	
  
•  All	
  of	
  them	
  have	
  an	
  absurd	
  amount	
  of	
  samples	
  
•  Have	
  been	
  researching	
  and	
  consolida2ng	
  data	
  on	
  them	
  
for	
  decades.	
  
Lots	
  of	
  Malware	
  AcCvity	
  
•  Can	
  we	
  do	
  bePer	
  than	
  “AV	
  Heuris2cs”?	
  
•  Lots	
  and	
  lots	
  of	
  available	
  data	
  that	
  has	
  been	
  made	
  public	
  
•  Some	
  of	
  the	
  papers	
  also	
  suffer	
  from	
  poten2ally	
  bad	
  ground	
  
truth.	
  
Lots	
  of	
  Malware	
  AcCvity	
  
VS!
Lots	
  of	
  Malware	
  AcCvity	
  
VS!
Everyone	
  makes	
  mistakes!	
  
•  Private	
  Beta	
  of	
  our	
  Threat	
  Intelligence-­‐based	
  models:	
  
•  Some	
  use	
  TI	
  indicator	
  feeds	
  as	
  blocklists	
  
•  More	
  mature	
  companies	
  use	
  the	
  feeds	
  to	
  learn	
  about	
  
the	
  threats	
  (Trained	
  professionals	
  only)	
  
•  Our	
  models	
  extrapolate	
  the	
  knowledge	
  of	
  exis2ng	
  threat	
  
intelligence	
  feeds	
  as	
  those	
  experienced	
  analysis	
  would.	
  
•  Supervised	
  model	
  w/same	
  data	
  analyst	
  has	
  
•  Seeded	
  labeling	
  from	
  TI	
  feeds	
  
How	
  is	
  it	
  going	
  then,	
  Alex?	
  
•  Very	
  effec2ve	
  first	
  triage	
  for	
  SOCs	
  and	
  Incident	
  Responders	
  
•  Send	
  us:	
  log	
  data	
  from	
  firewalls,	
  DNS,	
  web	
  proxies	
  
•  Receive:	
  Report	
  with	
  a	
  short	
  list	
  of	
  poten2al	
  
compromised	
  machines	
  
•  Would	
  you	
  rather	
  download	
  all	
  the	
  feeds	
  and	
  integrate	
  it	
  
yourself?	
  
•  MLSecProject/Combine	
  
•  MLSecProject/TIQ-­‐test	
  
	
  
Yeah,	
  but	
  why	
  should	
  I	
  care?	
  
•  Huge	
  amounts	
  of	
  TI	
  feeds	
  available	
  now	
  (open/commercial)	
  
•  Non-­‐malicious	
  samples	
  s2ll	
  challenging,	
  but	
  we	
  have	
  
expanded	
  to	
  a	
  lot	
  of	
  collec2on	
  techniques	
  from	
  different	
  
sources.	
  
•  Very	
  high-­‐ranked	
  Alexa	
  /	
  Quan2cast	
  /	
  OpenDNS	
  
Random	
  domains	
  as	
  seeds	
  for	
  search	
  of	
  trust	
  
•  Helped	
  by	
  the	
  customer	
  logs	
  as	
  well	
  in	
  a	
  semi-­‐
supervised	
  fashion	
  
What	
  about	
  the	
  Ground	
  Truth	
  
(labels)?	
  
•  Vast	
  majority	
  of	
  features	
  are	
  derived	
  from	
  structural/
intrinsic	
  data:	
  
•  GeoIP,	
  ASN	
  informa2on,	
  BGP	
  Prefixes	
  
•  pDNS	
  informa2on	
  for	
  the	
  IP	
  addresses,	
  hostnames	
  
•  WHOIS	
  informa2on	
  
•  APacker	
  can’t	
  change	
  those	
  things	
  without	
  cost.	
  
•  Log	
  data	
  from	
  the	
  customer,	
  can,	
  of	
  course.	
  But	
  this	
  does	
  
not	
  make	
  it	
  worse	
  than	
  human	
  specialist.	
  
But	
  what	
  about	
  data	
  tampering?	
  
•  False	
  posi2ves	
  /	
  false	
  nega2ves	
  are	
  an	
  intrinsic	
  part	
  of	
  ML.	
  
•  “False	
  posi2ves	
  are	
  very	
  good,	
  and	
  would	
  have	
  fooled	
  our	
  
human	
  analysts	
  at	
  first.”	
  
•  Their	
  feedback	
  helps	
  us	
  improve	
  the	
  models	
  for	
  everyone.	
  
•  Remember	
  it	
  is	
  about	
  ini2al	
  triage.	
  A	
  Tier-­‐2/Tier-­‐3	
  analyst	
  
must	
  inves2gate	
  and	
  provide	
  feedback	
  to	
  the	
  model.	
  
And	
  what	
  about	
  false	
  posiCves?	
  
•  1)	
  What	
  are	
  you	
  trying	
  to	
  achieve	
  with	
  adding	
  Machine	
  
Learning	
  to	
  the	
  solu2on?	
  
•  2)	
  What	
  are	
  the	
  sources	
  of	
  Ground	
  Truth	
  for	
  your	
  models?	
  
•  3)	
  How	
  can	
  you	
  protect	
  the	
  features	
  /	
  ground	
  truth	
  from	
  
adversaries?	
  
•  4)	
  How	
  does	
  the	
  solu2on/processes	
  around	
  it	
  handle	
  false	
  
posi2ves?	
  !
Buyer’s	
  Guide	
  
 
#NotAllAlgorithms!
Buyer’s	
  Guide	
  
MLSec	
  Project	
  
•  Don’t	
  take	
  my	
  word	
  for	
  it!	
  Try	
  it	
  out!!	
  
•  Help	
  us	
  test	
  and	
  improve	
  the	
  models!	
  
•  Looking	
  for	
  par2cipants	
  and	
  data	
  sharing	
  agreements	
  
•  Limited	
  capacity	
  at	
  the	
  moment,	
  so	
  be	
  pa2ent.	
  :)	
  
	
  
•  Visit	
  hGps://www.mlsecproject.org	
  ,	
  message	
  @MLSecProject	
  
or	
  just	
  e-­‐mail	
  me.!
Thanks!	
  
•  Q&A?	
  
•  Don’t	
  forget	
  the	
  feedback!	
  
Alex	
  Pinto	
  	
  
@alexcpsec	
  
@MLSecProject	
  
”We	
  are	
  drowning	
  on	
  informa2on	
  and	
  starved	
  for	
  knowledge"	
  	
  
	
   	
   	
   	
   	
  	
  	
   	
  	
  	
  -­‐	
  John	
  NaisbiP	
  	
  

More Related Content

PDF
BSidesLV 2013 - Using Machine Learning to Support Information Security
PDF
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
PDF
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
PDF
Biting into the Jawbreaker: Pushing the Boundaries of Threat Hunting Automation
PDF
Data-Driven Threat Intelligence: Metrics on Indicator Dissemination and Sharing
PDF
SANS CTI Summit 2016 - Data-Driven Threat Intelligence: Sharing
PPTX
Towards a Threat Hunting Automation Maturity Model
PPTX
Determining the Fit and Impact of CTI Indicators on Your Monitoring Pipeline ...
BSidesLV 2013 - Using Machine Learning to Support Information Security
Sharing is Caring: Understanding and Measuring Threat Intelligence Sharing Ef...
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
Biting into the Jawbreaker: Pushing the Boundaries of Threat Hunting Automation
Data-Driven Threat Intelligence: Metrics on Indicator Dissemination and Sharing
SANS CTI Summit 2016 - Data-Driven Threat Intelligence: Sharing
Towards a Threat Hunting Automation Maturity Model
Determining the Fit and Impact of CTI Indicators on Your Monitoring Pipeline ...

What's hot (20)

PDF
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
PDF
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
PDF
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
PPTX
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
PPTX
2016 FS-ISAC Annual Summit (Miami) - Developing Effective Encryption Strategies
PPTX
Databases, Web Services and Tools For Systems Immunology
PDF
Software Analytics: Towards Software Mining that Matters
PPTX
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
PDF
Software Analytics: Data Analytics for Software Engineering
PDF
Software Mining and Software Datasets
PPTX
Filar seymour oreilly_bot_story_
PDF
Luncheon 2016-07-16 - Topic 2 - Advanced Threat Hunting by Justin Falck
PPTX
Big Data: the weakest link
PPTX
SOC2016 - The Investigation Labyrinth
PDF
User Expectations in Mobile App Security
PPTX
CarolinaCon Presentation on Streaming Analytics
PDF
Software Analytics: Data Analytics for Software Engineering and Security
PDF
Threat Hunting with Elastic at SpectorOps: Welcome to HELK
PDF
Deep Learning in Security—An Empirical Example in User and Entity Behavior An...
PPTX
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
From Threat Intelligence to Defense Cleverness: A Data Science Approach (#tid...
Beyond Matching: Applying Data Science Techniques to IOC-based Detection
Applying Machine Learning to Network Security Monitoring - BayThreat 2013
Data-Driven Threat Intelligence: Useful Methods and Measurements for Handling...
2016 FS-ISAC Annual Summit (Miami) - Developing Effective Encryption Strategies
Databases, Web Services and Tools For Systems Immunology
Software Analytics: Towards Software Mining that Matters
HotSoS16 Tutorial "Text Analytics for Security" by Tao Xie and William Enck
Software Analytics: Data Analytics for Software Engineering
Software Mining and Software Datasets
Filar seymour oreilly_bot_story_
Luncheon 2016-07-16 - Topic 2 - Advanced Threat Hunting by Justin Falck
Big Data: the weakest link
SOC2016 - The Investigation Labyrinth
User Expectations in Mobile App Security
CarolinaCon Presentation on Streaming Analytics
Software Analytics: Data Analytics for Software Engineering and Security
Threat Hunting with Elastic at SpectorOps: Welcome to HELK
Deep Learning in Security—An Empirical Example in User and Entity Behavior An...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
Ad

Viewers also liked (20)

PDF
Machine Learning Essentials (dsth Meetup#3)
PDF
Boston Spark Meetup May 24, 2016
PDF
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
PDF
陸永祥/全球網路攝影機帶來的機會與挑戰
PDF
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
PDF
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
PDF
高嘉良/Open Innovation as Strategic Plan
PDF
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
PDF
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
PDF
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
PDF
02 math essentials
PDF
Machine Learning without the Math: An overview of Machine Learning
PDF
Machine Learning Preliminaries and Math Refresher
PDF
[系列活動] 資料探勘速遊
PDF
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
PDF
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
PDF
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
PDF
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
PDF
Advanced Spark and TensorFlow Meetup May 26, 2016
PDF
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Machine Learning Essentials (dsth Meetup#3)
Boston Spark Meetup May 24, 2016
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
陸永祥/全球網路攝影機帶來的機會與挑戰
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Deploy Spark ML and Tensorflow AI Models from Notebooks to Microservices - No...
高嘉良/Open Innovation as Strategic Plan
Big Data Spain - Nov 17 2016 - Madrid Continuously Deploy Spark ML and Tensor...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
Kafka Summit SF Apr 26 2016 - Generating Real-time Recommendations with NiFi,...
02 math essentials
Machine Learning without the Math: An overview of Machine Learning
Machine Learning Preliminaries and Math Refresher
[系列活動] 資料探勘速遊
Advanced Spark and TensorFlow Meetup 08-04-2016 One Click Spark ML Pipeline D...
[DSC 2016] 系列活動:李泳泉 / 星火燎原 - Spark 機器學習初探
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Optimizing, Profiling, and Deploying TensorFlow AI Models with GPUs - San Fra...
Advanced Spark and TensorFlow Meetup May 26, 2016
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Ad

Similar to Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath) (20)

PPTX
Role of data mining in cyber security
PDF
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
DOCX
Meet anomaly detection: a powerful cybersecurity defense mechanism when its w...
PDF
Review of Intrusion and Anomaly Detection Techniques
PPTX
Using Big Data to Counteract Advanced Threats
PPTX
Needles, Haystacks and Algorithms: Using Machine Learning to detect complex t...
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
Using Data Science for Cybersecurity
PPTX
Delivering Security Insights with Data Analytics and Visualization
DOCX
Ids 014 anomaly detection
PDF
2010.08 Applied Threat Modeling: Live (Hutton/Miller)
PPT
data mining for security application
PPT
data mining for security application
PDF
A BAYESIAN CLASSIFICATION ON ASSET VULNERABILITY FOR REAL TIME REDUCTION OF F...
PDF
The Practical Data Mining Model for Efficient IDS through Relational Databases
PPTX
Webinar: Will the Real AI Please Stand Up?
PDF
Kb2417221726
PPTX
A review of machine learning based anomaly detection
PPTX
A review of machine learning based anomaly detection
PDF
Monitoring Smart Grid Operations and Maintaining Missions Assurance
Role of data mining in cyber security
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Meet anomaly detection: a powerful cybersecurity defense mechanism when its w...
Review of Intrusion and Anomaly Detection Techniques
Using Big Data to Counteract Advanced Threats
Needles, Haystacks and Algorithms: Using Machine Learning to detect complex t...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Using Data Science for Cybersecurity
Delivering Security Insights with Data Analytics and Visualization
Ids 014 anomaly detection
2010.08 Applied Threat Modeling: Live (Hutton/Miller)
data mining for security application
data mining for security application
A BAYESIAN CLASSIFICATION ON ASSET VULNERABILITY FOR REAL TIME REDUCTION OF F...
The Practical Data Mining Model for Efficient IDS through Relational Databases
Webinar: Will the Real AI Please Stand Up?
Kb2417221726
A review of machine learning based anomaly detection
A review of machine learning based anomaly detection
Monitoring Smart Grid Operations and Maintaining Missions Assurance

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Lecture1 pattern recognition............
PPT
Quality review (1)_presentation of this 21
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Foundation of Data Science unit number two notes
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Global journeys: estimating international migration
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Mega Projects Data Mega Projects Data
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Lecture1 pattern recognition............
Quality review (1)_presentation of this 21
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction-to-Cloud-ComputingFinal.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Global journeys: estimating international migration

Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

  • 1. Secure  Because  Math:  A  Deep-­‐Dive  on   Machine  Learning-­‐Based  Monitoring     (#SecureBecauseMath)   Alex  Pinto   Chief  Data  Scien2st  |  MLSec  Project     @alexcpsec   @MLSecProject!
  • 2. Alex  Pinto   •  Chief  Data  Scien2st  at  MLSec  Project   •  Machine  Learning  Researcher  and  Trainer   •  Network  security  and  incident  response  aficionado     •  Tortured  by  SIEMs  as  a  child   •  Hacker  Spirit  Animal™:  CAFFEINATED  CAPYBARA! whoami   (hPps://secure.flickr.com/photos/kobashi_san/)  
  • 3.   •  Security  Singularity   •  Some  History   •  TLA   •  ML  Marke2ng  PaPerns   •  Anomaly  Detec2on   •  Classifica2on   •  Buyer’s  Guide   •  MLSec  Project   Agenda  
  • 5. (Side  Note)   First  hit  on  Google  images  for  “Network  Security  Solved”  is  a   picture  of  Jack  Daniel!
  • 6. Security  Singularity  Approaches   •  “Machine  learning  /  math  /  algorithms…  these  terms  are   used  interchangeably  quite  frequently.”   •  “Is  behavioral  baselining  and  anomaly  detec2on  part  of   this?”   •  “What  about  Big  Data  Security  Analy2cs?”     (hPp://bigdatapix.tumblr.com/)  
  • 7. Are  we  even  trying?   •  “Hyper-­‐dimensional  security   analy2cs”   •  “3rd  genera2on  Ar2ficial   Intelligence”   •  “Secure  because  Math”     •  Lack  of  ability  to  differen2ate   hurts  buyers,  investors.   •  Are  we  even  funding  the  right   things?  
  • 8. Is  this  a  communicaCon  issue?  
  • 9. Guess  the  Year!   •  “(…)  behavior  analysis  system  that  enhances  your   network  intelligence  and  security  by  audi2ng  network   flow  data  from  exis2ng  infrastructure  devices”   •  "Mathema2cal  models  (…)  that  determine  baseline   behavior  across  users  and  machines,  detec2ng  (...)   anomalous  and  risky  ac2vi2es  (...)”   •  ”(…)  maintains  historical  profiles  of  usage  per  user  and   raises  an  alarm  when  observed  ac2vity  departs  from   established  paPerns  of  usage  for  an  individual.”    
  • 10. A  liGle  history   •  Dorothy  E.  Denning  (professor  at  the   Department  of  Defense  Analysis  at  the   Naval  Postgraduate  School)   •  1986  (SRI)  -­‐  First  research  that  led   to  IDS   •  Intrusion  Detec2on  Expert  System   (IDES)   •  Already  had  sta2s2cal  anomaly   detec2on  built-­‐in   •  1993:  Her  colleagues  release  the  Next   Genera2on  (!)  IDES  
  • 11. Three  LeGer  Acronyms  -­‐  KDD   •  Ajer  the  release  of  Bro  (1998)  and  Snort  (1999),  DARPA   thought  we  were  covered  for  this  signature  thing   •  DARPA  released  datasets  for  user  anomaly  detec2on  in   1998  and  1999   •  And  then  came  the  KDD-­‐99  dataset  –  over  6200  cita2ons   on  Google  Scholar  
  • 14. Three  LeGer  Acronyms  -­‐  KDD  
  • 16. Not  here  to  bash  academia  
  • 17. A  Probable  Outcome   GRAD   SCHOOL   FRESHMAN   ZOMG   RESULTS  !! 11!1!   ZOMG!   RESULTS???   MATH,  STAHP!   MATH  IS   HARD,  LET’S   GO  SHOPPING  
  • 18. ML  MarkeCng  PaGerns   •  The  “Has-­‐beens”     •  Name  is  a  bit  harsh,  but  hey,  you  hardly  use  ML   anymore,  let  us  try  it   •  The  “Machine  Learning  ¯ˉ_(ツ)_/¯ˉ”   •  Hey,  that  sounds  cool,  let’s  put  that  in  our  brochure   •  The  “Sweet  Spot”   •  People  that  actually  are  trying  to  do  something   •  Anomaly  Detec2on  vs.  Classifica2on  
  • 20. Anomaly  DetecCon   •  Works  wonders  for  well   defined  “industrial-­‐like”   processes.   •  Looking  at  single,   consistently  measured   variables   •  Historical  usage  in  financial   fraud  preven2on.  
  • 22. Anomaly  DetecCon   •  What  fits  this  mold?   •  Network/Neqlow  behavior  analysis     •  User  behavior  analysis   •  What  are  the  challenges?   •  Curse  of  Dimensionality   •  Lack  of  ground  truth  and  normality  poisoning   •  Hanlon’s  Razor  
  • 23. AD:  Curse  of  Dimensionality   •  We  need  “distances”  to  measure   the  features/variables   •  Usually  ManhaPan  or  Euclidian   •  For  high-­‐dimensional  data,  the   distribu2on  of  distances  between   all  pairwise  points  in  the  space   becomes  concentrated  around  an   average  distance.  
  • 24. AD:  Curse  of  Dimensionality   •  The  volume  of  the  high   dimensional  sphere   becomes  negligible  in   rela2on  to  the  volume  of   the  high  dimensional  cube.   •  The  prac2cal  result  is  that   everything  just  seems  too   far  away,  and  at  similar   distances.   (hPp://www.datasciencecentral.com/m/blogpost? id=6448529%3ABlogPost%3A175670)  
  • 25. A  PracCcal  example   •  NetFlow  data,  company  with  n  internal  nodes.   •  2(nˆ2  -­‐  n)  communica2on  direc2ons   •  2*2*2*65535(nˆ2  -­‐  n)  measures  of  network  ac2vity   •  1000  nodes  -­‐>  Half  a  trillion  possible  dimensions  
  • 26. Breaking  the  Curse   •  Different  /  crea2ve   distance  metrics   •  Organizing  the  space  into   sub-­‐manifolds  where   Euclidean  distances  make   more  sense.   •  Aggressive  feature   removal   •  A  few  interes2ng  results   available  
  • 28. AD:  Normality-­‐poisoning  aGacks   •  Ground  Truth  (labels)  >>  Features  >>  Algorithms   •  There  is  no  (or  next  to  none)  Ground  Truth  in  AD   •  What  is  “normal”  in  your  environment?   •  Problem  asymmetry   •  Solu2ons  are  biased  to  the  prevalent  class   •  Very  hard  to  fine-­‐tune,  becomes  prone  to  a  lot  of  false   nega2ves  or  false  posi2ves  
  • 30. AD:  Hanlon’s  Razor   Never attribute to malice that which is adequately explained by stupidity.
  • 31. AD:  Hanlon’s  Razor   vs! Evil  Hacker! Hipster  Developer     (a.k.a.  MaP  Johansen)!
  • 32. What  about  User  Behavior?   •  Surprise,  it  kinda  works!  (as  supervised,  that  is)   •  As  specific  implementa2ons  for  specific  solu2ons   •  Good  stuff  from  Square,  AirBnB   •  Well  defined  scope  and  labeling.   •  Can  it  be  general  enough?   •  File  exfiltra2on  example  (roles/info  classifica2on   are  mandatory?)   •  Can  I  “average  out”  user  behaviors  in  different   applica2ons?  
  • 34. •  Lots  of  available  academic  research  around  this   •  Classifica2on  and  clustering  of  malware  samples   •  More  success  into  classifying  ar2facts  you  already  know  to   be  malware  then  to  actually  detect  it.  (Lineage)   •  State  of  the  art?  My  guess  is  AV  companies!   •  All  of  them  have  an  absurd  amount  of  samples   •  Have  been  researching  and  consolida2ng  data  on  them   for  decades.   Lots  of  Malware  AcCvity  
  • 35. •  Can  we  do  bePer  than  “AV  Heuris2cs”?   •  Lots  and  lots  of  available  data  that  has  been  made  public   •  Some  of  the  papers  also  suffer  from  poten2ally  bad  ground   truth.   Lots  of  Malware  AcCvity   VS!
  • 36. Lots  of  Malware  AcCvity   VS!
  • 38. •  Private  Beta  of  our  Threat  Intelligence-­‐based  models:   •  Some  use  TI  indicator  feeds  as  blocklists   •  More  mature  companies  use  the  feeds  to  learn  about   the  threats  (Trained  professionals  only)   •  Our  models  extrapolate  the  knowledge  of  exis2ng  threat   intelligence  feeds  as  those  experienced  analysis  would.   •  Supervised  model  w/same  data  analyst  has   •  Seeded  labeling  from  TI  feeds   How  is  it  going  then,  Alex?  
  • 39. •  Very  effec2ve  first  triage  for  SOCs  and  Incident  Responders   •  Send  us:  log  data  from  firewalls,  DNS,  web  proxies   •  Receive:  Report  with  a  short  list  of  poten2al   compromised  machines   •  Would  you  rather  download  all  the  feeds  and  integrate  it   yourself?   •  MLSecProject/Combine   •  MLSecProject/TIQ-­‐test     Yeah,  but  why  should  I  care?  
  • 40. •  Huge  amounts  of  TI  feeds  available  now  (open/commercial)   •  Non-­‐malicious  samples  s2ll  challenging,  but  we  have   expanded  to  a  lot  of  collec2on  techniques  from  different   sources.   •  Very  high-­‐ranked  Alexa  /  Quan2cast  /  OpenDNS   Random  domains  as  seeds  for  search  of  trust   •  Helped  by  the  customer  logs  as  well  in  a  semi-­‐ supervised  fashion   What  about  the  Ground  Truth   (labels)?  
  • 41. •  Vast  majority  of  features  are  derived  from  structural/ intrinsic  data:   •  GeoIP,  ASN  informa2on,  BGP  Prefixes   •  pDNS  informa2on  for  the  IP  addresses,  hostnames   •  WHOIS  informa2on   •  APacker  can’t  change  those  things  without  cost.   •  Log  data  from  the  customer,  can,  of  course.  But  this  does   not  make  it  worse  than  human  specialist.   But  what  about  data  tampering?  
  • 42. •  False  posi2ves  /  false  nega2ves  are  an  intrinsic  part  of  ML.   •  “False  posi2ves  are  very  good,  and  would  have  fooled  our   human  analysts  at  first.”   •  Their  feedback  helps  us  improve  the  models  for  everyone.   •  Remember  it  is  about  ini2al  triage.  A  Tier-­‐2/Tier-­‐3  analyst   must  inves2gate  and  provide  feedback  to  the  model.   And  what  about  false  posiCves?  
  • 43. •  1)  What  are  you  trying  to  achieve  with  adding  Machine   Learning  to  the  solu2on?   •  2)  What  are  the  sources  of  Ground  Truth  for  your  models?   •  3)  How  can  you  protect  the  features  /  ground  truth  from   adversaries?   •  4)  How  does  the  solu2on/processes  around  it  handle  false   posi2ves?  ! Buyer’s  Guide  
  • 45. MLSec  Project   •  Don’t  take  my  word  for  it!  Try  it  out!!   •  Help  us  test  and  improve  the  models!   •  Looking  for  par2cipants  and  data  sharing  agreements   •  Limited  capacity  at  the  moment,  so  be  pa2ent.  :)     •  Visit  hGps://www.mlsecproject.org  ,  message  @MLSecProject   or  just  e-­‐mail  me.!
  • 46. Thanks!   •  Q&A?   •  Don’t  forget  the  feedback!   Alex  Pinto     @alexcpsec   @MLSecProject   ”We  are  drowning  on  informa2on  and  starved  for  knowledge"                        -­‐  John  NaisbiP