Applying Machine Learning to Network Security Monitoring - BayThreat 2013

Applying
Machine
Learning
to
Network

Security
Monitoring

Alexandre
Pinto

Chief
Data
Scien4st
|
MLSec
Project

@alexcpsec
@MLSecProject!

WARNING!

•  This
is
a
talk
about
BUILDING
not
breaking

–  NO
systems
were
harmed
on
the
development
of
this
talk.

–  This
is
NOT
about
1337
Android
Malware

•  Only
thing
we
are
likely
to
break
here
is
the
4me
limit
on
the

talk

•  This
talk
includes
more
MATH
than
the
daily
recommended

intake
by
the
FDA.

•  All
stunts
described
in
this
talk
were
performed
by
trained

professionals.!

Who's
Alex?

•  13
years
in
Informa4on
Security,
done
a
liRle
bit
of
everything.

•  Past
7
or
so
years
leading
security
consultancy
and
monitoring

teams
in
Brazil,
London
and
the
US.

–  If
there
is
any
way
a
SIEM
can
hurt
you,
it
did
to
me.

•  Researching
machine
learning
and
data
science
in
general
for

the
past
year
or
so
and
presen4ng
about
the
intersec4on
of
it

and
Infosec
throughout
the
year.

•  Created
MLSec
Project
in
July
2013
to
give
structure
to
the

research
being
done.

Agenda

•  Deﬁni4ons

•  Big
Data

•  Data
Science

•  Machine
Learning

• 
• 
• 
• 
• 

Y
U
DO
DIS?

Network
Security
Monitoring

PoC
||
GTFO

Feature
Intui4on

How
to
get
started?

Big
Data
+
Machine
Learning
+
Data
Science

(Security)
Data
ScienEst

•  “Data
Scien4st
(n.):
Person
who
is
beRer
at
sta4s4cs
than
any
so`ware

engineer
and
beRer
at
so`ware
engineering
than
any
sta4s4cian.”

-‐-‐
Josh
Willis,
Cloudera

Data
Science
Venn
Diagram
by
Drew
Conway!

Enter
Machine
Learning

•  “Machine
learning
systems
automa4cally
learn
programs

from
data”
(*)

•  You
don’t
really
code
the
program,
but
it
is
inferred

from
data.

•  Intui4on
of
trying
to
mimic
the
way
the
brain
learns:

that's
where
terms
like
ar#ﬁcial
intelligence
come
from.
!

(*)
CACM
55(10)
-‐
A
Few
Useful
Things
to
Know
about
Machine
Learning
(Domingos
2012)

Kinds
of
Machine
Learning

•  Supervised
Learning:

–  Classiﬁca4on
(NN,
SVM,
Naïve

Bayes)

–  Regression
(linear,
logis4c)!

•  Unsupervised
Learning
:

–  Clustering
(k-‐means)

–  Decomposi4on
(PCA,
SVD)

Source
–
scikit-‐learn.github.io/scikit-‐learn-‐tutorial/general_concepts.html

ClassiﬁcaEon
Example

VS!

ConsideraEons
on
Data
Gathering

•  Models
will
(generally)
get
beRer
with
more
data

–  But
we
always
have
to
consider
bias
and
variance
as
we
select
our
data

points

–  Also
adversaries
–
we
may
be
force
fed
“bad
data”,
ﬁnd
signal
in
weird

noise
or
design
bad
(or
exploitable)
features

•  “I’ve
got
99
problems,
but
data
ain’t
one”!

Domingos,
2012

Abu-‐Mostafa,
Caltech,
2012

ApplicaEons
of
Machine
Learning

•  Sales

!

•  Trading

•  Image
and

Voice

Recogni4on

Y
U
DO
DIS?

•  Common
reac4ons
from
Security
Professionals:

•  “Eh,
cool…”
*blank
stare*
*walks
away*

•  “Are
you
high,
bro?”

•  “Why
aren’t
you
doing
some
cool
research
like
Android

Malware?”

Security
ApplicaEons
of
ML

•  Fraud
detec4on
systems:

–  Is
what
he
just
did
consistent
with
past

behavior?

•  Network
anomaly
detec4on
(?):

–  More
like
bad
sta4s4cal
analysis

–  Did
not
advance
a
lot,
IMO

•  Predic4ng
likelihood
of
aRack
actors

–  Create
different
predic4ve
models
and

chain
them
to
gain
more
confidence
in
each

step.!

•  SPAM
filters

ConsideraEons
on
Data
Gathering

•  Adversaries
-‐
Exploi4ng
the
learning
process

•  Understand
the
model,
understand
the
machine,
and

you
can
circumvent
it

•  Something
InfoSec
community
knows
very
well

•  Any
predic4ve
model
on
InfoSec
will
be
pushed
to
the

limit

•  Again,
think
back
on
the

way
SPAM
engines
evolved.!

Network
Security
Monitoring

CorrelaEon
Rules:
A
Primer

•  Rules
in
a
SIEM
solu4on
invariably
are:

–  “Something”
has
happened
“x”
4mes;

–  “Something”
has
happened
and
other
“something2”
has

happened,
with
some
rela4onship
(4me,
same
fields,
etc)

between
them.

•  Configuring
SIEM
=
iterate
on
combina4ons
un4l:

–  Customer
or
management
is
foole..
I
mean
sa4sfied;

–  Consul4ng
money
runs
out

•  Behavioral
rules
(anomaly
detec4on)
helps
a
bit
with

the
“x”s,
but
s4ll,
very
laborious
and
4me

consuming.!

Kinds
of
Network
Security
Monitoring

•  Alert-‐based:

–  “Tradi4onal”
log
management

–  SIEM

–  Using
“Threat
Intelligence”
(i.e

blacklists)
for
about
a
year
or

so

–  Lack
of
context

–  Low
eﬀec4veness

–  You
get
the
results
handed

over
to
you

•  Explora4on-‐based:

–  Network
Forensics
tools
(2/3

years
ago)

–  Elas4c
Search
based
LM

systems

–  High
eﬀec4veness

–  Lots
of
people
necessary

–  Lots
of
HIGHLY
trained
people

•  Big
Data
Security
Analy4cs
(BDSA):

–  Run
explora4on-‐based
monitoring
on
Hadoop

–  More
like
Big
Data
Security
Monitoring
(BDSM)

Alert-‐based
+
ExploraEon-‐based

A
wild
army
of
robots
appears

Using
robots
to
catch
bad
guys

PoC
||
GTFO

•  We
developed
a
set
of
algorithms
to
detect
malicious

behavior
from
log
entries
of
firewall
blocks

•  Over
6
months
of
data
from
SANS
DShield
(thanks,
guys!)

•  Aèr
a
lot
of
sta4s4cal-‐based
math
(true
posi4ve
ra4o,

true
nega4ve
ra4o,
odds
likelihood),
it
could
pinpoint

actors
that
would
be
13x-‐18x
more
likely
to
aRack
you.

•  Today
more
like
30x
on
the
SANS
data,
and
finding

around
80%
of
“badness”
in
par4cipant
deployments.!

Feature
IntuiEon:
IP
Proximity

•  Assump4ons
to
aggregate
the
data

•  Correla4on
/
proximity
/
similarity
BY
BEHAVIOR

•  “Bad
Neighborhoods”
concept:

–  Spamhaus
x
CyberBunker

–  Google
Report
(June
2013)

–  Moura
2013

•  Group
by
Geoloca4on

•  Group
by
Netblock
(/16,
/24)

•  Group
by
ASN

–  (thanks,
Team
Cymru)!

0

10

MULTICAST
AND
FRIENDS

You
are

here!

CN,

BR,

TH

Map
of
the

Internet

•  (Hilbert
Curve)

•  Block
port
22

•  2013-‐07-‐20

CN

127

RU

Feature
IntuiEon:
Temporal
Decay

•  Even
bad
neighborhoods
renovate:

–  ARackers
may
change
ISPs/proxies

–  Botnets
may
be
shut
down
/
relocate

–  A
liRle
paranoia
is
Ok,
but
not
EVERYONE
is
out
to
get
you
(at
least

not
all
at
once)!

•  As
days
pass,
let's
forget,
bit
by
bit,

who
aRacked

•  Last
4me
I
saw
this
actor,
and
how

o`en
did
I
see
them!

MLSec
Project

•  Behavior:
block

on
port
22

•  Trial
inference
on

100k
IP
addresses

per
Class
A

subnet

•  Logarithm

scale:

brightest
4les
are

10
to
1000
4mes

more
likely
to

aRack.

Feature
IntuiEon:
DNS
features

•  Who
resolves
to
this
IP
address?

•  Number
of
domains
that
resolve
to
the
IP
address

•  Distribu4on
of
their
life4me

•  Entropy,
size,
ccTLDs

•  Registrar
informa4on

•  Reverse
DNS
informa4on…

•  History
of
DNS
registra4on…

•  (Thanks,
DNSDB!)

Training
the
Model

•  YAY!
We
have
a
bunch
of
numbers
per
IP
address/domain!

•  How
do
you
deﬁne
what
is
malicious
or
not?

•  “Advanced
exper4se
in
both
informa4on
security
and
data

science
will
be
a
necessary
ingredient
in
enabling
accurate

discrimina4on
between
malicious
and
benign
ac4vity.
“

-‐
Anton
Chuvakin,
Gartner

•  Kinda
easy
for
security
tools
(if
you
trust
them)

•  Web
applica4on
logs
need
deeper
sta4s4cal
analysis

•  Not
normal
/
standard
devia4on
thing

!

How
do
I
get
started
on
this?

•  Programming
is
a
must
(Python
/
R)

•  Sta4s4cal
knowledge
keeps
you
from
making
dumb

mistakes

•  Speciﬁc
machine
learning
courses
and
books:

–  Coursera
(ML/
Data
Analysis
/
Data
Science)

•  Prac4ce,
Prac4ce,
Prac4ce:

–  Explore
your
data!
–
(Security
Onion)

–  Kaggle

–  KDD,
VAST,
VizSec!

MLSec
Project

•  Sign
up,
send
logs,
receive
reports
generated
by
machine

learning
models!

•  Working
with
several
companies
on
trying
out
these
models
on

their
environment
with
their
data

•  We
are
hiring
(KINDA)

•  Visit
h]ps://www.mlsecproject.org
,
message
@MLSecProject

or
just
e-‐mail
me.!

MLSec
Project
-‐
Current
Research

•  Inbound
aRacks
on
exposed
services
(DEFCON/BH
2013):

–  Informa4on
from
inbound
connec4ons
on
firewalls,
IPS,
WAFs

–  Feature
extrac4on
and
supervised
learning

•  Malware
Distribu4on
and
Botnets:

–  Informa4on
from
outbound
connec4ons
on
firewalls,
DNS
and

Web
Proxy

–  Ini4al
labeling
provided
by
intelligence
feeds
and
AV/an4-‐malware

–  Semi-‐supervised
learning
involved

•  Kill-‐chain
Ensemble
Models:

–  Increased
precision
by
composing
different
behaviors

–  Web
server
path
-‐>
go
through
Firewall,
then
IPS,
then
WAF

–  Early
confirma4on
of
aRack
failure
or
success

Thanks!

•  Q&A?

•  Feedback?

Alexandre
Pinto

@alexcpsec

@MLSecProject

hRps://www.mlsecproject.org/

"
Essen4ally,
all
models
are
wrong,
but
some
are
useful."

-‐
George
E.
P.
Box

Applying Machine Learning to Network Security Monitoring - BayThreat 2013

More Related Content

What's hot (20)

Similar to Applying Machine Learning to Network Security Monitoring - BayThreat 2013 (20)

Recently uploaded (20)

Applying Machine Learning to Network Security Monitoring - BayThreat 2013