Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Secure
Because
Math:
A
Deep-‐Dive
on

Machine
Learning-‐Based
Monitoring

(#SecureBecauseMath)

Alex
Pinto

Chief
Data
Scien2st
|
MLSec
Project

@alexcpsec

@MLSecProject!

Alex
Pinto

•  Chief
Data
Scien2st
at
MLSec
Project

•  Machine
Learning
Researcher
and
Trainer

•  Network
security
and
incident
response
aﬁcionado

•  Tortured
by
SIEMs
as
a
child

•  Hacker
Spirit
Animal™:
CAFFEINATED
CAPYBARA!
whoami

(hPps://secure.ﬂickr.com/photos/kobashi_san/)

•  Security
Singularity

•  Some
History

•  TLA

•  ML
Marke2ng
PaPerns

•  Anomaly
Detec2on

•  Classiﬁca2on

•  Buyer’s
Guide

•  MLSec
Project

Agenda

Security
Singularity
Approaches

(Side
Note)

First
hit
on
Google
images
for
“Network
Security
Solved”
is
a

picture
of
Jack
Daniel!

Security
Singularity
Approaches

•  “Machine
learning
/
math
/
algorithms…
these
terms
are

used
interchangeably
quite
frequently.”

•  “Is
behavioral
baselining
and
anomaly
detec2on
part
of

this?”

•  “What
about
Big
Data
Security
Analy2cs?”

(hPp://bigdatapix.tumblr.com/)

Are
we
even
trying?

•  “Hyper-‐dimensional
security

analy2cs”

•  “3rd
genera2on
Ar2ﬁcial

Intelligence”

•  “Secure
because
Math”

•  Lack
of
ability
to
diﬀeren2ate

hurts
buyers,
investors.

•  Are
we
even
funding
the
right

things?

Is
this
a
communicaCon
issue?

Guess
the
Year!

•  “(…)
behavior
analysis
system
that
enhances
your

network
intelligence
and
security
by
audi2ng
network

ﬂow
data
from
exis2ng
infrastructure
devices”

•  "Mathema2cal
models
(…)
that
determine
baseline

behavior
across
users
and
machines,
detec2ng
(...)

anomalous
and
risky
ac2vi2es
(...)”

•  ”(…)
maintains
historical
proﬁles
of
usage
per
user
and

raises
an
alarm
when
observed
ac2vity
departs
from

established
paPerns
of
usage
for
an
individual.”

A
liGle
history

•  Dorothy
E.
Denning
(professor
at
the

Department
of
Defense
Analysis
at
the

Naval
Postgraduate
School)

•  1986
(SRI)
-‐
First
research
that
led

to
IDS

•  Intrusion
Detec2on
Expert
System

(IDES)

•  Already
had
sta2s2cal
anomaly

detec2on
built-‐in

•  1993:
Her
colleagues
release
the
Next

Genera2on
(!)
IDES

Three
LeGer
Acronyms
-‐
KDD

•  Ajer
the
release
of
Bro
(1998)
and
Snort
(1999),
DARPA

thought
we
were
covered
for
this
signature
thing

•  DARPA
released
datasets
for
user
anomaly
detec2on
in

1998
and
1999

•  And
then
came
the
KDD-‐99
dataset
–
over
6200
cita2ons

on
Google
Scholar

Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

Three
LeGer
Acronyms
-‐
KDD

Not
here
to
bash
academia

A
Probable
Outcome

GRAD

SCHOOL

FRESHMAN

ZOMG

RESULTS
!!
11!1!

ZOMG!

RESULTS???

MATH,
STAHP!

MATH
IS

HARD,
LET’S

GO
SHOPPING

ML
MarkeCng
PaGerns

•  The
“Has-‐beens”

•  Name
is
a
bit
harsh,
but
hey,
you
hardly
use
ML

anymore,
let
us
try
it

•  The
“Machine
Learning
¯ˉ_(ツ)_/¯ˉ”

•  Hey,
that
sounds
cool,
let’s
put
that
in
our
brochure

•  The
“Sweet
Spot”

•  People
that
actually
are
trying
to
do
something

•  Anomaly
Detec2on
vs.
Classiﬁca2on

Anomaly
DetecCon

•  Works
wonders
for
well

deﬁned
“industrial-‐like”

processes.

•  Looking
at
single,

consistently
measured

variables

•  Historical
usage
in
ﬁnancial

fraud
preven2on.

Anomaly
DetecCon

•  What
ﬁts
this
mold?

•  Network/Neqlow
behavior
analysis

•  User
behavior
analysis

•  What
are
the
challenges?

•  Curse
of
Dimensionality

•  Lack
of
ground
truth
and
normality
poisoning

•  Hanlon’s
Razor

AD:
Curse
of
Dimensionality

•  We
need
“distances”
to
measure

the
features/variables

•  Usually
ManhaPan
or
Euclidian

•  For
high-‐dimensional
data,
the

distribu2on
of
distances
between

all
pairwise
points
in
the
space

becomes
concentrated
around
an

average
distance.

AD:
Curse
of
Dimensionality

•  The
volume
of
the
high

dimensional
sphere

becomes
negligible
in

rela2on
to
the
volume
of

the
high
dimensional
cube.

•  The
prac2cal
result
is
that

everything
just
seems
too

far
away,
and
at
similar

distances.

(hPp://www.datasciencecentral.com/m/blogpost?
id=6448529%3ABlogPost%3A175670)

A
PracCcal
example

•  NetFlow
data,
company
with
n
internal
nodes.

•  2(nˆ2
-‐
n)
communica2on
direc2ons

•  2*2*2*65535(nˆ2
-‐
n)
measures
of
network
ac2vity

•  1000
nodes
-‐>
Half
a
trillion
possible
dimensions

Breaking
the
Curse

•  Diﬀerent
/
crea2ve

distance
metrics

•  Organizing
the
space
into

sub-‐manifolds
where

Euclidean
distances
make

more
sense.

•  Aggressive
feature

removal

•  A
few
interes2ng
results

available

AD:
Normality-‐poisoning
aGacks

•  Ground
Truth
(labels)
>>
Features
>>
Algorithms

•  There
is
no
(or
next
to
none)
Ground
Truth
in
AD

•  What
is
“normal”
in
your
environment?

•  Problem
asymmetry

•  Solu2ons
are
biased
to
the
prevalent
class

•  Very
hard
to
ﬁne-‐tune,
becomes
prone
to
a
lot
of
false

nega2ves
or
false
posi2ves

AD:
Normality-‐poisoning
aGacks

AD:
Hanlon’s
Razor

Never attribute to malice
that which is adequately
explained by stupidity.

AD:
Hanlon’s
Razor

vs!
Evil
Hacker!
Hipster
Developer

(a.k.a.
MaP
Johansen)!

What
about
User
Behavior?

•  Surprise,
it
kinda
works!
(as
supervised,
that
is)

•  As
specific
implementa2ons
for
specific
solu2ons

•  Good
stuff
from
Square,
AirBnB

•  Well
defined
scope
and
labeling.

•  Can
it
be
general
enough?

•  File
exfiltra2on
example
(roles/info
classifica2on

are
mandatory?)

•  Can
I
“average
out”
user
behaviors
in
different

applica2ons?

•  Lots
of
available
academic
research
around
this

•  Classiﬁca2on
and
clustering
of
malware
samples

•  More
success
into
classifying
ar2facts
you
already
know
to

be
malware
then
to
actually
detect
it.
(Lineage)

•  State
of
the
art?
My
guess
is
AV
companies!

•  All
of
them
have
an
absurd
amount
of
samples

•  Have
been
researching
and
consolida2ng
data
on
them

for
decades.

Lots
of
Malware
AcCvity

•  Can
we
do
bePer
than
“AV
Heuris2cs”?

•  Lots
and
lots
of
available
data
that
has
been
made
public

•  Some
of
the
papers
also
suﬀer
from
poten2ally
bad
ground

truth.

Lots
of
Malware
AcCvity

VS!

Lots
of
Malware
AcCvity

VS!

Everyone
makes
mistakes!

•  Private
Beta
of
our
Threat
Intelligence-‐based
models:

•  Some
use
TI
indicator
feeds
as
blocklists

•  More
mature
companies
use
the
feeds
to
learn
about

the
threats
(Trained
professionals
only)

•  Our
models
extrapolate
the
knowledge
of
exis2ng
threat

intelligence
feeds
as
those
experienced
analysis
would.

•  Supervised
model
w/same
data
analyst
has

•  Seeded
labeling
from
TI
feeds

How
is
it
going
then,
Alex?

•  Very
effec2ve
first
triage
for
SOCs
and
Incident
Responders

•  Send
us:
log
data
from
firewalls,
DNS,
web
proxies

•  Receive:
Report
with
a
short
list
of
poten2al

compromised
machines

•  Would
you
rather
download
all
the
feeds
and
integrate
it

yourself?

•  MLSecProject/Combine

•  MLSecProject/TIQ-‐test

Yeah,
but
why
should
I
care?

•  Huge
amounts
of
TI
feeds
available
now
(open/commercial)

•  Non-‐malicious
samples
s2ll
challenging,
but
we
have

expanded
to
a
lot
of
collec2on
techniques
from
diﬀerent

sources.

•  Very
high-‐ranked
Alexa
/
Quan2cast
/
OpenDNS

Random
domains
as
seeds
for
search
of
trust

•  Helped
by
the
customer
logs
as
well
in
a
semi-‐
supervised
fashion

What
about
the
Ground
Truth

(labels)?

•  Vast
majority
of
features
are
derived
from
structural/
intrinsic
data:

•  GeoIP,
ASN
informa2on,
BGP
Preﬁxes

•  pDNS
informa2on
for
the
IP
addresses,
hostnames

•  WHOIS
informa2on

•  APacker
can’t
change
those
things
without
cost.

•  Log
data
from
the
customer,
can,
of
course.
But
this
does

not
make
it
worse
than
human
specialist.

But
what
about
data
tampering?

•  False
posi2ves
/
false
nega2ves
are
an
intrinsic
part
of
ML.

•  “False
posi2ves
are
very
good,
and
would
have
fooled
our

human
analysts
at
ﬁrst.”

•  Their
feedback
helps
us
improve
the
models
for
everyone.

•  Remember
it
is
about
ini2al
triage.
A
Tier-‐2/Tier-‐3
analyst

must
inves2gate
and
provide
feedback
to
the
model.

And
what
about
false
posiCves?

•  1)
What
are
you
trying
to
achieve
with
adding
Machine

Learning
to
the
solu2on?

•  2)
What
are
the
sources
of
Ground
Truth
for
your
models?

•  3)
How
can
you
protect
the
features
/
ground
truth
from

adversaries?

•  4)
How
does
the
solu2on/processes
around
it
handle
false

posi2ves?
!
Buyer’s
Guide

#NotAllAlgorithms!
Buyer’s
Guide

MLSec
Project

•  Don’t
take
my
word
for
it!
Try
it
out!!

•  Help
us
test
and
improve
the
models!

•  Looking
for
par2cipants
and
data
sharing
agreements

•  Limited
capacity
at
the
moment,
so
be
pa2ent.
:)

•  Visit
hGps://www.mlsecproject.org
,
message
@MLSecProject

or
just
e-‐mail
me.!

Thanks!

•  Q&A?

•  Don’t
forget
the
feedback!

Alex
Pinto

@alexcpsec

@MLSecProject

”We
are
drowning
on
informa2on
and
starved
for
knowledge"

-‐
John
NaisbiP

Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath) (20)

Recently uploaded (20)

Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#SecureBecauseMath)