Machine Learning

Geoff Bernard
Solution Architect, Elastic
Machine Learning & Graph
Automated Anomaly Detection with the Elastic Stack

Overview
•The Elastic Stack
•Machine Learning
•Graph

Elastic Stack
Store, Search, & AnalyzeElasticsearch
Visualize & ManageKibana
IngestBeats Logstash
Metrics
Logging
APM
Site
Search
Application
Search
Business
Analytics
Enterprise
Search
Security
Analytics
Future Solutions
SaaS
Elastic Cloud
Self Managed
Elastic Cloud
Enterprise
Standalone
Deployment

Anomaly Detection
1) When an entities’
behavior changes
significantly and suddenly
2) When an entity is
drastically different than
others within a population

Example 1) Anomalies in temporal pattern
Single (univariate) time series
Example: Is there unusual traffic on website ?
8
Time
Metric

Example 2) Outliers in population
Detect an unusual population member
Example:
Which IP address is not like the others?
(indication of a bot / attacker)
9

Why is Automated Anomaly
Detection Needed?
“I’m not actively watching my data”
“There’s a lot of data I’m just ignoring”
“I only want to know if something is weird or if it changes”
“I don’t know if any machine has been compromised”

Predict
Expect ed val ue @ 15: 05 = 1859
Learn Operationalize

Spike of errors in logs
Failing Device
Incorrect Config Change
IT Operational Analytics Security Analytics Business Analytics
Unusual Network Activity
Malware Exfiltrating Data
Rogue Insider
Sudden Dip in Revenue
Operational Issue
Payment Processor Problem
Anomalies in your data could indicate
trouble

Top Use Cases
IT Ops Cyber Security
Unusual log volume by “type”
• spike of error counts in logs
• unusual or rare log messages
Unusual Access/Usage/Authentication
• unusual login activity
• unusual process invocation
• abuse, spamming, snooping
Metric/KPI analysis
• host/system metrics
• APM data
• KPI analysis (orders, transactions, etc.)
Covert Communication / Exfiltration
• DNS Tunneling
• port Scanning
• unusual bytes outbound to rare destinations
Holistic/360-degree monitoring
• anomaly detection across disparate, but
related data sets
• easier root cause analysis
Intrusion Detection Filtering
• finding unusual patterns in IDS events

1
6
What is Graph?
• Discover relationships in your Elasticsearch data
• Combines graph algorithms and search
• Explore data using relevancy
• Detect fraud
• Recommendation engine
• Kibana UI to interact with graphs
• Select relationships to explore
• Find new or existing connections in the data
• Visualize results
• API alternative available

1
7
Why Graph?
• Uses existing Elasticsearch indexes
• No need to reindex
• No need to change data models
• Start exploring data today!
• Simple architecture
• Scales with Elasticsearch cluster
• Combine the power of search, relevancy and graph databases
• Relevancy allows you to find "uncommonly common" graph relationships between data

1
8
Graphs
Graphs are a set of "vertices”
and the “connections"
between them.
Example from stackoverflow
- Javascript tag and related tags
- Java tag and related tags

1
9
Graphs
Example from stackoverflow
Javascript cluster from prior slide
and connected title words
Graphs are a set of "vertices”
and the “connections"
between them.

2
0
Graphs
• An indexed value is a potential vertex
• single value fields such as an email address which can link to other fields
• array fields such as tags which can link to itself and other fields
• A relationship forms an connection
• Not persisted in Elasticsearch
• Uses aggregation framework to search and connect the vertices
• Graph Traversal
• Graph traversal algorithms prioritize finding meaningful connections in the data…

How does Graph find / infer relationships ?
By analyzing co-occurence of terms in your documents

Example: Inferring relationships from co-
occurence
Users that liked Vivaldi also liked ???Music Recommendation
{ "user_id" : "1",
"liked" : { vivaldi, brahms, schubert}
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "200000",
}
{ "user_id" : "1",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "200000",
}

occurence
{
...
"timestamp" : "April 4th 2016, 02:22:09" ,
"request_url" : "/blog/wp-admin" ,
"source_ip" : "165.98.197.10" ,
"status" : 404 ,
"bytes" : 20 ,
...
}
Cyber threat hunting
Strong correlation between this IP address
and particular attack vector
Web Access Logs
Sample Document

occurence
{
"timestamp" : "April 4th 2016, 02:22:09" ,
"trans_id" : "abc1089787" ,
"vendor" : "Sam Deli" ,
"isFraud" : true ,
"location" : "San Francisco" ,
"amount" : 5.71 ,
...
}
Fraud Detection
Strong correlation between fraudulent
charges and “Sam’s deli”
Card Transactions Dataset
Sample Document

Users that liked Vivaldi also liked ???
occurence
Music Recommendation
{ "user_id" : "1",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "200000",
}
{ "user_id" : "1",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "200000",
"liked" : { vivaldi, brahms, beatles}
}
Vivaldi
Beatles
Brahms
Bach

Users that liked Metallica also liked ???
occurence
Music Recommendation
{ "user_id" : "1",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "200000",
}
{ "user_id" : "1",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "2",
}
{ "user_id" : "200000",
"liked" : { metallica, AC/DC, beatles}
}
Metallica
Beatles
AC/DC
Iron Maiden
Megadeth

How does Graph inject relevance ?
Using math and search features to subtract background signal to surface only the
meaningful relationships

2
9
What does meaningful mean?
• Super nodes (Super Connectors)
• Traditional graphs get distorted by "super nodes"
• They frequently include these heavily connected vertices during exploration which can
distort finding relevant connections
• When storing connections instead of computing them on-the-fly this becomes a major
issue
• Wisdom of crowds
• Use sampling and diversity settings to choose which signals we want to summarize
• Provide a more personalized form of recommendation…

3
0
Personalized recommendations
• Many approaches store edges so they can retrieve an answer for
questions like:
• "people who searched for X tend to click on product Y"
• This provides only a single interpretation of this events
• Elastic Graph can find the answer to these questions by searching:
• "people who searched for X (and ideally were females in London with an age range of 25-
40 with interests in product Z), tend to click on product Y"
• This is computed using aggregations and the search on top of proven graph algorithms
• Change the criteria and potentially get a different answer

3
4
Click data example
• Dataset where every search and the product a user clicked on is recorded
as a document:
• Can we use this to recommend products to users who search for similar
terms?
• "Oh, you're searching for 'midi' and you're this type of user - other users similar to you who searched for
'midi' had a significantly related correlation with these other products…"
{
"_id": "AU9ADwcEN-SyHOpO8lp4"
"category": "pcmcat152100050032",
"query_time": "2011-08-20 15:05:49",
"product": "1005145",
"ids": [
"P[1005145]",
"Q[midi]",
"C[pcmcat152100050032]"
],
"user":
"362d844db99baefa4e51d630b9724bc3e767c4aa",
"query": "midi"
}

3
5
Basic use
• REST interface accepts user graph-exploration criteria as JSON:
• Find product codes that are significantly associated with searches for "midi" and further, show
other queries that led people to these products
• Internally a number of searches with aggregations are then made to build the graph
POST clicklogs/_graph/explore
{
"query": {
"match": {
"query.raw": "midi"
}
},
"vertices": [
{
"field": "product"
}
],
"connections": {
"vertices": [
{
"field": "query.raw"
}
]
}
}

3
6
Basic use
• Potential JSON response:
{
"vertices": [
{
"field": "query.raw",
"term": "midi cable",
"weight": 0.08745858139552132,
"depth": 1
},
{
"field": "product",
"term": "8567446",
"weight": 0.13247784285434397,
"depth": 0
},
{
"field": "product",
"term": "1112375",
"weight": 0.018600718471158982,
"depth": 0
},
{.
"term": "midi keyboard",
"weight": 0.04802242866755111,
"depth": 1
}
],
"connections": [
{
"source": 0,
"target": 1,
"weight": 0.04802242866755111,
"doc_count": 13
},
{
"source": 2,
"target": 3,
"weight": 0.08120623870976627,
"doc_count": 23
}
]
}

3
7
Expressed visually
• Kibana plug-in makes manually exploring graph data simple:
Note: Graph expanded from
previous query

3
8
Query controls
• Configure controls to the query to tune the graph query results and
performance:
Control Description
use_significance Used to filter associated terms to only those that are significantly
associated with our query, defaults true
sample_size Each "hop" considers a sample of the best-matching documents on
each shard (default is 100 documents). Using samples has the dual
benefit of keeping exploration focused on meaningfully-connected
terms and improving the speed of execution.
timeout Time in milliseconds exploration will be halted and results gathered
so far are returned
sample_diversity To avoid the top-matching documents sample being dominated by
a single source of results sometimes it can prove necessary to
request diversity in the sample

3
9
Diversity setting tip
• Best to use on high cardinality fields
– A boolean field is a poor choice for the diversity setting
• Make sure the diversity setting isn’t limiting the sample size to just a few
documents
– A max docs per field of 5 on a boolean field would result in only 10 documents per shard being
examined to generate the graph.

4
0
Vertices controls
• You can control each vertices settings too:
Control Description
size Number of vertex terms returned for each field, defaults to
5
min_doc_count Acts as a certainty threshold - just how many documents
have to contain a pair of terms before we consider this to
be a useful connection? (default is 3)
shard_min_doc_count Advanced setting - just how many documents on a shard
have to contain a pair of terms before we return this for
global consideration? (default is 2)

4
1
Excluding terms
• You can request that certain terms not be part of the graph:
{
"vertices": [
{
"field": "product",
"include": [ "1854873" ]
}
],
"connections": {
"vertices": [
{
"exclude": [
"midi keyboard",
"midi",
"synth"
]
}
]
}
}

4
2
Connecting Existing terms
• You can include known vertices into the api to search for connections
{
"vertices": [
{
"field": "product",
"include": [
{
"term": "1854873"
}
]
}
],
"connections": {
"vertices": [
{
"include": [
"midi keyboard",
"midi",
"synth"
]
}
]
},
"query": {
"bool": {
"minimum_should_match":
2,
"should": [
{
"term":
{"product": "1854873"}
},
{
"term":
{"query.raw": "midi keyboard"}
},
{
"term":
{"query.raw": "midi"}
},
{
"term":
{"query.raw": "synth"}
}
]
}
}
}

4
3
Results for connecting existing terms
• Potential JSON response:
{
"took": 201,
"timed_out": false,
"failures": [],
"vertices": [
{
"field": "product",
"term": "1854873",
"weight": 1,
"depth": 0
},
{
"term": "midi keyboard",
"weight": 0.22873449033036267,
"depth": 1
},
{
"term": "midi",
"weight": 0.0630400960897268,
"depth": 1
},
{
"term": "synth",
"weight": 0.6582254135799106,
"depth": 1
}
],
"connections": [
{
"source": 0,
"target": 1,
"weight": 0.22873449033036267,
"doc_count": 7
},
{
"source": 0,
"target": 2,
"weight": 0.0630400960897268,
"doc_count": 5
},
{
"source": 0,
"target": 3,
"weight": 0.6582254135799106,
"doc_count": 3
}
]
}

4
4
Connection filtering
• Filter which connections are allowed:
…
"connections": {
"query": {
"bool": {
"filter": [
{
"range": {
"query_time": {
"gte": "2015-10-01 00:00:00"
}
}
}
]
}
},
"vertices": [
{
"size": 5,
"min_doc_count": 10,
"shard_min_doc_count": 3
}
]
}
}

4
5
Putting it together
• Connect query.raw to product since 2015-10-01 where
midi is relevant: "vertices": [
{
"field": "product",
"size": 5,
}
],
"connections": {
"query": {
"bool": {
"filter": [
{
"range": {
"query_time": {
"gte": "2015-10-01 00:00:00"
}
}
}
]
}
},
"vertices": [
{
"size": 5,
}
]
}
}
{
"query": {
"bool": {
"must": {
"match": {
"query.raw": "midi"
}
},
"filter": [
{
"range": {
"query_time": {
"gte": "2015-10-01 00:00:00"
}
}
}
]
}
},
"controls": {
"use_significance": true,
"sample_size": 2000,
"timeout": 2000,
"sample_diversity": {
"field": "category.raw",
"max_docs_per_value": 500
}
},

4
6
Manual exploration
• The graph APIs can be used to programmatically find meaningful relations
in your application. But, sometimes you want to explore your data
interactively
• Graph plug-in for Kibana allows you to quickly query and will display the
results visually
• Easy to work with your indices visually and begin to find new insights
• Choose colors and icons for vertices
• Save and Load workspaces
• Undo and Redo
• Advanced settings and groupings

4
7
Graph UI
Index Field Query Expand
Settings
Works well on a single array field to find
connections between values

4
8
Settings
Sampler Aggregation
size
Should Significant
Terms
Be Used?
Ensure sample has
diversity

4
9
Request & Response/Blacklist/Drill-downs

5
0
Add fields
1: Add a field
2: Select and repeat to add more than one
3: Configure field
4: Shift-click to turn on or off
click to configure field

5
1
Selection expansion
Hold shift to select
multiple fields
Click to expand
selected fields

5
3
Find links between terms/vertices
New links
Click to add links

5
4
Connections
Click on a connection to
see strength between vertexes

Machine Learning

More Related Content

Similar to Machine Learning (20)

Recently uploaded (20)

Machine Learning