SlideShare a Scribd company logo
Geoff Bernard
Solution Architect, Elastic
Machine Learning & Graph
Automated Anomaly Detection with the Elastic Stack
Overview
•The Elastic Stack
•Machine Learning
•Graph
The Elastic Stack
Elastic Stack
Store, Search, & AnalyzeElasticsearch
Visualize & ManageKibana
IngestBeats Logstash
Metrics
Logging
APM
Site
Search
Application
Search
Business
Analytics
Enterprise
Search
Security
Analytics
Future Solutions
SaaS
Elastic Cloud
Self Managed
Elastic Cloud
Enterprise
Standalone
Deployment
Machine Learning
Machine Learning
Anomaly Detection
1) When an entities’
behavior changes
significantly and suddenly
2) When an entity is
drastically different than
others within a population
Example 1) Anomalies in temporal pattern
Single (univariate) time series
Example: Is there unusual traffic on website ?
8
Time
Metric
Example 2) Outliers in population
Detect an unusual population member
Example:
Which IP address is not like the others?
(indication of a bot / attacker)
9
Why is Automated Anomaly
Detection Needed?
“I’m not actively watching my data”
“There’s a lot of data I’m just ignoring”
“I only want to know if something is weird or if it changes”
“I don’t know if any machine has been compromised”
Predict
Expect ed val ue @ 15: 05 = 1859
Learn Operationalize
Spike of errors in logs
Failing Device
Incorrect Config Change
IT Operational Analytics Security Analytics Business Analytics
Unusual Network Activity
Malware Exfiltrating Data
Rogue Insider
Sudden Dip in Revenue
Operational Issue
Payment Processor Problem
Anomalies in your data could indicate
trouble
ML Demo
Top Use Cases
IT Ops Cyber Security
Unusual log volume by “type”
• spike of error counts in logs
• unusual or rare log messages
Unusual Access/Usage/Authentication
• unusual login activity
• unusual process invocation
• abuse, spamming, snooping
Metric/KPI analysis
• host/system metrics
• APM data
• KPI analysis (orders, transactions, etc.)
Covert Communication / Exfiltration
• DNS Tunneling
• port Scanning
• unusual bytes outbound to rare destinations
Holistic/360-degree monitoring
• anomaly detection across disparate, but
related data sets
• easier root cause analysis
Intrusion Detection Filtering
• finding unusual patterns in IDS events
Graph
1
6
What is Graph?
• Discover relationships in your Elasticsearch data
• Combines graph algorithms and search
• Explore data using relevancy
• Detect fraud
• Recommendation engine
• Kibana UI to interact with graphs
• Select relationships to explore
• Find new or existing connections in the data
• Visualize results
• API alternative available
1
7
Why Graph?
• Uses existing Elasticsearch indexes
• No need to reindex
• No need to change data models
• Start exploring data today!
• Simple architecture
• Scales with Elasticsearch cluster
• Combine the power of search, relevancy and graph databases
• Relevancy allows you to find "uncommonly common" graph relationships between data
1
8
Graphs
Graphs are a set of "vertices”
and the “connections"
between them.
Example from stackoverflow
- Javascript tag and related tags
- Java tag and related tags
1
9
Graphs
Example from stackoverflow
Javascript cluster from prior slide
and connected title words
Graphs are a set of "vertices”
and the “connections"
between them.
2
0
Graphs
• An indexed value is a potential vertex
• single value fields such as an email address which can link to other fields
• array fields such as tags which can link to itself and other fields
• A relationship forms an connection
• Not persisted in Elasticsearch
• Uses aggregation framework to search and connect the vertices
• Graph Traversal
• Graph traversal algorithms prioritize finding meaningful connections in the data…
How does Graph find / infer relationships ?
By analyzing co-occurence of terms in your documents
Example: Inferring relationships from co-
occurence
Users that liked Vivaldi also liked ???Music Recommendation
{ "user_id" : "1",
"liked" : { vivaldi, brahms, schubert}
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "200000",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "1",
"liked" : { vivaldi, brahms, schubert}
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "200000",
"liked" : { vivaldi, brahms, bach }
}
Example: Inferring relationships from co-
occurence
{
...
"timestamp" : "April 4th 2016, 02:22:09" ,
"request_url" : "/blog/wp-admin" ,
"source_ip" : "165.98.197.10" ,
"status" : 404 ,
"bytes" : 20 ,
...
}
Cyber threat hunting
Strong correlation between this IP address
and particular attack vector
Web Access Logs
Sample Document
Example: Inferring relationships from co-
occurence
{
"timestamp" : "April 4th 2016, 02:22:09" ,
"trans_id" : "abc1089787" ,
"vendor" : "Sam Deli" ,
"isFraud" : true ,
"location" : "San Francisco" ,
"amount" : 5.71 ,
...
}
Fraud Detection
Strong correlation between fraudulent
charges and “Sam’s deli”
Card Transactions Dataset
Sample Document
Understanding relevance
Users that liked Vivaldi also liked ???
Example: Inferring relationships from co-
occurence
Music Recommendation
{ "user_id" : "1",
"liked" : { vivaldi, brahms, schubert}
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "200000",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "1",
"liked" : { vivaldi, brahms, schubert}
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "200000",
"liked" : { vivaldi, brahms, beatles}
}
Vivaldi
Beatles
Brahms
Bach
Users that liked Metallica also liked ???
Example: Inferring relationships from co-
occurence
Music Recommendation
{ "user_id" : "1",
"liked" : { vivaldi, brahms, schubert}
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "200000",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "1",
"liked" : { vivaldi, brahms, schubert}
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "2",
"liked" : { vivaldi, brahms, bach }
}
{ "user_id" : "200000",
"liked" : { metallica, AC/DC, beatles}
}
Metallica
Beatles
AC/DC
Iron Maiden
Megadeth
How does Graph inject relevance ?
Using math and search features to subtract background signal to surface only the
meaningful relationships
2
9
What does meaningful mean?
• Super nodes (Super Connectors)
• Traditional graphs get distorted by "super nodes"
• They frequently include these heavily connected vertices during exploration which can
distort finding relevant connections
• When storing connections instead of computing them on-the-fly this becomes a major
issue
• Wisdom of crowds
• Use sampling and diversity settings to choose which signals we want to summarize
• Provide a more personalized form of recommendation…
3
0
Personalized recommendations
• Many approaches store edges so they can retrieve an answer for
questions like:
• "people who searched for X tend to click on product Y"
• This provides only a single interpretation of this events
• Elastic Graph can find the answer to these questions by searching:
• "people who searched for X (and ideally were females in London with an age range of 25-
40 with interests in product Z), tend to click on product Y"
• This is computed using aggregations and the search on top of proven graph algorithms
• Change the criteria and potentially get a different answer
Graph Demo
Oil Demo
Thank You
3
4
Click data example
• Dataset where every search and the product a user clicked on is recorded
as a document:
• Can we use this to recommend products to users who search for similar
terms?
• "Oh, you're searching for 'midi' and you're this type of user - other users similar to you who searched for
'midi' had a significantly related correlation with these other products…"
{
"_id": "AU9ADwcEN-SyHOpO8lp4"
"category": "pcmcat152100050032",
"query_time": "2011-08-20 15:05:49",
"product": "1005145",
"ids": [
"P[1005145]",
"Q[midi]",
"C[pcmcat152100050032]"
],
"user":
"362d844db99baefa4e51d630b9724bc3e767c4aa",
"query": "midi"
}
3
5
Basic use
• REST interface accepts user graph-exploration criteria as JSON:
• Find product codes that are significantly associated with searches for "midi" and further, show
other queries that led people to these products
• Internally a number of searches with aggregations are then made to build the graph
POST clicklogs/_graph/explore
{
"query": {
"match": {
"query.raw": "midi"
}
},
"vertices": [
{
"field": "product"
}
],
"connections": {
"vertices": [
{
"field": "query.raw"
}
]
}
}
3
6
Basic use
• Potential JSON response:
{
"vertices": [
{
"field": "query.raw",
"term": "midi cable",
"weight": 0.08745858139552132,
"depth": 1
},
{
"field": "product",
"term": "8567446",
"weight": 0.13247784285434397,
"depth": 0
},
{
"field": "product",
"term": "1112375",
"weight": 0.018600718471158982,
"depth": 0
},
{.
"field": "query.raw",
"term": "midi keyboard",
"weight": 0.04802242866755111,
"depth": 1
}
],
"connections": [
{
"source": 0,
"target": 1,
"weight": 0.04802242866755111,
"doc_count": 13
},
{
"source": 2,
"target": 3,
"weight": 0.08120623870976627,
"doc_count": 23
}
]
}
3
7
Expressed visually
• Kibana plug-in makes manually exploring graph data simple:
Note: Graph expanded from
previous query
3
8
Query controls
• Configure controls to the query to tune the graph query results and
performance:
Control Description
use_significance Used to filter associated terms to only those that are significantly
associated with our query, defaults true
sample_size Each "hop" considers a sample of the best-matching documents on
each shard (default is 100 documents). Using samples has the dual
benefit of keeping exploration focused on meaningfully-connected
terms and improving the speed of execution.
timeout Time in milliseconds exploration will be halted and results gathered
so far are returned
sample_diversity To avoid the top-matching documents sample being dominated by
a single source of results sometimes it can prove necessary to
request diversity in the sample
3
9
Diversity setting tip
• Best to use on high cardinality fields
– A boolean field is a poor choice for the diversity setting
• Make sure the diversity setting isn’t limiting the sample size to just a few
documents
– A max docs per field of 5 on a boolean field would result in only 10 documents per shard being
examined to generate the graph.
4
0
Vertices controls
• You can control each vertices settings too:
Control Description
size Number of vertex terms returned for each field, defaults to
5
min_doc_count Acts as a certainty threshold - just how many documents
have to contain a pair of terms before we consider this to
be a useful connection? (default is 3)
shard_min_doc_count Advanced setting - just how many documents on a shard
have to contain a pair of terms before we return this for
global consideration? (default is 2)
4
1
Excluding terms
• You can request that certain terms not be part of the graph:
POST clicklogs/_graph/explore
{
"vertices": [
{
"field": "product",
"include": [ "1854873" ]
}
],
"connections": {
"vertices": [
{
"field": "query.raw",
"exclude": [
"midi keyboard",
"midi",
"synth"
]
}
]
}
}
4
2
Connecting Existing terms
• You can include known vertices into the api to search for connections
POST clicklogs/_graph/explore
{
"vertices": [
{
"field": "product",
"include": [
{
"term": "1854873"
}
]
}
],
"connections": {
"vertices": [
{
"field": "query.raw",
"include": [
"midi keyboard",
"midi",
"synth"
]
}
]
},
"query": {
"bool": {
"minimum_should_match":
2,
"should": [
{
"term":
{"product": "1854873"}
},
{
"term":
{"query.raw": "midi keyboard"}
},
{
"term":
{"query.raw": "midi"}
},
{
"term":
{"query.raw": "synth"}
}
]
}
}
}
4
3
Results for connecting existing terms
• Potential JSON response:
{
"took": 201,
"timed_out": false,
"failures": [],
"vertices": [
{
"field": "product",
"term": "1854873",
"weight": 1,
"depth": 0
},
{
"field": "query.raw",
"term": "midi keyboard",
"weight": 0.22873449033036267,
"depth": 1
},
{
"field": "query.raw",
"term": "midi",
"weight": 0.0630400960897268,
"depth": 1
},
{
"field": "query.raw",
"term": "synth",
"weight": 0.6582254135799106,
"depth": 1
}
],
"connections": [
{
"source": 0,
"target": 1,
"weight": 0.22873449033036267,
"doc_count": 7
},
{
"source": 0,
"target": 2,
"weight": 0.0630400960897268,
"doc_count": 5
},
{
"source": 0,
"target": 3,
"weight": 0.6582254135799106,
"doc_count": 3
}
]
}
4
4
Connection filtering
• Filter which connections are allowed:
POST clicklogs/_graph/explore
…
"connections": {
"query": {
"bool": {
"filter": [
{
"range": {
"query_time": {
"gte": "2015-10-01 00:00:00"
}
}
}
]
}
},
"vertices": [
{
"field": "query.raw",
"size": 5,
"min_doc_count": 10,
"shard_min_doc_count": 3
}
]
}
}
4
5
Putting it together
• Connect query.raw to product since 2015-10-01 where
midi is relevant: "vertices": [
{
"field": "product",
"size": 5,
"min_doc_count": 10,
"shard_min_doc_count": 3
}
],
"connections": {
"query": {
"bool": {
"filter": [
{
"range": {
"query_time": {
"gte": "2015-10-01 00:00:00"
}
}
}
]
}
},
"vertices": [
{
"field": "query.raw",
"size": 5,
"min_doc_count": 10,
"shard_min_doc_count": 3
}
]
}
}
POST clicklogs/_graph/explore
{
"query": {
"bool": {
"must": {
"match": {
"query.raw": "midi"
}
},
"filter": [
{
"range": {
"query_time": {
"gte": "2015-10-01 00:00:00"
}
}
}
]
}
},
"controls": {
"use_significance": true,
"sample_size": 2000,
"timeout": 2000,
"sample_diversity": {
"field": "category.raw",
"max_docs_per_value": 500
}
},
4
6
Manual exploration
• The graph APIs can be used to programmatically find meaningful relations
in your application. But, sometimes you want to explore your data
interactively
• Graph plug-in for Kibana allows you to quickly query and will display the
results visually
• Easy to work with your indices visually and begin to find new insights
• Choose colors and icons for vertices
• Save and Load workspaces
• Undo and Redo
• Advanced settings and groupings
4
7
Graph UI
Index Field Query Expand
Settings
Works well on a single array field to find
connections between values
4
8
Settings
Sampler Aggregation
size
Should Significant
Terms
Be Used?
Ensure sample has
diversity
4
9
Request & Response/Blacklist/Drill-downs
5
0
Add fields
1: Add a field
2: Select and repeat to add more than one
3: Configure field
4: Shift-click to turn on or off
click to configure field
5
1
Selection expansion
Hold shift to select
multiple fields
Click to expand
selected fields
5
2
Expanded fields
5
3
Find links between terms/vertices
New links
Click to add links
5
4
Connections
Click on a connection to
see strength between vertexes

More Related Content

PDF
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
PPTX
Hydra: A Vocabulary for Hypermedia-Driven Web APIs
PDF
JSON-LD Update
PPTX
Creating 3rd Generation Web APIs with Hydra
PPTX
JSON-LD for RESTful services
PDF
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
PDF
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
PDF
[Apache Kafka® Meetup by Confluent] Graph-based stream processing
Linked Data in Use: Schema.org, JSON-LD and hypermedia APIs - Front in Bahia...
Hydra: A Vocabulary for Hypermedia-Driven Web APIs
JSON-LD Update
Creating 3rd Generation Web APIs with Hydra
JSON-LD for RESTful services
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
[Apache Kafka® Meetup by Confluent] Graph-based stream processing

Similar to Machine Learning (20)

PPTX
Follow the money with graphs
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
Web-Scale Graph Analytics with Apache® Spark™
PPTX
Graph Analytics: Graph Algorithms Inside Neo4j
PDF
0629venmoplus
PPTX
Knowledge Graph Introduction
PDF
VenmoPlus demo week6
PDF
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
PDF
Introduction to Graph Databases
PDF
Fraud Detection in Financial Services using Graph Analysis and Machine Learning
PDF
Your Database Cannot Do this (well)
PPTX
Using Graph Analysis and Fraud Detection in the Fintech Industry
PPTX
Using Graph Analysis and Fraud Detection in the Fintech Industry
PDF
Graphs, Graphs everywhere - Lucene powered relation exploration
PPTX
Large Scale Graph Analytics with JanusGraph
PPTX
Large Scale Graph Analytics with JanusGraph
PPT
Re-using Media on the Web: Media fragment re-mixing and playout
PDF
Spark and MongoDB
PDF
Advanced Analytics: Graph Database Use Cases
PDF
Elasto Mania
Follow the money with graphs
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Graph Analytics: Graph Algorithms Inside Neo4j
0629venmoplus
Knowledge Graph Introduction
VenmoPlus demo week6
Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridz...
Introduction to Graph Databases
Fraud Detection in Financial Services using Graph Analysis and Machine Learning
Your Database Cannot Do this (well)
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
Graphs, Graphs everywhere - Lucene powered relation exploration
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
Re-using Media on the Web: Media fragment re-mixing and playout
Spark and MongoDB
Advanced Analytics: Graph Database Use Cases
Elasto Mania
Ad

Recently uploaded (20)

PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Modernising the Digital Integration Hub
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Architecture types and enterprise applications.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PPTX
Chapter 5: Probability Theory and Statistics
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Modernising the Digital Integration Hub
1 - Historical Antecedents, Social Consideration.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Programs and apps: productivity, graphics, security and other tools
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Hybrid model detection and classification of lung cancer
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Architecture types and enterprise applications.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Developing a website for English-speaking practice to English as a foreign la...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
A contest of sentiment analysis: k-nearest neighbor versus neural network
DP Operators-handbook-extract for the Mautical Institute
cloud_computing_Infrastucture_as_cloud_p
Zenith AI: Advanced Artificial Intelligence
NewMind AI Weekly Chronicles - August'25-Week II
A novel scalable deep ensemble learning framework for big data classification...
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Chapter 5: Probability Theory and Statistics
Ad

Machine Learning

  • 1. Geoff Bernard Solution Architect, Elastic Machine Learning & Graph Automated Anomaly Detection with the Elastic Stack
  • 4. Elastic Stack Store, Search, & AnalyzeElasticsearch Visualize & ManageKibana IngestBeats Logstash Metrics Logging APM Site Search Application Search Business Analytics Enterprise Search Security Analytics Future Solutions SaaS Elastic Cloud Self Managed Elastic Cloud Enterprise Standalone Deployment
  • 7. Anomaly Detection 1) When an entities’ behavior changes significantly and suddenly 2) When an entity is drastically different than others within a population
  • 8. Example 1) Anomalies in temporal pattern Single (univariate) time series Example: Is there unusual traffic on website ? 8 Time Metric
  • 9. Example 2) Outliers in population Detect an unusual population member Example: Which IP address is not like the others? (indication of a bot / attacker) 9
  • 10. Why is Automated Anomaly Detection Needed? “I’m not actively watching my data” “There’s a lot of data I’m just ignoring” “I only want to know if something is weird or if it changes” “I don’t know if any machine has been compromised”
  • 11. Predict Expect ed val ue @ 15: 05 = 1859 Learn Operationalize
  • 12. Spike of errors in logs Failing Device Incorrect Config Change IT Operational Analytics Security Analytics Business Analytics Unusual Network Activity Malware Exfiltrating Data Rogue Insider Sudden Dip in Revenue Operational Issue Payment Processor Problem Anomalies in your data could indicate trouble
  • 14. Top Use Cases IT Ops Cyber Security Unusual log volume by “type” • spike of error counts in logs • unusual or rare log messages Unusual Access/Usage/Authentication • unusual login activity • unusual process invocation • abuse, spamming, snooping Metric/KPI analysis • host/system metrics • APM data • KPI analysis (orders, transactions, etc.) Covert Communication / Exfiltration • DNS Tunneling • port Scanning • unusual bytes outbound to rare destinations Holistic/360-degree monitoring • anomaly detection across disparate, but related data sets • easier root cause analysis Intrusion Detection Filtering • finding unusual patterns in IDS events
  • 15. Graph
  • 16. 1 6 What is Graph? • Discover relationships in your Elasticsearch data • Combines graph algorithms and search • Explore data using relevancy • Detect fraud • Recommendation engine • Kibana UI to interact with graphs • Select relationships to explore • Find new or existing connections in the data • Visualize results • API alternative available
  • 17. 1 7 Why Graph? • Uses existing Elasticsearch indexes • No need to reindex • No need to change data models • Start exploring data today! • Simple architecture • Scales with Elasticsearch cluster • Combine the power of search, relevancy and graph databases • Relevancy allows you to find "uncommonly common" graph relationships between data
  • 18. 1 8 Graphs Graphs are a set of "vertices” and the “connections" between them. Example from stackoverflow - Javascript tag and related tags - Java tag and related tags
  • 19. 1 9 Graphs Example from stackoverflow Javascript cluster from prior slide and connected title words Graphs are a set of "vertices” and the “connections" between them.
  • 20. 2 0 Graphs • An indexed value is a potential vertex • single value fields such as an email address which can link to other fields • array fields such as tags which can link to itself and other fields • A relationship forms an connection • Not persisted in Elasticsearch • Uses aggregation framework to search and connect the vertices • Graph Traversal • Graph traversal algorithms prioritize finding meaningful connections in the data…
  • 21. How does Graph find / infer relationships ? By analyzing co-occurence of terms in your documents
  • 22. Example: Inferring relationships from co- occurence Users that liked Vivaldi also liked ???Music Recommendation { "user_id" : "1", "liked" : { vivaldi, brahms, schubert} } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "200000", "liked" : { vivaldi, brahms, bach } } { "user_id" : "1", "liked" : { vivaldi, brahms, schubert} } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "200000", "liked" : { vivaldi, brahms, bach } }
  • 23. Example: Inferring relationships from co- occurence { ... "timestamp" : "April 4th 2016, 02:22:09" , "request_url" : "/blog/wp-admin" , "source_ip" : "165.98.197.10" , "status" : 404 , "bytes" : 20 , ... } Cyber threat hunting Strong correlation between this IP address and particular attack vector Web Access Logs Sample Document
  • 24. Example: Inferring relationships from co- occurence { "timestamp" : "April 4th 2016, 02:22:09" , "trans_id" : "abc1089787" , "vendor" : "Sam Deli" , "isFraud" : true , "location" : "San Francisco" , "amount" : 5.71 , ... } Fraud Detection Strong correlation between fraudulent charges and “Sam’s deli” Card Transactions Dataset Sample Document
  • 26. Users that liked Vivaldi also liked ??? Example: Inferring relationships from co- occurence Music Recommendation { "user_id" : "1", "liked" : { vivaldi, brahms, schubert} } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "200000", "liked" : { vivaldi, brahms, bach } } { "user_id" : "1", "liked" : { vivaldi, brahms, schubert} } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "200000", "liked" : { vivaldi, brahms, beatles} } Vivaldi Beatles Brahms Bach
  • 27. Users that liked Metallica also liked ??? Example: Inferring relationships from co- occurence Music Recommendation { "user_id" : "1", "liked" : { vivaldi, brahms, schubert} } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "200000", "liked" : { vivaldi, brahms, bach } } { "user_id" : "1", "liked" : { vivaldi, brahms, schubert} } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "2", "liked" : { vivaldi, brahms, bach } } { "user_id" : "200000", "liked" : { metallica, AC/DC, beatles} } Metallica Beatles AC/DC Iron Maiden Megadeth
  • 28. How does Graph inject relevance ? Using math and search features to subtract background signal to surface only the meaningful relationships
  • 29. 2 9 What does meaningful mean? • Super nodes (Super Connectors) • Traditional graphs get distorted by "super nodes" • They frequently include these heavily connected vertices during exploration which can distort finding relevant connections • When storing connections instead of computing them on-the-fly this becomes a major issue • Wisdom of crowds • Use sampling and diversity settings to choose which signals we want to summarize • Provide a more personalized form of recommendation…
  • 30. 3 0 Personalized recommendations • Many approaches store edges so they can retrieve an answer for questions like: • "people who searched for X tend to click on product Y" • This provides only a single interpretation of this events • Elastic Graph can find the answer to these questions by searching: • "people who searched for X (and ideally were females in London with an age range of 25- 40 with interests in product Z), tend to click on product Y" • This is computed using aggregations and the search on top of proven graph algorithms • Change the criteria and potentially get a different answer
  • 34. 3 4 Click data example • Dataset where every search and the product a user clicked on is recorded as a document: • Can we use this to recommend products to users who search for similar terms? • "Oh, you're searching for 'midi' and you're this type of user - other users similar to you who searched for 'midi' had a significantly related correlation with these other products…" { "_id": "AU9ADwcEN-SyHOpO8lp4" "category": "pcmcat152100050032", "query_time": "2011-08-20 15:05:49", "product": "1005145", "ids": [ "P[1005145]", "Q[midi]", "C[pcmcat152100050032]" ], "user": "362d844db99baefa4e51d630b9724bc3e767c4aa", "query": "midi" }
  • 35. 3 5 Basic use • REST interface accepts user graph-exploration criteria as JSON: • Find product codes that are significantly associated with searches for "midi" and further, show other queries that led people to these products • Internally a number of searches with aggregations are then made to build the graph POST clicklogs/_graph/explore { "query": { "match": { "query.raw": "midi" } }, "vertices": [ { "field": "product" } ], "connections": { "vertices": [ { "field": "query.raw" } ] } }
  • 36. 3 6 Basic use • Potential JSON response: { "vertices": [ { "field": "query.raw", "term": "midi cable", "weight": 0.08745858139552132, "depth": 1 }, { "field": "product", "term": "8567446", "weight": 0.13247784285434397, "depth": 0 }, { "field": "product", "term": "1112375", "weight": 0.018600718471158982, "depth": 0 }, {. "field": "query.raw", "term": "midi keyboard", "weight": 0.04802242866755111, "depth": 1 } ], "connections": [ { "source": 0, "target": 1, "weight": 0.04802242866755111, "doc_count": 13 }, { "source": 2, "target": 3, "weight": 0.08120623870976627, "doc_count": 23 } ] }
  • 37. 3 7 Expressed visually • Kibana plug-in makes manually exploring graph data simple: Note: Graph expanded from previous query
  • 38. 3 8 Query controls • Configure controls to the query to tune the graph query results and performance: Control Description use_significance Used to filter associated terms to only those that are significantly associated with our query, defaults true sample_size Each "hop" considers a sample of the best-matching documents on each shard (default is 100 documents). Using samples has the dual benefit of keeping exploration focused on meaningfully-connected terms and improving the speed of execution. timeout Time in milliseconds exploration will be halted and results gathered so far are returned sample_diversity To avoid the top-matching documents sample being dominated by a single source of results sometimes it can prove necessary to request diversity in the sample
  • 39. 3 9 Diversity setting tip • Best to use on high cardinality fields – A boolean field is a poor choice for the diversity setting • Make sure the diversity setting isn’t limiting the sample size to just a few documents – A max docs per field of 5 on a boolean field would result in only 10 documents per shard being examined to generate the graph.
  • 40. 4 0 Vertices controls • You can control each vertices settings too: Control Description size Number of vertex terms returned for each field, defaults to 5 min_doc_count Acts as a certainty threshold - just how many documents have to contain a pair of terms before we consider this to be a useful connection? (default is 3) shard_min_doc_count Advanced setting - just how many documents on a shard have to contain a pair of terms before we return this for global consideration? (default is 2)
  • 41. 4 1 Excluding terms • You can request that certain terms not be part of the graph: POST clicklogs/_graph/explore { "vertices": [ { "field": "product", "include": [ "1854873" ] } ], "connections": { "vertices": [ { "field": "query.raw", "exclude": [ "midi keyboard", "midi", "synth" ] } ] } }
  • 42. 4 2 Connecting Existing terms • You can include known vertices into the api to search for connections POST clicklogs/_graph/explore { "vertices": [ { "field": "product", "include": [ { "term": "1854873" } ] } ], "connections": { "vertices": [ { "field": "query.raw", "include": [ "midi keyboard", "midi", "synth" ] } ] }, "query": { "bool": { "minimum_should_match": 2, "should": [ { "term": {"product": "1854873"} }, { "term": {"query.raw": "midi keyboard"} }, { "term": {"query.raw": "midi"} }, { "term": {"query.raw": "synth"} } ] } } }
  • 43. 4 3 Results for connecting existing terms • Potential JSON response: { "took": 201, "timed_out": false, "failures": [], "vertices": [ { "field": "product", "term": "1854873", "weight": 1, "depth": 0 }, { "field": "query.raw", "term": "midi keyboard", "weight": 0.22873449033036267, "depth": 1 }, { "field": "query.raw", "term": "midi", "weight": 0.0630400960897268, "depth": 1 }, { "field": "query.raw", "term": "synth", "weight": 0.6582254135799106, "depth": 1 } ], "connections": [ { "source": 0, "target": 1, "weight": 0.22873449033036267, "doc_count": 7 }, { "source": 0, "target": 2, "weight": 0.0630400960897268, "doc_count": 5 }, { "source": 0, "target": 3, "weight": 0.6582254135799106, "doc_count": 3 } ] }
  • 44. 4 4 Connection filtering • Filter which connections are allowed: POST clicklogs/_graph/explore … "connections": { "query": { "bool": { "filter": [ { "range": { "query_time": { "gte": "2015-10-01 00:00:00" } } } ] } }, "vertices": [ { "field": "query.raw", "size": 5, "min_doc_count": 10, "shard_min_doc_count": 3 } ] } }
  • 45. 4 5 Putting it together • Connect query.raw to product since 2015-10-01 where midi is relevant: "vertices": [ { "field": "product", "size": 5, "min_doc_count": 10, "shard_min_doc_count": 3 } ], "connections": { "query": { "bool": { "filter": [ { "range": { "query_time": { "gte": "2015-10-01 00:00:00" } } } ] } }, "vertices": [ { "field": "query.raw", "size": 5, "min_doc_count": 10, "shard_min_doc_count": 3 } ] } } POST clicklogs/_graph/explore { "query": { "bool": { "must": { "match": { "query.raw": "midi" } }, "filter": [ { "range": { "query_time": { "gte": "2015-10-01 00:00:00" } } } ] } }, "controls": { "use_significance": true, "sample_size": 2000, "timeout": 2000, "sample_diversity": { "field": "category.raw", "max_docs_per_value": 500 } },
  • 46. 4 6 Manual exploration • The graph APIs can be used to programmatically find meaningful relations in your application. But, sometimes you want to explore your data interactively • Graph plug-in for Kibana allows you to quickly query and will display the results visually • Easy to work with your indices visually and begin to find new insights • Choose colors and icons for vertices • Save and Load workspaces • Undo and Redo • Advanced settings and groupings
  • 47. 4 7 Graph UI Index Field Query Expand Settings Works well on a single array field to find connections between values
  • 50. 5 0 Add fields 1: Add a field 2: Select and repeat to add more than one 3: Configure field 4: Shift-click to turn on or off click to configure field
  • 51. 5 1 Selection expansion Hold shift to select multiple fields Click to expand selected fields
  • 53. 5 3 Find links between terms/vertices New links Click to add links
  • 54. 5 4 Connections Click on a connection to see strength between vertexes