SlideShare a Scribd company logo
Entity-Centric Indexing
Mark Harwood @elasticmark
4/6/2015
www.elastic.co
2
(or “when aggregations don’t cut it”)
Entity-centric indexes
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
3
A typical “event-centric” deployment
Time-based event indexesEvent stream
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
4
Problem: some aggregations are expensive
We need to join all event-level data together at query-time.
?Using web server log data,
answer the question:
"how long on average do
customers spend on my site?"
!
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
5
How to cripple elasticsearch with a bucket explosion:
1. Ask a question about values that needs to be derived from multiple
documents (e.g. deriving a web session’s duration)
2. Make the joining key a high cardinality field e.g. something like “IP
address”
3. Extra points if you use no routing of your documents so that related
content is spray-gunned across multiple shards
www.elastic.co
6
A “pay-as-you-go” model to the
costs of fusing data
Solution
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
7
Solution: an “entity-centric” model
Usual stream of events
Time-based event indexes
Entity-based summary indexes
Periodic extracts sorted
by entity ID and time
www.elastic.co
8
• WebSessions
• "how long on average do my customers spend on my site?”
• “which users behave like bots?”
• “what is the most common exit page?”
• Bank Accounts
• "Does this new payment match the typical spending behaviour of bank account X?”
Entity-centric queries
www.elastic.co
9
• Buyers
• "What do the users who bought product X also buy?”
• “Which buyers behave like ‘shills’ and who are they promoting?”
• Cars
• “Which cars drove long distances after failing a road worthiness test?”
Entity-centric queries
www.elastic.co
10
Web log analytics
Use case
www.elastic.co
11
• Analyses website traffic for retailers and manufacturers in the automotive
industry
• Summarising many behaviours over time e.g.
• unique numbers of visitors per month
• engagement: average session durations
• Faced scaling issues producing some results from raw events
Use case: GFORCES
www.elastic.co
12
• Data store contains 150m events generated by 26m user sessions
• Event-centric aggregations were taking ~25 seconds
• Equivalent entity-centric aggregations take <50ms
• Simplified queries for common entry pages, common exit pages etc
Results of moving to entity-centric indexing
www.elastic.co
13
Amazon marketplace reviews -
building profiles for reviewers
Worked example
Play	
  along!	
  Code	
  +	
  data	
  here:	
  bit.ly/entcent
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
14
An “entity-centric” model
AmazonReviews
(an event-centric index)
reviews.csv loadEvents.sh
Review event fields
• rating
• seller
• reviewer
• date
AmazonReviewers
(an entity-centric index)
buildEntities.sh
• Drops and creates reviewers index.
• Uses Python client to query and scroll list of
reviews sorted by reviewerId and time
• Python pushes _update requests to ~400k
“Reviewer” documents each containing
bundles of their recent reviews using bulk
indexing API
• Shard-side Groovy script collapses the
multiple reviews into a single reviewer JSON
document summarising behaviour
Reviewer entity fields
• positivity
• num sellers reviewed
• last 50 reviews
• profile (“newbie”, “fanboy” etc)
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
15
Anatomy of an entity indexing groovy script
Initialize	
  if	
  new	
  document
Loop	
  to	
  consolidate	
  latest	
  events
Re-­‐run	
  risk	
  profile	
  logic	
  
Load	
  stored	
  state
Store	
  the	
  script	
  in	
  ES_HOME/config/scripts/foo.groovy
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
16
Insight: which sellers have a lot of fanboys?
Seller	
  #187	
  has	
  more	
  than	
  his	
  
fair	
  share	
  of	
  “fanboy”	
  reviewers	
  
…
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
17
Drilling down into seller #187’s fanboys
Suspiciously	
  
synchronised	
  
behaviour
www.elastic.co
18
UK 2013 car road worthiness tests
Worked example
www.elastic.co
19
• In the UK all vehicles must pass an annual roadworthiness test, called an MOT
(named after the Ministry of Transport)
• It is illegal to drive a car that has failed an MOT (unless driving home from a
test or to a repair centre)
• Taxis and other forms of public transport have to be tested more frequently -
every 6 months.
• All data is freely available from data.gov.uk but with anonymised vehicle ID and
inexact test locations.
Example background
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
20
Example background
MOTs
mots.csv loadMOTs.sh
Cars
buildEntities.sh
• Drops and creates mots
index.
• Uses Python client to
bulk load all 37m road
worthiness test results
for 2013 (data source
http://data.gov.uk/
• Drops and creates cars index.
• Registers CarProfileUpdater.groovy as a
stored script
• Uses Python client to query and scroll list of
mot test results sorted by vehicle ID and
time
• Python pushes _update requests to ~27m
“Car” documents each containing bundles
of related MOT test results using bulk
indexing API
• Shard-side Groovy script collapses the
multiple tests into a single summary JSON
document for a car, deriving summaries eg
MOT event fields
• result (pass/fail)
• vehicle ID
• Make + model +
age
• mileage
• test date
• test location
Car entity fields
• Make + model + age
• last test result, date, location
• miles driven while failed
• days between fail and fix
• complete test history
• suspected bad mileometer
readings
www.elastic.co
21
Car attributes derived from 3 test result documents
Data fusion logic
1
2
3
Test	
  date
Mile-­‐o-­‐meter	
  reading
daysForFix
badReading?
milesDrivenAfterFailure
mile-o-meterRewind
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
22
Insight: who is driving failed vehicles?
Q: Why is there an
unexpected peak in
milesDrivenWithFailure
around 6-months?
A: Taxis
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
23
Insight: Taxis keep on trucking after failures..
www.elastic.co
24
A user-centric index as a
recommendation engine
Recycling user behaviours
www.elastic.co
25
• A public dataset* of 10m movie ratings made by 71k users
• One elasticsearch document per user with a list of their
movie ratings
Movielens data
Example background
*	
  http://files.grouplens.org/datasets/movielens/ml-­‐10m-­‐README.html
www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing
without written permission is strictly prohibited
26
“Uncommonly common”user behaviours
www.elastic.co
27
Conclusions
www.elastic.co
28
• Efficient and simple queries
• Advanced analytics/insights
• Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups)
• Can reuse existing elasticsearch APIs or build entity documents using external
technologies
Entity centric indexing: Advantages
www.elastic.co
29
• Avoid “fat entities”
• Use forgetful collections: Priority queues, circular buffers, HyperLogLog
• Avoid pointless updates
• Use ctx.op=“none” to avoid writes of insignificant changes
• Consider options for reducing event volumes:
• Use of aggregations in gathering events
• Reduce related events in event-gathering script that issues updates
• Parallelise the pull of event information
Entity centric indexing: tips
www.elastic.co
30
• Incremental entity updates can be achieved by querying all events since the
timestamp of the last run
• Data integrity - implement policies for:
• handling any failures in performing entity updates
• retiring old entities (use of TTL?)
Entity centric indexing
www.elastic.co
31
@elasticmark
Questions?

More Related Content

PPTX
MaFI Meeting 2016 (slides)
PPTX
Lucene KV-Store
PPTX
Proposal for nested document support in Lucene
PPT
Keys To World-Class Retail Web Performance - Expert tips for holiday web read...
PDF
AMP - Accelerated Mobile Pages
PPT
Suguta industry_automotive
PPTX
How to prepare for Google's page experience update
PPT
Omniturebasicsv1 100622051011-phpapp02
MaFI Meeting 2016 (slides)
Lucene KV-Store
Proposal for nested document support in Lucene
Keys To World-Class Retail Web Performance - Expert tips for holiday web read...
AMP - Accelerated Mobile Pages
Suguta industry_automotive
How to prepare for Google's page experience update
Omniturebasicsv1 100622051011-phpapp02

Similar to Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015 (20)

PDF
Colman Hackathon Webhose.io API Reference
PPTX
Core web Vitals: Web Performance and Usability
PDF
10 Commonly Missed SEO Opportunities For Wordpress Awesomeness
PDF
Comment transformer vos données en informations exploitables
PPTX
Website Parameters.pptx
PDF
How Tracking Companies Circumvented Ad Blockers Using WebSockets
PDF
Core Web Vitals - Why You Need to Pay Attention
PPTX
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
PPTX
Increase Profits with Better Vehicle Listing Data
PDF
Comment transformer vos données en informations exploitables
PDF
Transforming data into actionable insights
PPTX
Understanding Google Analytics: Who's Oggling My Company
PDF
Performance Engineering - how to start!
PPTX
MLI Strategic Pitch Deck AEM Implementation
PDF
Cómo transformar los datos en análisis con los que tomar decisiones
PDF
WordPress Theme Performance - WP Vienna meetup 8.6.2016
PDF
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
PPTX
Web Application Performance from User Perspective
PPTX
Browser Diagnostics using dynatrace Ajax Edition
PDF
Adobe Digital Analytics - SiteCatalyst, Test & Target Workshop
Colman Hackathon Webhose.io API Reference
Core web Vitals: Web Performance and Usability
10 Commonly Missed SEO Opportunities For Wordpress Awesomeness
Comment transformer vos données en informations exploitables
Website Parameters.pptx
How Tracking Companies Circumvented Ad Blockers Using WebSockets
Core Web Vitals - Why You Need to Pay Attention
SUGCON NA 2023 - Crafting Lightning Fast Composable Experiences.pptx
Increase Profits with Better Vehicle Listing Data
Comment transformer vos données en informations exploitables
Transforming data into actionable insights
Understanding Google Analytics: Who's Oggling My Company
Performance Engineering - how to start!
MLI Strategic Pitch Deck AEM Implementation
Cómo transformar los datos en análisis con los que tomar decisiones
WordPress Theme Performance - WP Vienna meetup 8.6.2016
Customer Story: Elastic Stack을 이용한 게임 서비스 통합 로깅 플랫폼
Web Application Performance from User Perspective
Browser Diagnostics using dynatrace Ajax Edition
Adobe Digital Analytics - SiteCatalyst, Test & Target Workshop
Ad

More from NoSQLmatters (20)

PDF
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
PDF
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
PDF
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
PDF
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
PDF
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
PDF
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
PDF
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
PDF
Chris Ward - Understanding databases for distributed docker applications - No...
PDF
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
PDF
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
PDF
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
PDF
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
PDF
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
PDF
David Pilato - Advance search for your legacy application - NoSQL matters Par...
PDF
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
PDF
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
PDF
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
PDF
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Nathan Ford- Divination of the Defects (Graph-Based Defect Prediction through...
Stefan Hochdörfer - The NoSQL Store everyone ignores: PostgreSQL - NoSQL matt...
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Peter Bakas - Zero to Insights - Real time analytics with Kafka, C*, and Spar...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase -...
Akmal Chaudhri - How to Build Streaming Data Applications: Evaluating the Top...
Michael Hackstein - NoSQL meets Microservices - NoSQL matters Dublin 2015
Chris Ward - Understanding databases for distributed docker applications - No...
Philipp Krenn - Host your database in the cloud, they said... - NoSQL matters...
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
David Pilato - Advance search for your legacy application - NoSQL matters Par...
Tugdual Grall - From SQL to NoSQL in less than 40 min - NoSQL matters Paris 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Michael Hackstein - Polyglot Persistence & Multi-Model NoSQL Databases - NoSQ...
Rob Harrop- Key Note The God, the Bad and the Ugly - NoSQL matters Paris 2015
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Ad

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Lecture1 pattern recognition............
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
Fluorescence-microscope_Botany_detailed content
Reliability_Chapter_ presentation 1221.5784
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
1_Introduction to advance data techniques.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
ISS -ESG Data flows What is ESG and HowHow
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IB Computer Science - Internal Assessment.pptx
Lecture1 pattern recognition............
IBA_Chapter_11_Slides_Final_Accessible.pptx
Database Infoormation System (DBIS).pptx
Qualitative Qantitative and Mixed Methods.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Miokarditis (Inflamasi pada Otot Jantung)

Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015

  • 1. Entity-Centric Indexing Mark Harwood @elasticmark 4/6/2015
  • 2. www.elastic.co 2 (or “when aggregations don’t cut it”) Entity-centric indexes
  • 3. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 3 A typical “event-centric” deployment Time-based event indexesEvent stream
  • 4. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 4 Problem: some aggregations are expensive We need to join all event-level data together at query-time. ?Using web server log data, answer the question: "how long on average do customers spend on my site?" !
  • 5. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 5 How to cripple elasticsearch with a bucket explosion: 1. Ask a question about values that needs to be derived from multiple documents (e.g. deriving a web session’s duration) 2. Make the joining key a high cardinality field e.g. something like “IP address” 3. Extra points if you use no routing of your documents so that related content is spray-gunned across multiple shards
  • 6. www.elastic.co 6 A “pay-as-you-go” model to the costs of fusing data Solution
  • 7. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 7 Solution: an “entity-centric” model Usual stream of events Time-based event indexes Entity-based summary indexes Periodic extracts sorted by entity ID and time
  • 8. www.elastic.co 8 • WebSessions • "how long on average do my customers spend on my site?” • “which users behave like bots?” • “what is the most common exit page?” • Bank Accounts • "Does this new payment match the typical spending behaviour of bank account X?” Entity-centric queries
  • 9. www.elastic.co 9 • Buyers • "What do the users who bought product X also buy?” • “Which buyers behave like ‘shills’ and who are they promoting?” • Cars • “Which cars drove long distances after failing a road worthiness test?” Entity-centric queries
  • 11. www.elastic.co 11 • Analyses website traffic for retailers and manufacturers in the automotive industry • Summarising many behaviours over time e.g. • unique numbers of visitors per month • engagement: average session durations • Faced scaling issues producing some results from raw events Use case: GFORCES
  • 12. www.elastic.co 12 • Data store contains 150m events generated by 26m user sessions • Event-centric aggregations were taking ~25 seconds • Equivalent entity-centric aggregations take <50ms • Simplified queries for common entry pages, common exit pages etc Results of moving to entity-centric indexing
  • 13. www.elastic.co 13 Amazon marketplace reviews - building profiles for reviewers Worked example Play  along!  Code  +  data  here:  bit.ly/entcent
  • 14. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 14 An “entity-centric” model AmazonReviews (an event-centric index) reviews.csv loadEvents.sh Review event fields • rating • seller • reviewer • date AmazonReviewers (an entity-centric index) buildEntities.sh • Drops and creates reviewers index. • Uses Python client to query and scroll list of reviews sorted by reviewerId and time • Python pushes _update requests to ~400k “Reviewer” documents each containing bundles of their recent reviews using bulk indexing API • Shard-side Groovy script collapses the multiple reviews into a single reviewer JSON document summarising behaviour Reviewer entity fields • positivity • num sellers reviewed • last 50 reviews • profile (“newbie”, “fanboy” etc)
  • 15. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 15 Anatomy of an entity indexing groovy script Initialize  if  new  document Loop  to  consolidate  latest  events Re-­‐run  risk  profile  logic   Load  stored  state Store  the  script  in  ES_HOME/config/scripts/foo.groovy
  • 16. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 16 Insight: which sellers have a lot of fanboys? Seller  #187  has  more  than  his   fair  share  of  “fanboy”  reviewers   …
  • 17. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 17 Drilling down into seller #187’s fanboys Suspiciously   synchronised   behaviour
  • 18. www.elastic.co 18 UK 2013 car road worthiness tests Worked example
  • 19. www.elastic.co 19 • In the UK all vehicles must pass an annual roadworthiness test, called an MOT (named after the Ministry of Transport) • It is illegal to drive a car that has failed an MOT (unless driving home from a test or to a repair centre) • Taxis and other forms of public transport have to be tested more frequently - every 6 months. • All data is freely available from data.gov.uk but with anonymised vehicle ID and inexact test locations. Example background
  • 20. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 20 Example background MOTs mots.csv loadMOTs.sh Cars buildEntities.sh • Drops and creates mots index. • Uses Python client to bulk load all 37m road worthiness test results for 2013 (data source http://data.gov.uk/ • Drops and creates cars index. • Registers CarProfileUpdater.groovy as a stored script • Uses Python client to query and scroll list of mot test results sorted by vehicle ID and time • Python pushes _update requests to ~27m “Car” documents each containing bundles of related MOT test results using bulk indexing API • Shard-side Groovy script collapses the multiple tests into a single summary JSON document for a car, deriving summaries eg MOT event fields • result (pass/fail) • vehicle ID • Make + model + age • mileage • test date • test location Car entity fields • Make + model + age • last test result, date, location • miles driven while failed • days between fail and fix • complete test history • suspected bad mileometer readings
  • 21. www.elastic.co 21 Car attributes derived from 3 test result documents Data fusion logic 1 2 3 Test  date Mile-­‐o-­‐meter  reading daysForFix badReading? milesDrivenAfterFailure mile-o-meterRewind
  • 22. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 22 Insight: who is driving failed vehicles? Q: Why is there an unexpected peak in milesDrivenWithFailure around 6-months? A: Taxis
  • 23. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 23 Insight: Taxis keep on trucking after failures..
  • 24. www.elastic.co 24 A user-centric index as a recommendation engine Recycling user behaviours
  • 25. www.elastic.co 25 • A public dataset* of 10m movie ratings made by 71k users • One elasticsearch document per user with a list of their movie ratings Movielens data Example background *  http://files.grouplens.org/datasets/movielens/ml-­‐10m-­‐README.html
  • 26. www.elastic.coCopyright Elastic 2015 Copying, publishing and/or distributing without written permission is strictly prohibited 26 “Uncommonly common”user behaviours
  • 28. www.elastic.co 28 • Efficient and simple queries • Advanced analytics/insights • Can provide a cheaper data retention policy (daily->weekly->monthly roll-ups) • Can reuse existing elasticsearch APIs or build entity documents using external technologies Entity centric indexing: Advantages
  • 29. www.elastic.co 29 • Avoid “fat entities” • Use forgetful collections: Priority queues, circular buffers, HyperLogLog • Avoid pointless updates • Use ctx.op=“none” to avoid writes of insignificant changes • Consider options for reducing event volumes: • Use of aggregations in gathering events • Reduce related events in event-gathering script that issues updates • Parallelise the pull of event information Entity centric indexing: tips
  • 30. www.elastic.co 30 • Incremental entity updates can be achieved by querying all events since the timestamp of the last run • Data integrity - implement policies for: • handling any failures in performing entity updates • retiring old entities (use of TTL?) Entity centric indexing