SlideShare a Scribd company logo
© 2018 PURE STORAGE INC.1 #PUREACCELERATE
#NEWMEETSNOW
RUNNING ELASTICSEARCH ON PURE
BIG DATA AS A SERVICE
Brian Gold | Pure Storage
@briantgold
WHEN, WHY, AND HOW SHOULD
I USE ELASTICSEARCH?
© 2018 PURE STORAGE INC.3
Searching a Large Document Corpus
Find the book containing “It was the best of times, it was the worst of times”
Data Con LA 2018 - Big Data as a Service: Running Elasticsearch on Pure by Brian Gold
Data Con LA 2018 - Big Data as a Service: Running Elasticsearch on Pure by Brian Gold
© 2018 PURE STORAGE INC.6
Search Engines: Indexes in the Digital Age
Tokenizing
Filtering
Stemming
Normalizing
Term Document: Locations
voided {‘doc1’: [221]}
toil {‘doc3’: [12]}
house {‘doc2’: [248]}
topics {‘doc1’: [23, 206, 342]}
mandate {‘doc3’: [143]}
edition {‘doc2’: [178]}
job {‘doc1’: [282]}
week {‘doc1’: [22], ‘doc2’: [84]}
buildings {‘doc3’: [832]}
Inverted Index
THE REST IS JUST ENGINEERING ☺
© 2018 PURE STORAGE INC.7
Data shippers
(e.g., crawler)
Library for preprocessing
and indexing text
Documents
AS DATA VOLUMES GROW, WE MUST SCALE OUT
Basic indexing workflow with Apache Lucene
© 2018 PURE STORAGE INC.8
Data shippers
(e.g., crawler)
Documents
Scaling out with Elasticsearch
cluster frontend
Elasticsearch
POWERING SEARCH BACKEND FOR
MAJOR WEBSITES (WIKIPEDIA)
© 2018 PURE STORAGE INC.9
“Data shippers”“Documents”
What’s a document?
cluster frontend
Elasticsearch
Emails
Web pages
Records in a DBMS
Syslog entries
Security logs
…
USE ELASTICSEARCH TO INDEX
ANY KIND OF TEXT
Logstash
Fluentd
Apache Spark
Filebeat
Web crawlers
…
© 2018 PURE STORAGE INC.10
Top use cases
Application search
Business analytics
Log analytics
Security analytics
© 2018 PURE STORAGE INC.11
Common theme: rapid insights into data
© 2018 PURE STORAGE INC.12 #PUREACCELERATE
#NEWMEETSNOW
INFRASTRUCTURE REQUIREMENTS AND BEST PRACTICES
ELASTICSEARCH IN DEPTH
© 2018 PURE STORAGE INC.13
How does Elasticsearch store my data?
Elasticsearch
translog segments Persistent data structures
Ephemeral process
Inverted indexes & associated metadata
Write ahead log
© 2018 PURE STORAGE INC.14
Elasticsearch makes scale-out simple
Elasticsearch
translog segments
Elasticsearch
translog segments
Elasticsearch
translog segments
ElasticSearch cluster - data nodes
© 2018 PURE STORAGE INC.15
Mapping concepts to hardware implementation
Storage (SSDs or disks)
CPU & DRAM
Network Interface
Elasticsearch
© 2018 PURE STORAGE INC.16
Scale out with commodity servers
Elasticsearch Elasticsearch Elasticsearch
Cluster Interconnect
Pros
Excellent scalability
Cons
Deployment requires specific server configuration
Hard to change compute-to-storage ratios
© 2018 PURE STORAGE INC.17
What if we disaggregate?
Elasticsearch Elasticsearch Elasticsearch
Cluster Interconnect
© 2018 PURE STORAGE INC.18
Disaggregation enables Elasticsearch-as-a-service
ES ES ES
Cluster Interconnect
ES ES ESES
Pros
Run nodes anywhere, anyhow
Scale storage and compute independently
Cons
Network bottlenecks?
Storage bottlenecks?
© 2018 PURE STORAGE INC.19
Do not use remote-mounted
storage... The latency introduced
here is antithetical to performance.
“
”- Elastic.co – Indexing Performance Tips
© 2018 PURE STORAGE INC.20
CHALLENGE
ACCEPTED
© 2018 PURE STORAGE INC.21
FLASHBLADEPURPOSE-BUILT FOR MODERN ANALYTICS
BLADE PURITY SCALE-OUT FABRIC
Powerful, Elastic Data
Processing & Storage Unit
Massively Distributed
Software for Limitless Scale
Software-defined fabric that scales
linearly with more data & clients
© 2018 PURE STORAGE INC.22
segment segment
segment segment
segment segment
translog
merge
refresh
Incoming
documents
①
②
③
Why should an all-flash array
perform well with Elasticsearch?
Indexing needs high write throughput
at consistent, low-enough latency
© 2018 PURE STORAGE INC.23
20 Elasticsearch nodes
• 40 vCPUs @ 2.6GHz
• 192 GB DRAM
• 256GB local SSD
• 2x10GbE network
• Elasticsearch 6.2.1
(2 per node in containers)
5 Rally load generators
• 40 vCPUs @ 2.6GHz
• 192 GB DRAM
• 2x10GbE network
• NYC Taxis dataset
8x replicated => 595GB raw
FlashBlade
• 15 blades, 17TB each
• 162TB usable via NFS
• 8x40GbE network
Indexing benchmark methodology
© 2018 PURE STORAGE INC.24
Over 3M documents per second and as fast as local SSD
0
1
2
3
0 5 10 15 20
Number of nodes (2x Elasticsearch instances per node)
Direct-attach SSD
FlashBlade
Indexing throughput
Millions of documents per sec
© 2018 PURE STORAGE INC.25
High throughput at consistently low latency
Real-time monitoring aids performance tuning and diagnostics
© 2018 PURE STORAGE INC.26
① Don’t fear “remote-mounted storage”!
② Do extensive testing at scale
=> Elasticsearch needs high storage throughput
at consistent, low-enough latency
③ Disaggregation enables service delivery and
scaling models not possible with local drives.
Lessons learned
© 2018 PURE STORAGE INC.27
THANK
YOU!
bgold@purestorage.com
@briantgold
© 2017 PURE STORAGE INC.28

More Related Content

PDF
An overview of BigQuery
PDF
Google BigQuery Best Practices
PPTX
BigQuery for the Big Data win
PDF
Google Bigtable
PDF
Big query
PDF
Redshift VS BigQuery
PDF
Kickstart your data strategy for 2018: Getting started with Amazon Redshift
PDF
DE gitConnect
An overview of BigQuery
Google BigQuery Best Practices
BigQuery for the Big Data win
Google Bigtable
Big query
Redshift VS BigQuery
Kickstart your data strategy for 2018: Getting started with Amazon Redshift
DE gitConnect

What's hot (19)

PDF
Quick Intro to Google Cloud Technologies
PPTX
LendingClub RealTime BigData Platform with Oracle GoldenGate
PPTX
Xanadu Based Big Data CBIR System:Automated Astronomical Objects Classificati...
PPTX
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
PPTX
Making connections with Graph
PDF
Big Query - Utilizing Google Data Warehouse for Media Analytics
PPTX
The Yellowbrick Impact for MicroStrategy
PDF
A secure and dynamic multi keyword ranked search scheme over encrypted cloud ...
PPT
DMaeda BI Design to Reports
PPTX
Data Structure and Types
PDF
9 facts about statice's data anonymization solution
PPTX
Ensuring compliance of patient data with big data
PDF
The Rise of Logical Data Architecture - Breaking the Data Gravity Notion (Mid...
PDF
Postgres Vision 2018: Your Migration Path - BinckBank Case Study
 
PDF
How to migrate to GraphDB in 10 easy to follow steps
PDF
Mapping the road to better data storage strategies
PDF
Cloud Developer Days - BigQuery
PPTX
Xanadu Big Data Platform Technology BMT@ Rackspace Cloud
PDF
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
Quick Intro to Google Cloud Technologies
LendingClub RealTime BigData Platform with Oracle GoldenGate
Xanadu Based Big Data CBIR System:Automated Astronomical Objects Classificati...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Making connections with Graph
Big Query - Utilizing Google Data Warehouse for Media Analytics
The Yellowbrick Impact for MicroStrategy
A secure and dynamic multi keyword ranked search scheme over encrypted cloud ...
DMaeda BI Design to Reports
Data Structure and Types
9 facts about statice's data anonymization solution
Ensuring compliance of patient data with big data
The Rise of Logical Data Architecture - Breaking the Data Gravity Notion (Mid...
Postgres Vision 2018: Your Migration Path - BinckBank Case Study
 
How to migrate to GraphDB in 10 easy to follow steps
Mapping the road to better data storage strategies
Cloud Developer Days - BigQuery
Xanadu Big Data Platform Technology BMT@ Rackspace Cloud
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
Ad

Similar to Data Con LA 2018 - Big Data as a Service: Running Elasticsearch on Pure by Brian Gold (20)

PPTX
Elastic pivorak
PDF
Elasticsearch speed is key
PPTX
Elasticsearch as a search alternative to a relational database
PDF
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
PPTX
BigData Search Simplified with ElasticSearch
PPTX
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
PPTX
Perl and Elasticsearch
PDF
Mastering Elasticsearch 2nd Edition Edition Rafal Kuc
PPTX
Introduction to Elasticsearch
PPTX
Elasticsearch
PDF
Scaling the Content Repository with Elasticsearch
PDF
Faster and better search results with Elasticsearch
PPTX
Search and analyze your data with elasticsearch
PDF
Elasticsearch : petit déjeuner du 13 mars 2014
PDF
ElasticSearch & Elastica in Symfony2 - SfLive 2015
PPTX
Elasticsearch - Scalability and Multitenancy
PDF
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
PDF
Elasticsearch
PDF
Explore Elasticsearch and Why It’s Worth Using
PDF
ElasticSearch - index server used as a document database
Elastic pivorak
Elasticsearch speed is key
Elasticsearch as a search alternative to a relational database
JavaCro'15 - Elasticsearch as a search alternative to a relational database -...
BigData Search Simplified with ElasticSearch
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Perl and Elasticsearch
Mastering Elasticsearch 2nd Edition Edition Rafal Kuc
Introduction to Elasticsearch
Elasticsearch
Scaling the Content Repository with Elasticsearch
Faster and better search results with Elasticsearch
Search and analyze your data with elasticsearch
Elasticsearch : petit déjeuner du 13 mars 2014
ElasticSearch & Elastica in Symfony2 - SfLive 2015
Elasticsearch - Scalability and Multitenancy
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Elasticsearch
Explore Elasticsearch and Why It’s Worth Using
ElasticSearch - index server used as a document database
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Cloud computing and distributed systems.
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Cloud computing and distributed systems.
Machine learning based COVID-19 study performance prediction
NewMind AI Monthly Chronicles - July 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation_ Review paper, used for researhc scholars
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Teaching material agriculture food technology
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

Data Con LA 2018 - Big Data as a Service: Running Elasticsearch on Pure by Brian Gold

  • 1. © 2018 PURE STORAGE INC.1 #PUREACCELERATE #NEWMEETSNOW RUNNING ELASTICSEARCH ON PURE BIG DATA AS A SERVICE Brian Gold | Pure Storage @briantgold
  • 2. WHEN, WHY, AND HOW SHOULD I USE ELASTICSEARCH?
  • 3. © 2018 PURE STORAGE INC.3 Searching a Large Document Corpus Find the book containing “It was the best of times, it was the worst of times”
  • 6. © 2018 PURE STORAGE INC.6 Search Engines: Indexes in the Digital Age Tokenizing Filtering Stemming Normalizing Term Document: Locations voided {‘doc1’: [221]} toil {‘doc3’: [12]} house {‘doc2’: [248]} topics {‘doc1’: [23, 206, 342]} mandate {‘doc3’: [143]} edition {‘doc2’: [178]} job {‘doc1’: [282]} week {‘doc1’: [22], ‘doc2’: [84]} buildings {‘doc3’: [832]} Inverted Index THE REST IS JUST ENGINEERING ☺
  • 7. © 2018 PURE STORAGE INC.7 Data shippers (e.g., crawler) Library for preprocessing and indexing text Documents AS DATA VOLUMES GROW, WE MUST SCALE OUT Basic indexing workflow with Apache Lucene
  • 8. © 2018 PURE STORAGE INC.8 Data shippers (e.g., crawler) Documents Scaling out with Elasticsearch cluster frontend Elasticsearch POWERING SEARCH BACKEND FOR MAJOR WEBSITES (WIKIPEDIA)
  • 9. © 2018 PURE STORAGE INC.9 “Data shippers”“Documents” What’s a document? cluster frontend Elasticsearch Emails Web pages Records in a DBMS Syslog entries Security logs … USE ELASTICSEARCH TO INDEX ANY KIND OF TEXT Logstash Fluentd Apache Spark Filebeat Web crawlers …
  • 10. © 2018 PURE STORAGE INC.10 Top use cases Application search Business analytics Log analytics Security analytics
  • 11. © 2018 PURE STORAGE INC.11 Common theme: rapid insights into data
  • 12. © 2018 PURE STORAGE INC.12 #PUREACCELERATE #NEWMEETSNOW INFRASTRUCTURE REQUIREMENTS AND BEST PRACTICES ELASTICSEARCH IN DEPTH
  • 13. © 2018 PURE STORAGE INC.13 How does Elasticsearch store my data? Elasticsearch translog segments Persistent data structures Ephemeral process Inverted indexes & associated metadata Write ahead log
  • 14. © 2018 PURE STORAGE INC.14 Elasticsearch makes scale-out simple Elasticsearch translog segments Elasticsearch translog segments Elasticsearch translog segments ElasticSearch cluster - data nodes
  • 15. © 2018 PURE STORAGE INC.15 Mapping concepts to hardware implementation Storage (SSDs or disks) CPU & DRAM Network Interface Elasticsearch
  • 16. © 2018 PURE STORAGE INC.16 Scale out with commodity servers Elasticsearch Elasticsearch Elasticsearch Cluster Interconnect Pros Excellent scalability Cons Deployment requires specific server configuration Hard to change compute-to-storage ratios
  • 17. © 2018 PURE STORAGE INC.17 What if we disaggregate? Elasticsearch Elasticsearch Elasticsearch Cluster Interconnect
  • 18. © 2018 PURE STORAGE INC.18 Disaggregation enables Elasticsearch-as-a-service ES ES ES Cluster Interconnect ES ES ESES Pros Run nodes anywhere, anyhow Scale storage and compute independently Cons Network bottlenecks? Storage bottlenecks?
  • 19. © 2018 PURE STORAGE INC.19 Do not use remote-mounted storage... The latency introduced here is antithetical to performance. “ ”- Elastic.co – Indexing Performance Tips
  • 20. © 2018 PURE STORAGE INC.20 CHALLENGE ACCEPTED
  • 21. © 2018 PURE STORAGE INC.21 FLASHBLADEPURPOSE-BUILT FOR MODERN ANALYTICS BLADE PURITY SCALE-OUT FABRIC Powerful, Elastic Data Processing & Storage Unit Massively Distributed Software for Limitless Scale Software-defined fabric that scales linearly with more data & clients
  • 22. © 2018 PURE STORAGE INC.22 segment segment segment segment segment segment translog merge refresh Incoming documents ① ② ③ Why should an all-flash array perform well with Elasticsearch? Indexing needs high write throughput at consistent, low-enough latency
  • 23. © 2018 PURE STORAGE INC.23 20 Elasticsearch nodes • 40 vCPUs @ 2.6GHz • 192 GB DRAM • 256GB local SSD • 2x10GbE network • Elasticsearch 6.2.1 (2 per node in containers) 5 Rally load generators • 40 vCPUs @ 2.6GHz • 192 GB DRAM • 2x10GbE network • NYC Taxis dataset 8x replicated => 595GB raw FlashBlade • 15 blades, 17TB each • 162TB usable via NFS • 8x40GbE network Indexing benchmark methodology
  • 24. © 2018 PURE STORAGE INC.24 Over 3M documents per second and as fast as local SSD 0 1 2 3 0 5 10 15 20 Number of nodes (2x Elasticsearch instances per node) Direct-attach SSD FlashBlade Indexing throughput Millions of documents per sec
  • 25. © 2018 PURE STORAGE INC.25 High throughput at consistently low latency Real-time monitoring aids performance tuning and diagnostics
  • 26. © 2018 PURE STORAGE INC.26 ① Don’t fear “remote-mounted storage”! ② Do extensive testing at scale => Elasticsearch needs high storage throughput at consistent, low-enough latency ③ Disaggregation enables service delivery and scaling models not possible with local drives. Lessons learned
  • 27. © 2018 PURE STORAGE INC.27 THANK YOU! bgold@purestorage.com @briantgold
  • 28. © 2017 PURE STORAGE INC.28