SlideShare a Scribd company logo
Scaling ElasticSearch

      SF Meetup
      2012.10.03


                       Sushant Shankar
                   sushant.shankar@33across.com
Agenda
•   Why we need a search engine
•   Monitoring
•   Index Building
•   Query Performance
Who is asdfas
>600,000 Publishers
Machine Learning and Graph algorithms to:
- Build advertising segments
- Extract insights out of social and interest data
- Target via high-performance distributed systems that
  integrate with our advertising partners




Website | Facebook | Twitter
Why we really need a search engine
         Batch! Good for complicated tasks
         (Machine Learning, Graph Algorithms, etc.)




                          …                           …
INDEX BUILDING
       1 WEEK → 3 HOURS
Mappers to build index

                        6 nodes, 24GB RAM
                        16GB for ES service
                        4 cores
                        3x 1.5TB drive

                        >1TB/index
         Build index
                        (replicated)
         using MR job
                        ~300M documents
         and Bulk API
                        ~5KB / document
                        ~3 hours
Monitoring: Zabbix
Monitoring: SPM
Parameter Optimization
Amount bulk indexed




                      Time taken
                       CPU util.
                       Mem util.
                        Disk I/O
                       Network



                                   # Shards
Index Building: Learnings
• Bulk API
• No replicas
• 2 shards / CPU
• 10,000 documents (users) per indexing
  request
• Refresh off (index.refresh_interval = -1)
QUERY PERFORMANCE
     5 MINUTES  10 SECONDS
Query Performance: Learnings
•   1-2 Replicas (and for reliability)
•   Turn refresh on again (5s default)
•   Warm up effect (Index Warm up API 0.20+)
•   Optimize API
•   Simulate multiple users
Warm Up: load into memory and cache
Other cool features
• Custom Scoring functions
• Scripts – MVEL, Python
• Facets

•   Exploring:
•   Real-time indexing
•   Indexing images, files, etc.
•   Parent-child relationships
QUERIES?
Index Building over time

More Related Content

PPTX
Session 03 data_migration_at_scale_by_sameer
PDF
Elasticwulf Pycon Talk
PDF
LocalSocial, Dial2Do and the Cloud
PDF
Wantedly on AWS #ctonight
PPTX
Amazon Web Services lection 2
PDF
SCasia 2018 MSFT hands on session for Azure Batch AI
PDF
OSOM Operations in the Cloud
PDF
OSOM - Operations in the Cloud
Session 03 data_migration_at_scale_by_sameer
Elasticwulf Pycon Talk
LocalSocial, Dial2Do and the Cloud
Wantedly on AWS #ctonight
Amazon Web Services lection 2
SCasia 2018 MSFT hands on session for Azure Batch AI
OSOM Operations in the Cloud
OSOM - Operations in the Cloud

What's hot (19)

PDF
Libcloud presentation
PPTX
Amazon Web Services lection 5
PDF
Building Robust Pipelines with Airflow
PDF
Online statistical analysis using transducers and sketch algorithms
PDF
Transducing for fun and profit
PDF
Diminuendo! Tactics in Support of FaaS Migrations Slides
KEY
Scaling out Rails with MySQL
PPTX
Amazon web services
PPTX
Serverless by examples and case studies
PPTX
Amazon Web Services lection 6
PPTX
How to build analytics for 100bn logs a month with ClickHouse. By Vadim Tkach...
PPTX
Become Thanos of the LambdaLand: Wield all the Infinity Stones
PPTX
Graphite
PPTX
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
PDF
Zentral QueryCon 2018
PDF
Apache Airflow
PPTX
Hyperloglog Lightning Talk
PDF
Surviving Hadoop on AWS
PDF
Scaling drupal on amazon web services dr
Libcloud presentation
Amazon Web Services lection 5
Building Robust Pipelines with Airflow
Online statistical analysis using transducers and sketch algorithms
Transducing for fun and profit
Diminuendo! Tactics in Support of FaaS Migrations Slides
Scaling out Rails with MySQL
Amazon web services
Serverless by examples and case studies
Amazon Web Services lection 6
How to build analytics for 100bn logs a month with ClickHouse. By Vadim Tkach...
Become Thanos of the LambdaLand: Wield all the Infinity Stones
Graphite
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Zentral QueryCon 2018
Apache Airflow
Hyperloglog Lightning Talk
Surviving Hadoop on AWS
Scaling drupal on amazon web services dr
Ad

Similar to SF ElasticSearch Meetup 2012.10.03 (20)

PPTX
SF ElasticSearch Meetup 2013.04.06 - Monitoring
PDF
Collecting 600M events/day
PPTX
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
PPTX
Capacity Planning
PPTX
ElasticSearch as (only) datastore
PDF
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
PDF
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
PPTX
Exploring MongoDB & Elasticsearch: Better Together
PPTX
Sizing MongoDB Clusters
PPTX
Running MongoDB 3.0 on AWS
PPTX
Performance Monitoring for the Cloud - Java2Days 2017
PPTX
Lots of facets, fast
PDF
Benchmarking at Parse
PDF
Advanced Benchmarking at Parse
PDF
SharePoint Saturday San Antonio: SharePoint 2010 Performance
PDF
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
PDF
Distributed Inference with MXNet and Spark
PPTX
Spark Magic Building and Deploying a High Scale Product in 4 Months
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
PDF
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
SF ElasticSearch Meetup 2013.04.06 - Monitoring
Collecting 600M events/day
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Capacity Planning
ElasticSearch as (only) datastore
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Exploring MongoDB & Elasticsearch: Better Together
Sizing MongoDB Clusters
Running MongoDB 3.0 on AWS
Performance Monitoring for the Cloud - Java2Days 2017
Lots of facets, fast
Benchmarking at Parse
Advanced Benchmarking at Parse
SharePoint Saturday San Antonio: SharePoint 2010 Performance
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Distributed Inference with MXNet and Spark
Spark Magic Building and Deploying a High Scale Product in 4 Months
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Ad

Recently uploaded (20)

PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
Modernising the Digital Integration Hub
PPT
What is a Computer? Input Devices /output devices
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
STKI Israel Market Study 2025 version august
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Getting Started with Data Integration: FME Form 101
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
Developing a website for English-speaking practice to English as a foreign la...
observCloud-Native Containerability and monitoring.pptx
Modernising the Digital Integration Hub
What is a Computer? Input Devices /output devices
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
O2C Customer Invoices to Receipt V15A.pptx
Hindi spoken digit analysis for native and non-native speakers
Enhancing emotion recognition model for a student engagement use case through...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Group 1 Presentation -Planning and Decision Making .pptx
Getting started with AI Agents and Multi-Agent Systems
Web Crawler for Trend Tracking Gen Z Insights.pptx
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
STKI Israel Market Study 2025 version august
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Chapter 5: Probability Theory and Statistics
Getting Started with Data Integration: FME Form 101
A review of recent deep learning applications in wood surface defect identifi...
A contest of sentiment analysis: k-nearest neighbor versus neural network

SF ElasticSearch Meetup 2012.10.03

  • 1. Scaling ElasticSearch SF Meetup 2012.10.03 Sushant Shankar sushant.shankar@33across.com
  • 2. Agenda • Why we need a search engine • Monitoring • Index Building • Query Performance
  • 3. Who is asdfas >600,000 Publishers Machine Learning and Graph algorithms to: - Build advertising segments - Extract insights out of social and interest data - Target via high-performance distributed systems that integrate with our advertising partners Website | Facebook | Twitter
  • 4. Why we really need a search engine Batch! Good for complicated tasks (Machine Learning, Graph Algorithms, etc.) … …
  • 5. INDEX BUILDING 1 WEEK → 3 HOURS
  • 6. Mappers to build index 6 nodes, 24GB RAM 16GB for ES service 4 cores 3x 1.5TB drive >1TB/index Build index (replicated) using MR job ~300M documents and Bulk API ~5KB / document ~3 hours
  • 9. Parameter Optimization Amount bulk indexed Time taken CPU util. Mem util. Disk I/O Network # Shards
  • 10. Index Building: Learnings • Bulk API • No replicas • 2 shards / CPU • 10,000 documents (users) per indexing request • Refresh off (index.refresh_interval = -1)
  • 11. QUERY PERFORMANCE 5 MINUTES  10 SECONDS
  • 12. Query Performance: Learnings • 1-2 Replicas (and for reliability) • Turn refresh on again (5s default) • Warm up effect (Index Warm up API 0.20+) • Optimize API • Simulate multiple users
  • 13. Warm Up: load into memory and cache
  • 14. Other cool features • Custom Scoring functions • Scripts – MVEL, Python • Facets • Exploring: • Real-time indexing • Indexing images, files, etc. • Parent-child relationships

Editor's Notes

  • #5: Collect information over 1B users internationally – text copied from over 600K publisher sites, images, searches, pages visitedDifferent slices of data – now!