SlideShare a Scribd company logo
Solr Power FTW
    Alex Pinkin
     @apinkin
    #solrnosql
What Will I Cover?

● Who I am

● What Bazaarvoice does

● SOLR and NoSQL

● Can SOLR handle 20K queries per second?

● Lessons learned: large scale multi data center deployment

● Conclusion
Alex Pinkin

● Software Engineering Lead,
  Data Infrastructure team,
  Bazaarvoice

● Loves to play with SQL and NoSQL



                        @apinkin
Bazaarvoice

● Bazaarvoice is a software as a service
  company powering user generated content
  such as ratings and reviews
  on thousands of web sites


● 5 billion page views per month

● 230 billion impressions

● 75 million UGC
NoSQL ?
SQL vs NoSQL
NoSQL is Not Only SQL

● Departs from relational model

● No fixed schema

● No joins

● Eventual consistency is OK

● Scale horizontally
Types of NoSQL

● Key-value (Redis, Riak, Voldemort)

● Document (MongoDB, CouchDB)

● Graph (Neo4J, FlockDB)

● Column family (Cassandra, HBase)
SOLR as NoSQL

● Non-relational model - Check

● No fixed schema - Check (dynamic fields)

● No joins - Check (denormalization)

● Horizontal scaling - Check (with work)
SOLR stats - Bazaarvoice

     Documents         250 MM
     Index size        200 GB

     QPS, avg          2,350

     QPS, max          10,200

     Response time, avg 12 ms

     Servers           6+20
SOLR Case Study
SOLR Case Study
Life Before SOLR

● Indexes for sorting and filtering

● Aggregate tables for stats

● Nightly jobs

● Bugs...
Enter SOLR

● Index content and product catalog
● De-normalization
● Filtering and sorting
● Index every 15 minutes (20 seconds NRT)
SOLR - Statistics

● COUNT, SUM, AVG, MIN, MAX
  (StatsComponent)

● Stored fields

● Whenever content changes,
  re-calc stats for all affected subjects
Scaling reads - Replication
Replication - Multiple Data Centers
Replication - Multiple Data Centers

Chatty if using multiple cores

Relay
 ● Core auto-warming disabled
 ● Connection wait and read timeouts increased
 ● Replication poll interval increased (15 min)
 ● Compression enabled
...
<str name="httpConnTimeout">20000</str>
<str name="httpReadTimeout">65000</str>
<str name="pollInterval">00:15:00</str>
<str name="compression">internal</str>
...
SOLR Cloud - Bazaarvoice version

● Multiple cores (100+ per server)

● Re-balance indexes across cores and servers
   ○ Automatic
   ○ Manual

● Deployment map stored in MySQL
   ○ Host - Core - Partition
   ○ Statistics

● Partition lifecycle
Schema Changes

Re-indexing is time consuming for large indexes

Process

    1. Full re-index off-line
    2. Incremental indexing prior to release
    3. Incremental indexing after the release

Bottleneck: reading from MySQL

Goal: Transparent re-indexing
Performance Tuning

 ● Heap size
 ● Cache sizing
 ● Auto-warming
 ● Stored fields
 ● Merge factor
 ● Commit frequency
 ● Optimize frequency
Process: Simulate and measure
 ● Replay logs
 ● Analyze metrics
 ● Monitor GC
Performance Tuning - GC

# Java memory usage settings
# Force the NewSize to be larger than the JVM typically allocates.
# In practice, the JVM has been allocating an extremely small Young generation
which objects to be prematurely promoted to the Tenured generation
JAVA_MEM_OPTS="-Xms27g -Xmx27g -XX:NewRatio=8"

# -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps --> Turn on GC Logging
# -XX:+UseConcMarkSweepGC        --> Use the concurrent collector
# -XX:+CMSIncrementalMode        --> Incremental mode for the concurrent collector
# -XX:+CMSIncrementalPacing      --> Let the JVM adjust the amount of incremental
collection

JAVA_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:
+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=55 -XX:
ParallelGCThreads=8 -XX:SurvivorRatio=4"
SOLR Performance - Summary

● SOLR loves RAM!

● Log replay                         SOLR

● Same config, same hardware

● Get the most out of one instance
Conclusion - SOLR Strengths

● Lightning fast given enough RAM

● Good scale out support
  including multi-data center

● Great community
Conclusion - SOLR's Gaps

● Not fully elastic

● Real time takes work

● Secondary data store = sync overhead

● Schema changes
Questions




            @apinkin

More Related Content

PPTX
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
PDF
Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds ma...
PDF
Debugging data pipelines @OLA by Karan Kumar
PDF
Enabling Presto to handle massive scale at lightning speed
PDF
Automated YCSB Benchmarking
PDF
OpenTSDB for monitoring @ Criteo
PPTX
Monitoring MySQL with OpenTSDB
PDF
The Dark Side Of Go -- Go runtime related problems in TiDB in production
Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019
Nagios Conference 2012 - Anders Haal - Why dynamic and adaptive thresholds ma...
Debugging data pipelines @OLA by Karan Kumar
Enabling Presto to handle massive scale at lightning speed
Automated YCSB Benchmarking
OpenTSDB for monitoring @ Criteo
Monitoring MySQL with OpenTSDB
The Dark Side Of Go -- Go runtime related problems in TiDB in production

What's hot (20)

PDF
hbaseconasia2017: HBase Practice At XiaoMi
PDF
Consistent hashing algorithmic tradeoffs
PDF
Netflix - Realtime Impression Store
PDF
10 EZ Steps to SOLR Domination - Berlin Buzzwords 2012
PPTX
Lightweight Transactions at Lightning Speed
PPTX
Developing Scylla Applications: Practical Tips
PDF
Keynote: Scaling Sensu Go
PDF
Gnocchi v4 - past and present
PPTX
How to be Successful with Scylla
PPTX
Migrating Data Pipeline from MongoDB to Cassandra
PPT
Evolution and Scaling of MongoDB Management Service Running on MongoDB
PPTX
Stream processing at Hotstar
PDF
JEEConf. Vanilla java
PPT
JVM performance options. How it works
PPTX
KDB+ Lite
PDF
app/server monitoring
PDF
MongoDB - Warehouse and Aggregator of Events
PDF
Scaling Islandora
PPTX
Eko10 Workshop Opensource Database Auditing
PDF
Security Monitoring for big Infrastructures without a Million Dollar budget
hbaseconasia2017: HBase Practice At XiaoMi
Consistent hashing algorithmic tradeoffs
Netflix - Realtime Impression Store
10 EZ Steps to SOLR Domination - Berlin Buzzwords 2012
Lightweight Transactions at Lightning Speed
Developing Scylla Applications: Practical Tips
Keynote: Scaling Sensu Go
Gnocchi v4 - past and present
How to be Successful with Scylla
Migrating Data Pipeline from MongoDB to Cassandra
Evolution and Scaling of MongoDB Management Service Running on MongoDB
Stream processing at Hotstar
JEEConf. Vanilla java
JVM performance options. How it works
KDB+ Lite
app/server monitoring
MongoDB - Warehouse and Aggregator of Events
Scaling Islandora
Eko10 Workshop Opensource Database Auditing
Security Monitoring for big Infrastructures without a Million Dollar budget
Ad

Similar to Solr Power FTW: Powering NoSQL the World Over (20)

PDF
SOLR Power FTW: short version
PDF
Webinar: Faster Log Indexing with Fusion
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
KEY
ApacheCon Europe 2012 -Big Search 4 Big Data
PDF
Apache Solr - An Experience Report
PDF
What's new in Solr 5.0
PPTX
Benchmarking Solr Performance at Scale
KEY
Big Search with Big Data Principles
PDF
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
PDF
Searching Billions of Product Logs in Real Time (Use Case)
PDF
High Performance Solr
PDF
Hadoop-scale Search with Solr
PDF
BP-1 Performance and Scalability
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
PDF
KEYNOTE: Lucene / Solr road map
PPTX
Apache Solr - search for everyone!
PPTX
MyHeritage backend group - build to scale
KEY
Solr 101
PPTX
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
SOLR Power FTW: short version
Webinar: Faster Log Indexing with Fusion
Building a Large Scale SEO/SEM Application with Apache Solr
ApacheCon Europe 2012 -Big Search 4 Big Data
Apache Solr - An Experience Report
What's new in Solr 5.0
Benchmarking Solr Performance at Scale
Big Search with Big Data Principles
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Searching Billions of Product Logs in Real Time (Use Case)
High Performance Solr
Hadoop-scale Search with Solr
BP-1 Performance and Scalability
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
KEYNOTE: Lucene / Solr road map
Apache Solr - search for everyone!
MyHeritage backend group - build to scale
Solr 101
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Ad

Recently uploaded (20)

PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Getting Started with Data Integration: FME Form 101
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Modernising the Digital Integration Hub
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Hybrid model detection and classification of lung cancer
PPTX
Tartificialntelligence_presentation.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
STKI Israel Market Study 2025 version august
Group 1 Presentation -Planning and Decision Making .pptx
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Chapter 5: Probability Theory and Statistics
Web App vs Mobile App What Should You Build First.pdf
NewMind AI Weekly Chronicles – August ’25 Week III
A contest of sentiment analysis: k-nearest neighbor versus neural network
Getting Started with Data Integration: FME Form 101
DP Operators-handbook-extract for the Mautical Institute
Modernising the Digital Integration Hub
OMC Textile Division Presentation 2021.pptx
Hybrid model detection and classification of lung cancer
Tartificialntelligence_presentation.pptx
1 - Historical Antecedents, Social Consideration.pdf
A novel scalable deep ensemble learning framework for big data classification...
O2C Customer Invoices to Receipt V15A.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Enhancing emotion recognition model for a student engagement use case through...
Final SEM Unit 1 for mit wpu at pune .pptx
Getting started with AI Agents and Multi-Agent Systems
STKI Israel Market Study 2025 version august

Solr Power FTW: Powering NoSQL the World Over

  • 1. Solr Power FTW Alex Pinkin @apinkin #solrnosql
  • 2. What Will I Cover? ● Who I am ● What Bazaarvoice does ● SOLR and NoSQL ● Can SOLR handle 20K queries per second? ● Lessons learned: large scale multi data center deployment ● Conclusion
  • 3. Alex Pinkin ● Software Engineering Lead, Data Infrastructure team, Bazaarvoice ● Loves to play with SQL and NoSQL @apinkin
  • 4. Bazaarvoice ● Bazaarvoice is a software as a service company powering user generated content such as ratings and reviews on thousands of web sites ● 5 billion page views per month ● 230 billion impressions ● 75 million UGC
  • 7. NoSQL is Not Only SQL ● Departs from relational model ● No fixed schema ● No joins ● Eventual consistency is OK ● Scale horizontally
  • 8. Types of NoSQL ● Key-value (Redis, Riak, Voldemort) ● Document (MongoDB, CouchDB) ● Graph (Neo4J, FlockDB) ● Column family (Cassandra, HBase)
  • 9. SOLR as NoSQL ● Non-relational model - Check ● No fixed schema - Check (dynamic fields) ● No joins - Check (denormalization) ● Horizontal scaling - Check (with work)
  • 10. SOLR stats - Bazaarvoice Documents 250 MM Index size 200 GB QPS, avg 2,350 QPS, max 10,200 Response time, avg 12 ms Servers 6+20
  • 13. Life Before SOLR ● Indexes for sorting and filtering ● Aggregate tables for stats ● Nightly jobs ● Bugs...
  • 14. Enter SOLR ● Index content and product catalog ● De-normalization ● Filtering and sorting ● Index every 15 minutes (20 seconds NRT)
  • 15. SOLR - Statistics ● COUNT, SUM, AVG, MIN, MAX (StatsComponent) ● Stored fields ● Whenever content changes, re-calc stats for all affected subjects
  • 16. Scaling reads - Replication
  • 17. Replication - Multiple Data Centers
  • 18. Replication - Multiple Data Centers Chatty if using multiple cores Relay ● Core auto-warming disabled ● Connection wait and read timeouts increased ● Replication poll interval increased (15 min) ● Compression enabled ... <str name="httpConnTimeout">20000</str> <str name="httpReadTimeout">65000</str> <str name="pollInterval">00:15:00</str> <str name="compression">internal</str> ...
  • 19. SOLR Cloud - Bazaarvoice version ● Multiple cores (100+ per server) ● Re-balance indexes across cores and servers ○ Automatic ○ Manual ● Deployment map stored in MySQL ○ Host - Core - Partition ○ Statistics ● Partition lifecycle
  • 20. Schema Changes Re-indexing is time consuming for large indexes Process 1. Full re-index off-line 2. Incremental indexing prior to release 3. Incremental indexing after the release Bottleneck: reading from MySQL Goal: Transparent re-indexing
  • 21. Performance Tuning ● Heap size ● Cache sizing ● Auto-warming ● Stored fields ● Merge factor ● Commit frequency ● Optimize frequency Process: Simulate and measure ● Replay logs ● Analyze metrics ● Monitor GC
  • 22. Performance Tuning - GC # Java memory usage settings # Force the NewSize to be larger than the JVM typically allocates. # In practice, the JVM has been allocating an extremely small Young generation which objects to be prematurely promoted to the Tenured generation JAVA_MEM_OPTS="-Xms27g -Xmx27g -XX:NewRatio=8" # -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps --> Turn on GC Logging # -XX:+UseConcMarkSweepGC --> Use the concurrent collector # -XX:+CMSIncrementalMode --> Incremental mode for the concurrent collector # -XX:+CMSIncrementalPacing --> Let the JVM adjust the amount of incremental collection JAVA_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX: +UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=55 -XX: ParallelGCThreads=8 -XX:SurvivorRatio=4"
  • 23. SOLR Performance - Summary ● SOLR loves RAM! ● Log replay SOLR ● Same config, same hardware ● Get the most out of one instance
  • 24. Conclusion - SOLR Strengths ● Lightning fast given enough RAM ● Good scale out support including multi-data center ● Great community
  • 25. Conclusion - SOLR's Gaps ● Not fully elastic ● Real time takes work ● Secondary data store = sync overhead ● Schema changes
  • 26. Questions @apinkin