SlideShare a Scribd company logo
SolrCloud: Searching Big Data
Shalin Shekhar Mangar
Subset of o ptio nal featuresin Solr to enableand
simplify horizontal scaling asearch index using
sharding and replication.
Goals
performance, scalability, high-availability,
simplicity, and elasticity
What is SolrCloud?
Terminology
●
ZooKeeper: Distributed coordination servicethat
providescentralized configuration, cluster state
management, and leader election
●
Node: JVM processbound to aspecific port on amachine;
hoststheSolr web application
●
Collection: Search index distributed acrossmultiple
nodes; each collection hasaname, shard count, and
replication factor
●
Replication Factor: Number of copiesof adocument in
acollection
• Shard: Logical sliceof acollection; each shard hasaname, hash
range, leader, and replication factor. Documentsareassigned to
oneand only oneshard per collection using ahash-based
document routing strategy
• Replica: Solr index that hostsacopy of ashard in acollection;
behind thescenes, each replicaisimplemented asaSolr core
• Leader: Replicain ashard that assumesspecial dutiesneeded to
support distributed indexing in Solr; each shard hasoneand only
oneleader at any timeand leadersareelected using ZooKeeper
Terminology
High-level Architecture
Collection == Distributed Index
A collection isa distributed index defined by:
• named configuration stored in ZooKeeper
• number of shards: documents are distributed
across N partitions of the index
• document routing strategy: how documents get
assigned to shards
• replication factor: how many copiesof each
document in thecollection
Collections API:
http://localhost:8983/solr/admin/collections?
action=create&name=logstash4solr&replicationFactor=
2&numShards=2&collection.configName=logs
Collection == Distributed Index
●
Collection has a fixed number of shards
- existing shardscan besplit
●
When to shard?
- Largenumber of docs
- Largedocument sizes
- Parallelization during indexing and
queries
- Datapartitioning (custom hashing)
Sharding
●
Each shard coversahash-range
●
Default: Hash ID into 32-bit integer, map to range
- leadsto balanced (roughly) shards
●
Custom-hashing (examplein afew slides)
●
Tri-level: app!user!doc
●
Implicit: no hash-rangeset for shards
Document Routing
• Why replicate?
- High-availability
- Load balancing
●
How does it work in SolrCloud?
- Near-real-time, not master-slave
- Leader forwards to replicas in parallel,
waits for response
- Error handling during indexing is tricky
Replication
Example: Indexing
Example: Querying
1. Get cluster statefrom ZK
2. Routedocument directly to
leader (hash on doc ID)
3. Persist document on durable
storage(tlog)
4. Forward to healthy replicas
5. Acknowledgewrite succeed to
client
Distributed Indexing
●
Additional responsibilitiesduring indexing only! Not a
master node
●
Leader isareplica(handlesqueries)
●
Acceptsupdaterequestsfor theshard
●
Incrementsthe_version_ on thenew or updated doc
●
Sendsupdates(in parallel) to all replicas
Shard Leader
Distributed Queries
1. Query client can beZK awareor just
query viaaload balancer
2. Client can send query to any nodein the
cluster
3. Controller nodedistributesthequery to
areplicafor each shard to identify
documentsmatching query
4. Controller nodesortstheresultsfrom
step 3 and issuesasecond query for all
fieldsfor apageof results
Scalability / Stability Highlights
●
All nodesin cluster perform indexing and execute
queries; no master node
●
Distributed indexing: No SPoF, high throughput via
direct updatesto leaders, automated failover to new
leader
●
Distributed queries: Add replicasto scale-out qps;
parallelizecomplex query computations; fault-tolerance
●
Indexing / queriescontinueso long asthereis1 healthy
replicaper shard
SolrCloud and CAP
●
A distributed system should be: Consistent, Available, and
Partition tolerant
●
CAPsayspick 2 of the3! (slightly morenuanced than that
in reality)
●
SolrCloud favorsconsistency over write-availability (CP)
●
All replicasin ashard havethesamedata
●
Activereplicasetsconcept (writesaccepted so long asa
shard hasat least oneactivereplicaavailable)
SolrCloud and CAP
• No toolsto detect or fix consistency issuesin Solr
– Reads go to one replica; no concept of quorum
– Writes must fail if consistency cannot be
guaranteed (SOLR-5468)
ZooKeeper
●
Isavery good thing ... clustersareazoo!
●
Centralized configuration management
●
Cluster statemanagement
●
Leader election (shard leader and overseer)
●
Overseer distributed work queue
●
LiveNodes
– Ephemeral znodesused to signal aserver isgone
●
Needs3 nodesfor quorum in production
ZooKeeper: Centralized Configuration
●
Storeconfig filesin
ZooKeeper
●
Solr nodespull config
during coreinitialization
●
Config setscan be“shared”
acrosscollections
●
Changesareuploaded to ZK
and then collectionsshould
bereloaded
ZooKeeper: State Management
●
Keep track of /live_nodesznode
●
Ephemeral nodes
●
ZooKeeper client timeout
●
Collection metadataand replicastatein /clusterstate.json
●
Every corehaswatchersfor /live_nodesand
/clusterstate.json
●
Leader election
●
ZooKeeper sequencenumberson ephemeral znodes
Overseer
●
What doesit do?
– Persistscollection statechangeeventsto ZooKeeper
– Controller for Collection API commands
– Ordered updates
– Oneper cluster (for all collections); elected using leader election
●
How doesit work?
– Asynchronous(pub/sub messaging)
– ZooKeeper asdistributed queuerecipe
– Automated failover to ahealthy node
– Can beassigned to adedicated node(SOLR-5476)
Custom Hashing
●
Routedocumentsto specific shardsbased on ashard key
component in thedocument ID
●
Send all log messagesfrom thesamesystem to the
sameshard
●
Direct queriesto specific shards: q=...&_route_=httpd
{
"id" : ”httpd!2",
"level_s" : ”ERROR",
"lang_s" : "en",
...
},
Hash:
shardKey!docID
Custom Hashing Highlights
●
Co-locatedocumentshaving acommon property in thesame
shard
- e.g. docshaving IDshttpd!21 and httpd!33 will
bein thesameshard
• Scale-up thereplicasfor specific shardsto addresshigh query
and/or indexing volumefrom specific apps
• Not asmuch control over thedistribution of keys
- httpd, mysql, and collectd all in same shard
• Can split unbalanced shards when using custom hashing
• Can split shards into two sub-shards
• Live splitting! No downtime needed!
• Requests start being forwarded to sub-shards
automatically
• Expensive operation: Use as required during low
traffic
Shard Splitting
Other features / highlights
• Near-Real-Time Search: Documentsarevisiblewithin a
second or so after being indexed
• Partial Document Update: Just updatethefieldsyou need to
changeon existing documents
• Optimistic Locking: Ensureupdatesareapplied to thecorrect
version of adocument
• Transaction log: Better recoverability; peer-sync between nodes
after hiccups
• HTTPS
• Use HDFS for storing indexes
• UseMapReduce for building index (SOLR-1301)
More?
• Workshop: Apache Solr in Minutes tomorrow
• https://cwiki.apache.org/confluence/display/solr/Ap
ache+Solr+Reference+Guide
• shalin@apache.org
• http://twitter.com/shalinmangar
• http://shal.in
Attributions
• Tim Potter's slides on “Introduction to SolrCloud” at
Lucene/Solr Exchange 2014
– http://twitter.com/thelabdude
• Erik Hatcher's slides on “Solr: Search at the speed of
light” at JavaZone 2009
– http://twitter.com/ErikHatcher
GIDS2014: SolrCloud: Searching Big Data

More Related Content

PDF
Introduction to SolrCloud
PPTX
Solrcloud Leader Election
PDF
Scaling search with SolrCloud
PPTX
Scaling Through Partitioning and Shard Splitting in Solr 4
PDF
Inside Solr 5 - Bangalore Solr/Lucene Meetup
PPTX
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
PDF
Call me maybe: Jepsen and flaky networks
PDF
Cross Datacenter Replication in Apache Solr 6
Introduction to SolrCloud
Solrcloud Leader Election
Scaling search with SolrCloud
Scaling Through Partitioning and Shard Splitting in Solr 4
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Call me maybe: Jepsen and flaky networks
Cross Datacenter Replication in Apache Solr 6

What's hot (20)

PPTX
Solr Exchange: Introduction to SolrCloud
PDF
High Performance Solr
PDF
Solr cluster with SolrCloud at lucenerevolution (tutorial)
PPTX
NYC Lucene/Solr Meetup: Spark / Solr
PDF
SolrCloud on Hadoop
PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
PPTX
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
PDF
Deploying and managing Solr at scale
ODP
Apache SolrCloud
PPTX
Scaling SolrCloud to a large number of Collections
PDF
SolrCloud Failover and Testing
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PDF
How to make a simple cheap high availability self-healing solr cluster
PDF
Best practices for highly available and large scale SolrCloud
PDF
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
PDF
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
PDF
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
PDF
How SolrCloud Changes the User Experience In a Sharded Environment
PPTX
Benchmarking Solr Performance at Scale
Solr Exchange: Introduction to SolrCloud
High Performance Solr
Solr cluster with SolrCloud at lucenerevolution (tutorial)
NYC Lucene/Solr Meetup: Spark / Solr
SolrCloud on Hadoop
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Deploying and managing Solr at scale
Apache SolrCloud
Scaling SolrCloud to a large number of Collections
SolrCloud Failover and Testing
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
How to make a simple cheap high availability self-healing solr cluster
Best practices for highly available and large scale SolrCloud
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
How SolrCloud Changes the User Experience In a Sharded Environment
Benchmarking Solr Performance at Scale
Ad

Viewers also liked (9)

PDF
Intro to Apache Solr
ODP
Introduction to Apache Solr
PDF
Parallel SQL and Streaming Expressions in Apache Solr 6
PDF
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
PDF
SolrCloud and Shard Splitting
PDF
第10回solr勉強会 solr cloudの導入事例
PDF
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
PDF
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
PPTX
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Apache Solr
Introduction to Apache Solr
Parallel SQL and Streaming Expressions in Apache Solr 6
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
SolrCloud and Shard Splitting
第10回solr勉強会 solr cloudの導入事例
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Ad

Similar to GIDS2014: SolrCloud: Searching Big Data (20)

PPTX
Real time Analytics with Apache Kafka and Apache Spark
PDF
Elasticsearch Data Analyses
PDF
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
PDF
Introduction to Apache Geode (Cork, Ireland)
PPTX
Multivariate algorithms in distributed data processing computing.pptx
PPTX
Multivariate algorithms in distributed data processing computing.pptx
PPTX
Scalable Web Apps
PPTX
Taking Splunk to the Next Level - Architecture Breakout Session
PPTX
Solr Lucene Conference 2014 - Nitin Presentation
PDF
Cdcr apachecon-talk
PPTX
Everything You Need To Know About Persistent Storage in Kubernetes
PPT
HPTS talk on micro sharding with Katta
PPTX
Spark 1.0
PPTX
Comparison between zookeeper, etcd 3 and other distributed coordination systems
KEY
Building Distributed Systems in Scala
PPTX
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
PDF
NetflixOSS Open House Lightning talks
PDF
Apache Geode Meetup, Cork, Ireland at CIT
PDF
Ingesting hdfs intosolrusingsparktrimmed
PPTX
Cloudera Impala: A Modern SQL Engine for Hadoop
Real time Analytics with Apache Kafka and Apache Spark
Elasticsearch Data Analyses
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
Introduction to Apache Geode (Cork, Ireland)
Multivariate algorithms in distributed data processing computing.pptx
Multivariate algorithms in distributed data processing computing.pptx
Scalable Web Apps
Taking Splunk to the Next Level - Architecture Breakout Session
Solr Lucene Conference 2014 - Nitin Presentation
Cdcr apachecon-talk
Everything You Need To Know About Persistent Storage in Kubernetes
HPTS talk on micro sharding with Katta
Spark 1.0
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Building Distributed Systems in Scala
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
NetflixOSS Open House Lightning talks
Apache Geode Meetup, Cork, Ireland at CIT
Ingesting hdfs intosolrusingsparktrimmed
Cloudera Impala: A Modern SQL Engine for Hadoop

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
ai tools demonstartion for schools and inter college
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Understanding Forklifts - TECH EHS Solution
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
medical staffing services at VALiNTRY
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
AI in Product Development-omnex systems
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
CHAPTER 2 - PM Management and IT Context
Softaken Excel to vCard Converter Software.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Navsoft: AI-Powered Business Solutions & Custom Software Development
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
ai tools demonstartion for schools and inter college
VVF-Customer-Presentation2025-Ver1.9.pptx
Understanding Forklifts - TECH EHS Solution
PTS Company Brochure 2025 (1).pdf.......
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Which alternative to Crystal Reports is best for small or large businesses.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Design an Analysis of Algorithms I-SECS-1021-03
medical staffing services at VALiNTRY
Operating system designcfffgfgggggggvggggggggg
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Design an Analysis of Algorithms II-SECS-1021-03
AI in Product Development-omnex systems

GIDS2014: SolrCloud: Searching Big Data

  • 1. SolrCloud: Searching Big Data Shalin Shekhar Mangar
  • 2. Subset of o ptio nal featuresin Solr to enableand simplify horizontal scaling asearch index using sharding and replication. Goals performance, scalability, high-availability, simplicity, and elasticity What is SolrCloud?
  • 3. Terminology ● ZooKeeper: Distributed coordination servicethat providescentralized configuration, cluster state management, and leader election ● Node: JVM processbound to aspecific port on amachine; hoststheSolr web application ● Collection: Search index distributed acrossmultiple nodes; each collection hasaname, shard count, and replication factor ● Replication Factor: Number of copiesof adocument in acollection
  • 4. • Shard: Logical sliceof acollection; each shard hasaname, hash range, leader, and replication factor. Documentsareassigned to oneand only oneshard per collection using ahash-based document routing strategy • Replica: Solr index that hostsacopy of ashard in acollection; behind thescenes, each replicaisimplemented asaSolr core • Leader: Replicain ashard that assumesspecial dutiesneeded to support distributed indexing in Solr; each shard hasoneand only oneleader at any timeand leadersareelected using ZooKeeper Terminology
  • 6. Collection == Distributed Index A collection isa distributed index defined by: • named configuration stored in ZooKeeper • number of shards: documents are distributed across N partitions of the index • document routing strategy: how documents get assigned to shards • replication factor: how many copiesof each document in thecollection
  • 8. ● Collection has a fixed number of shards - existing shardscan besplit ● When to shard? - Largenumber of docs - Largedocument sizes - Parallelization during indexing and queries - Datapartitioning (custom hashing) Sharding
  • 9. ● Each shard coversahash-range ● Default: Hash ID into 32-bit integer, map to range - leadsto balanced (roughly) shards ● Custom-hashing (examplein afew slides) ● Tri-level: app!user!doc ● Implicit: no hash-rangeset for shards Document Routing
  • 10. • Why replicate? - High-availability - Load balancing ● How does it work in SolrCloud? - Near-real-time, not master-slave - Leader forwards to replicas in parallel, waits for response - Error handling during indexing is tricky Replication
  • 13. 1. Get cluster statefrom ZK 2. Routedocument directly to leader (hash on doc ID) 3. Persist document on durable storage(tlog) 4. Forward to healthy replicas 5. Acknowledgewrite succeed to client Distributed Indexing
  • 14. ● Additional responsibilitiesduring indexing only! Not a master node ● Leader isareplica(handlesqueries) ● Acceptsupdaterequestsfor theshard ● Incrementsthe_version_ on thenew or updated doc ● Sendsupdates(in parallel) to all replicas Shard Leader
  • 15. Distributed Queries 1. Query client can beZK awareor just query viaaload balancer 2. Client can send query to any nodein the cluster 3. Controller nodedistributesthequery to areplicafor each shard to identify documentsmatching query 4. Controller nodesortstheresultsfrom step 3 and issuesasecond query for all fieldsfor apageof results
  • 16. Scalability / Stability Highlights ● All nodesin cluster perform indexing and execute queries; no master node ● Distributed indexing: No SPoF, high throughput via direct updatesto leaders, automated failover to new leader ● Distributed queries: Add replicasto scale-out qps; parallelizecomplex query computations; fault-tolerance ● Indexing / queriescontinueso long asthereis1 healthy replicaper shard
  • 17. SolrCloud and CAP ● A distributed system should be: Consistent, Available, and Partition tolerant ● CAPsayspick 2 of the3! (slightly morenuanced than that in reality) ● SolrCloud favorsconsistency over write-availability (CP) ● All replicasin ashard havethesamedata ● Activereplicasetsconcept (writesaccepted so long asa shard hasat least oneactivereplicaavailable)
  • 18. SolrCloud and CAP • No toolsto detect or fix consistency issuesin Solr – Reads go to one replica; no concept of quorum – Writes must fail if consistency cannot be guaranteed (SOLR-5468)
  • 19. ZooKeeper ● Isavery good thing ... clustersareazoo! ● Centralized configuration management ● Cluster statemanagement ● Leader election (shard leader and overseer) ● Overseer distributed work queue ● LiveNodes – Ephemeral znodesused to signal aserver isgone ● Needs3 nodesfor quorum in production
  • 20. ZooKeeper: Centralized Configuration ● Storeconfig filesin ZooKeeper ● Solr nodespull config during coreinitialization ● Config setscan be“shared” acrosscollections ● Changesareuploaded to ZK and then collectionsshould bereloaded
  • 21. ZooKeeper: State Management ● Keep track of /live_nodesznode ● Ephemeral nodes ● ZooKeeper client timeout ● Collection metadataand replicastatein /clusterstate.json ● Every corehaswatchersfor /live_nodesand /clusterstate.json ● Leader election ● ZooKeeper sequencenumberson ephemeral znodes
  • 22. Overseer ● What doesit do? – Persistscollection statechangeeventsto ZooKeeper – Controller for Collection API commands – Ordered updates – Oneper cluster (for all collections); elected using leader election ● How doesit work? – Asynchronous(pub/sub messaging) – ZooKeeper asdistributed queuerecipe – Automated failover to ahealthy node – Can beassigned to adedicated node(SOLR-5476)
  • 23. Custom Hashing ● Routedocumentsto specific shardsbased on ashard key component in thedocument ID ● Send all log messagesfrom thesamesystem to the sameshard ● Direct queriesto specific shards: q=...&_route_=httpd { "id" : ”httpd!2", "level_s" : ”ERROR", "lang_s" : "en", ... }, Hash: shardKey!docID
  • 24. Custom Hashing Highlights ● Co-locatedocumentshaving acommon property in thesame shard - e.g. docshaving IDshttpd!21 and httpd!33 will bein thesameshard • Scale-up thereplicasfor specific shardsto addresshigh query and/or indexing volumefrom specific apps • Not asmuch control over thedistribution of keys - httpd, mysql, and collectd all in same shard • Can split unbalanced shards when using custom hashing
  • 25. • Can split shards into two sub-shards • Live splitting! No downtime needed! • Requests start being forwarded to sub-shards automatically • Expensive operation: Use as required during low traffic Shard Splitting
  • 26. Other features / highlights • Near-Real-Time Search: Documentsarevisiblewithin a second or so after being indexed • Partial Document Update: Just updatethefieldsyou need to changeon existing documents • Optimistic Locking: Ensureupdatesareapplied to thecorrect version of adocument • Transaction log: Better recoverability; peer-sync between nodes after hiccups • HTTPS • Use HDFS for storing indexes • UseMapReduce for building index (SOLR-1301)
  • 27. More? • Workshop: Apache Solr in Minutes tomorrow • https://cwiki.apache.org/confluence/display/solr/Ap ache+Solr+Reference+Guide • shalin@apache.org • http://twitter.com/shalinmangar • http://shal.in
  • 28. Attributions • Tim Potter's slides on “Introduction to SolrCloud” at Lucene/Solr Exchange 2014 – http://twitter.com/thelabdude • Erik Hatcher's slides on “Solr: Search at the speed of light” at JavaZone 2009 – http://twitter.com/ErikHatcher