SlideShare a Scribd company logo
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P.
Never Stop Exploring: 
Pushing the Limits of Solr 
Anirudha Jadhav 
©2014 Bloomberg L.P.
Who am I ? 
‱ Big Search and Distributed database specialist 
‱ Built a Search as a Service platform 
‱ Lead Search Architect @ Bloomberg Vault 
‱ Credit Derivatives Analytics Engineer @ Bloomberg 
‱ Masters' @ Courant Institute of Mathematical Sciences, New York University 
‱ Passionate about Search, Scuba Diving , Motorcycles and German Shepherds
bloomberg.com/company
Agenda 
‱ Search 
at 
Bloomberg 
‱ 
Goals 
and 
Objec5ves 
‱ 
A 
li9le 
background 
‱ 
Factors 
affec5ng 
indexing 
‱ 
Our 
tests 
and 
benchmarks 
‱ 
Design 
for 
a 
be9er 
NRT 
indexer 
‱ 
Future 
work 
‱ 
Q/A
Search at Bloomberg
Search at Bloomberg 
‱ News Search 
‱ Federated Search 
‱ Complex re-ranking of search results 
‱ Archival Search 
‱ GeoSpatial Search 
‱ Analytics and Statistics on Search
Objective 
Significantly increase Near Real Time (NRT) indexing throughput 
Eg. Building a Search application that receives market data
Indexing workflow
Indexing Data Flow in SolrCloud
Indexing Workflow 
We were talking about IBM during the fishing trip 
Down 
Cas)ng 
[We] [were] [talking] [about] [IBM] [during] [the] [fishing] [trip] 
Creates 
tokens 
by 
lowercasing 
all 
le4ers 
and 
dropping 
non-­‐le4ers. 
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] 
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] 
[talk] [fish] 
[talking] [about] [ibm] [fishing] [trip] 
[talk] [big] [blue] [fish] [journey] 
[chat] 
Consider 
the 
sentence: 
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] 
[talk] [fish] 
Tokeniza)on 
A 
tokenizer 
splits 
the 
stream 
of 
characters 
into 
a 
series 
of 
tokens. 
Stemming 
Lemma)za)on 
Stemming 
algorithms 
reduce 
words 
"fishing", 
"fished", 
"fish", 
and 
"fisher" 
to 
the 
root 
word, 
"fish" 
Lemma*za*on 
expands 
words 
to 
their 
inflected 
forms 
(ie 
fishing 
-­‐> 
fished, 
fishes, 
fish 
but 
not 
fisher) 
Stop 
Word 
Removal 
Remove 
common 
stop 
words 
“and”,”or” 
etc. 
which 
introduce 
noise 
in 
the 
search 
process 
Synonym 
Expansion 
Mapping 
of 
words 
based 
upon 
thesaurus 
(synonyms, 
acronyms, 
hypernyms, 
business 
rules, 
etc..) 
For 
example 
talk 
-­‐> 
chat, 
IBM 
-­‐> 
“big 
blue”, 
trip 
-­‐> 
journey
Designing the Search Index 
Designing 
a 
good 
Search 
Applica)on 
also 
involves 
many 
aspects 
of 
user 
interac)on 
that 
directly 
influence 
indexing 
design 
‱ 
Data 
Type 
and 
Data 
Distribu)on 
‱ 
Server 
side 
parameters 
‱ 
Networking 
‱ 
Client 
side 
parameters 
‱ 
Query 
pa4erns
Factors Affecting Indexing
Data and Distribution of Tokens 
Common types of data that we index in a search index 
‱ 
Textual 
data 
( 
human 
generated 
) 
e.g. 
messages, 
news, 
blogs 
‱ 
Textual 
data 
( 
machine 
generated 
) 
e.g. 
logs 
, 
5ckets 
‱ 
Numerical 
data 
‱ 
Geospa5al 
data 
How does this affect search index designs ? 
‱ 
Query 
speed 
and 
indexing 
speed 
depend 
on 
the 
size 
of 
an 
index 
‱ 
Size 
is 
dependent 
on 
‱ 
Number 
of 
documents 
in 
the 
index 
‱ 
Average 
size 
of 
each 
document 
‱ 
Distribu5on 
of 
tokens 
‱ 
Index 
features 
eg. 
Face5ng, 
Highligh5ng
Server-side Factors 
‱ Ratio of CPU’s to the number of solr cores running 
‱ 
2 
Solr 
indices 
per 
CPU 
or 
a 
Thread 
‱ Disk space 
‱ 
Disk 
space 
for 
Solr 
index 
* 
2 
( 
head 
room 
for 
merge 
cycles 
) 
‱ Memory 
‱ 
JVM 
heap 
‱ 
Off 
Heap 
‱ 
DocValues
Networking 
Cluster design consideration 
‱ 
Should 
a 
cluster 
span 
data 
centers 
? 
‱ 
Latency 
between 
datacenters 
‱ 
Reliability 
and 
availability 
SLA’s 
‱ 
Where 
does 
your 
Zookeeper 
ensemble 
live 
? 
‱ 
How 
many 
elec5on 
members 
‱ 
Consider 
observers 
to 
scale 
zookeeper 
‱ 
Dynamically 
promote 
an 
observer 
to 
elec5on 
member 
Manage concurrent connections on the server 
Monitor network latencies for QoS guarantees
Client-side Factors 
‱ Managing connections and reusing connections 
‱ Which format to use for indexing data 
‱ 
javabin 
‱ 
csv 
‱ 
json 
‱ 
xml 
‱ How many simultaneous threads to use
Experiments with NRT Indexing 
It’s not always efficient to send a single document to Solr for indexing 
How do you decide how many documents to send ? 
Collector : A buffer that collects Solr update documents 
‱ 
Time 
Triggers 
( 
T 
) 
‱ 
Time 
based 
collector 
on 
the 
client-­‐side 
to 
batch 
document 
payloads 
to 
Solr 
‱ 
Document 
Size 
Triggers 
( 
S 
) 
‱ 
Document 
size 
based 
collector 
on 
the 
client-­‐side 
to 
batch 
document 
payloads 
to 
Solr 
‱ 
Document 
Number 
Triggers 
( 
N 
) 
‱ 
Number 
of 
documents 
based 
collector 
on 
the 
client-­‐side 
to 
batch 
document 
payloads 
to 
Solr 
The 
collectors 
are 
all 
simultaneously 
used 
in 
order 
of 
priority. 
The 
lower 
priority 
collectors 
act 
as 
a 
cut-­‐off 
backups 
to 
safe 
guard 
from 
overflows.
Tests and Benchmarks
Benchmarking Setup 
‱ Client application sending data to 4-way replicated SolrCloud 
‱ 5 node Zookeeper ensemble 
‱ All tests done with a similar dataset ( machine generated text ) 
‱ We synthesize a high throughput ingest stream, which serves as our input 
‱ Soft commits set at 1sec
Benchmarking : Time Limit Tests 
docs/sec 
Time Triggers: Collection window in ms
Benchmarking : Document Limit Tests 
docs/sec 
Document Number Triggers: Collection window in number of documents
Benchmarking : Byte Limit Tests 
docs/sec 
Document Size Triggers: Collection window in bytes
Observations 
‱ On an average we were able to observe 5x-7x increase in ingestion throughput 
‱ Optimization parameters are dependent constantly changing factors 
‱ The tuning variables need to be constantly adjusted for best performance 
‱ How to use this now
Design for a better NRT indexer
PID Controller 
Proportional term ( P ) – present 
Output proportional to current error value 
Integral term ( I ) - past 
Sum of instantaneous error over time, 
and give accumulated offset that should 
have been corrected previously 
Derivative term ( D ) - future 
Calculated by determining the slope of previous 
error over time times the rate of change
PID implementation in the indexer 
Solr 
Cloud 
Solr 
response 
Sampling 
thread 
Process 
variable 
Docs/sec 
Client 
indexer 
process 
Pick 
one 
of 
the 
Triggers 
Time 
(T 
) 
Control 
Variable 
PID 
controller 
implementa5on 
Indexing 
threads
Future Work
Future work 
‱ Perfect the PID indexer 
‱ Add it to the YCSB benchmarking framework 
‱ Add other server side parameters on the PID indexer 
‱ Use the PID indexer along with the YCSB framework to size hardware
Never Stop Exploring: 
Pushing the Limits of Solr 
Anirudha Jadhav , Bloomberg LP 
QUESTIONS ?

More Related Content

PPTX
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
PDF
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
PPTX
Building a data driven search application with LucidWorks SiLK
PDF
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
PDF
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
PDF
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
PDF
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
PDF
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using...
Building a Real-Time News Search Engine: Presented by Ramkumar Aiyengar, Bloo...
Building a data driven search application with LucidWorks SiLK
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.

What's hot (20)

PDF
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
PDF
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
PDF
Search at Twitter: Presented by Michael Busch, Twitter
PPTX
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
PPTX
Case study of Rujhaan.com (A social news app )
PDF
Solr4 nosql search_server_2013
PDF
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
PDF
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Dictionary Based Annotation at Scale with Spark by Sujit Pal
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
PDF
Semi-Supervised Learning In An Adversarial Environment
PDF
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
PPTX
Real Time search using Spark and Elasticsearch
PDF
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
PDF
Solr + Hadoop = Big Data Search
PDF
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
PDF
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
ODP
Get involved with the Apache Software Foundation
PPTX
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Search at Twitter: Presented by Michael Busch, Twitter
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...
Case study of Rujhaan.com (A social news app )
Solr4 nosql search_server_2013
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Building a Large Scale SEO/SEM Application with Apache Solr
Semi-Supervised Learning In An Adversarial Environment
PlayStation and Lucene - Indexing 1M documents per second: Presented by Alexa...
Real Time search using Spark and Elasticsearch
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Solr + Hadoop = Big Data Search
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
Get involved with the Apache Software Foundation
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Ad

Viewers also liked (20)

PDF
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
PDF
Building a real time big data analytics platform with solr
PDF
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
PDF
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
PDF
Webinar: Simpler Semantic Search with Solr
PDF
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
PDF
Webinar: Natural Language Search with Solr
PDF
SF Solr Meetup - Interactively Search and Visualize Your Big Data
 
PDF
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
PDF
Webinar: What's New in Solr 6
PDF
Natural Language Search in Solr
PDF
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
PPT
Data Discovery, Visualization, and Apache Hadoop
PDF
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
PDF
Visualize Big Graph Data
PDF
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
PPTX
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
PDF
The Evolution of Airbnb's Frontend
PDF
Introducing Apache Giraph for Large Scale Graph Processing
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Building a real time big data analytics platform with solr
Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Webinar: Simpler Semantic Search with Solr
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Webinar: Natural Language Search with Solr
SF Solr Meetup - Interactively Search and Visualize Your Big Data
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Webinar: What's New in Solr 6
Natural Language Search in Solr
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Data Discovery, Visualization, and Apache Hadoop
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Visualize Big Graph Data
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
The Evolution of Airbnb's Frontend
Introducing Apache Giraph for Large Scale Graph Processing
Ad

Similar to Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P. (20)

PPTX
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
PPTX
Correlate Log Data with Business Metrics Like a Jedi
PPTX
Capacity Planning
PDF
Intro to Time Series
PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
PPTX
Share point 2013 enterprise search (public)
PDF
Measuring CDN performance and why you're doing it wrong
 
PPTX
MongoDB Best Practices
PPTX
Webinar: Best Practices for Getting Started with MongoDB
PPTX
Big data meet_up_08042016
PPTX
Web Performance BootCamp 2013
PDF
Building real time data-driven products
PDF
Realtime Analytics on AWS
PDF
Beyond DevOps: How Netflix Bridges the Gap?
PPTX
Building Scalable Aggregation Systems
PDF
SDSC18 and DSATL Meetup March 2018
PPTX
Effective Microservices In a Data-centric World
PDF
How to create custom dashboards in Elastic Search / Kibana with Performance V...
PPTX
State of Florida Neo4j Graph Briefing - Cyber IAM
 
PPTX
10 Big Data Technologies you Didn't Know About
Solr Under the Hood at S&P Global- Sumit Vadhera, S&P Global
Correlate Log Data with Business Metrics Like a Jedi
Capacity Planning
Intro to Time Series
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Share point 2013 enterprise search (public)
Measuring CDN performance and why you're doing it wrong
 
MongoDB Best Practices
Webinar: Best Practices for Getting Started with MongoDB
Big data meet_up_08042016
Web Performance BootCamp 2013
Building real time data-driven products
Realtime Analytics on AWS
Beyond DevOps: How Netflix Bridges the Gap?
Building Scalable Aggregation Systems
SDSC18 and DSATL Meetup March 2018
Effective Microservices In a Data-centric World
How to create custom dashboards in Elastic Search / Kibana with Performance V...
State of Florida Neo4j Graph Briefing - Cyber IAM
 
10 Big Data Technologies you Didn't Know About

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPT
Introduction Database Management System for Course Database
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
AI in Product Development-omnex systems
PPTX
ai tools demonstartion for schools and inter college
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
history of c programming in notes for students .pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Nekopoi APK 2025 free lastest update
PDF
top salesforce developer skills in 2025.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
PTS Company Brochure 2025 (1).pdf.......
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Odoo POS Development Services by CandidRoot Solutions
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Introduction Database Management System for Course Database
ManageIQ - Sprint 268 Review - Slide Deck
How to Choose the Right IT Partner for Your Business in Malaysia
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
AI in Product Development-omnex systems
ai tools demonstartion for schools and inter college
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
history of c programming in notes for students .pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Nekopoi APK 2025 free lastest update
top salesforce developer skills in 2025.pdf
Design an Analysis of Algorithms II-SECS-1021-03
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
PTS Company Brochure 2025 (1).pdf.......

Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadhav, Bloomberg L.P.

  • 2. Never Stop Exploring: Pushing the Limits of Solr Anirudha Jadhav ©2014 Bloomberg L.P.
  • 3. Who am I ? ‱ Big Search and Distributed database specialist ‱ Built a Search as a Service platform ‱ Lead Search Architect @ Bloomberg Vault ‱ Credit Derivatives Analytics Engineer @ Bloomberg ‱ Masters' @ Courant Institute of Mathematical Sciences, New York University ‱ Passionate about Search, Scuba Diving , Motorcycles and German Shepherds
  • 5. Agenda ‱ Search at Bloomberg ‱ Goals and Objec5ves ‱ A li9le background ‱ Factors affec5ng indexing ‱ Our tests and benchmarks ‱ Design for a be9er NRT indexer ‱ Future work ‱ Q/A
  • 7. Search at Bloomberg ‱ News Search ‱ Federated Search ‱ Complex re-ranking of search results ‱ Archival Search ‱ GeoSpatial Search ‱ Analytics and Statistics on Search
  • 8. Objective Significantly increase Near Real Time (NRT) indexing throughput Eg. Building a Search application that receives market data
  • 10. Indexing Data Flow in SolrCloud
  • 11. Indexing Workflow We were talking about IBM during the fishing trip Down Cas)ng [We] [were] [talking] [about] [IBM] [during] [the] [fishing] [trip] Creates tokens by lowercasing all le4ers and dropping non-­‐le4ers. [we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish] [talking] [about] [ibm] [fishing] [trip] [talk] [big] [blue] [fish] [journey] [chat] Consider the sentence: [we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish] Tokeniza)on A tokenizer splits the stream of characters into a series of tokens. Stemming Lemma)za)on Stemming algorithms reduce words "fishing", "fished", "fish", and "fisher" to the root word, "fish" Lemma*za*on expands words to their inflected forms (ie fishing -­‐> fished, fishes, fish but not fisher) Stop Word Removal Remove common stop words “and”,”or” etc. which introduce noise in the search process Synonym Expansion Mapping of words based upon thesaurus (synonyms, acronyms, hypernyms, business rules, etc..) For example talk -­‐> chat, IBM -­‐> “big blue”, trip -­‐> journey
  • 12. Designing the Search Index Designing a good Search Applica)on also involves many aspects of user interac)on that directly influence indexing design ‱ Data Type and Data Distribu)on ‱ Server side parameters ‱ Networking ‱ Client side parameters ‱ Query pa4erns
  • 14. Data and Distribution of Tokens Common types of data that we index in a search index ‱ Textual data ( human generated ) e.g. messages, news, blogs ‱ Textual data ( machine generated ) e.g. logs , 5ckets ‱ Numerical data ‱ Geospa5al data How does this affect search index designs ? ‱ Query speed and indexing speed depend on the size of an index ‱ Size is dependent on ‱ Number of documents in the index ‱ Average size of each document ‱ Distribu5on of tokens ‱ Index features eg. Face5ng, Highligh5ng
  • 15. Server-side Factors ‱ Ratio of CPU’s to the number of solr cores running ‱ 2 Solr indices per CPU or a Thread ‱ Disk space ‱ Disk space for Solr index * 2 ( head room for merge cycles ) ‱ Memory ‱ JVM heap ‱ Off Heap ‱ DocValues
  • 16. Networking Cluster design consideration ‱ Should a cluster span data centers ? ‱ Latency between datacenters ‱ Reliability and availability SLA’s ‱ Where does your Zookeeper ensemble live ? ‱ How many elec5on members ‱ Consider observers to scale zookeeper ‱ Dynamically promote an observer to elec5on member Manage concurrent connections on the server Monitor network latencies for QoS guarantees
  • 17. Client-side Factors ‱ Managing connections and reusing connections ‱ Which format to use for indexing data ‱ javabin ‱ csv ‱ json ‱ xml ‱ How many simultaneous threads to use
  • 18. Experiments with NRT Indexing It’s not always efficient to send a single document to Solr for indexing How do you decide how many documents to send ? Collector : A buffer that collects Solr update documents ‱ Time Triggers ( T ) ‱ Time based collector on the client-­‐side to batch document payloads to Solr ‱ Document Size Triggers ( S ) ‱ Document size based collector on the client-­‐side to batch document payloads to Solr ‱ Document Number Triggers ( N ) ‱ Number of documents based collector on the client-­‐side to batch document payloads to Solr The collectors are all simultaneously used in order of priority. The lower priority collectors act as a cut-­‐off backups to safe guard from overflows.
  • 20. Benchmarking Setup ‱ Client application sending data to 4-way replicated SolrCloud ‱ 5 node Zookeeper ensemble ‱ All tests done with a similar dataset ( machine generated text ) ‱ We synthesize a high throughput ingest stream, which serves as our input ‱ Soft commits set at 1sec
  • 21. Benchmarking : Time Limit Tests docs/sec Time Triggers: Collection window in ms
  • 22. Benchmarking : Document Limit Tests docs/sec Document Number Triggers: Collection window in number of documents
  • 23. Benchmarking : Byte Limit Tests docs/sec Document Size Triggers: Collection window in bytes
  • 24. Observations ‱ On an average we were able to observe 5x-7x increase in ingestion throughput ‱ Optimization parameters are dependent constantly changing factors ‱ The tuning variables need to be constantly adjusted for best performance ‱ How to use this now
  • 25. Design for a better NRT indexer
  • 26. PID Controller Proportional term ( P ) – present Output proportional to current error value Integral term ( I ) - past Sum of instantaneous error over time, and give accumulated offset that should have been corrected previously Derivative term ( D ) - future Calculated by determining the slope of previous error over time times the rate of change
  • 27. PID implementation in the indexer Solr Cloud Solr response Sampling thread Process variable Docs/sec Client indexer process Pick one of the Triggers Time (T ) Control Variable PID controller implementa5on Indexing threads
  • 29. Future work ‱ Perfect the PID indexer ‱ Add it to the YCSB benchmarking framework ‱ Add other server side parameters on the PID indexer ‱ Use the PID indexer along with the YCSB framework to size hardware
  • 30. Never Stop Exploring: Pushing the Limits of Solr Anirudha Jadhav , Bloomberg LP QUESTIONS ?