SlideShare a Scribd company logo
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS 
Past, Present, and Future 
Mark Miller, Cloudera
About Me 
Lucene Committer, Solr Committer. 
Works for Cloudera. 
A lot of work on Lucene, Solr, and SolrCloud.
Some Basics 
Solr 
A distributed, fault tolerant search engine using Lucene as it’s core search library. 
HDFS 
A distributed, fault tolerant filesystem that is part of the Hadoop project.
Solr on HDFS 
Wouldn’t it be nice if Solr could run on HDFS. 
If you are running other things on HDFS, it simplifies operations. 
If you are building indexes with MapReduce, merging them into your cluster becomes 
easy. 
You can do some other kind of cool things when you have are using a shared file 
system. 
Most attempts in the past have not really caught on.
Solr on HDFS in the Past. 
• Apache Blur is one of the more successful marriages of Lucene and HDFS. 
• We borrowed some code from them to seed Solr on HDFS. 
• Others have copied indexes between local filesystem and HDFS. 
• Most people felt that running Lucene or Solr straight on HDFS would be too slow.
How HDFS Writes Data 
Remote Remote Remote Remote 
Local 
Solr 
Write An attempt is made to make a local copy 
and as many remote copies as necessary to 
satisfy the replication factor configuration.
Co-Located Solr and HFDS Data Nodes 
HDFS HDFS HDFS HDFS 
Solr Solr Solr Solr 
We recommend that HDFS data nodes and Solr nodes are co-located 
so that the default case involves fast, local data.
Non Local Data 
• BlockCache is first line of defense, but it’s good to get local data again. 
• Optimize is more painful option. 
• An HDFS affinity feature could be useful. 
• A tool that simply wrote out a copy of the index with no merging might be interesting.
HdfsDirectory 
• Fairly simple and straightforward implementation. 
• Full support required making the Directory interface a first class citizen in Solr. 
• Largest part was making Replication work with non local filesystem directories. 
• With large enough ‘buffer’ sizes, works reasonably well as long as the data is local. 
• Really needs some kind of cache to be reasonable though.
“The Block Cache” 
A replacement for the OS filesystem cache, especially for the case when there is no 
local data. 
Even with local data, making it larger will beneficially reduce HDFS traffic in many 
cases. 
Block 
Cache 
HDFS Solr
Inside the Block Cache. 
ConcurrentLinkedHashMap<BlockCacheKey,BlockCacheLocation> 
ByteBuffer[] banks 
int numberOfBlocksPerBank 
Each ByteBuffer of size ‘blockSize’. 
Used locations tracked by ‘lock’ bitset.
The Global Block Cache 
The initial Block Cache implementation used a separate Block Cache for every unique 
index directory used by Solr in HDFS. 
There are many limitations around this strategy. It hinders capacity planning, it’s not 
very efficient, and it bites you at the worst times. 
The Global Block Cache is meant to be a single Block Cache to be used by all 
SolrCore’s for every directory. 
This makes sizing very simple - determine how much RAM you can spare for the Block 
Cache and size it that way once and forget it.
Performance 
In many average cases, performance looks really good - very comparable to local 
filesystem performance, though usually somewhat slower. 
In other cases, adjusting various settings for the Block Cache can help with 
performance. 
We have recently found some changes to improve performance.
Tuning the Block Cache 
Sizing 
By default, each ‘slab’ is 128 MB. Raise the slab count to increase by 128 MB slabs. 
Block Size (8 KB default) 
Not originally configurable, but certain use cases appear to work better with 4 KB.
HDFS Transaction Log 
We also moved the Transaction Log to HDFS. 
Implementation has held up okay, some improvements needed, a large replay 
performance issue improved. 
The HDFSDirectory and Block Cache have had a much larger impact. 
No truncate support in HDFS, so we work around it by replaying the whole log in some 
failed recovery cases where local filesystem impl just drops the log.
The autoAddReplicas Feature 
A new feature that is currently only available when using a shared filesystem like 
HDFS. 
The Overseer monitors the cluster state and fires off SolrCore create command 
pointing to existing data in HDFS when a node goes down.
The autoAddReplicas Feature 2 
HDFS HXDFS HDFS HDFS 
Solr SXolr Solr Solr
The Future 
At Cloudera, we are building an Enterprise Data Hub. 
In our vision, the more that runs on HDFS, the better. 
We will continue to improve and push forward HDFS support in SolrCloud.
Block Cache Improvements 
Apache Blur has a Block Cache V2. 
Uses variable sized blocks. 
Optionally uses Unsafe for direct memory management. 
The V1 Block Cache has some performance limitations. 
* Copying bytes from off heap to IndexInput buffer. 
* Concurrent access of the cache. 
* Sequential reads have to pull a lot of blocks from the cache. 
* Each DirectByteBuffer has some overhead, including a Cleaner object that can affect 
GC and add to RAM reqs.
HDFS Only Replication When Using Replicas 
Currently, if you want to use SolrCloud replicas, data is replicated both by HDFS and 
by Solr. 
HDFS replication factor = 1 is not a very good solution. 
autoAddReplicas is one possible solution. 
We will be working on another solution where only the leader writes to an index in 
HDFS while replicas read from it.
The End 
Mark Miller 
@heismark

More Related Content

PDF
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
PPTX
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
PDF
Cloudera search
PDF
Cloudera Search Webinar: Big Data Search, Bigger Insights
PDF
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
PDF
The First Class Integration of Solr with Hadoop
PPTX
Adding Search to the Hadoop Ecosystem
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Cloudera search
Cloudera Search Webinar: Big Data Search, Bigger Insights
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
The First Class Integration of Solr with Hadoop
Adding Search to the Hadoop Ecosystem

What's hot (20)

PDF
HBase Status Report - Hadoop Summit Europe 2014
PPTX
Introduction to Cloudera Search Training
PDF
HPE Hadoop Solutions - From use cases to proposal
PPTX
Architecting Applications with Hadoop
PDF
Large-scale Web Apps @ Pinterest
PDF
Application Architectures with Hadoop
PDF
Search On Hadoop
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
PPTX
HBase in Practice
PDF
SQOOP - RDBMS to Hadoop
PPTX
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
PPTX
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
PDF
SQL Engines for Hadoop - The case for Impala
PDF
Tachyon and Apache Spark
PDF
NYC HUG - Application Architectures with Apache Hadoop
PDF
Solr + Hadoop = Big Data Search
PPTX
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
PPTX
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
PPTX
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
PPTX
A Survey of HBase Application Archetypes
HBase Status Report - Hadoop Summit Europe 2014
Introduction to Cloudera Search Training
HPE Hadoop Solutions - From use cases to proposal
Architecting Applications with Hadoop
Large-scale Web Apps @ Pinterest
Application Architectures with Hadoop
Search On Hadoop
Building a Large Scale SEO/SEM Application with Apache Solr
HBase in Practice
SQOOP - RDBMS to Hadoop
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
SQL Engines for Hadoop - The case for Impala
Tachyon and Apache Spark
NYC HUG - Application Architectures with Apache Hadoop
Solr + Hadoop = Big Data Search
October 2016 HUG: The Pillars of Effective Data Archiving and Tiering in Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
A Survey of HBase Application Archetypes
Ad

Similar to Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera (20)

PPTX
Hadoop - HDFS
PPTX
Apache hadoop basics
PPTX
Introduction to Hadoop Distributed File System(HDFS).pptx
PPTX
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
PPTX
Big Data Analytics -Introduction education
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
PPTX
Big data with HDFS and Mapreduce
PPTX
module 2.pptx
PPT
Big Data Analytics (Collection of Huge Data 2)
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPTX
Introduction to HDFS and MapReduce
PDF
Hadoop architecture-tutorial
PPTX
Asbury Hadoop Overview
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
PDF
Hortonworks.Cluster Config Guide
PPSX
Hadoop – big deal
PPTX
Data Analytics presentation.pptx
PPTX
Introduction to Apache Hadoop Ecosystem
PDF
Delphix database virtualization v1.0
Hadoop - HDFS
Apache hadoop basics
Introduction to Hadoop Distributed File System(HDFS).pptx
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Big Data Analytics -Introduction education
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Big data with HDFS and Mapreduce
module 2.pptx
Big Data Analytics (Collection of Huge Data 2)
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Introduction to HDFS and MapReduce
Hadoop architecture-tutorial
Asbury Hadoop Overview
Topic 9a-Hadoop Storage- HDFS.pptx
Hortonworks.Cluster Config Guide
Hadoop – big deal
Data Analytics presentation.pptx
Introduction to Apache Hadoop Ecosystem
Delphix database virtualization v1.0
Ad

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
Salesforce Agentforce AI Implementation.pdf
PDF
Autodesk AutoCAD Crack Free Download 2025
DOCX
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
L1 - Introduction to python Backend.pptx
PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
Salesforce Agentforce AI Implementation.pdf
Autodesk AutoCAD Crack Free Download 2025
Greta — No-Code AI for Building Full-Stack Web & Mobile Apps
Design an Analysis of Algorithms I-SECS-1021-03
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 41
L1 - Introduction to python Backend.pptx
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
Advanced SystemCare Ultimate Crack + Portable (2025)
Monitoring Stack: Grafana, Loki & Promtail
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
iTop VPN 6.5.0 Crack + License Key 2025 (Premium Version)
Oracle Fusion HCM Cloud Demo for Beginners
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Navsoft: AI-Powered Business Solutions & Custom Software Development
Why Generative AI is the Future of Content, Code & Creativity?

Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera

  • 2. Solr on HDFS Past, Present, and Future Mark Miller, Cloudera
  • 3. About Me Lucene Committer, Solr Committer. Works for Cloudera. A lot of work on Lucene, Solr, and SolrCloud.
  • 4. Some Basics Solr A distributed, fault tolerant search engine using Lucene as it’s core search library. HDFS A distributed, fault tolerant filesystem that is part of the Hadoop project.
  • 5. Solr on HDFS Wouldn’t it be nice if Solr could run on HDFS. If you are running other things on HDFS, it simplifies operations. If you are building indexes with MapReduce, merging them into your cluster becomes easy. You can do some other kind of cool things when you have are using a shared file system. Most attempts in the past have not really caught on.
  • 6. Solr on HDFS in the Past. • Apache Blur is one of the more successful marriages of Lucene and HDFS. • We borrowed some code from them to seed Solr on HDFS. • Others have copied indexes between local filesystem and HDFS. • Most people felt that running Lucene or Solr straight on HDFS would be too slow.
  • 7. How HDFS Writes Data Remote Remote Remote Remote Local Solr Write An attempt is made to make a local copy and as many remote copies as necessary to satisfy the replication factor configuration.
  • 8. Co-Located Solr and HFDS Data Nodes HDFS HDFS HDFS HDFS Solr Solr Solr Solr We recommend that HDFS data nodes and Solr nodes are co-located so that the default case involves fast, local data.
  • 9. Non Local Data • BlockCache is first line of defense, but it’s good to get local data again. • Optimize is more painful option. • An HDFS affinity feature could be useful. • A tool that simply wrote out a copy of the index with no merging might be interesting.
  • 10. HdfsDirectory • Fairly simple and straightforward implementation. • Full support required making the Directory interface a first class citizen in Solr. • Largest part was making Replication work with non local filesystem directories. • With large enough ‘buffer’ sizes, works reasonably well as long as the data is local. • Really needs some kind of cache to be reasonable though.
  • 11. “The Block Cache” A replacement for the OS filesystem cache, especially for the case when there is no local data. Even with local data, making it larger will beneficially reduce HDFS traffic in many cases. Block Cache HDFS Solr
  • 12. Inside the Block Cache. ConcurrentLinkedHashMap<BlockCacheKey,BlockCacheLocation> ByteBuffer[] banks int numberOfBlocksPerBank Each ByteBuffer of size ‘blockSize’. Used locations tracked by ‘lock’ bitset.
  • 13. The Global Block Cache The initial Block Cache implementation used a separate Block Cache for every unique index directory used by Solr in HDFS. There are many limitations around this strategy. It hinders capacity planning, it’s not very efficient, and it bites you at the worst times. The Global Block Cache is meant to be a single Block Cache to be used by all SolrCore’s for every directory. This makes sizing very simple - determine how much RAM you can spare for the Block Cache and size it that way once and forget it.
  • 14. Performance In many average cases, performance looks really good - very comparable to local filesystem performance, though usually somewhat slower. In other cases, adjusting various settings for the Block Cache can help with performance. We have recently found some changes to improve performance.
  • 15. Tuning the Block Cache Sizing By default, each ‘slab’ is 128 MB. Raise the slab count to increase by 128 MB slabs. Block Size (8 KB default) Not originally configurable, but certain use cases appear to work better with 4 KB.
  • 16. HDFS Transaction Log We also moved the Transaction Log to HDFS. Implementation has held up okay, some improvements needed, a large replay performance issue improved. The HDFSDirectory and Block Cache have had a much larger impact. No truncate support in HDFS, so we work around it by replaying the whole log in some failed recovery cases where local filesystem impl just drops the log.
  • 17. The autoAddReplicas Feature A new feature that is currently only available when using a shared filesystem like HDFS. The Overseer monitors the cluster state and fires off SolrCore create command pointing to existing data in HDFS when a node goes down.
  • 18. The autoAddReplicas Feature 2 HDFS HXDFS HDFS HDFS Solr SXolr Solr Solr
  • 19. The Future At Cloudera, we are building an Enterprise Data Hub. In our vision, the more that runs on HDFS, the better. We will continue to improve and push forward HDFS support in SolrCloud.
  • 20. Block Cache Improvements Apache Blur has a Block Cache V2. Uses variable sized blocks. Optionally uses Unsafe for direct memory management. The V1 Block Cache has some performance limitations. * Copying bytes from off heap to IndexInput buffer. * Concurrent access of the cache. * Sequential reads have to pull a lot of blocks from the cache. * Each DirectByteBuffer has some overhead, including a Cleaner object that can affect GC and add to RAM reqs.
  • 21. HDFS Only Replication When Using Replicas Currently, if you want to use SolrCloud replicas, data is replicated both by HDFS and by Solr. HDFS replication factor = 1 is not a very good solution. autoAddReplicas is one possible solution. We will be working on another solution where only the leader writes to an index in HDFS while replicas read from it.
  • 22. The End Mark Miller @heismark