SlideShare a Scribd company logo
Integrating Hadoop & Solr
Who am I? 
Yann Yu 
Systems Engineer @ Lucidworks
Lucidworks is Search. 
Technology Retail Financial 
Healthcare Services Industrial
Why would you integrate Hadoop and Solr? 
(and how would you do that?)
• Open-source 
• Enterprise support 
• Cheap, scalable storage 
• Distributed computation 
• Farm animals for extensibility 
• Open-source, Lucene based 
• Enterprise support 
• Real-time queries 
• Full-text search 
• NoSQL capabilities 
• Repeatedly proven in production 
environments at massive scales
I have Hadoop, why do I need Solr? 
Hadoop excels in storing and working with large amounts of data, 
but has difficulty with frequent, random access to it 
• NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across 
structured and unstructured big data 
• Empower users of all technical ability to interact with, and derive 
value from, big data — all using a natural language search interface 
(no MapReduce, Pig, SQL, etc.) 
• Preliminary data exploration and analysis 
• Near real-time indexing and querying 
• Thousands of simultaneous, parallel requests 
• Share machine-learning insights created on Hadoop to a broad 
audience through an interactive medium
I have Solr, why do I need Hadoop? 
As Solr indexes grow in size, the size and number of the machines hosting Solr 
must also grow, increasing index time and complexity 
• Least expensive storage solution in market 
• Leverage Hadoop processing power (MapReduce) to build 
indexes or send document updates to Solr 
• Store Solr indexes and transaction logs within HDFS 
• Augment Solr data by storing additional information for last-second 
retrieval in Hadoop
So what does this actually look like? 
?
The enterprise storage situation today 
⚒
Enterprise data deployment 
Lucidworks HDFS connector 
processes documents and 
sends to SolrCloud 
Enterprise documents 
are stored in HDFS 
And retrieve source 
files directly from 
HDFS as necessary 
Users make ad-hoc, full-text 
queries across the full content 
of all documents in Solr 
Standard document storage and search
• Documents can be migrated from other file 
storage systems via Flume or other scripts 
• MapReduce allows for batch processing of 
documents (e.g. OCR, NER, clustering, etc.) 
Sink documents into HDFS
Index document contents into Solr 
• The Lucidworks Hadoop 
connector parses content from 
files using many different tools 
• Tika, GrokIngest, CSV 
mapping, Pig, etc. 
• Content and data are added to 
fields in a Solr document 
• The resulting document is sent 
to Solr for indexing
Enable users to search and access content 
• Users are empowered with ad-hoc, 
full-text search in Solr 
• Provides standard search tools 
such as autocomplete, more-like-this, 
spellchecking, faceting, etc. 
• Users only access HDFS as needed
Log record search 
Machine generated log records 
are sent to Flume. 
Flume forwards raw log record 
to Hadoop for archiving. 
Flume simultaneously parses out 
data in record into a Solr document, 
forwarding resulting document to Solr 
Lucidworks SiLK exposes real-time 
statistics and analytics to end-users, 
as well as full-text search 
High volume indexing of many small records
Flume archives data in HDFS 
• Flume performs minimal work on log 
files and sends them directly into 
HDFS for archival 
• Under optimal circumstances, the log 
files are sized to the block size of 
HDFS
Flume submits records to Solr 
• Flume processes records, extracting 
strings, ints, dates, times, and other 
information into Solr fields 
• Once the Solr document is created, it 
is submitted to Solr for indexing 
• This process happens in real-time, 
allowing for near real-time search
Real-time analytics dashboard 
• Lucidworks SiLK allows users to create 
simple dashboards through a GUI 
• The Banana dashboard will issue queries 
to Solr, rendering the received data in 
tables, graphs, and other plots 
• Users can also perform full-text search 
across the data, allowing for extremely 
fine granularity
Integrating Hadoop & Solr
End 
Find me at: 
yann.yu@lucidworks.com 
@yawnyou 
Any questions?

More Related Content

PDF
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
PDF
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
PDF
Cloudera Search Webinar: Big Data Search, Bigger Insights
PDF
Cloudera search
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
PPTX
HBase in Practice
PPTX
Hadoop configuration & performance tuning
PDF
HBase Status Report - Hadoop Summit Europe 2014
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
Cloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera search
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
HBase in Practice
Hadoop configuration & performance tuning
HBase Status Report - Hadoop Summit Europe 2014

What's hot (20)

PDF
Search On Hadoop
PDF
Hadoop 3.0 - Revolution or evolution?
PPTX
A brave new world in mutable big data relational storage (Strata NYC 2017)
PPTX
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
PDF
Hadoop meets Agile! - An Agile Big Data Model
PPT
Presentation
PPTX
Adding Search to the Hadoop Ecosystem
PPTX
Backup and Disaster Recovery in Hadoop
PDF
Hadoop Operations - Best practices from the field
PDF
Welcome to Hadoop2Land!
PPTX
Hadoop Backup and Disaster Recovery
PDF
Hadoop 2 - Beyond MapReduce
PDF
Large-scale Web Apps @ Pinterest
PDF
HPE Hadoop Solutions - From use cases to proposal
PDF
Hadoop 3.0 - Revolution or evolution?
PPTX
Pptx present
PPTX
HDFS Tiered Storage: Mounting Object Stores in HDFS
PPTX
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
PPTX
Introduction to Cloudera Search Training
PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
Search On Hadoop
Hadoop 3.0 - Revolution or evolution?
A brave new world in mutable big data relational storage (Strata NYC 2017)
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
Hadoop meets Agile! - An Agile Big Data Model
Presentation
Adding Search to the Hadoop Ecosystem
Backup and Disaster Recovery in Hadoop
Hadoop Operations - Best practices from the field
Welcome to Hadoop2Land!
Hadoop Backup and Disaster Recovery
Hadoop 2 - Beyond MapReduce
Large-scale Web Apps @ Pinterest
HPE Hadoop Solutions - From use cases to proposal
Hadoop 3.0 - Revolution or evolution?
Pptx present
HDFS Tiered Storage: Mounting Object Stores in HDFS
Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data - Rüdige...
Introduction to Cloudera Search Training
Intro to Apache Kudu (short) - Big Data Application Meetup
Ad

Viewers also liked (12)

PDF
Solr+Hadoop = Big Data Search
PDF
Harness the power of Spark and Solr in Hue: Big Data Amsterdam v.2.0
PPTX
Enterprise Search: An Information Architect's Perspective
PPTX
TriHUG: Lucene Solr Hadoop
PDF
Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Hue
PPTX
Real-time searching of big data with Solr and Hadoop
PDF
SF Solr Meetup - Interactively Search and Visualize Your Big Data
PPTX
Enterprise Search Summit Keynote: A Big Data Architecture for Search
PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
PDF
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
PPTX
Dlvr.it 使用說明
PPTX
Building a distributed search system with Hadoop and Lucene
Solr+Hadoop = Big Data Search
Harness the power of Spark and Solr in Hue: Big Data Amsterdam v.2.0
Enterprise Search: An Information Architect's Perspective
TriHUG: Lucene Solr Hadoop
Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Hue
Real-time searching of big data with Solr and Hadoop
SF Solr Meetup - Interactively Search and Visualize Your Big Data
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Dlvr.it 使用說明
Building a distributed search system with Hadoop and Lucene
Ad

Similar to Integrating Hadoop & Solr (20)

PDF
Integrating Hadoop & Solr
PDF
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
PPTX
Indexing with solr search server and hadoop framework
PPTX
Cloudera Hadoop Distribution
PDF
Search On Hadoop Frontier Meetup
PPTX
MODULE 1: Introduction to Big Data Analytics.pptx
PPTX
Hadoop ppt1
PPTX
Introduction to HDFS and MapReduce
PDF
Introduction To Hadoop Ecosystem
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PDF
Big Data Architecture Workshop - Vahid Amiri
PPTX
Solr + Hadoop: Interactive Search for Hadoop
PPTX
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
PPTX
Webinar: Solr & Fusion for Big Data
PPTX
Hadoop and Big data in Big data and cloud.pptx
PPTX
Big Data Retrospective - STL Big Data IDEA Jan 2019
PPTX
Hadoop.pptx
PPTX
Hadoop.pptx
PPTX
List of Engineering Colleges in Uttarakhand
PPTX
Data IO: Next Generation Search with Lucene and Solr 4
Integrating Hadoop & Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
Indexing with solr search server and hadoop framework
Cloudera Hadoop Distribution
Search On Hadoop Frontier Meetup
MODULE 1: Introduction to Big Data Analytics.pptx
Hadoop ppt1
Introduction to HDFS and MapReduce
Introduction To Hadoop Ecosystem
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Big Data Architecture Workshop - Vahid Amiri
Solr + Hadoop: Interactive Search for Hadoop
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Webinar: Solr & Fusion for Big Data
Hadoop and Big data in Big data and cloud.pptx
Big Data Retrospective - STL Big Data IDEA Jan 2019
Hadoop.pptx
Hadoop.pptx
List of Engineering Colleges in Uttarakhand
Data IO: Next Generation Search with Lucene and Solr 4

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Modernizing your data center with Dell and AMD
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Approach and Philosophy of On baking technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation theory and applications.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Encapsulation_ Review paper, used for researhc scholars
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Monthly Chronicles - July 2025
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Modernizing your data center with Dell and AMD
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
CIFDAQ's Market Insight: SEC Turns Pro Crypto
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
NewMind AI Weekly Chronicles - August'25 Week I
Approach and Philosophy of On baking technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation theory and applications.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Encapsulation_ Review paper, used for researhc scholars

Integrating Hadoop & Solr

  • 2. Who am I? Yann Yu Systems Engineer @ Lucidworks
  • 3. Lucidworks is Search. Technology Retail Financial Healthcare Services Industrial
  • 4. Why would you integrate Hadoop and Solr? (and how would you do that?)
  • 5. • Open-source • Enterprise support • Cheap, scalable storage • Distributed computation • Farm animals for extensibility • Open-source, Lucene based • Enterprise support • Real-time queries • Full-text search • NoSQL capabilities • Repeatedly proven in production environments at massive scales
  • 6. I have Hadoop, why do I need Solr? Hadoop excels in storing and working with large amounts of data, but has difficulty with frequent, random access to it • NoSQL front-end to Hadoop: Enable fast, ad-hoc, search across structured and unstructured big data • Empower users of all technical ability to interact with, and derive value from, big data — all using a natural language search interface (no MapReduce, Pig, SQL, etc.) • Preliminary data exploration and analysis • Near real-time indexing and querying • Thousands of simultaneous, parallel requests • Share machine-learning insights created on Hadoop to a broad audience through an interactive medium
  • 7. I have Solr, why do I need Hadoop? As Solr indexes grow in size, the size and number of the machines hosting Solr must also grow, increasing index time and complexity • Least expensive storage solution in market • Leverage Hadoop processing power (MapReduce) to build indexes or send document updates to Solr • Store Solr indexes and transaction logs within HDFS • Augment Solr data by storing additional information for last-second retrieval in Hadoop
  • 8. So what does this actually look like? ?
  • 9. The enterprise storage situation today ⚒
  • 10. Enterprise data deployment Lucidworks HDFS connector processes documents and sends to SolrCloud Enterprise documents are stored in HDFS And retrieve source files directly from HDFS as necessary Users make ad-hoc, full-text queries across the full content of all documents in Solr Standard document storage and search
  • 11. • Documents can be migrated from other file storage systems via Flume or other scripts • MapReduce allows for batch processing of documents (e.g. OCR, NER, clustering, etc.) Sink documents into HDFS
  • 12. Index document contents into Solr • The Lucidworks Hadoop connector parses content from files using many different tools • Tika, GrokIngest, CSV mapping, Pig, etc. • Content and data are added to fields in a Solr document • The resulting document is sent to Solr for indexing
  • 13. Enable users to search and access content • Users are empowered with ad-hoc, full-text search in Solr • Provides standard search tools such as autocomplete, more-like-this, spellchecking, faceting, etc. • Users only access HDFS as needed
  • 14. Log record search Machine generated log records are sent to Flume. Flume forwards raw log record to Hadoop for archiving. Flume simultaneously parses out data in record into a Solr document, forwarding resulting document to Solr Lucidworks SiLK exposes real-time statistics and analytics to end-users, as well as full-text search High volume indexing of many small records
  • 15. Flume archives data in HDFS • Flume performs minimal work on log files and sends them directly into HDFS for archival • Under optimal circumstances, the log files are sized to the block size of HDFS
  • 16. Flume submits records to Solr • Flume processes records, extracting strings, ints, dates, times, and other information into Solr fields • Once the Solr document is created, it is submitted to Solr for indexing • This process happens in real-time, allowing for near real-time search
  • 17. Real-time analytics dashboard • Lucidworks SiLK allows users to create simple dashboards through a GUI • The Banana dashboard will issue queries to Solr, rendering the received data in tables, graphs, and other plots • Users can also perform full-text search across the data, allowing for extremely fine granularity
  • 19. End Find me at: yann.yu@lucidworks.com @yawnyou Any questions?