SlideShare a Scribd company logo
Search   Discover   Analyze




Large Scale Search, Discovery and
Analytics with Solr, Mahout and
Hadoop




Grant Ingersoll
Chief Scientist
Lucid Imagination


                                                  |   1
Search is Dead, Long Live Search



   Good keyword search is a                       Documents
    commodity and easy to get
    up and running


   The Bar is Raised                Content                      User
                                   Relationships               Interaction
     – Relevance is (always will
       be?) hard


   Holistic view of the data
    AND the users is critical
                                                    Access




                                                                       |     2
Topics



   Quick Background and needs
   Architecture
     – Abstract
     – Practical
   SDA In Practice
     – Components
     – Challenges and Lessons Learned
   Wrap Up




                                        |   3
Why Search, Discovery and Analytics (SDA)?

                                     User Needs
                                       – Real-time, ad hoc access to content
                                       – Aggressive Prioritization based on Importance
             Search
                                       – Serendipity
                                       – Feedback/Learning from past


                                     Business Needs
 Analytics            Discovery        – Deeper insight into users
                                       – Leverage existing internal knowledge
                                       – Cost effective




                                                                                |   4
What Do Developers Need for SDA?



   Fast, efficient, scalable search
     – Bulk and Near Real Time Indexing
     – Handle billions of records w/ sub-second search and faceting
   Large scale, cost effective storage and processing capabilities
     – Need whole data consumption and analysis
     – Experimentation/Sampling tools
     – Distributed In Memory where appropriate
   NLP and machine learning tools that scale to enhance discovery and
    analysis




                                                                      |   5
Abstract -> Practical SDA Architecture
                           Access (API, UI,Visualization)

                                  Search, Discovery and Analytics              Glue
                       Stats Mahout, R, GATE, Others
                          Pig, Machine   Docs     User                        Admin
                      Package Learning  Access Modeling

                                        Experiment Mgmt                       Service
                                                                               Mgmt
       Content                     Computation and Storage
      Acquisition
                                              DB
                                                               Dist.          Data
                      Search                 NoSQL
                                                              Process         Mgmt
                                              KV

                      Shards                  Shards                Shards
                       Shards                  Shards                Shards
                         Shards                   Logs                  DFS




                    Provisioning, Monitoring, Infrastructure


                                                                                        |   6
Computation and Storage


       Solr                  Hadoop                     HBase

• Document Index        • Stores Logs,          • Metric Storage
• Document                Raw files,            • User Histories
  Storage?                intermediate          • Document
                          files, etc.             Storage?
• SolrCloud             • WebHDFS
  makes sharding
  easy                  • Small file are an
                          unnatural act

Challenges
     • Who is the authoritative store? Solr or HBase?
     • Real time vs. Batch
     • Where should analysis be done?
                                                                |   7
Search In Practice



   Three primary concerns
     – Performance/Scaling


     – Relevance


     – Operations: monitoring, failover, etc.


   Business typically cares more about relevance
   Devs more about performance (and then ops)




                                                    |   8
Search with Solr: Scaling and NRT



   SolrCloud takes care of distributed indexing and search needs
     – Transaction logs for recovery
     – Automatic leader election, so no more master/worker
     – Have to declare number of shards now, but splitting coming soon
     – Use CloudSolrServer in SolrJ
   NRT Config tips:
     – 1 second soft commits for NRT updates
     – 1 minute hard commits (no searcher reopen)




                                                                         |   9
Search: Relevance



   ABT – Always Be Testing
     – Experiment management is critical
     – Top X + Random Sampling of Long Tail
     – Click logs
   Track Everything!
     – Queries
     – Clicks
     – Displayed Documents
     – Mouse/Scroll tracking???
   Phrases are your friend




                                              |   10
Discovery Components

       Serendipity             Organization            Data Quality

•   Trends                 • Importance           • Document factor
•   Topics                 • Clustering             Distributions
•   Recommendations        • Classification         • Length
•   Related Items            • Named Entities       • Boosts
•   More Like This         • Time Factors         • Duplicates
•   Did you mean?          • Faceting
•   Stat. Interesting
    Phrases

Challenges
        • Many of these are intense calculations or iterative
        • Many are subjective and require a lot of experimentation


                                                                      |   11
Discovery with Mahout



   Mahout’s 3 “C”s provide tools for helping across many aspects of discovery
     – Collaborative Filtering
     – Classification
     – Clustering
   Also:
     – Collocations (Statistically Interesting Phrases)
     – SVD
     – Others
   Challenges:
     – High cost to iterative machine learning algorithms
     – Mahout is very command line oriented
     – Some areas less mature

                                                                             |   12
Aside: Experiment Management



   Plan for running experiments from the beginning across Search and
    Discovery components
     – Your analytics engine should help!
   Types of Experiments to consider
     – Indexing/Analysis
     – Query parsing
     – Scoring formulas
     – Machine Learning Models
     – Recommendations, many more
   Make it easy to do A/B testing across all experiments and compare and
    contrast the results



                                                                            |   13
Analytics Components



   Commonly used components
     – Solr
     – R Stats
     – Hive
     – Pig
     – Commercial


   Starting with Search and Discovery metrics and analysis gives context into
    where to make investments for broader analytics




                                                                                 |   14
Analytics in Practice



   Simple Counts:
     – Facets
     – Term and Document frequencies
     – Clicks
   Search and Discovery example metrics
     – Relevance measures like Mean Reciprocal Rank
     – Histograms/Drilldowns around Number of Results
     – Log and navigation analysis


   Data cleanliness analysis is helpful for finding potential issues in content




                                                                                   |   15
Wrap



   Search, Discovery and Analytics, when combined into a single, coherent
    system provides powerful insight into both your content and your users


   Solr + Hadoop + Mahout


   Design for the big picture when building search-based applications




                                                                             |   16
Find me



   http://www.lucidimagination.com


   grant@lucidimagination.com
   @gsingers




                                      |   17

More Related Content

PPTX
Leveraging Solr and Mahout
PPTX
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
PDF
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
PDF
Concept Searching Portal Solutions Search Engine Face Off
PDF
Security data deluge
PDF
Searching conversations with hadoop
PDF
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
PPTX
Data Mining on Twitter
Leveraging Solr and Mahout
DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Concept Searching Portal Solutions Search Engine Face Off
Security data deluge
Searching conversations with hadoop
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
Data Mining on Twitter

What's hot (8)

PDF
Concept Searching Overview Google Vs Fast
PPTX
Book Recommendation System using Data Mining for the University of Hong Kong ...
PDF
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
PPTX
Data mining process powerpoint presentation templates
PDF
Data mining process powerpoint ppt slides.
PDF
Hadoop Data Reservoir Webinar
PPTX
Kuali update v4 - mw
PDF
Using hadoop to expand data warehousing
Concept Searching Overview Google Vs Fast
Book Recommendation System using Data Mining for the University of Hong Kong ...
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Data mining process powerpoint presentation templates
Data mining process powerpoint ppt slides.
Hadoop Data Reservoir Webinar
Kuali update v4 - mw
Using hadoop to expand data warehousing
Ad

Viewers also liked (10)

PPTX
Open Source Search FTW
PPTX
Apache Lucene 4
PPTX
Intro to Search
PPTX
Crowd Sourced Reflected Intelligence for Solr and Hadoop
PPTX
OpenSearchLab and the Lucene Ecosystem
PPTX
Data IO: Next Generation Search with Lucene and Solr 4
PPTX
This Ain't Your Parent's Search Engine
PPTX
What's new in Lucene and Solr 4.x
PDF
Solr for Data Science
PPTX
Taming Text
Open Source Search FTW
Apache Lucene 4
Intro to Search
Crowd Sourced Reflected Intelligence for Solr and Hadoop
OpenSearchLab and the Lucene Ecosystem
Data IO: Next Generation Search with Lucene and Solr 4
This Ain't Your Parent's Search Engine
What's new in Lucene and Solr 4.x
Solr for Data Science
Taming Text
Ad

Similar to Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr (20)

PPTX
Crowd-Sourced Intelligence Built into Search over Hadoop
PPTX
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
PDF
Search + Big Data: It's (still) All About the User- Grant Ingersoll
PPT
SPLive Orlando - Beyond the Search Center - Application or Solution?
PPT
Big Data = Big Decisions
PDF
Hadoop - Now, Next and Beyond
PPTX
Steve Watt Presentation
PPT
Search, APIs, capability management and Sensis's journey
PPTX
Tech4Africa - Opportunities around Big Data
PDF
Building apps with HBase - Data Days Texas March 2013
PPTX
MapR lucidworks joint webinar
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
PPTX
Large Scale Search, Discovery and Analytics in Action
PPTX
MapR LucidWorks Joint Webinar 121211
PDF
Apache hadoop bigdata-in-banking
PDF
Analyzing Multi-Structured Data
PDF
Globant and Big Data on AWS
PDF
16h00 globant - aws globant-big-data_summit2012
PDF
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
PDF
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
Crowd-Sourced Intelligence Built into Search over Hadoop
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
Search + Big Data: It's (still) All About the User- Grant Ingersoll
SPLive Orlando - Beyond the Search Center - Application or Solution?
Big Data = Big Decisions
Hadoop - Now, Next and Beyond
Steve Watt Presentation
Search, APIs, capability management and Sensis's journey
Tech4Africa - Opportunities around Big Data
Building apps with HBase - Data Days Texas March 2013
MapR lucidworks joint webinar
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Large Scale Search, Discovery and Analytics in Action
MapR LucidWorks Joint Webinar 121211
Apache hadoop bigdata-in-banking
Analyzing Multi-Structured Data
Globant and Big Data on AWS
16h00 globant - aws globant-big-data_summit2012
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...

More from Grant Ingersoll (10)

PPTX
Scalable Machine Learning with Hadoop
PPTX
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
PPTX
Bet you didn't know Lucene can...
PDF
Starfish: A Self-tuning System for Big Data Analytics
PPTX
Intro to Mahout -- DC Hadoop
PPTX
Intro to Apache Lucene and Solr
PPTX
Apache Mahout: Driving the Yellow Elephant
PPTX
Intelligent Apps with Apache Lucene, Mahout and Friends
PPTX
TriHUG: Lucene Solr Hadoop
PPTX
Intro to Apache Mahout
Scalable Machine Learning with Hadoop
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Bet you didn't know Lucene can...
Starfish: A Self-tuning System for Big Data Analytics
Intro to Mahout -- DC Hadoop
Intro to Apache Lucene and Solr
Apache Mahout: Driving the Yellow Elephant
Intelligent Apps with Apache Lucene, Mahout and Friends
TriHUG: Lucene Solr Hadoop
Intro to Apache Mahout

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Diabetes mellitus diagnosis method based random forest with bat algorithm
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
The AUB Centre for AI in Media Proposal.docx
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr

  • 1. Search Discover Analyze Large Scale Search, Discovery and Analytics with Solr, Mahout and Hadoop Grant Ingersoll Chief Scientist Lucid Imagination | 1
  • 2. Search is Dead, Long Live Search  Good keyword search is a Documents commodity and easy to get up and running  The Bar is Raised Content User Relationships Interaction – Relevance is (always will be?) hard  Holistic view of the data AND the users is critical Access | 2
  • 3. Topics  Quick Background and needs  Architecture – Abstract – Practical  SDA In Practice – Components – Challenges and Lessons Learned  Wrap Up | 3
  • 4. Why Search, Discovery and Analytics (SDA)?  User Needs – Real-time, ad hoc access to content – Aggressive Prioritization based on Importance Search – Serendipity – Feedback/Learning from past  Business Needs Analytics Discovery – Deeper insight into users – Leverage existing internal knowledge – Cost effective | 4
  • 5. What Do Developers Need for SDA?  Fast, efficient, scalable search – Bulk and Near Real Time Indexing – Handle billions of records w/ sub-second search and faceting  Large scale, cost effective storage and processing capabilities – Need whole data consumption and analysis – Experimentation/Sampling tools – Distributed In Memory where appropriate  NLP and machine learning tools that scale to enhance discovery and analysis | 5
  • 6. Abstract -> Practical SDA Architecture Access (API, UI,Visualization) Search, Discovery and Analytics Glue Stats Mahout, R, GATE, Others Pig, Machine Docs User Admin Package Learning Access Modeling Experiment Mgmt Service Mgmt Content Computation and Storage Acquisition DB Dist. Data Search NoSQL Process Mgmt KV Shards Shards Shards Shards Shards Shards Shards Logs DFS Provisioning, Monitoring, Infrastructure | 6
  • 7. Computation and Storage Solr Hadoop HBase • Document Index • Stores Logs, • Metric Storage • Document Raw files, • User Histories Storage? intermediate • Document files, etc. Storage? • SolrCloud • WebHDFS makes sharding easy • Small file are an unnatural act Challenges • Who is the authoritative store? Solr or HBase? • Real time vs. Batch • Where should analysis be done? | 7
  • 8. Search In Practice  Three primary concerns – Performance/Scaling – Relevance – Operations: monitoring, failover, etc.  Business typically cares more about relevance  Devs more about performance (and then ops) | 8
  • 9. Search with Solr: Scaling and NRT  SolrCloud takes care of distributed indexing and search needs – Transaction logs for recovery – Automatic leader election, so no more master/worker – Have to declare number of shards now, but splitting coming soon – Use CloudSolrServer in SolrJ  NRT Config tips: – 1 second soft commits for NRT updates – 1 minute hard commits (no searcher reopen) | 9
  • 10. Search: Relevance  ABT – Always Be Testing – Experiment management is critical – Top X + Random Sampling of Long Tail – Click logs  Track Everything! – Queries – Clicks – Displayed Documents – Mouse/Scroll tracking???  Phrases are your friend | 10
  • 11. Discovery Components Serendipity Organization Data Quality • Trends • Importance • Document factor • Topics • Clustering Distributions • Recommendations • Classification • Length • Related Items • Named Entities • Boosts • More Like This • Time Factors • Duplicates • Did you mean? • Faceting • Stat. Interesting Phrases Challenges • Many of these are intense calculations or iterative • Many are subjective and require a lot of experimentation | 11
  • 12. Discovery with Mahout  Mahout’s 3 “C”s provide tools for helping across many aspects of discovery – Collaborative Filtering – Classification – Clustering  Also: – Collocations (Statistically Interesting Phrases) – SVD – Others  Challenges: – High cost to iterative machine learning algorithms – Mahout is very command line oriented – Some areas less mature | 12
  • 13. Aside: Experiment Management  Plan for running experiments from the beginning across Search and Discovery components – Your analytics engine should help!  Types of Experiments to consider – Indexing/Analysis – Query parsing – Scoring formulas – Machine Learning Models – Recommendations, many more  Make it easy to do A/B testing across all experiments and compare and contrast the results | 13
  • 14. Analytics Components  Commonly used components – Solr – R Stats – Hive – Pig – Commercial  Starting with Search and Discovery metrics and analysis gives context into where to make investments for broader analytics | 14
  • 15. Analytics in Practice  Simple Counts: – Facets – Term and Document frequencies – Clicks  Search and Discovery example metrics – Relevance measures like Mean Reciprocal Rank – Histograms/Drilldowns around Number of Results – Log and navigation analysis  Data cleanliness analysis is helpful for finding potential issues in content | 15
  • 16. Wrap  Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users  Solr + Hadoop + Mahout  Design for the big picture when building search-based applications | 16
  • 17. Find me  http://www.lucidimagination.com  grant@lucidimagination.com  @gsingers | 17

Editor's Notes

  • #3: The bar is raised: when we first started Lucid, the problems were all around standing up Lucene or Solr or dealing with performance issues, now the large majority of them are around taking search to the next level: better relevance, personalization, recommendations, etc., i.e. how to have better relevance
  • #5: How do you gain insight?The Search boxis the UI for data these daysFeedback improvements into system for usersExtract key metrics for business understanding
  • #6: Make into images?
  • #7: All about ad hoc and bulk storage and computationAll about the analytics that drive your computationGlue to make it all work together – data where it needs to be when it needs to be thereAll are examples of ways to do this. There are actually a fair number of viable alternatives for all of these pieces, all in open sourceI tend to stick to Apache and “commercial” friendly licenses, where possible
  • #8: Authoritative store: managing across, consistency, etc.Analysis should be done where it most makes sense given the location of the data and the type of analysis being doneHadoop and HBase stuff are all pretty straightforward
  • #10: Relevance – plan for relevance testing from day 1.
  • #16: Log and navigation: clicks, search trails, etc.Data cleanliness: Never viewed docs that are related to other documents
  • #17: Big Picture: too often devs are stuck in the weeds