SlideShare a Scribd company logo
Search	
  	
  	
  	
  	
  	
  	
  Discover	
  	
  	
  	
  	
  	
  	
  Analyze	
  




Large	
  Scale	
  Search,	
  Discovery	
  and	
  
Analy5cs	
  with	
  Solr,	
  Mahout	
  and	
  
Hadoop	
  




Grant	
  Ingersoll	
  
Chief	
  Scien:st	
  
Lucid	
  Imagina:on	
  


                                                                                                              	
  	
  	
  	
  	
  	
  |	
     	
  1	
  	
  
Search	
  is	
  Dead,	
  Long	
  Live	
  Search	
  



l    Good	
  keyword	
  search	
  is	
  a	
                       Documents
      commodity	
  and	
  easy	
  to	
  get	
  
      up	
  and	
  running	
  


l    The	
  Bar	
  is	
  Raised	
                   Content                      User
                                                   Relationships               Interaction
       –  Relevance	
  is	
  (always	
  will	
  
          be?)	
  hard	
  


l    Holis:c	
  view	
  of	
  the	
  data	
  
      AND	
  the	
  users	
  is	
  cri:cal	
  
                                                                    Access




                                                                                      	
  	
  	
  	
  	
  	
  |	
     	
  2	
  	
  
Topics	
  



l    Quick	
  Background	
  and	
  needs	
  
l    Architecture	
  
       –  Abstract	
  
       –  Prac:cal	
  

l    SDA	
  In	
  Prac:ce	
  
       –  Components	
  
       –  Challenges	
  and	
  Lessons	
  Learned	
  

l    Wrap	
  Up	
  




                                                        	
  	
  	
  	
  	
  	
  |	
     	
  3	
  	
  
Why	
  Search,	
  Discovery	
  and	
  Analy8cs	
  (SDA)?	
  

                                     l    User	
  Needs	
  
                                            –  Real-­‐:me,	
  ad	
  hoc	
  access	
  to	
  content	
  
                                            –  Aggressive	
  Priori:za:on	
  based	
  on	
  Importance	
  
             Search
                                            –  Serendipity	
  
                                            –  Feedback/Learning	
  from	
  past	
  



                                     l    Business	
  Needs	
  
 Analytics             Discovery            –  Deeper	
  insight	
  into	
  users	
  
                                            –  Leverage	
  exis:ng	
  internal	
  knowledge	
  
                                            –  Cost	
  effec:ve	
  




                                                                                                  	
  	
  	
  	
  	
  	
  |	
     	
  4	
  	
  
What	
  Do	
  Developers	
  Need	
  for	
  SDA?	
  



l    Fast, efficient, scalable search
       –  Bulk and Near Real Time Indexing
       –  Handle billions of records w/ sub-second search and faceting

l    Large scale, cost effective storage and processing capabilities
       –  Need whole data consumption and analysis
       –  Experimentation/Sampling tools
       –  Distributed In Memory where appropriate

l    NLP and machine learning tools that scale to enhance discovery and
      analysis




                                                                         	
  	
  	
  	
  	
  	
  |	
     	
  5	
  	
  
Abstract	
  -­‐>	
  Prac8cal	
  SDA	
  Architecture	
  
                              Access (API, UI,Visualization)

                                     Search, Discovery and Analytics              Glue
                           Stats Mahout, R, GATE, Others
                              Pig, Machine   Docs     User                       Admin
                          Package Learning  Access Modeling

                                           Experiment Mgmt                       Service
                                                                                  Mgmt
         Content                      Computation and Storage
        Acquisition
                                                 DB
                                                                  Dist.          Data
                         Search                 NoSQL
                                                                 Process         Mgmt
                                                 KV

                         Shards                  Shards                Shards
                          Shards                  Shards                Shards
                            Shards                   Logs                  DFS




                      Provisioning, Monitoring, Infrastructure


                                                                                           	
  	
  	
  	
  	
  	
  |	
     	
  6	
  	
  
Computa8on	
  and	
  Storage	
  


         Solr                              Hadoop                                 HBase

•  Document Index                 •  Stores Logs,                      •  Metric Storage
•  Document                          Raw files,                        •  User Histories
   Storage?                          intermediate                      •  Document
                                     files, etc.                          Storage?
•  SolrCloud makes                •  WebHDFS
   sharding easy
                                  •  Small file are an
                                     unnatural act

Challenges	
  
      •  Who	
  is	
  the	
  authorita:ve	
  store?	
  Solr	
  or	
  HBase?	
  
      •  Real	
  :me	
  vs.	
  Batch	
  
      •  Where	
  should	
  analysis	
  be	
  done?	
  
                                                                                          	
  	
  	
  	
  	
  	
  |	
     	
  7	
  	
  
Search	
  In	
  Prac8ce	
  



l    Three	
  primary	
  concerns	
  
       –  Performance/Scaling	
  


       –  Relevance	
  


       –  Opera:ons:	
  monitoring,	
  failover,	
  etc.	
  


l    Business	
  typically	
  cares	
  more	
  about	
  relevance	
  
l    Devs	
  more	
  about	
  performance	
  (and	
  then	
  ops)	
  




                                                                         	
  	
  	
  	
  	
  	
  |	
     	
  8	
  	
  
Search	
  with	
  Solr:	
  Scaling	
  and	
  NRT	
  



l    SolrCloud	
  takes	
  care	
  of	
  distributed	
  indexing	
  and	
  search	
  needs	
  
       –  Transac:on	
  logs	
  for	
  recovery	
  
       –  Automa:c	
  leader	
  elec:on,	
  so	
  no	
  more	
  master/worker	
  
       –  Have	
  to	
  declare	
  number	
  of	
  shards	
  now,	
  but	
  spliang	
  coming	
  soon	
  
       –  Use	
  CloudSolrServer	
  in	
  SolrJ	
  

l    NRT	
  Config	
  :ps:	
  
       –  1	
  second	
  sod	
  commits	
  for	
  NRT	
  updates	
  
       –  1	
  minute	
  hard	
  commits	
  (no	
  searcher	
  reopen)	
  




                                                                                                            	
  	
  	
  	
  	
  	
  |	
     	
  9	
  	
  
Search:	
  Relevance	
  



l    ABT	
  –	
  Always	
  Be	
  Tes:ng	
  
       –  Experiment	
  management	
  is	
  cri:cal	
  
       –  Top	
  X	
  +	
  Random	
  Sampling	
  of	
  Long	
  Tail	
  
       –  Click	
  logs	
  
l    Track	
  Everything!	
  
       –  Queries	
  
       –  Clicks	
  
       –  Displayed	
  Documents	
  	
  
       –  Mouse/Scroll	
  tracking???	
  

l    Phrases	
  are	
  your	
  friend	
  




                                                                          	
  	
  	
  	
  	
  	
  |	
     	
  10	
  	
  
Discovery	
  Components	
  

       Serendipity                      Organization                        Data Quality

•    Trends                        •  Importance                     •  Document factor
•    Topics                        •  Clustering                        Distributions
•    Recommendations               •  Classification                    •  Length
•    Related Items                    •  Named Entities                 •  Boosts
•    More Like This                •  Time Factors                   •  Duplicates
•    Did you mean?                 •  Faceting
•    Stat. Interesting
     Phrases

Challenges	
  
        •  Many	
  of	
  these	
  are	
  intense	
  calcula:ons	
  or	
  itera:ve	
  
        •  Many	
  are	
  subjec:ve	
  and	
  require	
  a	
  lot	
  of	
  experimenta:on	
  


                                                                                           	
  	
  	
  	
  	
  	
  |	
     	
  11	
  	
  
Discovery	
  with	
  Mahout	
  



l    Mahout’s	
  3	
  “C”s	
  provide	
  tools	
  for	
  helping	
  across	
  many	
  aspects	
  of	
  discovery	
  
        –  Collabora:ve	
  Filtering	
  
        –  Classifica:on	
  
        –  Clustering	
  
l    Also:	
  	
  
        –  Colloca:ons	
  (Sta:s:cally	
  Interes:ng	
  Phrases)	
  
        –  SVD	
  
        –  Others	
  
l    Challenges:	
  
        –  High	
  cost	
  to	
  itera:ve	
  machine	
  learning	
  algorithms	
  
        –  Mahout	
  is	
  very	
  command	
  line	
  oriented	
  
        –  Some	
  areas	
  less	
  mature	
  

                                                                                                                 	
  	
  	
  	
  	
  	
  |	
     	
  12	
  	
  
Aside:	
  Experiment	
  Management	
  



l    Plan	
  for	
  running	
  experiments	
  from	
  the	
  beginning	
  across	
  Search	
  and	
  
      Discovery	
  components	
  
       –  Your	
  analy:cs	
  engine	
  should	
  help!	
  
l    Types	
  of	
  Experiments	
  to	
  consider	
  
       –  Indexing/Analysis	
  
       –  Query	
  parsing	
  
       –  Scoring	
  formulas	
  
       –  Machine	
  Learning	
  Models	
  
       –  Recommenda:ons,	
  many	
  more	
  
l    Make	
  it	
  easy	
  to	
  do	
  A/B	
  tes:ng	
  across	
  all	
  experiments	
  and	
  compare	
  and	
  
      contrast	
  the	
  results	
  



                                                                                                                     	
  	
  	
  	
  	
  	
  |	
     	
  13	
  	
  
Analy8cs	
  Components	
  



l    Commonly	
  used	
  components	
  
       –  Solr	
  
       –  R	
  Stats	
  
       –  Hive	
  
       –  Pig	
  
       –  Commercial	
  


l    Star:ng	
  with	
  Search	
  and	
  Discovery	
  metrics	
  and	
  analysis	
  gives	
  context	
  into	
  
      where	
  to	
  make	
  investments	
  for	
  broader	
  analy:cs	
  




                                                                                                               	
  	
  	
  	
  	
  	
  |	
     	
  14	
  	
  
Analy8cs	
  in	
  Prac8ce	
  



l    Simple	
  Counts:	
  
       –  Facets	
  
       –  Term	
  and	
  Document	
  frequencies	
  
       –  Clicks	
  
l    Search	
  and	
  Discovery	
  example	
  metrics	
  
       –  Relevance	
  measures	
  like	
  Mean	
  Reciprocal	
  Rank	
  
       –  Histograms/Drilldowns	
  around	
  Number	
  of	
  Results	
  
       –  Log	
  and	
  naviga:on	
  analysis	
  


l    Data	
  cleanliness	
  analysis	
  is	
  helpful	
  for	
  finding	
  poten:al	
  issues	
  in	
  content	
  




                                                                                                                     	
  	
  	
  	
  	
  	
  |	
     	
  15	
  	
  
Wrap	
  



l    Search,	
  Discovery	
  and	
  Analy:cs,	
  when	
  combined	
  into	
  a	
  single,	
  coherent	
  
      system	
  provides	
  powerful	
  insight	
  into	
  both	
  your	
  content	
  and	
  your	
  users	
  


l    Lucid	
  has	
  combined	
  many	
  of	
  these	
  things	
  into	
  LucidWorks	
  Big	
  Data	
  
       –  hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/
          lucidworks-­‐big-­‐data	
  



l    Design	
  for	
  the	
  big	
  picture	
  when	
  building	
  search-­‐based	
  applica:ons	
  




                                                                                                                 	
  	
  	
  	
  	
  	
  |	
     	
  16	
  	
  
Find	
  It	
  



l    hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/
      lucidworks-­‐big-­‐data	
  
l    hrp://www.lucidimagina:on.com	
  


l    grant@lucidimagina:on.com	
  
l    @gsingers	
  




                                                                          	
  	
  	
  	
  	
  	
  |	
     	
  17	
  	
  

More Related Content

PDF
Security data deluge
PDF
Searching conversations with hadoop
PPTX
Etu L2 Training - Hadoop 企業應用實作
PPTX
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
PDF
Using hadoop to expand data warehousing
PPTX
Facing enterprise specific challenges – utility programming in hadoop
PPTX
Big data Hadoop
PDF
Architectural considerations for Hadoop Applications
Security data deluge
Searching conversations with hadoop
Etu L2 Training - Hadoop 企業應用實作
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Using hadoop to expand data warehousing
Facing enterprise specific challenges – utility programming in hadoop
Big data Hadoop
Architectural considerations for Hadoop Applications

What's hot (20)

PPTX
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
PDF
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
PDF
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
KEY
Processing Big Data
PDF
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
PDF
hadoop @ Ibmbigdata
PDF
Emergent Distributed Data Storage
PDF
Hadoop meets Agile! - An Agile Big Data Model
PPT
Data Science Day New York: The Platform for Big Data
PDF
Treasure Data and Heroku
PDF
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
PDF
Strata EU tutorial - Architectural considerations for hadoop applications
PPTX
Data warehousing with Hadoop
PDF
Introduction to Hadoop Administration
PDF
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
PPTX
Productionizing Hadoop: 7 Architectural Best Practices
PPTX
2014 july 24_what_ishadoop
PDF
Hadoop Data Reservoir Webinar
PPTX
Getting Started with Big Data in the Cloud
PDF
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Processing Big Data
An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
hadoop @ Ibmbigdata
Emergent Distributed Data Storage
Hadoop meets Agile! - An Agile Big Data Model
Data Science Day New York: The Platform for Big Data
Treasure Data and Heroku
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Strata EU tutorial - Architectural considerations for hadoop applications
Data warehousing with Hadoop
Introduction to Hadoop Administration
Integrating Hadoop Into the Enterprise – Hadoop Summit 2012
Productionizing Hadoop: 7 Architectural Best Practices
2014 july 24_what_ishadoop
Hadoop Data Reservoir Webinar
Getting Started with Big Data in the Cloud
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
Ad

Similar to Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr (20)

PPTX
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
PPTX
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
PPTX
MapR lucidworks joint webinar
PPTX
MapR LucidWorks Joint Webinar 121211
PPTX
Leveraging Solr and Mahout
PPTX
Crowd-Sourced Intelligence Built into Search over Hadoop
PPTX
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
PDF
FAST Search for SharePoint
PPTX
OpenSearchLab and the Lucene Ecosystem
PPTX
Mesh Labs Introduction June 2012
PDF
Data Governance for Data Lakes
PDF
Streaming Hadoop for Enterprise Adoption
PPTX
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
PDF
2010 10-building-global-listening-platform-with-solr
PDF
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
PDF
Intergen Twilight - Corralling the Document Chaos
PPTX
Drill njhug -19 feb2013
PPTX
Building a Data Discovery Network for Sustainability Science
PPTX
Piloting agile project management
PDF
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
MapR lucidworks joint webinar
MapR LucidWorks Joint Webinar 121211
Leveraging Solr and Mahout
Crowd-Sourced Intelligence Built into Search over Hadoop
Hadoop summit EU - Crowd Sourcing Reflected Intelligence
FAST Search for SharePoint
OpenSearchLab and the Lucene Ecosystem
Mesh Labs Introduction June 2012
Data Governance for Data Lakes
Streaming Hadoop for Enterprise Adoption
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
2010 10-building-global-listening-platform-with-solr
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Intergen Twilight - Corralling the Document Chaos
Drill njhug -19 feb2013
Building a Data Discovery Network for Sustainability Science
Piloting agile project management
Analytic Platforms in the Real World with 451Research and Calpont_July 2012
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Advanced Soft Computing BINUS July 2025.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
Teaching material agriculture food technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
MYSQL Presentation for SQL database connectivity
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Electronic commerce courselecture one. Pdf
Modernizing your data center with Dell and AMD
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
GamePlan Trading System Review: Professional Trader's Honest Take
NewMind AI Monthly Chronicles - July 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Advanced Soft Computing BINUS July 2025.pdf
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf
Understanding_Digital_Forensics_Presentation.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Teaching material agriculture food technology
Unlocking AI with Model Context Protocol (MCP)
MYSQL Presentation for SQL database connectivity
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Per capita expenditure prediction using model stacking based on satellite ima...
“AI and Expert System Decision Support & Business Intelligence Systems”

Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr

  • 1. Search              Discover              Analyze   Large  Scale  Search,  Discovery  and   Analy5cs  with  Solr,  Mahout  and   Hadoop   Grant  Ingersoll   Chief  Scien:st   Lucid  Imagina:on              |    1    
  • 2. Search  is  Dead,  Long  Live  Search   l  Good  keyword  search  is  a   Documents commodity  and  easy  to  get   up  and  running   l  The  Bar  is  Raised   Content User Relationships Interaction –  Relevance  is  (always  will   be?)  hard   l  Holis:c  view  of  the  data   AND  the  users  is  cri:cal   Access            |    2    
  • 3. Topics   l  Quick  Background  and  needs   l  Architecture   –  Abstract   –  Prac:cal   l  SDA  In  Prac:ce   –  Components   –  Challenges  and  Lessons  Learned   l  Wrap  Up              |    3    
  • 4. Why  Search,  Discovery  and  Analy8cs  (SDA)?   l  User  Needs   –  Real-­‐:me,  ad  hoc  access  to  content   –  Aggressive  Priori:za:on  based  on  Importance   Search –  Serendipity   –  Feedback/Learning  from  past   l  Business  Needs   Analytics Discovery –  Deeper  insight  into  users   –  Leverage  exis:ng  internal  knowledge   –  Cost  effec:ve              |    4    
  • 5. What  Do  Developers  Need  for  SDA?   l  Fast, efficient, scalable search –  Bulk and Near Real Time Indexing –  Handle billions of records w/ sub-second search and faceting l  Large scale, cost effective storage and processing capabilities –  Need whole data consumption and analysis –  Experimentation/Sampling tools –  Distributed In Memory where appropriate l  NLP and machine learning tools that scale to enhance discovery and analysis            |    5    
  • 6. Abstract  -­‐>  Prac8cal  SDA  Architecture   Access (API, UI,Visualization) Search, Discovery and Analytics Glue Stats Mahout, R, GATE, Others Pig, Machine Docs User Admin Package Learning Access Modeling Experiment Mgmt Service Mgmt Content Computation and Storage Acquisition DB Dist. Data Search NoSQL Process Mgmt KV Shards Shards Shards Shards Shards Shards Shards Logs DFS Provisioning, Monitoring, Infrastructure            |    6    
  • 7. Computa8on  and  Storage   Solr Hadoop HBase •  Document Index •  Stores Logs, •  Metric Storage •  Document Raw files, •  User Histories Storage? intermediate •  Document files, etc. Storage? •  SolrCloud makes •  WebHDFS sharding easy •  Small file are an unnatural act Challenges   •  Who  is  the  authorita:ve  store?  Solr  or  HBase?   •  Real  :me  vs.  Batch   •  Where  should  analysis  be  done?              |    7    
  • 8. Search  In  Prac8ce   l  Three  primary  concerns   –  Performance/Scaling   –  Relevance   –  Opera:ons:  monitoring,  failover,  etc.   l  Business  typically  cares  more  about  relevance   l  Devs  more  about  performance  (and  then  ops)              |    8    
  • 9. Search  with  Solr:  Scaling  and  NRT   l  SolrCloud  takes  care  of  distributed  indexing  and  search  needs   –  Transac:on  logs  for  recovery   –  Automa:c  leader  elec:on,  so  no  more  master/worker   –  Have  to  declare  number  of  shards  now,  but  spliang  coming  soon   –  Use  CloudSolrServer  in  SolrJ   l  NRT  Config  :ps:   –  1  second  sod  commits  for  NRT  updates   –  1  minute  hard  commits  (no  searcher  reopen)              |    9    
  • 10. Search:  Relevance   l  ABT  –  Always  Be  Tes:ng   –  Experiment  management  is  cri:cal   –  Top  X  +  Random  Sampling  of  Long  Tail   –  Click  logs   l  Track  Everything!   –  Queries   –  Clicks   –  Displayed  Documents     –  Mouse/Scroll  tracking???   l  Phrases  are  your  friend              |    10    
  • 11. Discovery  Components   Serendipity Organization Data Quality •  Trends •  Importance •  Document factor •  Topics •  Clustering Distributions •  Recommendations •  Classification •  Length •  Related Items •  Named Entities •  Boosts •  More Like This •  Time Factors •  Duplicates •  Did you mean? •  Faceting •  Stat. Interesting Phrases Challenges   •  Many  of  these  are  intense  calcula:ons  or  itera:ve   •  Many  are  subjec:ve  and  require  a  lot  of  experimenta:on              |    11    
  • 12. Discovery  with  Mahout   l  Mahout’s  3  “C”s  provide  tools  for  helping  across  many  aspects  of  discovery   –  Collabora:ve  Filtering   –  Classifica:on   –  Clustering   l  Also:     –  Colloca:ons  (Sta:s:cally  Interes:ng  Phrases)   –  SVD   –  Others   l  Challenges:   –  High  cost  to  itera:ve  machine  learning  algorithms   –  Mahout  is  very  command  line  oriented   –  Some  areas  less  mature              |    12    
  • 13. Aside:  Experiment  Management   l  Plan  for  running  experiments  from  the  beginning  across  Search  and   Discovery  components   –  Your  analy:cs  engine  should  help!   l  Types  of  Experiments  to  consider   –  Indexing/Analysis   –  Query  parsing   –  Scoring  formulas   –  Machine  Learning  Models   –  Recommenda:ons,  many  more   l  Make  it  easy  to  do  A/B  tes:ng  across  all  experiments  and  compare  and   contrast  the  results              |    13    
  • 14. Analy8cs  Components   l  Commonly  used  components   –  Solr   –  R  Stats   –  Hive   –  Pig   –  Commercial   l  Star:ng  with  Search  and  Discovery  metrics  and  analysis  gives  context  into   where  to  make  investments  for  broader  analy:cs              |    14    
  • 15. Analy8cs  in  Prac8ce   l  Simple  Counts:   –  Facets   –  Term  and  Document  frequencies   –  Clicks   l  Search  and  Discovery  example  metrics   –  Relevance  measures  like  Mean  Reciprocal  Rank   –  Histograms/Drilldowns  around  Number  of  Results   –  Log  and  naviga:on  analysis   l  Data  cleanliness  analysis  is  helpful  for  finding  poten:al  issues  in  content              |    15    
  • 16. Wrap   l  Search,  Discovery  and  Analy:cs,  when  combined  into  a  single,  coherent   system  provides  powerful  insight  into  both  your  content  and  your  users   l  Lucid  has  combined  many  of  these  things  into  LucidWorks  Big  Data   –  hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/ lucidworks-­‐big-­‐data   l  Design  for  the  big  picture  when  building  search-­‐based  applica:ons              |    16    
  • 17. Find  It   l  hrp://www.lucidimagina:on.com/products/lucidworks-­‐search-­‐plasorm/ lucidworks-­‐big-­‐data   l  hrp://www.lucidimagina:on.com   l  grant@lucidimagina:on.com   l  @gsingers              |    17