Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr

Search Discover Analyze

Large Scale Search, Discovery and
Analytics with Solr, Mahout and
Hadoop

Grant Ingersoll
Chief Scientist
Lucid Imagination

| 1

Search is Dead, Long Live Search

 Good keyword search is a Documents
commodity and easy to get
up and running

 The Bar is Raised Content User
Relationships Interaction
– Relevance is (always will
be?) hard

 Holistic view of the data
AND the users is critical
Access

| 2

Topics

 Quick Background and needs
 Architecture
– Abstract
– Practical
 SDA In Practice
– Components
– Challenges and Lessons Learned
 Wrap Up

| 3

Why Search, Discovery and Analytics (SDA)?

 User Needs
– Real-time, ad hoc access to content
– Aggressive Prioritization based on Importance
Search
– Serendipity
– Feedback/Learning from past

 Business Needs
Analytics Discovery – Deeper insight into users
– Leverage existing internal knowledge
– Cost effective

| 4

What Do Developers Need for SDA?

 Fast, efficient, scalable search
– Bulk and Near Real Time Indexing
– Handle billions of records w/ sub-second search and faceting
 Large scale, cost effective storage and processing capabilities
– Need whole data consumption and analysis
– Experimentation/Sampling tools
– Distributed In Memory where appropriate
 NLP and machine learning tools that scale to enhance discovery and
analysis

| 5

Abstract -> Practical SDA Architecture
Access (API, UI,Visualization)

Search, Discovery and Analytics Glue
Stats Mahout, R, GATE, Others
Pig, Machine Docs User Admin
Package Learning Access Modeling

Experiment Mgmt Service
Mgmt
Content Computation and Storage
Acquisition
DB
Dist. Data
Search NoSQL
Process Mgmt
KV

Shards Shards Shards
Shards Shards Shards
Shards Logs DFS

Provisioning, Monitoring, Infrastructure

| 6

Computation and Storage

Solr Hadoop HBase

• Document Index • Stores Logs, • Metric Storage
• Document Raw files, • User Histories
Storage? intermediate • Document
files, etc. Storage?
• SolrCloud • WebHDFS
makes sharding
easy • Small file are an
unnatural act

Challenges
• Who is the authoritative store? Solr or HBase?
• Real time vs. Batch
• Where should analysis be done?
| 7

Search In Practice

 Three primary concerns
– Performance/Scaling

– Relevance

– Operations: monitoring, failover, etc.

 Business typically cares more about relevance
 Devs more about performance (and then ops)

| 8

Search with Solr: Scaling and NRT

 SolrCloud takes care of distributed indexing and search needs
– Transaction logs for recovery
– Automatic leader election, so no more master/worker
– Have to declare number of shards now, but splitting coming soon
– Use CloudSolrServer in SolrJ
 NRT Config tips:
– 1 second soft commits for NRT updates
– 1 minute hard commits (no searcher reopen)

| 9

Search: Relevance

 ABT – Always Be Testing
– Experiment management is critical
– Top X + Random Sampling of Long Tail
– Click logs
 Track Everything!
– Queries
– Clicks
– Displayed Documents
– Mouse/Scroll tracking???
 Phrases are your friend

| 10

Discovery Components

Serendipity Organization Data Quality

• Trends • Importance • Document factor
• Topics • Clustering Distributions
• Recommendations • Classification • Length
• Related Items • Named Entities • Boosts
• More Like This • Time Factors • Duplicates
• Did you mean? • Faceting
• Stat. Interesting
Phrases

Challenges
• Many of these are intense calculations or iterative
• Many are subjective and require a lot of experimentation

| 11

Discovery with Mahout

 Mahout’s 3 “C”s provide tools for helping across many aspects of discovery
– Collaborative Filtering
– Classification
– Clustering
 Also:
– Collocations (Statistically Interesting Phrases)
– SVD
– Others
 Challenges:
– High cost to iterative machine learning algorithms
– Mahout is very command line oriented
– Some areas less mature

| 12

Aside: Experiment Management

 Plan for running experiments from the beginning across Search and
Discovery components
– Your analytics engine should help!
 Types of Experiments to consider
– Indexing/Analysis
– Query parsing
– Scoring formulas
– Machine Learning Models
– Recommendations, many more
 Make it easy to do A/B testing across all experiments and compare and
contrast the results

| 13

Analytics Components

 Commonly used components
– Solr
– R Stats
– Hive
– Pig
– Commercial

 Starting with Search and Discovery metrics and analysis gives context into
where to make investments for broader analytics

| 14

Analytics in Practice

 Simple Counts:
– Facets
– Term and Document frequencies
– Clicks
 Search and Discovery example metrics
– Relevance measures like Mean Reciprocal Rank
– Histograms/Drilldowns around Number of Results
– Log and navigation analysis

 Data cleanliness analysis is helpful for finding potential issues in content

| 15

Wrap

 Search, Discovery and Analytics, when combined into a single, coherent
system provides powerful insight into both your content and your users

 Solr + Hadoop + Mahout

 Design for the big picture when building search-based applications

| 16

Find me

 http://www.lucidimagination.com

 grant@lucidimagination.com
 @gsingers

| 17

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr

More Related Content

What's hot (8)

Viewers also liked (10)

Similar to Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr (20)

More from Grant Ingersoll (10)

Recently uploaded (20)

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr

Editor's Notes