Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr

Search

Discover

Analyze

Large
Scale
Search,
Discovery
and

Analy5cs
with
Solr,
Mahout
and

Hadoop

Grant
Ingersoll

Chief
Scien:st

Lucid
Imagina:on

|

1

Search
is
Dead,
Long
Live
Search

l  Good
keyword
search
is
a
Documents
commodity
and
easy
to
get

up
and
running

l  The
Bar
is
Raised
Content User
Relationships Interaction
–  Relevance
is
(always
will

be?)
hard

l  Holis:c
view
of
the
data

AND
the
users
is
cri:cal

Access

|

2

Topics

l  Quick
Background
and
needs

l  Architecture

–  Abstract

–  Prac:cal

l  SDA
In
Prac:ce

–  Components

–  Challenges
and
Lessons
Learned

l  Wrap
Up

|

3

Why
Search,
Discovery
and
Analy8cs
(SDA)?

l  User
Needs

–  Real-‐:me,
ad
hoc
access
to
content

–  Aggressive
Priori:za:on
based
on
Importance

Search
–  Serendipity

–  Feedback/Learning
from
past

l  Business
Needs

Analytics Discovery –  Deeper
insight
into
users

–  Leverage
exis:ng
internal
knowledge

–  Cost
eﬀec:ve

|

4

What
Do
Developers
Need
for
SDA?

l  Fast, efficient, scalable search
–  Bulk and Near Real Time Indexing
–  Handle billions of records w/ sub-second search and faceting

l  Large scale, cost effective storage and processing capabilities
–  Need whole data consumption and analysis
–  Experimentation/Sampling tools
–  Distributed In Memory where appropriate

l  NLP and machine learning tools that scale to enhance discovery and
analysis

|

5

Abstract
-‐>
Prac8cal
SDA
Architecture

Access (API, UI,Visualization)

Search, Discovery and Analytics Glue
Stats Mahout, R, GATE, Others
Pig, Machine Docs User Admin
Package Learning Access Modeling

Experiment Mgmt Service
Mgmt
Content Computation and Storage
Acquisition
DB
Dist. Data
Search NoSQL
Process Mgmt
KV

Shards Shards Shards
Shards Shards Shards
Shards Logs DFS

Provisioning, Monitoring, Infrastructure

|

6

Computa8on
and
Storage

Solr Hadoop HBase

•  Document Index •  Stores Logs, •  Metric Storage
•  Document Raw files, •  User Histories
Storage? intermediate •  Document
files, etc. Storage?
•  SolrCloud makes •  WebHDFS
sharding easy
•  Small file are an
unnatural act

Challenges

•  Who
is
the
authorita:ve
store?
Solr
or
HBase?

•  Real
:me
vs.
Batch

•  Where
should
analysis
be
done?

|

7

Search
In
Prac8ce

l  Three
primary
concerns

–  Performance/Scaling

–  Relevance

–  Opera:ons:
monitoring,
failover,
etc.

l  Business
typically
cares
more
about
relevance

l  Devs
more
about
performance
(and
then
ops)

|

8

Search
with
Solr:
Scaling
and
NRT

l  SolrCloud
takes
care
of
distributed
indexing
and
search
needs

–  Transac:on
logs
for
recovery

–  Automa:c
leader
elec:on,
so
no
more
master/worker

–  Have
to
declare
number
of
shards
now,
but
spliang
coming
soon

–  Use
CloudSolrServer
in
SolrJ

l  NRT
Conﬁg
:ps:

–  1
second
sod
commits
for
NRT
updates

–  1
minute
hard
commits
(no
searcher
reopen)

|

9

Search:
Relevance

l  ABT
–
Always
Be
Tes:ng

–  Experiment
management
is
cri:cal

–  Top
X
+
Random
Sampling
of
Long
Tail

–  Click
logs

l  Track
Everything!

–  Queries

–  Clicks

–  Displayed
Documents

–  Mouse/Scroll
tracking???

l  Phrases
are
your
friend

|

10

Discovery
Components

Serendipity Organization Data Quality

•  Trends •  Importance •  Document factor
•  Topics •  Clustering Distributions
•  Recommendations •  Classification •  Length
•  Related Items •  Named Entities •  Boosts
•  More Like This •  Time Factors •  Duplicates
•  Did you mean? •  Faceting
•  Stat. Interesting
Phrases

Challenges

•  Many
of
these
are
intense
calcula:ons
or
itera:ve

•  Many
are
subjec:ve
and
require
a
lot
of
experimenta:on

|

11

Discovery
with
Mahout

l  Mahout’s
3
“C”s
provide
tools
for
helping
across
many
aspects
of
discovery

–  Collabora:ve
Filtering

–  Classiﬁca:on

–  Clustering

l  Also:

–  Colloca:ons
(Sta:s:cally
Interes:ng
Phrases)

–  SVD

–  Others

l  Challenges:

–  High
cost
to
itera:ve
machine
learning
algorithms

–  Mahout
is
very
command
line
oriented

–  Some
areas
less
mature

|

12

Aside:
Experiment
Management

l  Plan
for
running
experiments
from
the
beginning
across
Search
and

Discovery
components

–  Your
analy:cs
engine
should
help!

l  Types
of
Experiments
to
consider

–  Indexing/Analysis

–  Query
parsing

–  Scoring
formulas

–  Machine
Learning
Models

–  Recommenda:ons,
many
more

l  Make
it
easy
to
do
A/B
tes:ng
across
all
experiments
and
compare
and

contrast
the
results

|

13

Analy8cs
Components

l  Commonly
used
components

–  Solr

–  R
Stats

–  Hive

–  Pig

–  Commercial

l  Star:ng
with
Search
and
Discovery
metrics
and
analysis
gives
context
into

where
to
make
investments
for
broader
analy:cs

|

14

Analy8cs
in
Prac8ce

l  Simple
Counts:

–  Facets

–  Term
and
Document
frequencies

–  Clicks

l  Search
and
Discovery
example
metrics

–  Relevance
measures
like
Mean
Reciprocal
Rank

–  Histograms/Drilldowns
around
Number
of
Results

–  Log
and
naviga:on
analysis

l  Data
cleanliness
analysis
is
helpful
for
ﬁnding
poten:al
issues
in
content

|

15

Wrap

l  Search,
Discovery
and
Analy:cs,
when
combined
into
a
single,
coherent

system
provides
powerful
insight
into
both
your
content
and
your
users

l  Lucid
has
combined
many
of
these
things
into
LucidWorks
Big
Data

–  hrp://www.lucidimagina:on.com/products/lucidworks-‐search-‐plasorm/
lucidworks-‐big-‐data

l  Design
for
the
big
picture
when
building
search-‐based
applica:ons

|

16

Find
It

l  hrp://www.lucidimagina:on.com/products/lucidworks-‐search-‐plasorm/
lucidworks-‐big-‐data

l  hrp://www.lucidimagina:on.com

l  grant@lucidimagina:on.com

l  @gsingers

|

17

Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr

More Related Content

What's hot (20)

Similar to Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr