SlideShare a Scribd company logo
Trey Grainger
Manager, Search Technology Development
@
Building a Real-time, Big Data
Analytics Platform with Solr
Lucene Revolution 2013 - San Diego, CA
Overview
• Intro to Analytics with Solr
• Real-world examples & Faceting deep dive
• Solr enhancements we’re contributing:
– Distributed Pivot Faceting
– Pivoted Percentile/Stats Faceting
• Data Analytics with Solr… the next frontier
My Background
Trey Grainger
Manager, Search Technology Development
@ CareerBuilder.com
Relevant Background
• Search & Recommendations
• High-volume, Distributed Systems
• NLP, Relevancy Tuning, User Group Testing, & Machine Learning
Other Projects
• Co-author: Solr in Action
• Founder and Chief Engineer @ .com
About Search @CareerBuilder
• Over 2.5 million new jobs each month
• Over 50 million actively searchable resumes
• ~300 globally distributed search servers (in
the U.S., Europe, & Asia)
• Thousands of unique, dynamically generated
indexes
• Over ½ Billion actively searchable documents
• Over 1 million searches an hour
Our Search Platform
• Generic Search API wrapping Solr + our domain stack
• Goal: Abstract away search into a simple API so that any engineer
can build search-based product with no prior search background
• 3 Supported Methods (with rich syntax):
– AddDocument
– DeleteDocument
– Search
*users pass along their own dynamically-defined schemas on each call
• Building out as Restful API so search can be used “from anywhere”
Paradigm shift?
• Originally, search engines were designed to
return a ranked list of relevant documents
Paradigm shift?
• Then, features like faceting were added to augment
the search experience with aggregate information…
Paradigm shift?
• But… what kinds of products could you build if you
only cared about the aggregate calculations?
Workforce Supply & Demand
//Range Faceting
&facet.range=years_experience
&facet.range.start=0
&facet.range.end=10
&facet.range.gap=1
&facet.range.other=after
Faceting Overview
/solr/select/?q=…&facet=true
//Field Faceting
&facet.field=city
"facet_ranges":{
"years_experience":{
"counts":[
"0",1010035,
"1",343831,
…
"9",121090
], …
"after":59462}}
"facet_fields":{
"city":[
"new york, ny",2337,
"los angeles, ca",1693,
"chicago, il",1535,
… ]}
"facet_queries":{
"0 to 10 km":1187,
"10 to 25 km":462,
"25 to 50 km":794,
"50+":105296
},
//Query Faceting:
&facet.query={!frange key="0 to 10 km" l=0 u=10 incll=false}geodist()
&facet.query={!frange key="10 to 25 km" l=10 u=25 incll=false}geodist()
&facet.query={!frange key="25 to 50 km" l=25 u=50 incll=false}geodist()
&facet.query={!frange key="50+" l=50 incll=false}geodist()
&sfield=location
&pt=37.7770,-122.4200
Multi-select Faceting with Tags / Excludes
*Slide Taken from Ch 8 of Solr in Action
Faceting with No Filters Selected:
&facet=true
&facet.field=state
&facet.field=price_range
&facet.field=city
Faceting with Filter of state:California selected: &facet=true
&facet.field=state
&facet.field=price_range
&facet.field=city
&fq=state:California
Multi-Select Faceting with state:California selected:
&facet=true
&facet.field={!ex="stateTag"}state
&facet.field=price_range
&facet.field=city
&fq={!tag="stateTag"}state:California
Supply of Candidates
Why Solr for Analytics?
• Allows “ad-hoc” querying of data by keywords
• Is good at on-the-fly aggregate calculations
(facets + stats + functions + grouping)
• Solr is horizontally scalable, and thus able to handle
billions of documents
• Insanely Fast queries, encouraging user exploration
Supply of Candidates
Demand for Jobs
Supply over Demand (Labor Pressure)
Wait, how’d you do that?
/solr/select/q=...&facet=true&facet.field=state
/solr/select/?q=…&facet=true&facet.field=month*
/solr/select/?q=…&facet=true&
facet.field=military_experience
Building Blocks…
*string field in format 201305
Building Blocks…
/solr/select/?
q="construction worker"&
fq=city:"las vegas, nv"&
facet=true&
facet.field=company
/solr/select/?
q="construction worker"&
fq=city:"las vegas, nv"&
facet=true&
facet.field=lastjobtitle
Building Blocks…
/solr/select/? q=...&
facet=true&facet.field=experience_ranges
/solr/select/?q=...&facet=true&
facet.field=management_experience
That’s pretty basic… what else can you do?
Radius Faceting
Hiring Comparison per Market
Query 1:
/solr/select/?...
fq={!geofilt sfield=latlong pt=37.777,-122.420 d=80}
&facet=true&facet.field=city&
"facet_fields":{
"city":[
"san francisco, ca",11713,
"san jose, ca",3071,
"oakland, ca",1482,
"palo alto, ca",1318,
"santa clara, ca",1212,
"mountain view, ca",1045,
"sunnyvale, ca",1004,
"fremont, ca",726,
"redwood city, ca",633,
"berkeley, ca",599]}Query 2:
/solr/select/?...
&facet=true&facet.field=city&
fq=( _query_:"{!geofilt sfield=latlong pt=37.7770,-122.4200 d=20} " //san francisco
OR _query_:"{!geofilt sfield=latlong pt=37.338,-121.886 d=20} " //san jose
…
OR _query_:"{!geofilt sfield=latlong pt=37.870,--122.271 d=20} " //berkeley
)
Geo-spatial Analytics
Customer-specific Analytics
Customer-specific Analytics
Customer-specific Analytics
Customer’s Performance vs. Competition
Customer’s Performance vs. Competition
Customer’s Performance vs. Competition
Event Stream Analysis
A/B Testing
A/B Testing
A/B Testing
1. Hash all users based upon user identifier in
application stack
2. Flow user event streams into Solr with user
identifier
3. Implement custom hashing function in Solr
4. Facet based upon custom function
5. Profit!
//Custom Solr Plugin
public class ExperimentGroupFunctionParser extends ValueSourceParser {
public ValueSource parse(FunctionQParser fqp) throws SyntaxError{
String experimentName = fqp.parseArg();
ValueSource uniqueID = fqp.parseValueSource();
int numberOfGroups = fqp.parseInt();
double ratioPerGroup = fqp.parseDouble();
return new ExperimentGroupFunction (experimentName, uniqueID, numberOfGroups, ratioPerGroup);
}
}
public class ExperimentGroupFunction extends ValueSource {
…
@Override
public FunctionValues getValues(Map context, AtomicReaderContext readerContext) throws IOException {
final FunctionValues vals = uniqueIdFieldValueSource.getValues(context, readerContext);
return new IntDocValues(this) {
@Override
public int intVal(int doc) {
//returns an deterministically hashed integer indicating which test group a user is in
return GetExperimentGroup(experimentName, vals.strVal(doc), numGroups, ratioPerGroup);
}
}
…
}
//SolrConfig.xml:
<valueSourceParser name=”experimentgroup"
class="com.careerbuilder.solr.functions. ExperimentGroupFunctionParser" />
A/B Testing – Adding Function to Solr
//Group 1 Time Series Facet Query:
/solr/select?
facet.range=eventDate&
facet.range.start=2013-04-01T01:00:00.000Z&
facet.range.end=2013-04-07T01:00:00.000Z&
facet.range.gap=+1Hour&
fq={!frange l=1 u=1 key="Group 1"}experimentgroup("EXPERIMENT_NAME", userIdField, 4, 0.25)
//repeat for groups 2-4
…fq={!frange l=2 u=2 key="Group 2"}experimentgroup("EXPERIMENT_NAME", userIdField, 4, 0.25)
…fq={!frange l=3 u=3 key="Group 3"}experimentgroup("EXPERIMENT_NAME", userIdField, 4,
0.25)…fq={!frange l=4 u=4 key="Group 4"}experimentgroup("EXPERIMENT_NAME", userIdField, 4,
0.25)
A/B Testing – Faceting on Function
//Group 1 Time Series Facet Query:
/solr/select?
facet.range=eventDate&
facet.range.start=2013-04-01T01:00:00.000Z&
facet.range.end=2013-04-07T01:00:00.000Z&
facet.range.gap=+1Hour&
fq={!frange l=1 u=1 key="Group 1"}experimentgroup("EXPERIMENT_NAME", userIdField, 4, 0.25)
//repeat for groups 2-4
…fq={!frange l=2 u=2 key="Group 2"}experimentgroup("EXPERIMENT_NAME", userIdField, 4, 0.25)
…fq={!frange l=3 u=3 key="Group 3"}experimentgroup("EXPERIMENT_NAME", userIdField, 4,
0.25)…fq={!frange l=4 u=4 key="Group 4"}experimentgroup("EXPERIMENT_NAME", userIdField, 4,
0.25)
A/B Testing – Faceting on Function
Solr Patches in progress
SOLR-2894: “Distributed Pivot Faceting”
Status: We have submitted a stable patch (including distributed
refinement) which is working in production at CareerBuilder.
Note: The community has reported Issues with non-text fields (i.e.
dates) which need to be resolved before the patch is committed.
SOLR-3583: “Stats within (pivot) facets”
Status: We have submitted an early patch (built on top of SOLR-2894,
distributed pivot facets), which is in production at CareerBuilder.
/solr/select?q=...&
facet=true&
facet.pivot=state,city&
facet.stats.percentiles=true&
facet.stats.percentiles.averages=true&
facet.stats.percentiles.field=compensation&
f.compensation.stats.percentiles.requested=10,25,50,75,90&
f.compensation.stats.percentiles.lower.fence=1000&
f.compensation.stats.percentiles.upper.fence=200000&
f.compensation.stats.percentiles.gap=1000
"facet_pivot":{
"state,city":[{
"field":"state",
"value":"california",
"count":1872280,
"statistics":[
"compensation",[
"percentiles",[
"10.0","26000.0",
"25.0","31000.0",
"50.0","43000.0",
"75.0","66000.0",
"90.0","94000.0"],
"percentiles_average",52613.72,
"percentiles_count",1514592]],
"pivot":[{
"field":"city",
"value":"los angeles, ca",
"count":134851,
"statistics":{
"compensation":[
"percentiles",[
"10.0","26000.0",
"25.0","31000.0",
"50.0","45000.0",
"75.0","70000.0",
"90.0","95000.0"],
"percentiles_average",54122.45,
"percentiles_count",213481]}}
…
]}]}
SOLR-3583: “Stats within (pivot) facets”
Real-world Use Case
Stats Pivot Faceting (Percentiles)Stats Pivot
Faceting (Average)
Field
Facet
Data Analytics with Solr… the next frontier…
Solr Analytics Wish List…
• Faceting Architecture Redesign:
– “All” Facets should be pivot facets
(default pivot depth = 1)
– Each “Pivot” could be a field, query, or range
– Meta information (like the Stats patch) could be nested in each pivot
– Backwards compatibility with current facet response format for a while…
• Grouping
– Grouping should also support multiple levels (Pivot Grouping)
– If “Pivot Grouping” were supported… what about faceting within each of the pivot
groups?
• Remaining caches should be converted to “per-segment” to enable NRT
• These kinds of changes would enable Solr to return rich Data Analytics
calculations in a single search call where many calls are required today.
Recap
Billions of documents
+
Ad-hoc querying by keyword
+
Horizontal Scalability
+
Sub-second query responses
+
Facet on “anything” – field, function, etc.
+
User friendly visualizations/UX
--------------------------------------------------------
Building a real-time, Big Data Analytics Platform with Solr
Contact Info
Yes, we are hiring @CareerBuilder. Come talk with me if you are interested…
 Trey Grainger
trey.grainger@careerbuilder.com
@treygrainger
Other presentations:
http://www.treygrainger.com http://solrinaction.com

More Related Content

PDF
Keynote Peter Skomoroch - skills, reputation, and search
PDF
Keynote session - Lucene/Solr Revolution
PDF
How to make a simple cheap high availability self-healing solr cluster
PDF
Concept search for e commerce with solr
PPTX
Simon Character Notes Lord of the Flies
PDF
математика 4 класс богданович 2015
PDF
The Many Facets of Apache Solr - Yonik Seeley
PDF
Retrieving Information From Solr
Keynote Peter Skomoroch - skills, reputation, and search
Keynote session - Lucene/Solr Revolution
How to make a simple cheap high availability self-healing solr cluster
Concept search for e commerce with solr
Simon Character Notes Lord of the Flies
математика 4 класс богданович 2015
The Many Facets of Apache Solr - Yonik Seeley
Retrieving Information From Solr

Similar to Building a real time, big data analytics platform with solr (20)

PDF
Multi faceted responsive search, autocomplete, feeds engine & logging
PDF
Solr 3.1 and beyond
PDF
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
PDF
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
PDF
Solr: 4 big features
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
PDF
KEYNOTE: Lucene / Solr road map
PDF
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
PPTX
Faceted search using Solr and Ontopia
PDF
PDF
Apache Solr lessons learned
PDF
Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Hue
PDF
Interactive Query and Search for your Big Data
PDF
Interactively Search and Visualize Your Big Data
PDF
Interactively Search and Visualize Your Data: Presented by Romain Rigaux, Clo...
PPT
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
PDF
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
PDF
Faceted Search And Result Reordering
PDF
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
PPTX
Building Search & Recommendation Engines
Multi faceted responsive search, autocomplete, feeds engine & logging
Solr 3.1 and beyond
Extending Solr: Behind CareerBuilder’s Cloud-like Knowledge Discovery Platfor...
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Solr: 4 big features
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
KEYNOTE: Lucene / Solr road map
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Faceted search using Solr and Ontopia
Apache Solr lessons learned
Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Hue
Interactive Query and Search for your Big Data
Interactively Search and Visualize Your Big Data
Interactively Search and Visualize Your Data: Presented by Romain Rigaux, Clo...
Extending Solr: Building a Cloud-like Knowledge Discovery Platform
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Faceted Search And Result Reordering
DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Resul...
Building Search & Recommendation Engines
Ad

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
State of the Art Logging. Kibana4Solr is Here!
PDF
Search at Twitter
PDF
Building Client-side Search Applications with Solr
PDF
Integrate Solr with real-time stream processing applications
PDF
Scaling Solr with SolrCloud
PDF
Administering and Monitoring SolrCloud Clusters
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
PDF
Using Solr to Search and Analyze Logs
PDF
Enhancing relevancy through personalization & semantic search
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
Solr's Admin UI - Where does the data come from?
PDF
Schemaless Solr and the Solr Schema REST API
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PDF
Faceted Search with Lucene
PDF
Recent Additions to Lucene Arsenal
PDF
Turning search upside down
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PDF
Shrinking the haystack wes caldwell - final
Text Classification Powered by Apache Mahout and Lucene
State of the Art Logging. Kibana4Solr is Here!
Search at Twitter
Building Client-side Search Applications with Solr
Integrate Solr with real-time stream processing applications
Scaling Solr with SolrCloud
Administering and Monitoring SolrCloud Clusters
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Using Solr to Search and Analyze Logs
Enhancing relevancy through personalization & semantic search
Real-time Inverted Search in the Cloud Using Lucene and Storm
Solr's Admin UI - Where does the data come from?
Schemaless Solr and the Solr Schema REST API
High Performance JSON Search and Relational Faceted Browsing with Lucene
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Faceted Search with Lucene
Recent Additions to Lucene Arsenal
Turning search upside down
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Shrinking the haystack wes caldwell - final
Ad

Recently uploaded (20)

PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Complications of Minimal Access Surgery at WLH
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Insiders guide to clinical Medicine.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Pre independence Education in Inndia.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Cell Types and Its function , kingdom of life
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Classroom Observation Tools for Teachers
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
Final Presentation General Medicine 03-08-2024.pptx
Complications of Minimal Access Surgery at WLH
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Insiders guide to clinical Medicine.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Supply Chain Operations Speaking Notes -ICLT Program
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
102 student loan defaulters named and shamed – Is someone you know on the list?
Pre independence Education in Inndia.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
TR - Agricultural Crops Production NC III.pdf
Cell Types and Its function , kingdom of life
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Classroom Observation Tools for Teachers
2.FourierTransform-ShortQuestionswithAnswers.pdf

Building a real time, big data analytics platform with solr