SlideShare a Scribd company logo
SOLR SEARCH WITH
SPARK FOR BIG DATA
ANALYTICS IN ACTION
Romain Rigaux
GOALS

Build	
  a	
  Web	
  app	
  
Quickly	
  explore	
  data	
  
…	
  with	
  Solr
make	
  Solr	
  /	
  Hadoop	
  easier	
  to	
  use
+
ARCHITECTURE

“Just	
  a	
  view”	
  on	
  top	
  of	
  the	
  standard	
  Solr	
  API
REST
HISTORY

V1 USER
HISTORY

V1 ADMIN
ARCHITECTURE

NEXT!
Lot	
  of	
  learning,	
  UX	
  Boost	
  needed	
  
Simple,	
  don’t	
  know	
  it	
  is	
  Solr
HISTORY

V2 USER
HISTORY

V2 ADMIN
HISTORY

V2 BETTER UX
ARCHITECTURE
/select	
  
/admin/collections	
  
/get	
  
/luke...
/add_widget	
  
/zoom_in	
  
/select_facet	
  
/select_range...
REST AJAX
Templates	
  
+	
  
JS	
  Model
www….
ARCHITECTURE

UI FOR FACETS
Query
Collection
	
  Layout All	
  the	
  2D	
  positioning	
  (cell	
  ids),	
  visual,	
  drag&drop
Dashboard,	
  fields,	
  template,	
  widgets	
  (ids)
Search	
  terms,	
  selected	
  facets	
  (q,	
  fqs)
ADDING A WIDGET

LIFECYCLE
Load	
  the	
  initial	
  page	
  
Edit	
  mode	
  and	
  Drag&Drop
/solr/zookeeper/clusterstate.json	
  
/solr/admin/luke…
/get_collection
ADDING A WIDGET

LIFECYCLE
/solr/select?stats=true /new_facet
Select	
  the	
  field	
  
Guess	
  ranges	
  (number	
  or	
  dates)	
  
Rounding	
  (number	
  or	
  dates)
ADDING A WIDGET

LIFECYCLE
Query	
  part	
  1
Query	
  Part	
  2
Augment	
  Solr	
  response
facet.range={!ex=bytes}bytes&f.bytes.facet.range.start=0&f.bytes.facet.range.end=9000000&	
  
f.bytes.facet.range.gap=900000&f.bytes.facet.mincount=0&f.bytes.facet.limit=10
q=Chrome&fq={!tag=bytes}bytes:[900000+TO+1800000]
{
'facet_counts':{
'facet_ranges':{
'bytes':{
'start':10000,
'counts':[
'900000',
3423,
'1800000',
339,
...
]
}
}
}
{
...,
'normalized_facets':[
{
'extraSeries':[
],
'label':'bytes',
'field':'bytes',
'counts':[
{
'from’:'900000',
'to':'1800000',
'selected':True,
'value':3423,
'field’:'bytes',
'exclude':False
}
], ...
}
}
}
JSON TO WIDGET

{
"field":"rate_code",
"counts":[
{
"count":97797,
"exclude":true,
"selected":false,
"value":"1",
"cat":"rate_code"
} ...
{
"field":"medallion",
"counts":[
{
"count":159,
"exclude":true,
"selected":false,
"value":"6CA28FC49A4C49A9A96",
"cat":"medallion"
} ….
{
"extraSeries":[
],
"label":"trip_time_in_secs",
"field":"trip_time_in_secs",
"counts":[
{
"from":"0",
"to":"10",
"selected":false,
"value":527,
"field":"trip_time_in_secs",
"exclude":true
} ...
{
"field":"passenger_count",
"counts":[
{
"count":74766,
"exclude":true,
"selected":false,
"value":"1",
"cat":"passenger_count"
} ...
REPEAT

UNTIL…
GAME CHANGER!
Possibilihes
5.1	
  /	
  5.2
Analyhc	
  Facets
FACET

FUNCTIONS
Count	
  
Sum	
  
Avg	
  
Percentile	
  
Max	
  
...
Count(id)	
  
Sum(bytes)	
  
Avg(mul(price,	
  quantity))	
  
Percentile(salary,	
  50,	
  90)	
  
Max(temperature)	
  
...
FACET

FUNCTIONS
SUB “NESTED”

FACETS
top_os	
  {	
  
	
  	
  type:	
  term,	
  
	
  	
  field:	
  os,	
  
	
  	
  limit:	
  5	
  
}
top_os	
  {	
  
	
  	
  type:	
  term,	
  
	
  	
  field:	
  os,	
  
	
  	
  limit:	
  5,	
  
	
  	
  facet	
  :	
  {	
  
	
  	
  	
  	
  by_country:	
  {	
  
	
  	
  	
  	
  	
  	
  type:	
  term,	
  
	
  	
  	
  	
  	
  	
  field:	
  country	
  
	
  	
  	
  	
  }	
  
	
  	
  }	
  
}
FUNCTION + NESTED =

ANALYTICS states	
  {	
  
	
  	
  type:	
  term,	
  
	
  	
  field:	
  state,	
  
	
  	
  facet	
  :	
  {	
  
	
  	
  	
  by_month	
  :	
  {	
  
	
  	
  	
  	
  	
  	
  type:	
  range,	
  
	
  	
  	
  	
  	
  	
  field:	
  time,	
  
	
  	
  	
  	
  	
  	
  start:	
  “TODAY-­‐6MONTHS”,	
  
	
  	
  	
  	
  	
  	
  end:	
  “TODAY”,	
  
	
  	
  	
  	
  	
  	
  gap:	
  “MONTH”,	
  
	
  	
  	
  	
  	
  	
  facet	
  :	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  avg_sal:	
  “avg(salary)”	
  
	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  }	
  
	
  	
  }	
  
}
states	
  {	
  
	
  	
  type:	
  term,	
  
	
  	
  field:	
  state,	
  
	
  	
  facet	
  :	
  {	
  
	
  	
  	
  	
  avg_sal:	
  “avg(salary)”	
  
	
  	
  }	
  
}
OPERATIONS ON

BUCKETS OF DATA
Counts	
  →	
  Functions
OPERATIONS ON

BUCKETS OF DATA
Nested	
  →	
  nD	
  functions
SEARCH AS ONLY

APP IN HUE
gethue.com/solr-­‐search-­‐ui-­‐only/
• Spark	
  in	
  your	
  browser	
  
• Notebooks	
  
• New	
  REST	
  Server
SPARK

INDEXING
WHAT
• Open	
  source	
  REST	
  for	
  Spark	
  Shell	
  
• Runs	
  locally	
  or	
  inside	
  YARN	
  
• Spark	
  Scala,	
  PySpark	
  and	
  jar/py	
  
submission
SPARK

INDEXING
WHAT
hpps://github.com/cloudera/hue/tree/master/apps/spark/java
LIVY ARCH
YARN LOCAL
Livy	
  Server
Livy	
  REPL
Spark	
  Contexts
Spark	
  Worker
Livy	
  Server
YARN	
  Master
YARN	
  Node
Livy	
  REPL
Spark	
  Context	
  /	
  PySpark
YARN	
  Node
Spark	
  Worker
YARN	
  Node
Spark	
  Worker
1
2
3
4
SPARK STREAMING
Real	
  hme!	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Spark	
  Solr
• Python	
  
• Scala	
  
• Charts
NOTEBOOKS / SHELL
WHAT
DEMO
TIME

• Analyze	
  Bay	
  area	
  bike	
  share	
  
• Visualize	
  one	
  year	
  of	
  data	
  
• Know	
  your	
  users,	
  predict	
  behavior
MISSED

SOMETHING?
demo.gethue.com
• Full	
  Analyhcs	
  
• Easier	
  indexing	
  
• Geo	
  
• Export/Share	
  results	
  
• Solr	
  Joins,	
  Solr	
  SQL	
  
• Spark,	
  SQL...	
  integrahon,	
  Hue	
  4
WHAT’S NEXT
NEW FEATURES
TWITTER
@gethue
USER GROUP
hue-­‐user@
WEBSITE
hpp://gethue.com
LEARN
hpp://learn.gethue.com
THANKS!


More Related Content

PPTX
Spark Streaming & Kafka-The Future of Stream Processing
PDF
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
PDF
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
PDF
Lambda architecture
PDF
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
PDF
Rethinking Streaming Analytics For Scale
Spark Streaming & Kafka-The Future of Stream Processing
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Using Spark, Kafka, Cassandra and Akka on Mesos for Real-Time Personalization
Lambda architecture
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Rethinking Streaming Analytics For Scale

What's hot (20)

PPTX
Real Time Data Processing Using Spark Streaming
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PDF
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
PDF
Impala presentation ahad rana
PDF
Reactive app using actor model & apache spark
ODP
Lambda Architecture with Spark
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PDF
How to deploy Apache Spark 
to Mesos/DCOS
PDF
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
PPTX
Kafka website activity architecture
PPTX
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
ODP
Kick-Start with SMACK Stack
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
PDF
Cassandra Core Concepts
PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
PDF
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Real Time Data Processing Using Spark Streaming
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Impala presentation ahad rana
Reactive app using actor model & apache spark
Lambda Architecture with Spark
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
How to deploy Apache Spark 
to Mesos/DCOS
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
Kafka website activity architecture
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Lambda architecture on Spark, Kafka for real-time large scale ML
Kick-Start with SMACK Stack
Sa introduction to big data pipelining with cassandra & spark west mins...
Cassandra Core Concepts
Kappa Architecture on Apache Kafka and Querona: datamass.io
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Ad

Viewers also liked (20)

PPTX
2014 bigdatacamp asya_kamsky
PDF
Yarn cloudera-kathleenting061414 kate-ting
PDF
Ag big datacampla-06-14-2014-ajay_gopal
PPTX
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
PDF
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
PDF
Aziksa hadoop for buisness users2 santosh jha
PDF
Kiji cassandra la june 2014 - v02 clint-kelly
PPT
Big datacamp june14_alex_liu
PPTX
Summit v4 dave wolcott
PDF
140614 bigdatacamp-la-keynote-jon hsieh
PDF
20140614 introduction to spark-ben white
PPTX
La big datacamp2014_vikram_dixit
PDF
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
PDF
Hadoop and NoSQL joining forces by Dale Kim of MapR
PPTX
Hadoop Innovation Summit 2014
PPTX
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
PPTX
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
PPTX
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
PDF
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
2014 bigdatacamp asya_kamsky
Yarn cloudera-kathleenting061414 kate-ting
Ag big datacampla-06-14-2014-ajay_gopal
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Aziksa hadoop for buisness users2 santosh jha
Kiji cassandra la june 2014 - v02 clint-kelly
Big datacamp june14_alex_liu
Summit v4 dave wolcott
140614 bigdatacamp-la-keynote-jon hsieh
20140614 introduction to spark-ben white
La big datacamp2014_vikram_dixit
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop Innovation Summit 2014
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Ad

Similar to Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Action with Hue by Romain Rigaux of Cloudera (20)

PDF
SF Solr Meetup - Interactively Search and Visualize Your Big Data
PDF
Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Hue
PDF
Interactive Query and Search for your Big Data
PDF
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
PDF
Data Science with Solr and Spark
KEY
Solr 101
PDF
Hue architecture in the Hadoop ecosystem and SQL Editor
PDF
SQL and Search with Spark in your browser
PDF
Data Engineering with Solr and Spark
PDF
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
PDF
Interactive Apache Spark in Your Browser
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
PDF
KEYNOTE: Lucene / Solr road map
PDF
Search@flipkart
PDF
Meet Solr For The Tirst Again
PPTX
Apache Solr - search for everyone!
PPTX
AI from your data lake: Using Solr for analytics
PDF
Introduction to solr
PDF
NoSQL, Apache SOLR and Apache Hadoop
SF Solr Meetup - Interactively Search and Visualize Your Big Data
Hadoop Summit - Interactive Big Data Analysis with Solr, Spark and Hue
Interactive Query and Search for your Big Data
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Data Science with Solr and Spark
Solr 101
Hue architecture in the Hadoop ecosystem and SQL Editor
SQL and Search with Spark in your browser
Data Engineering with Solr and Spark
Hue: Big Data Web applications for Interactive Hadoop at Big Data Spain 2014
Interactive Apache Spark in Your Browser
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
KEYNOTE: Lucene / Solr road map
Search@flipkart
Meet Solr For The Tirst Again
Apache Solr - search for everyone!
AI from your data lake: Using Solr for analytics
Introduction to solr
NoSQL, Apache SOLR and Apache Hadoop

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
KodekX | Application Modernization Development
PDF
Approach and Philosophy of On baking technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
Machine learning based COVID-19 study performance prediction
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
KodekX | Application Modernization Development
Approach and Philosophy of On baking technology
Review of recent advances in non-invasive hemoglobin estimation
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Understanding_Digital_Forensics_Presentation.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
A Presentation on Artificial Intelligence
Building Integrated photovoltaic BIPV_UPV.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.

Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Action with Hue by Romain Rigaux of Cloudera