SlideShare a Scribd company logo
NYC Lucene/Solr Meetup: Spark / Solr
GC Tuning
lucenerevolution.org
October 13-16  Austin, TX
Solr & Spark
https://github.com/LucidWorks/spark-solr/
• Indexing from Spark
• Reading data from Solr
• Solr data as a Spark SQL DataFrame
• Interacting with Solr from the Spark shell
• Document Matching
• Reading Term vectors from Solr for MLlib
• Solr user since 2010, committer since April 2014, work for
Lucidworks, PMC member ~ May 2015
• Focus mainly on SolrCloud features … and bin/solr!
• Release manager for Lucene / Solr 5.1
• Co-author of Solr in Action
• Several years experience working with Hadoop, Pig, Hive,
ZooKeeper, Spark about 9 months …
• Other contributions include Solr on YARN, Solr Scale Toolkit,
and Solr/Storm integration project on github
About Me …
About Solr
• Vibrant, thriving open source community
• Solr 5.2.1 just released!
 Pluggable authentication and authorization
 ~2x indexing performance w/ replication
http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
 Field cardinality estimation using HyperLogLog
 Rule-based replica placement strategy
• Deploy to YARN cluster using Slider
Spark Overview
• Wealth of overview / getting started resources on the Web
 Start here -> https://spark.apache.org/
 Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
• Faster, more modernized alternative to MapReduce
 Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x less
computing power)
• Unified platform for Big Data
 Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining
• Write code in Java, Scala, or Python … REPL interface too
• Runs on YARN (or Mesos), plays well with HDFS
Spark Components
Spark Core
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(BSP)
Hadoop YARN Mesos Standalone
HDFS
Execution
Model
The Shuffle Caching
UI / API
engine
cluster
mgmt
Can combine all of these together in the same app!
Physical Architecture
Spark Master (daemon)
Spark Slave (daemon)
spark-solr-1.0.jar
(w/ shaded deps)
My Spark App
SparkContext
(driver)
• Keeps track of live workers
• Web UI on port 8080
• Task Scheduler
• Restart failed tasks
Spark Executor (JVM process)
Tasks
Executor runs in separate
process than slave daemon
Spark Worker Node (1...N of these)
Each task works on some partition of a
data set to apply a transformation or action
Cache
Losing a master prevents new
applications from being executed
Can achieve HA using ZooKeeper
and multiple master nodes
Tasks are assigned
based on data-locality
When selecting which node to execute a task on,
the master takes into account data locality
• RDD Graph
• DAG Scheduler
• Block tracker
• Shuffle tracker
Spark vs. Hadoop’s Map/Reduce
Operators File System Fault-
Tolerance
Ecosystem
Hadoop Map-Reduce HDFS
S3
Replicated
blocks
Java API
Pig
Hive
Spark Filter, map,
flatMap, join,
groupByKey,
reduce,
sortByKey,
count, distinct,
top, cogroup,
etc …
HDFS
S3
Immutable RDD
lineage
Python / Scala /
Java API
SparkSQL
GraphX
MLlib
SparkR
RDD Illustrated: Word count
map(word => (word, 1))
Map words into
pairs with count of 1
(quick,1)
(brown,1)
(fox,1)
(quick,1)
(quick,1)
val file =
spark.textFile("hdfs://...")
HDFS
file RDD from HDFS
quick brown fox jumped …
quick brownie recipe …
quick drying glue …
………
file.flatMap(line => line.split(" "))
Split lines into words
quick
brown
fox
quick
quick
……
reduceByKey(_ + _)
Send all keys to same
reducer and sum
(quick,1)
(quick,1)
(quick,1)
(quick,3)
Shuffle
across
machine
boundaries
Executors assigned based on data-locality if possible, narrow transformations occur in same executor
Spark keeps track of the transformations made to generate each RDD
Partition 1
Partition 2
Partition 3
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Understanding Resilient Distributed Datasets (RDD)
• Read-only partitioned collection of records with fault-tolerance
• Created from external system OR using a transformation of another RDD
• RDDs track the lineage of coarse-grained transformations (map, join, filter, etc)
• If a partition is lost, RDDs can be re-computed by re-playing the transformations
• User can choose to persist an RDD (for reusing during interactive data-mining)
• User can control partitioning scheme
Spark & Solr Integration
• https://github.com/LucidWorks/spark-solr/
• Streaming applications
 Real-time, streaming ETL jobs
 Solr as sink for Spark job
 Real-time document matching against stored queries
• Distributed computations (interactive data mining, machine learning)
 Expose results from Solr query as Spark RDD (resilient distributed dataset)
 Optionally process results from each shard in parallel
 Read millions of rows efficiently using deep paging
 SparkSQL DataFrame support (uses Solr schema API) and Term Vectors too!
Spark Streaming: Nuts & Bolts
• Transform a stream of records into small, deterministic batches
 Discretized stream: sequence of RDDs
 Once you have an RDD, you can use all the other Spark libs (MLlib, etc)
 Low-latency micro batches
 Time to process a batch must be less than the batch interval time
• Two types of operators:
 Transformations (group by, join, etc)
 Output (send to some external sink, e.g. Solr)
• Impressive performance!
 4GB/s (40M records/s) on 100 node cluster with less than 1 second latency
 Haven’t found any unbiased, reproducible performance comparisons between Storm / Spark
Spark Streaming Example: Solr as Sink
Twitter
./spark-submit --master MASTER --class com.lucidworks.spark.SparkApp spark-solr-1.0.jar 
twitter-to-solr -zkHost localhost:2181 –collection social
Solr
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
Various transformations / enrichments
on each tweet (e.g. sentiment analysis,
language detection)
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
map()
class TwitterToSolrStreamProcessor
extends SparkApp.StreamProcessor
SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs);
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
Spark Streaming Example: Solr as Sink
// start receiving a stream of tweets ...
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
// map incoming tweets into SolrInputDocument objects for indexing in Solr
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
public SolrInputDocument call(Status status) {
SolrInputDocument doc =
SolrSupport.autoMapToSolrInputDoc("tweet-"+status.getId(), status, null);
doc.setField("provider_s", "twitter");
doc.setField("author_s", status.getUser().getScreenName());
doc.setField("type_s", status.isRetweet() ? "echo" : "post");
return doc;
}
}
);
// when ready, send the docs into a SolrCloud cluster
SolrSupport.indexDStreamOfDocs(zkHost, collection, docs);
com.lucidworks.spark.SolrSupport
public static void indexDStreamOfDocs(final String zkHost, final String collection, final int batchSize,
JavaDStream<SolrInputDocument> docs)
{
docs.foreachRDD(
new Function<JavaRDD<SolrInputDocument>, Void>() {
public Void call(JavaRDD<SolrInputDocument> solrInputDocumentJavaRDD) throws Exception {
solrInputDocumentJavaRDD.foreachPartition(
new VoidFunction<Iterator<SolrInputDocument>>() {
public void call(Iterator<SolrInputDocument> solrInputDocumentIterator) throws Exception {
final SolrServer solrServer = getSolrServer(zkHost);
List<SolrInputDocument> batch = new ArrayList<SolrInputDocument>();
while (solrInputDocumentIterator.hasNext()) {
batch.add(solrInputDocumentIterator.next());
if (batch.size() >= batchSize)
sendBatchToSolr(solrServer, collection, batch);
}
if (!batch.isEmpty())
sendBatchToSolr(solrServer, collection, batch);
}
}
);
return null;
}
}
);
}
com.lucidworks.spark.ShardPartitioner
• Custom partitioning scheme for RDD using Solr’s DocRouter
• Stream docs directly to each shard leader using metadata from ZooKeeper, do
cument shard assignment, and ConcurrentUpdateSolrClient
final ShardPartitioner shardPartitioner = new ShardPartitioner(zkHost, collection);
pairs.partitionBy(shardPartitioner).foreachPartition(
new VoidFunction<Iterator<Tuple2<String, SolrInputDocument>>>() {
public void call(Iterator<Tuple2<String, SolrInputDocument>> tupleIter) throws Exception {
ConcurrentUpdateSolrClient cuss = null;
while (tupleIter.hasNext()) {
// ... Initialize ConcurrentUpdateSolrClient once per partition
cuss.add(doc);
}
}
});
SolrRDD: Reading data from Solr into Spark
• Can execute any query and expose as an RDD
• SolrRDD produces JavaRDD<SolrDocument>
• Use deep-paging if needed (cursorMark)
• Stream docs from Solr (vs. building lists on the server-side)
• More parallelism using a range filter on a numeric field (_version_)
e.g. 10 shards x 10 splits per shard == 100 concurrent Spark tasks
SolrRDD: Reading data from Solr into Spark
Shard 1
Shard 2
Solr
Collection
Partition 1
SolrRDD
Partition 2
Spark
Driver
App
q=*:*
ZooKeeper
Read collection metadata
q=*:*&rows=1000&
distrib=false&cursorMark=*
Results streamed back from Solr
JavaRDD<SolrDocument>
Solr as a Spark SQL Data Source
• DataFrame is a DSL for distributed data manipulation
• Data source provides a DataFrame
• Uniform way of working with data from multiple sources
• Hive, JDBC, Solr, Cassandra, etc.
• Seamless integration with other Spark technologies: SparkR, Python, MLlib
…
Map<String, String> options = new HashMap<String, String>();
options.put("zkhost", zkHost);
options.put("collection”, "tweets");
DataFrame df = sqlContext.read().format("solr").options(options).load();
count = df.filter(df.col("type_s").equalTo(“echo")).count();
Spark SQL
Query Solr, then expose results as a SQL table
Map<String, String> options = new HashMap<String, String>();
options.put("zkhost", zkHost);
options.put("collection”, "tweets");
DataFrame df = sqlContext.read().format("solr").options(options).load();
df.registerTempTable("tweets");
sqlContext.sql("SELECT count(*) FROM tweets WHERE type_s='echo'");
Query Solr from the Spark Shell
Interactive data mining with the full power of Solr queries
ADD_JARS=$PROJECT_HOME/target/spark-solr-1.0-SNAPSHOT.jar bin/spark-shell
val solrDF = sqlContext.load("solr", Map(
"zkHost" -> "localhost:9983",
"collection" -> "gettingstarted"))
solrDF.registerTempTable("tweets")
sqlContext.sql("SELECT COUNT(type_s) FROM tweets WHERE type_s='echo'").show()
Reading Term Vectors from Solr
• Pull TF/IDF (or just TF) for each term in a field for each document in query
results from Solr
• Can be used to construct RDD<Vector> which can then be passed to MLLib:
SolrRDD solrRDD = new SolrRDD(zkHost, collection);
JavaRDD<Vector> vectors =
solrRDD.queryTermVectors(jsc, solrQuery, field, numFeatures);
vectors.cache();
KMeansModel clusters =
KMeans.train(vectors.rdd(), numClusters, numIterations);
// Evaluate clustering by computing Within Set Sum of Squared Errors
double WSSSE = clusters.computeCost(vectors.rdd());
Document Matching using Stored Queries
• For each document, determine which of a large set of stored queries
matches.
• Useful for alerts, alternative flow paths through a stream, etc
• Index a micro-batch into an embedded (in-memory) Solr instance and then
determine which queries match
• Matching framework; you have to decide where to load the stored queries
from and what to do when matches are found
• Scale it using Spark … need to scale to many queries, checkout Luwak
Document Matching using Stored Queries
Stored Queries
DocFilterContext
Twitter map()
Slide Legend
Provided by Spark
Custom Java / Scala code
Provided by Lucidworks
JavaReceiverInputDStream<Status> tweets =
TwitterUtils.createStream(jssc, null, filters);
JavaDStream<SolrInputDocument> docs = tweets.map(
new Function<Status,SolrInputDocument>() {
// Convert a twitter4j Status object into a SolrInputDocument
public SolrInputDocument call(Status status) {
SolrInputDocument doc = new SolrInputDocument();
…
return doc;
}});
JavaDStream<SolrInputDocument> enriched =
SolrSupport.filterDocuments(docFilterContext, …);
Get queries
Index docs into an
EmbeddedSolrServer
Initialized from configs
stored in ZooKeeper
…
ZooKeeper
Key abstraction to allow
you to plug-in how to
store the queries and
what action to take when
docs match
A word about Fusion …
Wrap-up and Q & A
Need more use cases :-)
Feel free to reach out to me with questions:
tim.potter@lucidworks.com / @thelabdude
NYC Lucene/Solr Meetup: Spark / Solr

More Related Content

PPTX
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
PPTX
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
PDF
Introduction to SolrCloud
PPTX
Scaling Through Partitioning and Shard Splitting in Solr 4
PDF
How to make a simple cheap high availability self-healing solr cluster
PPTX
ApacheCon NA 2015 Spark / Solr Integration
PDF
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
PPTX
Solr Exchange: Introduction to SolrCloud
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Deploying and managing SolrCloud in the cloud using the Solr Scale Toolkit
Introduction to SolrCloud
Scaling Through Partitioning and Shard Splitting in Solr 4
How to make a simple cheap high availability self-healing solr cluster
ApacheCon NA 2015 Spark / Solr Integration
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Solr Exchange: Introduction to SolrCloud

What's hot (19)

PDF
Scaling search with SolrCloud
PPTX
Solrcloud Leader Election
PDF
SolrCloud on Hadoop
PDF
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
PDF
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
PDF
Solr cluster with SolrCloud at lucenerevolution (tutorial)
ODP
GIDS2014: SolrCloud: Searching Big Data
PDF
High Performance Solr
PPTX
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
PDF
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
PDF
Webinar: Solr & Spark for Real Time Big Data Analytics
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PDF
Cross Datacenter Replication in Apache Solr 6
PDF
Call me maybe: Jepsen and flaky networks
ODP
Apache SolrCloud
PPTX
Benchmarking Solr Performance at Scale
PDF
Productionizing Spark and the Spark Job Server
PPT
11. From Hadoop to Spark 2/2
PDF
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Scaling search with SolrCloud
Solrcloud Leader Election
SolrCloud on Hadoop
Rebalance API for SolrCloud: Presented by Nitin Sharma, Netflix & Suruchi Sha...
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Solr cluster with SolrCloud at lucenerevolution (tutorial)
GIDS2014: SolrCloud: Searching Big Data
High Performance Solr
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Webinar: Solr & Spark for Real Time Big Data Analytics
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Cross Datacenter Replication in Apache Solr 6
Call me maybe: Jepsen and flaky networks
Apache SolrCloud
Benchmarking Solr Performance at Scale
Productionizing Spark and the Spark Job Server
11. From Hadoop to Spark 2/2
Tuning Solr and its Pipeline for Logs: Presented by Rafał Kuć & Radu Gheorghe...
Ad

Viewers also liked (13)

PPT
Boosting Documents in Solr (Lucene Revolution 2011)
PDF
Query Understanding at LinkedIn [Talk at Facebook]
PDF
Search@airbnb
PDF
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
PDF
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
PDF
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
PDF
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
PDF
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
PPTX
Scaling Solr with Solr Cloud
PDF
Data Infrastructure at Flipkart (VLDB 2016)
PPT
Realtime search at Yammer
PDF
The new Netflix API
PDF
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Boosting Documents in Solr (Lucene Revolution 2011)
Query Understanding at LinkedIn [Talk at Facebook]
Search@airbnb
Parallel SQL and Analytics with Solr: Presented by Yonik Seeley, Cloudera
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Airbnb Search Architecture: Presented by Maxim Charkov, Airbnb
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
High Performance Solr and JVM Tuning Strategies used for MapQuest’s Search Ah...
Scaling Solr with Solr Cloud
Data Infrastructure at Flipkart (VLDB 2016)
Realtime search at Yammer
The new Netflix API
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Ad

Similar to NYC Lucene/Solr Meetup: Spark / Solr (20)

PDF
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
PDF
TriHUG talk on Spark and Shark
PPT
Apache spark-melbourne-april-2015-meetup
PDF
Solr as a Spark SQL Datasource
PDF
Introduction to Apache Spark :: Lagos Scala Meetup session 2
PPTX
Dive into spark2
PDF
Introduction to Spark (Intern Event Presentation)
PPTX
Spark core
PDF
Jump Start on Apache Spark 2.2 with Databricks
PPTX
Intro to Apache Spark by CTO of Twingo
PPT
Apache Spark™ is a multi-language engine for executing data-S5.ppt
PPTX
Introduction to Apache Spark
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PDF
Unified Big Data Processing with Apache Spark
PDF
20170126 big data processing
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PPTX
Spark from the Surface
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PPTX
Apache Spark on HDinsight Training
PDF
Ingesting hdfs intosolrusingsparktrimmed
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
TriHUG talk on Spark and Shark
Apache spark-melbourne-april-2015-meetup
Solr as a Spark SQL Datasource
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Dive into spark2
Introduction to Spark (Intern Event Presentation)
Spark core
Jump Start on Apache Spark 2.2 with Databricks
Intro to Apache Spark by CTO of Twingo
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Introduction to Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Unified Big Data Processing with Apache Spark
20170126 big data processing
Jump Start with Apache Spark 2.0 on Databricks
Spark from the Surface
Unified Big Data Processing with Apache Spark (QCON 2014)
Apache Spark on HDinsight Training
Ingesting hdfs intosolrusingsparktrimmed

NYC Lucene/Solr Meetup: Spark / Solr

  • 3. Solr & Spark https://github.com/LucidWorks/spark-solr/ • Indexing from Spark • Reading data from Solr • Solr data as a Spark SQL DataFrame • Interacting with Solr from the Spark shell • Document Matching • Reading Term vectors from Solr for MLlib
  • 4. • Solr user since 2010, committer since April 2014, work for Lucidworks, PMC member ~ May 2015 • Focus mainly on SolrCloud features … and bin/solr! • Release manager for Lucene / Solr 5.1 • Co-author of Solr in Action • Several years experience working with Hadoop, Pig, Hive, ZooKeeper, Spark about 9 months … • Other contributions include Solr on YARN, Solr Scale Toolkit, and Solr/Storm integration project on github About Me …
  • 5. About Solr • Vibrant, thriving open source community • Solr 5.2.1 just released!  Pluggable authentication and authorization  ~2x indexing performance w/ replication http://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/  Field cardinality estimation using HyperLogLog  Rule-based replica placement strategy • Deploy to YARN cluster using Slider
  • 6. Spark Overview • Wealth of overview / getting started resources on the Web  Start here -> https://spark.apache.org/  Should READ! https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf • Faster, more modernized alternative to MapReduce  Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record while using10x less computing power) • Unified platform for Big Data  Great for iterative algorithms (PageRank, K-Means, Logistic regression) & interactive data mining • Write code in Java, Scala, or Python … REPL interface too • Runs on YARN (or Mesos), plays well with HDFS
  • 7. Spark Components Spark Core Spark SQL Spark Streaming MLlib (machine learning) GraphX (BSP) Hadoop YARN Mesos Standalone HDFS Execution Model The Shuffle Caching UI / API engine cluster mgmt Can combine all of these together in the same app!
  • 8. Physical Architecture Spark Master (daemon) Spark Slave (daemon) spark-solr-1.0.jar (w/ shaded deps) My Spark App SparkContext (driver) • Keeps track of live workers • Web UI on port 8080 • Task Scheduler • Restart failed tasks Spark Executor (JVM process) Tasks Executor runs in separate process than slave daemon Spark Worker Node (1...N of these) Each task works on some partition of a data set to apply a transformation or action Cache Losing a master prevents new applications from being executed Can achieve HA using ZooKeeper and multiple master nodes Tasks are assigned based on data-locality When selecting which node to execute a task on, the master takes into account data locality • RDD Graph • DAG Scheduler • Block tracker • Shuffle tracker
  • 9. Spark vs. Hadoop’s Map/Reduce Operators File System Fault- Tolerance Ecosystem Hadoop Map-Reduce HDFS S3 Replicated blocks Java API Pig Hive Spark Filter, map, flatMap, join, groupByKey, reduce, sortByKey, count, distinct, top, cogroup, etc … HDFS S3 Immutable RDD lineage Python / Scala / Java API SparkSQL GraphX MLlib SparkR
  • 10. RDD Illustrated: Word count map(word => (word, 1)) Map words into pairs with count of 1 (quick,1) (brown,1) (fox,1) (quick,1) (quick,1) val file = spark.textFile("hdfs://...") HDFS file RDD from HDFS quick brown fox jumped … quick brownie recipe … quick drying glue … ……… file.flatMap(line => line.split(" ")) Split lines into words quick brown fox quick quick …… reduceByKey(_ + _) Send all keys to same reducer and sum (quick,1) (quick,1) (quick,1) (quick,3) Shuffle across machine boundaries Executors assigned based on data-locality if possible, narrow transformations occur in same executor Spark keeps track of the transformations made to generate each RDD Partition 1 Partition 2 Partition 3 val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 11. Understanding Resilient Distributed Datasets (RDD) • Read-only partitioned collection of records with fault-tolerance • Created from external system OR using a transformation of another RDD • RDDs track the lineage of coarse-grained transformations (map, join, filter, etc) • If a partition is lost, RDDs can be re-computed by re-playing the transformations • User can choose to persist an RDD (for reusing during interactive data-mining) • User can control partitioning scheme
  • 12. Spark & Solr Integration • https://github.com/LucidWorks/spark-solr/ • Streaming applications  Real-time, streaming ETL jobs  Solr as sink for Spark job  Real-time document matching against stored queries • Distributed computations (interactive data mining, machine learning)  Expose results from Solr query as Spark RDD (resilient distributed dataset)  Optionally process results from each shard in parallel  Read millions of rows efficiently using deep paging  SparkSQL DataFrame support (uses Solr schema API) and Term Vectors too!
  • 13. Spark Streaming: Nuts & Bolts • Transform a stream of records into small, deterministic batches  Discretized stream: sequence of RDDs  Once you have an RDD, you can use all the other Spark libs (MLlib, etc)  Low-latency micro batches  Time to process a batch must be less than the batch interval time • Two types of operators:  Transformations (group by, join, etc)  Output (send to some external sink, e.g. Solr) • Impressive performance!  4GB/s (40M records/s) on 100 node cluster with less than 1 second latency  Haven’t found any unbiased, reproducible performance comparisons between Storm / Spark
  • 14. Spark Streaming Example: Solr as Sink Twitter ./spark-submit --master MASTER --class com.lucidworks.spark.SparkApp spark-solr-1.0.jar twitter-to-solr -zkHost localhost:2181 –collection social Solr JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters); Various transformations / enrichments on each tweet (e.g. sentiment analysis, language detection) JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }}); map() class TwitterToSolrStreamProcessor extends SparkApp.StreamProcessor SolrSupport.indexDStreamOfDocs(zkHost, collection, 100, docs); Slide Legend Provided by Spark Custom Java / Scala code Provided by Lucidworks
  • 15. Spark Streaming Example: Solr as Sink // start receiving a stream of tweets ... JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters); // map incoming tweets into SolrInputDocument objects for indexing in Solr JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { public SolrInputDocument call(Status status) { SolrInputDocument doc = SolrSupport.autoMapToSolrInputDoc("tweet-"+status.getId(), status, null); doc.setField("provider_s", "twitter"); doc.setField("author_s", status.getUser().getScreenName()); doc.setField("type_s", status.isRetweet() ? "echo" : "post"); return doc; } } ); // when ready, send the docs into a SolrCloud cluster SolrSupport.indexDStreamOfDocs(zkHost, collection, docs);
  • 16. com.lucidworks.spark.SolrSupport public static void indexDStreamOfDocs(final String zkHost, final String collection, final int batchSize, JavaDStream<SolrInputDocument> docs) { docs.foreachRDD( new Function<JavaRDD<SolrInputDocument>, Void>() { public Void call(JavaRDD<SolrInputDocument> solrInputDocumentJavaRDD) throws Exception { solrInputDocumentJavaRDD.foreachPartition( new VoidFunction<Iterator<SolrInputDocument>>() { public void call(Iterator<SolrInputDocument> solrInputDocumentIterator) throws Exception { final SolrServer solrServer = getSolrServer(zkHost); List<SolrInputDocument> batch = new ArrayList<SolrInputDocument>(); while (solrInputDocumentIterator.hasNext()) { batch.add(solrInputDocumentIterator.next()); if (batch.size() >= batchSize) sendBatchToSolr(solrServer, collection, batch); } if (!batch.isEmpty()) sendBatchToSolr(solrServer, collection, batch); } } ); return null; } } ); }
  • 17. com.lucidworks.spark.ShardPartitioner • Custom partitioning scheme for RDD using Solr’s DocRouter • Stream docs directly to each shard leader using metadata from ZooKeeper, do cument shard assignment, and ConcurrentUpdateSolrClient final ShardPartitioner shardPartitioner = new ShardPartitioner(zkHost, collection); pairs.partitionBy(shardPartitioner).foreachPartition( new VoidFunction<Iterator<Tuple2<String, SolrInputDocument>>>() { public void call(Iterator<Tuple2<String, SolrInputDocument>> tupleIter) throws Exception { ConcurrentUpdateSolrClient cuss = null; while (tupleIter.hasNext()) { // ... Initialize ConcurrentUpdateSolrClient once per partition cuss.add(doc); } } });
  • 18. SolrRDD: Reading data from Solr into Spark • Can execute any query and expose as an RDD • SolrRDD produces JavaRDD<SolrDocument> • Use deep-paging if needed (cursorMark) • Stream docs from Solr (vs. building lists on the server-side) • More parallelism using a range filter on a numeric field (_version_) e.g. 10 shards x 10 splits per shard == 100 concurrent Spark tasks
  • 19. SolrRDD: Reading data from Solr into Spark Shard 1 Shard 2 Solr Collection Partition 1 SolrRDD Partition 2 Spark Driver App q=*:* ZooKeeper Read collection metadata q=*:*&rows=1000& distrib=false&cursorMark=* Results streamed back from Solr JavaRDD<SolrDocument>
  • 20. Solr as a Spark SQL Data Source • DataFrame is a DSL for distributed data manipulation • Data source provides a DataFrame • Uniform way of working with data from multiple sources • Hive, JDBC, Solr, Cassandra, etc. • Seamless integration with other Spark technologies: SparkR, Python, MLlib … Map<String, String> options = new HashMap<String, String>(); options.put("zkhost", zkHost); options.put("collection”, "tweets"); DataFrame df = sqlContext.read().format("solr").options(options).load(); count = df.filter(df.col("type_s").equalTo(“echo")).count();
  • 21. Spark SQL Query Solr, then expose results as a SQL table Map<String, String> options = new HashMap<String, String>(); options.put("zkhost", zkHost); options.put("collection”, "tweets"); DataFrame df = sqlContext.read().format("solr").options(options).load(); df.registerTempTable("tweets"); sqlContext.sql("SELECT count(*) FROM tweets WHERE type_s='echo'");
  • 22. Query Solr from the Spark Shell Interactive data mining with the full power of Solr queries ADD_JARS=$PROJECT_HOME/target/spark-solr-1.0-SNAPSHOT.jar bin/spark-shell val solrDF = sqlContext.load("solr", Map( "zkHost" -> "localhost:9983", "collection" -> "gettingstarted")) solrDF.registerTempTable("tweets") sqlContext.sql("SELECT COUNT(type_s) FROM tweets WHERE type_s='echo'").show()
  • 23. Reading Term Vectors from Solr • Pull TF/IDF (or just TF) for each term in a field for each document in query results from Solr • Can be used to construct RDD<Vector> which can then be passed to MLLib: SolrRDD solrRDD = new SolrRDD(zkHost, collection); JavaRDD<Vector> vectors = solrRDD.queryTermVectors(jsc, solrQuery, field, numFeatures); vectors.cache(); KMeansModel clusters = KMeans.train(vectors.rdd(), numClusters, numIterations); // Evaluate clustering by computing Within Set Sum of Squared Errors double WSSSE = clusters.computeCost(vectors.rdd());
  • 24. Document Matching using Stored Queries • For each document, determine which of a large set of stored queries matches. • Useful for alerts, alternative flow paths through a stream, etc • Index a micro-batch into an embedded (in-memory) Solr instance and then determine which queries match • Matching framework; you have to decide where to load the stored queries from and what to do when matches are found • Scale it using Spark … need to scale to many queries, checkout Luwak
  • 25. Document Matching using Stored Queries Stored Queries DocFilterContext Twitter map() Slide Legend Provided by Spark Custom Java / Scala code Provided by Lucidworks JavaReceiverInputDStream<Status> tweets = TwitterUtils.createStream(jssc, null, filters); JavaDStream<SolrInputDocument> docs = tweets.map( new Function<Status,SolrInputDocument>() { // Convert a twitter4j Status object into a SolrInputDocument public SolrInputDocument call(Status status) { SolrInputDocument doc = new SolrInputDocument(); … return doc; }}); JavaDStream<SolrInputDocument> enriched = SolrSupport.filterDocuments(docFilterContext, …); Get queries Index docs into an EmbeddedSolrServer Initialized from configs stored in ZooKeeper … ZooKeeper Key abstraction to allow you to plug-in how to store the queries and what action to take when docs match
  • 26. A word about Fusion …
  • 27. Wrap-up and Q & A Need more use cases :-) Feel free to reach out to me with questions: tim.potter@lucidworks.com / @thelabdude

Editor's Notes

  • #4: Solr 5 – overview: http://www.slideshare.net/lucidworks/webinar-inside-apache-solr-5 Who is using Solr in production? Anyone currently evaluating Solr and other technologies for a search project? Anyone using Spark?
  • #7: Started out as a research project at UC Berkeley – platform for exploring new areas of research in distributed systems / Big Data Shorter paper: http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf Spark running on Hadoop sorted 100TB in 23 minutes (3x faster than Yahoo’s previous record) http://www.datanami.com/2014/10/10/spark-smashes-mapreduce-big-data-benchmark/ Highly optimized shuffle code and new network transport sub-system Key abstraction – Resilient Distributed Dataset Other projects using / moving to Spark: Mahout - https://www.mapr.com/blog/mahout-spark-what%E2%80%99s-new-recommenders#.VI5CBWTF9kA Hive Pig
  • #8: Internals talk: https://www.youtube.com/watch?v=dmL0N3qfSc8 Spark has all the same basic concepts around optimizing the shuffle stage (custom partitioning, combiners, etc) Recently overhauled the shuffle and network transport subsystem to use Netty and zero-copy techniques
  • #9: Can have multiple master nodes deployed for HA (leader is elected using ZooKeeper) Akka and Netty under the covers Execution Model: Create a DAG of RDDs Create logical execution plan for the DAG Schedule and execute individual tasks across the cluster Spark organizes tasks into stages; boundaries between stages are when the data needs to be re-organized (such as doing a groupBy or reduce) Stages are super operations that happen locally A task is data + computation Tasks get scheduled based on data locality
  • #10: Great presentation by Spark founder: https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia MapReduce suffers from having to write intermediate data to disk to be used by other jobs or iterations; no good way to share data across jobs / iterations Data locality is still important Spark chooses to share data across iterations / interactive queries – the hard part is fault-tolerance, which it achieves using an RDD Less boilerplate code One way to think about Spark is it is a more intelligent optimizer that’s very good at keeping data that is reused in memory reliance on persistent storage to provide fault tolerance and its one-pass computation model parallel programs look very much like sequential programs, which make them easier to develop and reason about
  • #11: Different color boxes indicate partitions of the same RDD Some text data in HDFS, partitioned by HDFS blocks Spark assigns tasks to process the blocks based on data locality Narrow transformations occur in the same executor (no shuffling across machines)
  • #12: Spark RDD: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Parallel computations using a restricted set of high-level operators Applied to *ALL* elements of a dataset at once Log one operation that is applied to many elements coarse-grained updates that apply the same operation to many data items Lineage + partition == low overhead recovery Achieve fault-tolerance by exposing coarse-grained transformations (steps are logged, which can be re-played if needed). If a partition is lost, RDDs contain enough information to re-compute the data Parallel applications apply the same transformations to many data items Persist – says to keep the RDD in-memory (probably because we’re going to be reusing it) Lazy execution: Spark will generate a DAG of stages to compute the result of an action
  • #13: The two technologies combined together provide near real-time processing, ad hoc queries, batch processing / deep analytics, machine learning, and horizontal scaling Aims to be a framework to help reduce boilerplate and get you started quickly, but you still have to write some code!
  • #14: Basically, split a stream into very small discretized batches (1 second is typical) and then all the other Spark RDD goodies apply AMP Camp Tathagata Das Probably on-par with Storm Trident (micro-batching)
  • #15: A series of very small deterministic batch jobs http://www.slideshare.net/pacoid/tiny-batches-in-the-wine-shiny-new-bits-in-spark-streaming http://www.cs.duke.edu/~kmoses/cps516/dstream.html Don’t have to have a separate stack for streaming apps e.g. instead of having Storm for streaming and Spark for interactive data mining, you just have Spark Spark chops live stream up into small batches of N seconds (each batch being an RDD) DStream is batch of records to be processed DStream is processed in micro-batches (controlled when the job is configured) map() step converts Twitter4J Status objects into SolrInputDocuments OR we could just send JSON directly to a Fusion pipeline and then do the mapping in the pipeline.
  • #17: This slide is here to show some ugliness that our Solr framework hides from end-users SolrSupport – removes need to worry about Spark boilerplate for sending a stream of docs to Solr
  • #18: Need to fix SOLR-3382 to get better error reporting when streaming docs to Solr using CUSS
  • #22: Basic process is to query Solr, expose Results as a JavaSchemaRDD, register as a temp table, perform queries Use Solr’s SchemaAPI to get metadata about fields in the query
  • #24: You can also get a Spark vector by doing: Vector vector = SolrTermVector.newInstance(String docId, HashingTF hashingTF, String rawText) // uses the Lucene StandardAnalyzer
  • #25: Spark RDD: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Parallel computations using a restricted set of high-level operators Achieve fault-tolerance by exposing coarse-grained transformations (steps are logged, which can be re-played if needed). If a partition is lost, RDDs contain enough information to re-compute the data Parallel applications apply the same transformations to many data items When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions
  • #26: Spark RDD: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Parallel computations using a restricted set of high-level operators Achieve fault-tolerance by exposing coarse-grained transformations (steps are logged, which can be re-played if needed). If a partition is lost, RDDs contain enough information to re-compute the data Parallel applications apply the same transformations to many data items When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions