SlideShare a Scribd company logo
Lessons Learned with
Spark at the US Patent &
Trademark Office
Christopher Bradford
Big Data Architect at OpenSource Connections
Christopher Bradford
Twitter: @bradfordcp
GitHub: bradfordcp
OpenSource Connections
Exploring Search Technologies - EST
EST – Technology Stack
EST – Data Loading
CSS Ingestion (CSS2C) Solr Ingestion (C2S)
EST – C2S Process
Note: some connections are omitted for clarity
EST – C2S Process (Scaled Out)
Note: some connections are omitted for clarity
EST – C2S Review
Did it work?
Why change it?
How could we make it better?
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)
EST – Old C2S Process
Note: some connections are omitted for clarity
EST – Spark C2S Process
Note: some connections are omitted for clarity
How did this work out?
Poorly
Poor Performance
joinedRDD = …
joinedRDD.foreach()
document = … // build document
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
// Job is done
Poor Performance
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
Optimum Performance
joinedRDD = …
sc = new SolrConnection()
joinedRDD.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
joinedRDD = …
joinedRDD.foreachPartition()
sc = new SolrConnection()
partition.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
Almost
The Solution!
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
document
sc.push(document)
sc.close()
return partition.rows
.collect()
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build
document
sc.push(document)
sc.close()
return partitions.rows.count
.collect()
Results?
Solr Indexing
Better Solr Indexing
Note: some connections are omitted for clarity
EST – Spark C2S Process v2
Note: some connections are omitted for clarity
Success?
YUP
5x faster than the original C2S process (with optimizations)
What’s Next?
•  Optimization of the C2S Spark job
•  More Spark jobs
•  Newer version of Spark & DSE
•  Scala Spark jobs instead of Java

More Related Content

PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
PDF
Introduction to Spark Training
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Jump Start on Apache® Spark™ 2.x with Databricks
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Web-Scale Graph Analytics with Apache® Spark™
Jump Start with Apache Spark 2.0 on Databricks
Extending Spark's Ingestion: Build Your Own Java Data Source with Jean George...
Introduction to Spark Training
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu

What's hot (20)

PDF
How Apache Spark fits into the Big Data landscape
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PDF
SparkApplicationDevMadeEasy_Spark_Summit_2015
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Understanding Query Plans and Spark UIs
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PPTX
Spark r under the hood with Hossein Falaki
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PPTX
Introduction to Spark ML
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
Operational Tips For Deploying Apache Spark
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Microservices, Containers, and Machine Learning
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Spark Community Update - Spark Summit San Francisco 2015
PDF
What's New in Apache Spark 2.3 & Why Should You Care
How Apache Spark fits into the Big Data landscape
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
SparkApplicationDevMadeEasy_Spark_Summit_2015
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Web-Scale Graph Analytics with Apache® Spark™
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Understanding Query Plans and Spark UIs
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark r under the hood with Hossein Falaki
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Introduction to Spark ML
Unified Big Data Processing with Apache Spark (QCON 2014)
Operational Tips For Deploying Apache Spark
Writing Continuous Applications with Structured Streaming in PySpark
Microservices, Containers, and Machine Learning
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Community Update - Spark Summit San Francisco 2015
What's New in Apache Spark 2.3 & Why Should You Care
Ad

Viewers also liked (20)

PDF
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
PDF
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
PDF
Open Stack Cheat Sheet V1
PDF
Tachyon-2014-11-21-amp-camp5
PDF
Linux Filesystems, RAID, and more
PDF
The Hot Rod Protocol in Infinispan
PDF
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
PDF
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
PDF
Scaling up genomic analysis with ADAM
PPTX
ELC-E 2010: The Right Approach to Minimal Boot Times
PDF
Velox: Models in Action
ODP
Naïveté vs. Experience
PDF
SparkR: Enabling Interactive Data Science at Scale
PDF
SampleClean: Bringing Data Cleaning into the BDAS Stack
PDF
OpenStack Cheat Sheet V2
PDF
A Curious Course on Coroutines and Concurrency
PDF
Lab 5: Interconnecting a Datacenter using Mininet
PDF
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
PDF
Best Practices for Virtualizing Apache Hadoop
PDF
Python in Action (Part 2)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Open Stack Cheat Sheet V1
Tachyon-2014-11-21-amp-camp5
Linux Filesystems, RAID, and more
The Hot Rod Protocol in Infinispan
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Scaling up genomic analysis with ADAM
ELC-E 2010: The Right Approach to Minimal Boot Times
Velox: Models in Action
Naïveté vs. Experience
SparkR: Enabling Interactive Data Science at Scale
SampleClean: Bringing Data Cleaning into the BDAS Stack
OpenStack Cheat Sheet V2
A Curious Course on Coroutines and Concurrency
Lab 5: Interconnecting a Datacenter using Mininet
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Best Practices for Virtualizing Apache Hadoop
Python in Action (Part 2)
Ad

Similar to Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections) (20)

PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Tuning and Debugging in Apache Spark
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
PPTX
Spark to DocumentDB connector
PPT
Jdbc drivers
PPTX
Apache Spark Fundamentals Training
PDF
Building Robust ETL Pipelines with Apache Spark
PPTX
Tuning and Debugging in Apache Spark
PDF
Spark SQL - 10 Things You Need to Know
PDF
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PPT
2 rel-algebra
DOCX
Quick Guide to Refresh Spark skills
PPTX
Engineering Document Collaboration with Office 365
PPTX
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
PPTX
Spark Cassandra Connector: Past, Present and Furure
Jump Start on Apache Spark 2.2 with Databricks
Jumpstart on Apache Spark 2.2 on Databricks
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Tuning and Debugging in Apache Spark
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Spark to DocumentDB connector
Jdbc drivers
Apache Spark Fundamentals Training
Building Robust ETL Pipelines with Apache Spark
Tuning and Debugging in Apache Spark
Spark SQL - 10 Things You Need to Know
USF Seminar Series: Apache Spark, Machine Learning, Recommendations Feb 05 2016
Building a modern Application with DataFrames
Building a modern Application with DataFrames
2 rel-algebra
Quick Guide to Refresh Spark skills
Engineering Document Collaboration with Office 365
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
Spark Cassandra Connector: Past, Present and Furure

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Mega Projects Data Mega Projects Data
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
A Quantitative-WPS Office.pptx research study
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Logistic Regression ml machine learning.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Computer network topology notes for revision
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Global journeys: estimating international migration
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Mega Projects Data Mega Projects Data
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Database Infoormation System (DBIS).pptx
A Quantitative-WPS Office.pptx research study
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Reliability_Chapter_ presentation 1221.5784
Logistic Regression ml machine learning.pptx
Business Acumen Training GuidePresentation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Clinical guidelines as a resource for EBP(1).pdf
Quality review (1)_presentation of this 21
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Computer network topology notes for revision
Supervised vs unsupervised machine learning algorithms
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Global journeys: estimating international migration
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher Bradford, Open Source Connections)