SlideShare a Scribd company logo
Indexing Big Data in the Cloud
Me
Scott Stults
Co-Founder of OpenSource Connections
Solr / Lucene
Bash / Python / Java
2
Indexing Big Data in the Cloud
Eric
3
Indexing Big Data in the Cloud
Big Data
Indexing Big Data in the Cloud 4
Big Data Wrangler
5
Indexing Big Data in the Cloud
How?
Address a Real Project
Be Agile
Make Small Mistaeks Fast
Succeed BIG
6
Indexing Big Data in the Cloud
USPTO Goals
Prototype Search UX
Prove Solr:
Scales
Integrates
Excels
7
Indexing Big Data in the Cloud
Scale?
8
Indexing Big Data in the Cloud
Our Approach
KISS
YAGNI
9
Indexing Big Data in the Cloud
(This space intentionally left blank)
Minimal Flair
10
Indexing Big Data in the Cloud
Record Everything!
11
Indexing Big Data in the Cloud
Some Numbers
Doc Count 1.1 Million
Zip Files 313
Docs per Zip File 4,000
Zip File Size 75M
File Size 300M
12
Indexing Big Data in the Cloud
Testing
Start some servers
Process a batch
Check the clock
13
Indexing Big Data in the Cloud
start_nodes
start_nodes() {
ec2-run-instances ami-1b814f72 
--block-device-mapping '/dev/sdb=snap-48adde35::true' 
--block-device-mapping '/dev/sdi1=:10:false' 
--block-device-mapping '/dev/sdi2=:10:false' 
--block-device-mapping '/dev/sdi3=:20:false' 
--instance-type m1.large 
--key uspto-proto 
--instance-count $MAX_NODES 
--group default > ~/run-output
}
14
Indexing Big Data in the Cloud
Gut Check
How fast can we do this?
What can we do in parallel?
15
Indexing Big Data in the Cloud
Scaling
Raise our instance limit
xargs -P GNU parallel
16
Indexing Big Data in the Cloud
Shortcomings
SSH?
Error recovery
One Solr
17
Indexing Big Data in the Cloud
Alternatives
CloudFormation
Puppet / Chef
Multiple Cores / Shards
Hadoop
18
Indexing Big Data in the Cloud
Success
19
Indexing Big Data in the Cloud
Victory Lap
20
Indexing Big Data in the Cloud
Instances / Time
Indexing Big Data in the Cloud 21
Thank You
https://github.com/sstults/patent-indexing
@scottstults
#o19s
22
Indexing Big Data in the Cloud

More Related Content

PDF
Cloud Security Monitoring at Auth0 - Art into Science
PDF
This week in Neo4j - 14th October 2017
PPTX
Overview of Blue Medora - New Relic Plugin for PostgreSQL Database
PDF
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
PDF
LUNA - Lessons in cloud based workflow: Universal & ETC by Guillaume Aubchon ...
PPTX
Getting to Know Airflow
PPTX
Pachyderm: Data Storage and Processing with Docker
PDF
Using Apache Spark Structured Streaming on Azure Databricks for Predictive M...
Cloud Security Monitoring at Auth0 - Art into Science
This week in Neo4j - 14th October 2017
Overview of Blue Medora - New Relic Plugin for PostgreSQL Database
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
LUNA - Lessons in cloud based workflow: Universal & ETC by Guillaume Aubchon ...
Getting to Know Airflow
Pachyderm: Data Storage and Processing with Docker
Using Apache Spark Structured Streaming on Azure Databricks for Predictive M...

Viewers also liked (6)

PPT
价值规律
PPT
Ideeën logo malawi
PDF
Change.org + Net2DC Moving the Needle - Online Advocacy Strategies 5-19-16 (1)
PDF
Creating profiles
PPT
Projeto social marcelo 14 tp
PPT
价值规律
Ideeën logo malawi
Change.org + Net2DC Moving the Needle - Online Advocacy Strategies 5-19-16 (1)
Creating profiles
Projeto social marcelo 14 tp
Ad

Similar to Indexing big data in the cloud (20)

PDF
Indexing Big Data on Amazon AWS
PPTX
Indexing big data in the cloud
PPTX
Getting Started with Splunk Breakout Session
PPTX
Skymind Open Power Summit ISV Round Table
PDF
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
PDF
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
PPT
Sem tech 2011 v8
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
PDF
Introduction to PySpark
PPTX
Databricks for Dummies
PPTX
SQLite and object-relational mapping in Java
PPTX
Customer Presentation - KCP&L
PPTX
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It
PDF
GDG Los Angeles - Introduction to Gemini
PPTX
Datastax / Cassandra Modeling Strategies
PPTX
STIX Patterning: Viva la revolución!
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
PPTX
Getting Started with Splunk Breakout Session
PPTX
IBM Strategy for Spark
PDF
Big Data: an introduction
Indexing Big Data on Amazon AWS
Indexing big data in the cloud
Getting Started with Splunk Breakout Session
Skymind Open Power Summit ISV Round Table
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Sem tech 2011 v8
Data infrastructure architecture for medium size organization: tips for colle...
Introduction to PySpark
Databricks for Dummies
SQLite and object-relational mapping in Java
Customer Presentation - KCP&L
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing It
GDG Los Angeles - Introduction to Gemini
Datastax / Cassandra Modeling Strategies
STIX Patterning: Viva la revolución!
UnConference for Georgia Southern Computer Science March 31, 2015
Getting Started with Splunk Breakout Session
IBM Strategy for Spark
Big Data: an introduction
Ad

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
State of the Art Logging. Kibana4Solr is Here!
PDF
Search at Twitter
PDF
Building Client-side Search Applications with Solr
PDF
Integrate Solr with real-time stream processing applications
PDF
Scaling Solr with SolrCloud
PDF
Administering and Monitoring SolrCloud Clusters
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
PDF
Using Solr to Search and Analyze Logs
PDF
Enhancing relevancy through personalization & semantic search
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
Solr's Admin UI - Where does the data come from?
PDF
Schemaless Solr and the Solr Schema REST API
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PDF
Faceted Search with Lucene
PDF
Recent Additions to Lucene Arsenal
PDF
Turning search upside down
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PDF
Shrinking the haystack wes caldwell - final
Text Classification Powered by Apache Mahout and Lucene
State of the Art Logging. Kibana4Solr is Here!
Search at Twitter
Building Client-side Search Applications with Solr
Integrate Solr with real-time stream processing applications
Scaling Solr with SolrCloud
Administering and Monitoring SolrCloud Clusters
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Using Solr to Search and Analyze Logs
Enhancing relevancy through personalization & semantic search
Real-time Inverted Search in the Cloud Using Lucene and Storm
Solr's Admin UI - Where does the data come from?
Schemaless Solr and the Solr Schema REST API
High Performance JSON Search and Relational Faceted Browsing with Lucene
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Faceted Search with Lucene
Recent Additions to Lucene Arsenal
Turning search upside down
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Shrinking the haystack wes caldwell - final

Recently uploaded (20)

PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Spectroscopy.pptx food analysis technology
PDF
Getting Started with Data Integration: FME Form 101
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Machine Learning_overview_presentation.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MIND Revenue Release Quarter 2 2025 Press Release
cloud_computing_Infrastucture_as_cloud_p
Encapsulation_ Review paper, used for researhc scholars
Unlocking AI with Model Context Protocol (MCP)
Empathic Computing: Creating Shared Understanding
Spectroscopy.pptx food analysis technology
Getting Started with Data Integration: FME Form 101
Mobile App Security Testing_ A Comprehensive Guide.pdf
Group 1 Presentation -Planning and Decision Making .pptx
A comparative analysis of optical character recognition models for extracting...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A comparative study of natural language inference in Swahili using monolingua...
Assigned Numbers - 2025 - Bluetooth® Document
SOPHOS-XG Firewall Administrator PPT.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Tartificialntelligence_presentation.pptx
Machine Learning_overview_presentation.pptx
TLE Review Electricity (Electricity).pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf

Indexing big data in the cloud