Indexing Big Data in the Cloud
Me
             Scott Stults
Co-Founder of OpenSource Connections

            Solr / Lucene

        Bash / Python / Java


             Indexing Big Data in the Cloud   2
Eric




Indexing Big Data in the Cloud   3
Big Data




Indexing Big Data in the Cloud   4
Big Data Wrangler




    Indexing Big Data in the Cloud   5
How?
 Address a Real Project
       Be Agile
Make Small Mistaeks Fast
     Succeed BIG




      Indexing Big Data in the Cloud   6
USPTO Goals
Prototype Search UX

    Prove Solr:
      Scales
    Integrates
      Excels


    Indexing Big Data in the Cloud   7
Scale?




Indexing Big Data in the Cloud   8
Our Approach

             KISS
            YAGNI

(This space intentionally left blank)




       Indexing Big Data in the Cloud   9
Minimal Flair




  Indexing Big Data in the Cloud   10
Record Everything!




     Indexing Big Data in the Cloud   11
Some Numbers

Doc Count                   1.1 Million
Zip Files                   313
Docs per Zip File           4,000
Zip File Size               75M
File Size                   300M




            Indexing Big Data in the Cloud   12
Testing
Start some servers
 Process a batch
 Check the clock




   Indexing Big Data in the Cloud   13
start_nodes
start_nodes() {
   ec2-run-instances ami-1b814f72 
     --block-device-mapping '/dev/sdb=snap-48adde35::true' 
     --block-device-mapping '/dev/sdi1=:10:false' 
     --block-device-mapping '/dev/sdi2=:10:false' 
     --block-device-mapping '/dev/sdi3=:20:false' 
     --instance-type m1.large 
     --key uspto-proto 
     --instance-count $MAX_NODES 
     --group default > ~/run-output
}




                             Indexing Big Data in the Cloud    14
Gut Check

 How fast can we do this?

What can we do in parallel?




       Indexing Big Data in the Cloud   15
Scaling

Raise our instance limit

xargs -P GNU parallel




      Indexing Big Data in the Cloud   16
Shortcomings

     SSH?
 Error recovery
    One Solr




  Indexing Big Data in the Cloud   17
Alternatives
   CloudFormation
     Puppet / Chef
Multiple Cores / Shards
        Hadoop




     Indexing Big Data in the Cloud   18
Success




Indexing Big Data in the Cloud   19
Victory Lap




 Indexing Big Data in the Cloud   20
Instances / Time




   Indexing Big Data in the Cloud   21
Thank You

https://github.com/sstults/patent-indexing

              @scottstults
                #o19s



               Indexing Big Data in the Cloud   22

More Related Content

PDF
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
PPTX
Deep Learning on Aerial Imagery: What does it look like on a map?
PDF
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
PDF
Taking Your Database Global with Kubernetes
PDF
Building a Data Plane with K8ssandra, Apache Cassandra on Kubernetes
PDF
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
PDF
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
PDF
Burst data retrieval after 50k GPU Cloud run
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Deep Learning on Aerial Imagery: What does it look like on a map?
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
Taking Your Database Global with Kubernetes
Building a Data Plane with K8ssandra, Apache Cassandra on Kubernetes
Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Without Downtime
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
Burst data retrieval after 50k GPU Cloud run

What's hot (20)

PDF
GeoMesa LocationTech DC
PDF
Data-intensive IceCube Cloud Burst
PDF
Climate data in r with the raster package
PDF
Advanced git
PPTX
I See NoSQL Document Stores in Geospatial Applications
PPT
Openshift GeoSpatial Capabilities
PDF
Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span...
PDF
DSD-INT 2017 The use of big data for dredging - De Boer
PDF
Scaling Elasticsearch at Synthesio
PPTX
Drinking from the firehose: Logging at scale with ELK
PDF
20171012 found IT #9 PySparkの勘所
PDF
20170210 sapporotechbar7
PDF
Building maps for apps in the cloud - a Softlayer Use Case
PDF
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
PDF
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
PPTX
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
PDF
Pachyderm: Building a Big Data Beast On Kubernetes
PDF
Green material, encryption and gate in Ark Load
PPTX
OpenShift with Eclipse Tooling - EclipseCon 2012
PPTX
Spatial mongo for PHP and Zend
GeoMesa LocationTech DC
Data-intensive IceCube Cloud Burst
Climate data in r with the raster package
Advanced git
I See NoSQL Document Stores in Geospatial Applications
Openshift GeoSpatial Capabilities
Big Data Analytics with Google BigQuery, by Javier Ramirez, datawaki, at Span...
DSD-INT 2017 The use of big data for dredging - De Boer
Scaling Elasticsearch at Synthesio
Drinking from the firehose: Logging at scale with ELK
20171012 found IT #9 PySparkの勘所
20170210 sapporotechbar7
Building maps for apps in the cloud - a Softlayer Use Case
Geospatial Data Visualization: WorldMap Integration by Raman Prasad
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
Pachyderm: Building a Big Data Beast On Kubernetes
Green material, encryption and gate in Ark Load
OpenShift with Eclipse Tooling - EclipseCon 2012
Spatial mongo for PHP and Zend
Ad

Similar to Indexing Big Data on Amazon AWS (20)

PPTX
Indexing big data in the cloud
PDF
Indexing big data in the cloud
PPTX
Scaling and Managing Big Data Apps in the Cloud
PDF
Cloud Architectures - Jinesh Varia - GrepTheWeb
PPTX
Big Data in the Cloud
PPT
Bd cloud v3
PPT
Big Data on The Cloud
PPTX
Big data (reversim)
PPTX
Lesson from Building a Search Engine using the cloud
PDF
Managing Big Data (Chapter 2, SC 11 Tutorial)
PDF
8 mattwoodaws-intro-pdf-110411093115-phpapp01
PDF
Processing Big Data (Chapter 3, SC 11 Tutorial)
PDF
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
PPTX
Big data and cloud
PDF
Big Data Analytics with Amazon Web Services
PPT
Data Science Day New York: The Platform for Big Data
PDF
Cloud Computing Training
PDF
Treasure Data and Heroku
PPTX
Achieve big data analytic platform with lambda architecture on cloud
PDF
Architecting Virtualized Infrastructure for Big Data
Indexing big data in the cloud
Indexing big data in the cloud
Scaling and Managing Big Data Apps in the Cloud
Cloud Architectures - Jinesh Varia - GrepTheWeb
Big Data in the Cloud
Bd cloud v3
Big Data on The Cloud
Big data (reversim)
Lesson from Building a Search Engine using the cloud
Managing Big Data (Chapter 2, SC 11 Tutorial)
8 mattwoodaws-intro-pdf-110411093115-phpapp01
Processing Big Data (Chapter 3, SC 11 Tutorial)
Контроль зверей: инструменты для управления и мониторинга распределенных сист...
Big data and cloud
Big Data Analytics with Amazon Web Services
Data Science Day New York: The Platform for Big Data
Cloud Computing Training
Treasure Data and Heroku
Achieve big data analytic platform with lambda architecture on cloud
Architecting Virtualized Infrastructure for Big Data
Ad

More from lucenerevolution (20)

PDF
Text Classification Powered by Apache Mahout and Lucene
PDF
State of the Art Logging. Kibana4Solr is Here!
PDF
Search at Twitter
PDF
Building Client-side Search Applications with Solr
PDF
Integrate Solr with real-time stream processing applications
PDF
Scaling Solr with SolrCloud
PDF
Administering and Monitoring SolrCloud Clusters
PDF
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
PDF
Using Solr to Search and Analyze Logs
PDF
Enhancing relevancy through personalization & semantic search
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
Solr's Admin UI - Where does the data come from?
PDF
Schemaless Solr and the Solr Schema REST API
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PDF
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
PDF
Faceted Search with Lucene
PDF
Recent Additions to Lucene Arsenal
PDF
Turning search upside down
PDF
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
PDF
Shrinking the haystack wes caldwell - final
Text Classification Powered by Apache Mahout and Lucene
State of the Art Logging. Kibana4Solr is Here!
Search at Twitter
Building Client-side Search Applications with Solr
Integrate Solr with real-time stream processing applications
Scaling Solr with SolrCloud
Administering and Monitoring SolrCloud Clusters
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Using Solr to Search and Analyze Logs
Enhancing relevancy through personalization & semantic search
Real-time Inverted Search in the Cloud Using Lucene and Storm
Solr's Admin UI - Where does the data come from?
Schemaless Solr and the Solr Schema REST API
High Performance JSON Search and Relational Faceted Browsing with Lucene
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Faceted Search with Lucene
Recent Additions to Lucene Arsenal
Turning search upside down
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Shrinking the haystack wes caldwell - final

Recently uploaded (20)

PDF
The influence of sentiment analysis in enhancing early warning system model f...
PPT
What is a Computer? Input Devices /output devices
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPTX
Benefits of Physical activity for teenagers.pptx
DOCX
search engine optimization ppt fir known well about this
PDF
Abstractive summarization using multilingual text-to-text transfer transforme...
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
STKI Israel Market Study 2025 version august
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PPTX
Modernising the Digital Integration Hub
The influence of sentiment analysis in enhancing early warning system model f...
What is a Computer? Input Devices /output devices
NewMind AI Weekly Chronicles – August ’25 Week III
Flame analysis and combustion estimation using large language and vision assi...
Benefits of Physical activity for teenagers.pptx
search engine optimization ppt fir known well about this
Abstractive summarization using multilingual text-to-text transfer transforme...
OpenACC and Open Hackathons Monthly Highlights July 2025
sbt 2.0: go big (Scala Days 2025 edition)
Developing a website for English-speaking practice to English as a foreign la...
Zenith AI: Advanced Artificial Intelligence
Final SEM Unit 1 for mit wpu at pune .pptx
STKI Israel Market Study 2025 version august
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
A review of recent deep learning applications in wood surface defect identifi...
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
A proposed approach for plagiarism detection in Myanmar Unicode text
Credit Without Borders: AI and Financial Inclusion in Bangladesh
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Modernising the Digital Integration Hub

Indexing Big Data on Amazon AWS