SlideShare a Scribd company logo
Un-Structured 	

!
Or: How I Learned to Stop
Worrying and Love the XML
Mike Nibeck, Asim Shaikh
1st NF, 2nd NF, 3rd NF	

!
It’s The Way It’s Done
Maintainability vs.
Performance
I’m Feeling Lucky
Solr
Extension	
  of	
  
Apache	
  Lucene
Full	
  Text	
  Search Open	
  Interfaces	
  
(XML,	
  JSON,	
  HTTP)
Faceted	
  Search Database	
  Ingest Document	
  Indexing	
  
(PDF,	
  Word,	
  etc)
Spelling	
  
Suggestions
Auto	
  Suggest “Cloudy”
Advanced	
  Input	
  
Parsing
Relevance	
  Ranking v4.4
You got your chocolate in
my peanut butter!
It’s a Hammer. 	

A really nice, efficient
and free hammer.
A Mental Shift	

Pancakes & Relevancy
Chronicling America
• 6.8 million documents	

• 10 Billion vectors	

• 50,000 queries/day	

• Index 250GB 	

• +100K documents per month
Congress.gov
• 4 million documents	

• 3.3+ million queries/day
(user and system)	

• 36 GB indexes	

•Adding many thousands/
month
Library Web Search
• 18+ million documents	

• 9,000 queries/day	

• 28GB index size	

• + many thousands/month
World Digital Library
• 120k documents	

• 7 different languages 	

• 10-50k queries/day	

• Index < 1GB 	

• +100 documents/month
Load Balancer
Database Filesystem
Indexing
SOLR Cores SOLR Cores
Users
App Servers
Web Cache
Legacy Systems
Data Partners
Solr Architecture - congress.gov
ETL Processing
Extract Translate
Load
Master Data Sources
Analyzers,Tokenizers and
Filters. Oh My!
Cores? We Don’t Need
No Stinkin' Cores
Data Import Handler
Next Steps
Open Source Tools
• PHP / Zend	

• Python / Django	

• MySQL	

• RabbitMQ	

•Varnish	

• Jenkins	

• Graphite, Statsd
Mike Nibeck - mnib@loc.gov	

!
Asim Shaikh - ashaikh@loc.gov

More Related Content

PDF
CEK KEMIRIPAN PADA CROSSREF
PDF
CARA MENGELOLA PERUBAHAN PADA NASKAH
PPTX
Reference linking and Cited-by
PDF
Collecting and Using Funding Data Crossref
PPTX
Checking for originality: Crossref Similarity Check
PDF
Reference Linking & Cited-By - Crossref LIVE Bangkok
PPTX
Working with Crossref and registering content
PPT
Library OKRA: A Matter of Semantics? Intelligence, Open Data and the Future o...
CEK KEMIRIPAN PADA CROSSREF
CARA MENGELOLA PERUBAHAN PADA NASKAH
Reference linking and Cited-by
Collecting and Using Funding Data Crossref
Checking for originality: Crossref Similarity Check
Reference Linking & Cited-By - Crossref LIVE Bangkok
Working with Crossref and registering content
Library OKRA: A Matter of Semantics? Intelligence, Open Data and the Future o...

What's hot (20)

PPTX
Managing changes to content: Crossmark
PDF
Content Registration at Crossref - LIVE Kuala Lumpur
PPTX
Collecting and using funding data in your publications
PPTX
Managing plagiarism: Similarity Check
PPTX
Crossref Metadata and Metadata Services
PPTX
Understanding Crossref Metadata
PPTX
Ed Pentz: Crossref Best Practice #crossref15
PDF
Introduction to Crossref - Crossref LIVE Kuala Lumpur
PPT
Crossref Community Call May 2016
PPTX
Multiple Resolution and handling content available in multiple places
PDF
Crossref Content Registration - LIVE Mumbai
PPTX
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
PDF
New member webinar 052418
PDF
MENGGUNAKAN METADATA PADA CROSSREF
PPTX
BISG DOI Overview
PPTX
Cited-by Linking
PDF
Winning the Big Data SPAM Challenge__HadoopSummit2010
PPTX
Large Scale Data Clean-ups & Challenges for the Library
PDF
Full text search
PDF
Managing changes to content: Crossmark
Content Registration at Crossref - LIVE Kuala Lumpur
Collecting and using funding data in your publications
Managing plagiarism: Similarity Check
Crossref Metadata and Metadata Services
Understanding Crossref Metadata
Ed Pentz: Crossref Best Practice #crossref15
Introduction to Crossref - Crossref LIVE Kuala Lumpur
Crossref Community Call May 2016
Multiple Resolution and handling content available in multiple places
Crossref Content Registration - LIVE Mumbai
Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library
New member webinar 052418
MENGGUNAKAN METADATA PADA CROSSREF
BISG DOI Overview
Cited-by Linking
Winning the Big Data SPAM Challenge__HadoopSummit2010
Large Scale Data Clean-ups & Challenges for the Library
Full text search
Ad

Viewers also liked (20)

PPT
Van gogh
PPTX
第4回「ブラウザー勉強会」オープニング トーク
PPT
Tennis
DOCX
Metacognicion
PPTX
Is this love
PPTX
20101023 ie9 cache
PDF
Overview of Searching in Solr 1.4
PPTX
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
DOCX
A haiti
PDF
The mobile as a health hub, and how bluetooth low energy enables the market
PDF
Moving to Solr/Lucene Open Source Search
PDF
Learn How to Master Solr1 4
PPTX
Amazing grace[1]
PDF
What’s New in Apache Lucene 3.0
PDF
"Search, APIs,Capability Management and the Sensis Journey"
PPTX
ブラウザー勉強会始めました
PPT
Adobe Photoshop
PPTX
Azure と世間様
PDF
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
PPTX
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Van gogh
第4回「ブラウザー勉強会」オープニング トーク
Tennis
Metacognicion
Is this love
20101023 ie9 cache
Overview of Searching in Solr 1.4
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
A haiti
The mobile as a health hub, and how bluetooth low energy enables the market
Moving to Solr/Lucene Open Source Search
Learn How to Master Solr1 4
Amazing grace[1]
What’s New in Apache Lucene 3.0
"Search, APIs,Capability Management and the Sensis Journey"
ブラウザー勉強会始めました
Adobe Photoshop
Azure と世間様
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Ad

Similar to Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh (20)

PDF
PDF
Basics of Solr and Solr Integration with AEM6
PDF
Information Retrieval - Data Science Bootcamp
PPTX
Introduction to Lucene & Solr and Usecases
PDF
Apache Solr Workshop
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
PDF
KEYNOTE: Lucene / Solr road map
PPTX
Introduction to Lucene and Solr - 1
ODP
Elasticsearch for beginners
PDF
Search Engine-Building with Lucene and Solr
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
KEY
Apache Solr - Enterprise search platform
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PDF
Apache Solr
PPTX
Introduction to Apache Lucene/Solr
PPTX
Building Search & Recommendation Engines
PDF
Solr search engine with multiple table relation
PDF
Solr 8 interview
PPTX
Apache Solr Workshop
PDF
Webinar: Inside Apache Solr 5
Basics of Solr and Solr Integration with AEM6
Information Retrieval - Data Science Bootcamp
Introduction to Lucene & Solr and Usecases
Apache Solr Workshop
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
KEYNOTE: Lucene / Solr road map
Introduction to Lucene and Solr - 1
Elasticsearch for beginners
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Apache Solr - Enterprise search platform
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Apache Solr
Introduction to Apache Lucene/Solr
Building Search & Recommendation Engines
Solr search engine with multiple table relation
Solr 8 interview
Apache Solr Workshop
Webinar: Inside Apache Solr 5

More from Lucidworks (Archived) (20)

PDF
Integrating Hadoop & Solr
PDF
The Data-Driven Paradigm
PDF
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
PDF
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
PPTX
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PPTX
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
PPTX
What's new in solr june 2014
PPTX
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
PPTX
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
PPTX
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
PDF
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
PPTX
Solr At AOL, Presented by Sean Timm at SolrExchage DC
PPTX
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
PPTX
Building a data driven search application with LucidWorks SiLK
PPTX
Introducing LucidWorks App for Splunk Enterprise webinar
PDF
Solr4 nosql search_server_2013
PPTX
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
PDF
Seeley yonik solr performance key innovations
Integrating Hadoop & Solr
The Data-Driven Paradigm
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
What's new in solr june 2014
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Building a data driven search application with LucidWorks SiLK
Introducing LucidWorks App for Splunk Enterprise webinar
Solr4 nosql search_server_2013
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Seeley yonik solr performance key innovations

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Electronic commerce courselecture one. Pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Modernizing your data center with Dell and AMD
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
A Presentation on Artificial Intelligence
Electronic commerce courselecture one. Pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Modernizing your data center with Dell and AMD
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Digital-Transformation-Roadmap-for-Companies.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh