SlideShare a Scribd company logo
ProjectHub Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends Otis Gospodneti ć ◦◦  [email_address]   ◦◦  @ otisg Sematext Int'l  ◦◦   www.sematext.com   ◦◦  @ sematext
What I Will Cover Who I am What Why Where Architecture Info Gathering & Indexing Search & Extra Search Dog Food Performance & Analytics Ops & Stats
About Otis Gospodneti ć Lucene/Solr/Nutch/Mahout/... committer Lucene in Action 1 & 2 co-author Lucene Consulting since 2005 Sematext International since 2007
About Sematext Search  ( Lucene, Solr, Elastic Search... ) Web Crawling  (Nutch) Machine Learning  (Mahout) Big Data  (Hadoop, HBase, Voldemort...)
What Search everything about a Software Project Lucene & Hadoop All sub-projects All content Mailing list archives JIRA issues Web site & Wiki pages Source code (local syntax highlighting), trunk Javadoc, trunk
ProjectHub
Why We need it Other Hadoop, Lucene, Solr... users need it Our own playground Live product demos Yummy dog food
Where search-lucene.com search-hadoop.com Other suggestions / needs? In your Enterprise?
Architecture
Tool Matrix Data Source Fetch Parse JIRA URLConnection (feed) Digester (feed) DOM (item) ML FileInputStream (fs) URLConnection (feed) Droid (works, unused) Digester (feed) MIME4J (mbox) Web site Droids Tika via Droids Wiki Droids Tika via Droids Source code svn co QDox Javadoc svn co QDox
Information Gathering Multiple independent JVM processes (cron) Different polling frequencies Different data sources / formats: RSS (JIRA, Mailing Lists) Mbox (Mailing Lists) HTTP/HTML (Web site, Wiki) Subversion (source code, Javadoc) Nutch is a beast.  Droids is light & simple. ML thread detection is tricky Finding deleted docs (Wiki, Web, Javadoc...)
Thread Detection Email clients are kaput SMTP headers are unreliable Heuristics are needed Try headers Fall back to subjects (get subject skeleton, calculate hash) Factor in time (4 weeks) Use index for thread info retrieval Q: Are there any libraries for this?
Indexing Use StreamingUpdateSolrServer AutoCommit use-case Solr index abuse: track seen/unseen &qsrc=indexer &warmUp=true Separate processes – easier reindexing (esp. with frequent project infra changes) Treating quoted portions of ML messages
Search Facets (multi-select) Project Data source/type Author (based on names only) Boosting more recent documents vs. pure relevance vs. newest/oldest first give equivalent of 0.5 year to docs w/ empty updateDate field (e.g. javadocs) recip(map(ms(NOW,updateDate),6.32e11,3.16e12,1.58e10),3.16e-11,4,1)^4
Search cont'd Query Spellchecker Sematext components: ReSearcher & Relaxer AutoComplete Key Phrase Extractor (2 approaches) Threaded vs. flat view In-document search term highlighting Short URLs
Search cont'd
Dog food #1: Auto-Complete Source: nightly refreshed subject and titles Approach: go directly to selection sematext.com/products/autocomplete/
Dog food #2: ReSearcher & Relaxer Avoid “sorry, no/poor matches” Multiple algos trigger re-searching Different forms of relaxing sematext.com/products/dym-researcher/
Dog food #3: Key Phrases Help narrow search results, like facets 2 types: Stored in index vs. calculated from top N hits sematext.com/products/key-phrase-extractor/
Basic Search Analytics Top queries, top terms... Daily, weekly, monthly MRR http://en.wikipedia.org/wiki/Mean_reciprocal_rank
Very Basic Search Analytics
Real Search Analytics
Performance & Monitoring: RPM
Availability: Site24x7.com
Operations Small EC2 instance: 1.7 GB RAM EBS for data - got burnt once Local disk for index Solr 1.4.1 multi-core Performance monitoring via RPM Availability & performance via site24x7.com
Statistics search-hadoop.com: 110K+ documents ~700 MB optimized search-lucene.com 170K+ documents ~900 MB optimized
Future Field collapsing (threads) Bot detection (load)  DONE Solr duplicate detection (release notes) Relevance tuning (MRR) Open sourcing?
World-wide! Search & Data Analytics Machine Learning & NLP Big Data [email_address] WE ARE HIRING
Questions ?
Contact sematext.com blog.sematext.com @ sematext @ otisg [email_address]

More Related Content

ODP
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
PPTX
PPTX
Apache lucene
PPTX
ElasticSearch Basics
PPTX
Introduction to apache lucene
PDF
What is in a Lucene index?
PDF
Introduction To Apache Lucene
PDF
Apache Lucene intro - Breizhcamp 2015
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache lucene
ElasticSearch Basics
Introduction to apache lucene
What is in a Lucene index?
Introduction To Apache Lucene
Apache Lucene intro - Breizhcamp 2015

What's hot (20)

PDF
Full Text Search with Lucene
PPTX
Introduction to Apache Solr
PDF
Tutorial 5 (lucene)
PDF
How Solr Search Works
PPTX
Sphinx
PPTX
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
PPTX
Search Me: Using Lucene.Net
PPT
Lucece Indexing
PDF
Full text search
PDF
Introduction to Apache Solr
PPTX
Dangerous google dorks
PPT
Apache Tika: 1 point Oh!
PDF
What's new with Apache Tika?
PDF
Solr: 4 big features
PPT
Using Thinking Sphinx with rails
PDF
Munching & crunching - Lucene index post-processing
KEY
Content extraction with apache tika
PPT
Content Analysis with Apache Tika
PPT
Text and metadata extraction with Apache Tika
PPTX
ElasticSearch Basics
Full Text Search with Lucene
Introduction to Apache Solr
Tutorial 5 (lucene)
How Solr Search Works
Sphinx
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
Search Me: Using Lucene.Net
Lucece Indexing
Full text search
Introduction to Apache Solr
Dangerous google dorks
Apache Tika: 1 point Oh!
What's new with Apache Tika?
Solr: 4 big features
Using Thinking Sphinx with rails
Munching & crunching - Lucene index post-processing
Content extraction with apache tika
Content Analysis with Apache Tika
Text and metadata extraction with Apache Tika
ElasticSearch Basics
Ad

Viewers also liked (20)

PPTX
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
ODP
Large Scale Crawling with Apache Nutch and Friends
KEY
Open source enterprise search and retrieval platform
PPTX
Populate your Search index, NEST 2016-01
PPT
Apache Tika end-to-end
PDF
Large Scale Crawling with Apache Nutch and Friends
PPTX
Search Engine Capabilities - Apache Solr(Lucene)
PDF
Web Crawling with Apache Nutch
PDF
An introduction to Storm Crawler
PPT
Search engine
PPTX
Metadata Extraction and Content Transformation
ODP
Large scale crawling with Apache Nutch
PDF
Apache Solr crash course
PDF
Indexing Text and HTML Files with Solr
PPT
Rate your Project manager – to express yourself
PDF
Luxury index 2012 eng
PPT
Leed Presentation Green Bldg Alliance 12 2 08
PDF
Drupal + Solr Mejorando la experiencia de búsqueda
PPT
Content analysis for ECM with Apache Tika
PDF
Mejorando la búsqueda Web con Apache Solr
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Large Scale Crawling with Apache Nutch and Friends
Open source enterprise search and retrieval platform
Populate your Search index, NEST 2016-01
Apache Tika end-to-end
Large Scale Crawling with Apache Nutch and Friends
Search Engine Capabilities - Apache Solr(Lucene)
Web Crawling with Apache Nutch
An introduction to Storm Crawler
Search engine
Metadata Extraction and Content Transformation
Large scale crawling with Apache Nutch
Apache Solr crash course
Indexing Text and HTML Files with Solr
Rate your Project manager – to express yourself
Luxury index 2012 eng
Leed Presentation Green Bldg Alliance 12 2 08
Drupal + Solr Mejorando la experiencia de búsqueda
Content analysis for ECM with Apache Tika
Mejorando la búsqueda Web con Apache Solr
Ad

Similar to ProjectHub (20)

ODP
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
PDF
NoSQL, Apache SOLR and Apache Hadoop
PPTX
OpenSearchLab and the Lucene Ecosystem
PPT
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
PPT
Solr Performance Monitoring with SPM
PDF
sunny-slides
PPTX
Battle of the giants: Apache Solr vs ElasticSearch
PPTX
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
PPT
PDF
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
PDF
Naukri Search Team achievements, 2009-2010
PDF
Solr + Hadoop = Big Data Search
PDF
BP-8 Global Federation and Search
PPTX
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
PDF
Quality, quantity, web and semantics
PDF
Quality, Quantity, Web and Semantics
PDF
hadoop
KEY
ApacheCon Europe 2012 -Big Search 4 Big Data
ODP
Text-mining and Automation
Large Scale Performance Monitoring for ElasticSearch, HBase, Solr, SenseiDB, ...
NoSQL, Apache SOLR and Apache Hadoop
OpenSearchLab and the Lucene Ecosystem
Open Source Search Tools for www2010 conferencesourcesearchtoolswww20100426dA...
Solr Performance Monitoring with SPM
sunny-slides
Battle of the giants: Apache Solr vs ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
RMLL 2013 : Build Your Personal Search Engine using Crawlzilla
Naukri Search Team achievements, 2009-2010
Solr + Hadoop = Big Data Search
BP-8 Global Federation and Search
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Quality, quantity, web and semantics
Quality, Quantity, Web and Semantics
hadoop
ApacheCon Europe 2012 -Big Search 4 Big Data
Text-mining and Automation

More from Sematext Group, Inc. (20)

PDF
Tweaking the Base Score: Lucene/Solr Similarities Explained
PDF
OOPs, OOMs, oh my! Containerizing JVM apps
PPTX
Is observability good for your brain?
PDF
Introducing log analysis to your organization
PPTX
Solr Search Engine: Optimize Is (Not) Bad for You
PDF
Solr on Docker - the Good, the Bad and the Ugly
PDF
Monitoring and Log Management for
PDF
Introduction to solr
PDF
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
PDF
Elasticsearch for Logs & Metrics - a deep dive
PDF
How to Run Solr on Docker and Why
PDF
Tuning Solr & Pipeline for Logs
PPTX
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
PDF
Top Node.js Metrics to Watch
PPT
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
PDF
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
PDF
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
PDF
Docker Logging Webinar
PDF
Docker Monitoring Webinar
PDF
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Tweaking the Base Score: Lucene/Solr Similarities Explained
OOPs, OOMs, oh my! Containerizing JVM apps
Is observability good for your brain?
Introducing log analysis to your organization
Solr Search Engine: Optimize Is (Not) Bad for You
Solr on Docker - the Good, the Bad and the Ugly
Monitoring and Log Management for
Introduction to solr
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Elasticsearch for Logs & Metrics - a deep dive
How to Run Solr on Docker and Why
Tuning Solr & Pipeline for Logs
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Top Node.js Metrics to Watch
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
Docker Logging Webinar
Docker Monitoring Webinar
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Approach and Philosophy of On baking technology
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
KodekX | Application Modernization Development
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Advanced IT Governance
PPTX
Big Data Technologies - Introduction.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
CIFDAQ's Market Insight: SEC Turns Pro Crypto
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Monthly Chronicles - July 2025
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Approach and Philosophy of On baking technology
GamePlan Trading System Review: Professional Trader's Honest Take
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
KodekX | Application Modernization Development
Network Security Unit 5.pdf for BCA BBA.
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Advanced IT Governance
Big Data Technologies - Introduction.pptx
20250228 LYD VKU AI Blended-Learning.pptx

ProjectHub

  • 1. ProjectHub Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends Otis Gospodneti ć ◦◦ [email_address] ◦◦ @ otisg Sematext Int'l ◦◦ www.sematext.com ◦◦ @ sematext
  • 2. What I Will Cover Who I am What Why Where Architecture Info Gathering & Indexing Search & Extra Search Dog Food Performance & Analytics Ops & Stats
  • 3. About Otis Gospodneti ć Lucene/Solr/Nutch/Mahout/... committer Lucene in Action 1 & 2 co-author Lucene Consulting since 2005 Sematext International since 2007
  • 4. About Sematext Search ( Lucene, Solr, Elastic Search... ) Web Crawling (Nutch) Machine Learning (Mahout) Big Data (Hadoop, HBase, Voldemort...)
  • 5. What Search everything about a Software Project Lucene & Hadoop All sub-projects All content Mailing list archives JIRA issues Web site & Wiki pages Source code (local syntax highlighting), trunk Javadoc, trunk
  • 7. Why We need it Other Hadoop, Lucene, Solr... users need it Our own playground Live product demos Yummy dog food
  • 8. Where search-lucene.com search-hadoop.com Other suggestions / needs? In your Enterprise?
  • 10. Tool Matrix Data Source Fetch Parse JIRA URLConnection (feed) Digester (feed) DOM (item) ML FileInputStream (fs) URLConnection (feed) Droid (works, unused) Digester (feed) MIME4J (mbox) Web site Droids Tika via Droids Wiki Droids Tika via Droids Source code svn co QDox Javadoc svn co QDox
  • 11. Information Gathering Multiple independent JVM processes (cron) Different polling frequencies Different data sources / formats: RSS (JIRA, Mailing Lists) Mbox (Mailing Lists) HTTP/HTML (Web site, Wiki) Subversion (source code, Javadoc) Nutch is a beast. Droids is light & simple. ML thread detection is tricky Finding deleted docs (Wiki, Web, Javadoc...)
  • 12. Thread Detection Email clients are kaput SMTP headers are unreliable Heuristics are needed Try headers Fall back to subjects (get subject skeleton, calculate hash) Factor in time (4 weeks) Use index for thread info retrieval Q: Are there any libraries for this?
  • 13. Indexing Use StreamingUpdateSolrServer AutoCommit use-case Solr index abuse: track seen/unseen &qsrc=indexer &warmUp=true Separate processes – easier reindexing (esp. with frequent project infra changes) Treating quoted portions of ML messages
  • 14. Search Facets (multi-select) Project Data source/type Author (based on names only) Boosting more recent documents vs. pure relevance vs. newest/oldest first give equivalent of 0.5 year to docs w/ empty updateDate field (e.g. javadocs) recip(map(ms(NOW,updateDate),6.32e11,3.16e12,1.58e10),3.16e-11,4,1)^4
  • 15. Search cont'd Query Spellchecker Sematext components: ReSearcher & Relaxer AutoComplete Key Phrase Extractor (2 approaches) Threaded vs. flat view In-document search term highlighting Short URLs
  • 17. Dog food #1: Auto-Complete Source: nightly refreshed subject and titles Approach: go directly to selection sematext.com/products/autocomplete/
  • 18. Dog food #2: ReSearcher & Relaxer Avoid “sorry, no/poor matches” Multiple algos trigger re-searching Different forms of relaxing sematext.com/products/dym-researcher/
  • 19. Dog food #3: Key Phrases Help narrow search results, like facets 2 types: Stored in index vs. calculated from top N hits sematext.com/products/key-phrase-extractor/
  • 20. Basic Search Analytics Top queries, top terms... Daily, weekly, monthly MRR http://en.wikipedia.org/wiki/Mean_reciprocal_rank
  • 21. Very Basic Search Analytics
  • 25. Operations Small EC2 instance: 1.7 GB RAM EBS for data - got burnt once Local disk for index Solr 1.4.1 multi-core Performance monitoring via RPM Availability & performance via site24x7.com
  • 26. Statistics search-hadoop.com: 110K+ documents ~700 MB optimized search-lucene.com 170K+ documents ~900 MB optimized
  • 27. Future Field collapsing (threads) Bot detection (load) DONE Solr duplicate detection (release notes) Relevance tuning (MRR) Open sourcing?
  • 28. World-wide! Search & Data Analytics Machine Learning & NLP Big Data [email_address] WE ARE HIRING
  • 30. Contact sematext.com blog.sematext.com @ sematext @ otisg [email_address]