SlideShare a Scribd company logo
Friends of Solr - “Nutch and HDFS” Saumitra Srivastav saumitra.srivastav@glassbeam.com Bangalore Apache Solr Group September-2014 Meetup
Friends
Friend #1 
Nutch
What is Nutch? 
-Distributed framework for large scale web crawling 
-but does not have to be large scale at all 
-Based on Apache Hadoop 
-Direct integration with Solr
Overview 
Seed 
(URLs) 
Solr 
Nutch Crawl Fetch Parse
Overview
Components 
-CrawlDB 
-Info about URLs 
-LinkDB 
-Info about links to each URL 
-Segments 
-set of URLs that are fetched as a unit
Segments 
1.crawl_generate 
-set of URLs to be fetched 
2.crawl_fetch 
-status of fetching each URL 
3.content 
-raw content retrieved from each URL 
4.parse_text 
-parsed text of each URL 
5.parse_data 
-outlinks and metadata parsed from each URL 
6.crawl_parse 
-outlink URLs, used to update the crawldb
Scale 
-Scalable storage 
-HDFS 
-Scalable crawling 
-Map-Reduce 
-Scalable search 
-SolrCloud 
-Scalable backend 
-Gora
Features 
-Fetcher 
-Multi-threaded fetcher 
-Queues URLs per hostname / domain / IP 
-Limit the number of URLs for round of fetching 
-Default values are polite but can be made more aggressive
Features 
-Crawl Strategy 
-Breadth-first but can be depth-first 
-Configurable via custom ScoringFilters
Features 
-Scoring 
-OPIC (On-line Page Importance Calculation) by default 
-LinkRank 
-Protocols 
-Http, file, ftp, https 
-Respects robots.txt directives
Features 
-Scheduling 
-Fixed or adaptive 
-URL filters 
-Regex, FSA, TLD, prefix, suffix 
-URL normalisers 
-Default, regex
Features 
-Parsing with Apache Tika 
-Hundreds of formats supported 
-But some legacy parsers as well 
-Plugins 
-Feeds, Language Identification etc. 
-Pluggable indexing 
-Solr, ES etc.
Common crawled fields 
-url 
-content 
-title 
-anchor 
-site 
-boost 
-digest 
-segment 
-host 
-type 
-arbitrary metadata
Setup 
-Download binary and unzip 
-http://nutch.apache.org/downloads.html 
-Conf Directory
Solr Schema
Solr-Nutch Mapping
Indexing crawled data to Solr 
-Add agent.name in nutch-default.xml 
-Copy fields from schema.xml to a core/collection in Solr 
-create seed directory 
-bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
Friend #2 
HDFS
Why integrate with Hadoop? 
-Hadoop is NOT AT ALL needed to scale your Solr installation 
-Hadoop is NOT AT ALL needed for Solr distributed capabilities
Why integrate with Hadoop? 
-Integrate Solr with HDFS when your whole pipeline is hadoop based 
-Avoid moving data and indexes in and out 
-Avoid multiple sinks 
-Avoid redundant provisioning for Solr 
-Individual nodes disk, etc
Solr + Hadoop 
-Read and write directly to HDFS 
-build indexes for Solr with Hadoop's map-reduce
Lucene Directory Abstraction 
Class Directory { 
listAll(); 
createOutput(file, context); 
openInput(file, context); 
deleteFile(file); 
makeLock(file); 
clearLock(file); 
... 
... 
}
HdfsDirectory
Index in HDFS 
-writing and reading index and transaction log files to the HDFS 
-does not use Hadoop Map-Reduce to process Solr data 
-Filesystem cache needed for Solr performance 
-HDFS not fit for random access
Block Cache 
-enables Solr to cache HDFS index files on read and write 
-LRU semantics 
-Hot blocks are cached
Transaction Log 
-HdfsUpdateLog 
-Extends updateLog 
-Triggered by setting the UpdateLog dataDir to something that starts with hdfs:/ 
-no additional configuration
Running Solr on HDFS 
Cloud mode java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.hdfs.home=hdfs://localhost:5432/solr/ -DzkHost=localhost:2181 -jar start.jar
Map-Reduce index building 
-Scalable index creation via map-reduce 
-https://github.com/markrmiller/solr-map-reduce- example
Map-Reduce index building 
-initial implementations sent documents from reducer to SolrCloud over http 
-Not scable 
-Reducers create indexes in HDFS
Map-Reduce index building 
-Reducers create indexes in HDFS 
-merge the indexes down to the correct number of ‘shards’ 
-zookeeper aware 
-Go-Live
Map-Reduce index building
MorphLines 
-A morphline is a configuration file that allows you to define ETL transformation pipelines 
-replaces Java programming with simple configuration steps 
-Extract content from input files, transform content, load content 
-Uses Tika to extract content from a large variety of input documents
MorphLines 
SOLR_LOCATOR : { collection : collection1 zkHost : "127.0.0.1:9983" batchSize : 100 } morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readAvroContainer { ... } } { extractAvroPaths {...} } { convertTimestamp {...} } { sanitizeUnknownSolrFields {...} } { loadSolr {...} } .... ] } ]
Map-Reduce index building 
bin/hadoop --config /tmp/hadoop/sample1 jar ~/softwares/solr/solr-4.10.0/dist/solr-map-reduce-*.jar -D 'mapred.child.java.opts=-Xmx500m' -libjars "$HADOOP_LIBJAR" --morphline-file /tmp/readAvroContainer.conf --zk-host localhost:2181 --output-dir hdfs://localhost/outdir --collection twitter --log4j log4j.properties --go-live --verbose "hdfs://localhost/indir"
Thanks 
-Attributions 
•Julien Nioche’s slides on “Large scale crawling with Apache Nutch” 
•Mark Miller’s slides on “First Class Integration of Solr with Hadoop” 
-Connect 
•saumitra.srivastav@glassbeam.com 
•saumitra.srivastav7@gmail.com 
•https://www.linkedin.com/in/saumitras 
•@_saumitra_ 
-Join: 
•http://www.meetup.com/Bangalore-Apache-Solr-Lucene-Group/

More Related Content

ODP
Web scraping with nutch solr part 2
ODP
Web scraping with nutch solr
PDF
Large Scale Crawling with Apache Nutch and Friends
PDF
Meet Solr For The Tirst Again
ODP
Large Scale Crawling with Apache Nutch and Friends
PDF
Web Crawling with Apache Nutch
PPT
Web Crawling and Data Gathering with Apache Nutch
PPTX
Introduction to apache nutch
Web scraping with nutch solr part 2
Web scraping with nutch solr
Large Scale Crawling with Apache Nutch and Friends
Meet Solr For The Tirst Again
Large Scale Crawling with Apache Nutch and Friends
Web Crawling with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
Introduction to apache nutch

What's hot (20)

PPTX
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
PDF
StormCrawler at Bristech
PDF
Nutch as a Web data mining platform
PDF
Nutch - web-scale search engine toolkit
PDF
A quick introduction to Storm Crawler
PPT
8a. How To Setup HBase with Docker
PPT
8b. Column Oriented Databases Lab
PPTX
Implementing Hadoop on a single cluster
PPTX
Making Apache Kafka Elastic with Apache Mesos
PDF
Get started with Developing Frameworks in Go on Apache Mesos
PPTX
Developing Frameworks for Apache Mesos
PDF
Apache HDFS - Lab Assignment
PPTX
HBaseConEast2016: HBase on Docker with Clusterdock
PPTX
Containerized Data Persistence on Mesos
KEY
You know, for search. Querying 24 Billion Documents in 900ms
PPTX
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
PPT
SphinxSE with MySQL
PPTX
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
PDF
SolrCloud on Hadoop
PDF
An introduction To Apache Spark
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
StormCrawler at Bristech
Nutch as a Web data mining platform
Nutch - web-scale search engine toolkit
A quick introduction to Storm Crawler
8a. How To Setup HBase with Docker
8b. Column Oriented Databases Lab
Implementing Hadoop on a single cluster
Making Apache Kafka Elastic with Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
Developing Frameworks for Apache Mesos
Apache HDFS - Lab Assignment
HBaseConEast2016: HBase on Docker with Clusterdock
Containerized Data Persistence on Mesos
You know, for search. Querying 24 Billion Documents in 900ms
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
SphinxSE with MySQL
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
SolrCloud on Hadoop
An introduction To Apache Spark
Ad

Viewers also liked (17)

PDF
Scaling search with SolrCloud
PPTX
Drools Ecosystem
PPTX
Solr installation
PPTX
Co-clustering of multi-view datasets: a parallelizable approach
PPT
A scalable collaborative filtering framework based on co clustering
PDF
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
PDF
Apache Solr Workshop
PPTX
Distributed messaging with Apache Kafka
PDF
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
PPTX
01 Introduction to Data Mining
PPTX
05 Clustering in Data Mining
PDF
Apache Solr crash course
PDF
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
PDF
Building a Recommendation Engine - An example of a product recommendation engine
PDF
Solr+Hadoop = Big Data Search
PPTX
Building a real time, solr-powered recommendation engine
PDF
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Scaling search with SolrCloud
Drools Ecosystem
Solr installation
Co-clustering of multi-view datasets: a parallelizable approach
A scalable collaborative filtering framework based on co clustering
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
Apache Solr Workshop
Distributed messaging with Apache Kafka
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
01 Introduction to Data Mining
05 Clustering in Data Mining
Apache Solr crash course
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Building a Recommendation Engine - An example of a product recommendation engine
Solr+Hadoop = Big Data Search
Building a real time, solr-powered recommendation engine
Deduplication Using Solr: Presented by Neeraj Jain, Stubhub
Ad

Similar to Friends of Solr - Nutch & HDFS (20)

PDF
Solr + Hadoop = Big Data Search
PPTX
TriHUG: Lucene Solr Hadoop
PDF
The First Class Integration of Solr with Hadoop
PDF
Ingesting hdfs intosolrusingsparktrimmed
KEY
Big Search with Big Data Principles
PDF
Cloudera search
PDF
Integrating Hadoop & Solr
PDF
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
PDF
Search On Hadoop Frontier Meetup
PDF
Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett...
PPTX
Indexing with solr search server and hadoop framework
KEY
ApacheCon Europe 2012 -Big Search 4 Big Data
PDF
Deploying and managing Solr at scale
PPTX
Benchmarking Solr Performance at Scale
KEY
Solr 101
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PPTX
Benchmarking Solr Performance
PDF
Data Science with Solr and Spark
PDF
Search On Hadoop
Solr + Hadoop = Big Data Search
TriHUG: Lucene Solr Hadoop
The First Class Integration of Solr with Hadoop
Ingesting hdfs intosolrusingsparktrimmed
Big Search with Big Data Principles
Cloudera search
Integrating Hadoop & Solr
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
Search On Hadoop Frontier Meetup
Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett...
Indexing with solr search server and hadoop framework
ApacheCon Europe 2012 -Big Search 4 Big Data
Deploying and managing Solr at scale
Benchmarking Solr Performance at Scale
Solr 101
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Benchmarking Solr Performance
Data Science with Solr and Spark
Search On Hadoop

Recently uploaded (20)

PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to Knowledge Engineering Part 1
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Computer network topology notes for revision
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Global journeys: estimating international migration
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Business Acumen Training GuidePresentation.pptx
Mega Projects Data Mega Projects Data
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Knowledge Engineering Part 1
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Database Infoormation System (DBIS).pptx
Moving the Public Sector (Government) to a Digital Adoption
IB Computer Science - Internal Assessment.pptx
Computer network topology notes for revision
Supervised vs unsupervised machine learning algorithms
Launch Your Data Science Career in Kochi – 2025
Global journeys: estimating international migration
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
climate analysis of Dhaka ,Banglades.pptx

Friends of Solr - Nutch & HDFS

  • 1. Friends of Solr - “Nutch and HDFS” Saumitra Srivastav saumitra.srivastav@glassbeam.com Bangalore Apache Solr Group September-2014 Meetup
  • 4. What is Nutch? -Distributed framework for large scale web crawling -but does not have to be large scale at all -Based on Apache Hadoop -Direct integration with Solr
  • 5. Overview Seed (URLs) Solr Nutch Crawl Fetch Parse
  • 7. Components -CrawlDB -Info about URLs -LinkDB -Info about links to each URL -Segments -set of URLs that are fetched as a unit
  • 8. Segments 1.crawl_generate -set of URLs to be fetched 2.crawl_fetch -status of fetching each URL 3.content -raw content retrieved from each URL 4.parse_text -parsed text of each URL 5.parse_data -outlinks and metadata parsed from each URL 6.crawl_parse -outlink URLs, used to update the crawldb
  • 9. Scale -Scalable storage -HDFS -Scalable crawling -Map-Reduce -Scalable search -SolrCloud -Scalable backend -Gora
  • 10. Features -Fetcher -Multi-threaded fetcher -Queues URLs per hostname / domain / IP -Limit the number of URLs for round of fetching -Default values are polite but can be made more aggressive
  • 11. Features -Crawl Strategy -Breadth-first but can be depth-first -Configurable via custom ScoringFilters
  • 12. Features -Scoring -OPIC (On-line Page Importance Calculation) by default -LinkRank -Protocols -Http, file, ftp, https -Respects robots.txt directives
  • 13. Features -Scheduling -Fixed or adaptive -URL filters -Regex, FSA, TLD, prefix, suffix -URL normalisers -Default, regex
  • 14. Features -Parsing with Apache Tika -Hundreds of formats supported -But some legacy parsers as well -Plugins -Feeds, Language Identification etc. -Pluggable indexing -Solr, ES etc.
  • 15. Common crawled fields -url -content -title -anchor -site -boost -digest -segment -host -type -arbitrary metadata
  • 16. Setup -Download binary and unzip -http://nutch.apache.org/downloads.html -Conf Directory
  • 19. Indexing crawled data to Solr -Add agent.name in nutch-default.xml -Copy fields from schema.xml to a core/collection in Solr -create seed directory -bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
  • 21. Why integrate with Hadoop? -Hadoop is NOT AT ALL needed to scale your Solr installation -Hadoop is NOT AT ALL needed for Solr distributed capabilities
  • 22. Why integrate with Hadoop? -Integrate Solr with HDFS when your whole pipeline is hadoop based -Avoid moving data and indexes in and out -Avoid multiple sinks -Avoid redundant provisioning for Solr -Individual nodes disk, etc
  • 23. Solr + Hadoop -Read and write directly to HDFS -build indexes for Solr with Hadoop's map-reduce
  • 24. Lucene Directory Abstraction Class Directory { listAll(); createOutput(file, context); openInput(file, context); deleteFile(file); makeLock(file); clearLock(file); ... ... }
  • 26. Index in HDFS -writing and reading index and transaction log files to the HDFS -does not use Hadoop Map-Reduce to process Solr data -Filesystem cache needed for Solr performance -HDFS not fit for random access
  • 27. Block Cache -enables Solr to cache HDFS index files on read and write -LRU semantics -Hot blocks are cached
  • 28. Transaction Log -HdfsUpdateLog -Extends updateLog -Triggered by setting the UpdateLog dataDir to something that starts with hdfs:/ -no additional configuration
  • 29. Running Solr on HDFS Cloud mode java -Dsolr.directoryFactory=HdfsDirectoryFactory -Dsolr.lock.type=hdfs -Dsolr.hdfs.home=hdfs://localhost:5432/solr/ -DzkHost=localhost:2181 -jar start.jar
  • 30. Map-Reduce index building -Scalable index creation via map-reduce -https://github.com/markrmiller/solr-map-reduce- example
  • 31. Map-Reduce index building -initial implementations sent documents from reducer to SolrCloud over http -Not scable -Reducers create indexes in HDFS
  • 32. Map-Reduce index building -Reducers create indexes in HDFS -merge the indexes down to the correct number of ‘shards’ -zookeeper aware -Go-Live
  • 34. MorphLines -A morphline is a configuration file that allows you to define ETL transformation pipelines -replaces Java programming with simple configuration steps -Extract content from input files, transform content, load content -Uses Tika to extract content from a large variety of input documents
  • 35. MorphLines SOLR_LOCATOR : { collection : collection1 zkHost : "127.0.0.1:9983" batchSize : 100 } morphlines : [ { id : morphline1 importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { readAvroContainer { ... } } { extractAvroPaths {...} } { convertTimestamp {...} } { sanitizeUnknownSolrFields {...} } { loadSolr {...} } .... ] } ]
  • 36. Map-Reduce index building bin/hadoop --config /tmp/hadoop/sample1 jar ~/softwares/solr/solr-4.10.0/dist/solr-map-reduce-*.jar -D 'mapred.child.java.opts=-Xmx500m' -libjars "$HADOOP_LIBJAR" --morphline-file /tmp/readAvroContainer.conf --zk-host localhost:2181 --output-dir hdfs://localhost/outdir --collection twitter --log4j log4j.properties --go-live --verbose "hdfs://localhost/indir"
  • 37. Thanks -Attributions •Julien Nioche’s slides on “Large scale crawling with Apache Nutch” •Mark Miller’s slides on “First Class Integration of Solr with Hadoop” -Connect •saumitra.srivastav@glassbeam.com •saumitra.srivastav7@gmail.com •https://www.linkedin.com/in/saumitras •@_saumitra_ -Join: •http://www.meetup.com/Bangalore-Apache-Solr-Lucene-Group/