SlideShare a Scribd company logo
Web Scraping Using Nutch and Solr
● A simple example of using open source code
● Web Scrape a single web site - ours
● Environment and code
– Using Centos V6.2 ( Linux )
– Apache Nutch 1.6
– Solr 4.2.1
– Java 1.6
Nutch and Solr Architecture
● Nutch processes urls and feeds content to Solr
● Solr indexes content
Where to get source code
● Nutch
– http://nutch.apache.org
● Solr
– http://lucene.apache.org/solr
● Java
– http://java.com
Installing Source - Nutch
● Nutch is delivered as
– apache-nutch-1.6-bin.tar ( 64M )
– apache-nutch-1.6-src.tar ( 20M )
● Copy each tar file to your desired location
● Install each tar file as
– tar xvf <tar file>
● Second tar file optional
Installing Source - Solr
● Solr is delivered as
– solr-4.2.1.zip ( 116M )
● Copy file to your desired location
● Install each tar file as
– unzip <zip file>
Configuring Nutch Part 1
● Assuming we will crawl a single web site
● Ensure that JAVA_HOME is set
● cd apache-nutch-1.6
● Edit agent name in conf/nutch-site.xml
<property>
<name>http.agent.name</name>
<value>Nutch Spider</value>
</property>
● mkdir -p urls ; cd urls ; touch seed.txt
Configuring Nutch Part 2
● Add following url ( ours ) to seed.txt
– http://www.semtech-solutions.co.nz
● Change url filtering in conf/regex-urlfilter.txt, change the line
– # accept anything else
– +.
– To be
– +^http://([a-z0-9]*.)*semtech-solutions.co.nz/
● This means that we will filter the urls found to only be from the
local site
Configuring Solr Part 1
● cd solr-4.2.1/example/solr/collection1/conf
● Add some extra fields to schema.xml after _version_ field i.e.
Start Solr Server – Part 1
● Within solr-4.2.1/example
● Run the following command
● java -jar start.jar
● Now try to access admin web page for solr
– http://localhost:8983/solr/admin
● You should now see the admin web site
– ( see next page )
Start Solr Server – Part 2
● Solr Admin web page
Run Nutch / Solr
● We are ready to crawl our first web site
● Go to apache-nutch-1.6 directory
● Run the following commands
– touch nutch_start.bash
– chmod 755 nutch_start.bash
– vi nutch_start.bash
● Add the text to the file
#!/bin/bash
bin/nutch crawl urls -solr http://localhost:8983/solr/ 
-dir crawl -depth 3 -topN 3
Run Nutch / Solr
● Now run the nutch bash file
– ./nutch_start.bash
● Select the Logging option on the admin console
● Monitor for errors in Logging console
● The crawl should finish with no errors and the line
– Crawl finished: crawl
– In the crawl window
Check Crawled Data
● Now we check the data that we have crawled
● In Admin Console window
– Set Core Selector to collection1
– Select the Query option
– Click execute query button
● You should now see some of the data that you have crawled
Crawled Data
● Crawled data in solr query
Crawled Data
● Thats your first simple crawl completed
● Further reading at
– http://nutch.apache.org
– http://lucene.apache.org/solr
● Now you can
– Add more urls to your seed.txt
– Increase the depth of your link search via options
● -depth
● -topN
– Modify your url filtering
Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

More Related Content

ODP
Web scraping with nutch solr part 2
PDF
Friends of Solr - Nutch & HDFS
PDF
Large Scale Crawling with Apache Nutch and Friends
ODP
Large Scale Crawling with Apache Nutch and Friends
PPTX
Implementing Hadoop on a single cluster
PPT
eZ Publish cluster unleashed revisited
PPT
Web Crawling and Data Gathering with Apache Nutch
PPTX
Containerized Data Persistence on Mesos
Web scraping with nutch solr part 2
Friends of Solr - Nutch & HDFS
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
Implementing Hadoop on a single cluster
eZ Publish cluster unleashed revisited
Web Crawling and Data Gathering with Apache Nutch
Containerized Data Persistence on Mesos

What's hot (20)

PDF
Caching. api. http 1.1
PDF
Nutch as a Web data mining platform
PDF
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
PDF
Web Crawling with Apache Nutch
PPTX
HBaseConEast2016: HBase on Docker with Clusterdock
ODP
Hadoop Installation and basic configuration
PDF
Nutch - web-scale search engine toolkit
PDF
Distributed Data Processing Workshop - SBU
PDF
Perl Programming - 04 Programming Database
PDF
Apache HDFS - Lab Assignment
PDF
Shark - Lab Assignment
PDF
PPTX
Install hadoop in a cluster
PDF
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
An introduction To Apache Spark
PDF
Apache HBase - Lab Assignment
ODP
Introduction to Mesos
PDF
Elasticsearch 1.x Cluster Installation (VirtualBox)
PDF
Advanced troubleshooting linux performance
ODP
D8 configuration migration
Caching. api. http 1.1
Nutch as a Web data mining platform
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Web Crawling with Apache Nutch
HBaseConEast2016: HBase on Docker with Clusterdock
Hadoop Installation and basic configuration
Nutch - web-scale search engine toolkit
Distributed Data Processing Workshop - SBU
Perl Programming - 04 Programming Database
Apache HDFS - Lab Assignment
Shark - Lab Assignment
Install hadoop in a cluster
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
An introduction To Apache Spark
Apache HBase - Lab Assignment
Introduction to Mesos
Elasticsearch 1.x Cluster Installation (VirtualBox)
Advanced troubleshooting linux performance
D8 configuration migration
Ad

Viewers also liked (13)

PDF
Migration from FAST ESP to Solr
PPTX
Solr installation
PDF
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
PPTX
TriHUG: Lucene Solr Hadoop
PDF
Apache ManifoldCF
PDF
Frontera-Open Source Large Scale Web Crawling Framework
PPTX
Building Satori: Web Data Extraction On Hadoop
PDF
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
ODP
Large scale crawling with Apache Nutch
PPTX
Apache Solr-Webinar
PPT
Introduction to Apache Solr.
PDF
Frontera: open source, large scale web crawling framework
PDF
Solr+Hadoop = Big Data Search
Migration from FAST ESP to Solr
Solr installation
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
TriHUG: Lucene Solr Hadoop
Apache ManifoldCF
Frontera-Open Source Large Scale Web Crawling Framework
Building Satori: Web Data Extraction On Hadoop
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
Large scale crawling with Apache Nutch
Apache Solr-Webinar
Introduction to Apache Solr.
Frontera: open source, large scale web crawling framework
Solr+Hadoop = Big Data Search
Ad

Similar to Web scraping with nutch solr (20)

PDF
A customized web search engine [autosaved]
DOCX
Open source search engine
PDF
Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett...
PDF
A quick introduction to Storm Crawler
PPT
Working with solr.pptx
PDF
Building Lanyrd
KEY
Big Search with Big Data Principles
KEY
Solr 101
PDF
Suche mit Apache Lucene & Co.
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PPTX
Apache Solr - search for everyone!
PDF
PPTX
Dev Con 2014
PPTX
DRUPAL Search API Solr
PDF
Indexing Text and HTML Files with Solr
PDF
Indexing Text and HTML Files with Solr
PDF
Indexing Text and HTML Files with Solr
PDF
Low latency scalable web crawling on Apache Storm
A customized web search engine [autosaved]
Open source search engine
Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett...
A quick introduction to Storm Crawler
Working with solr.pptx
Building Lanyrd
Big Search with Big Data Principles
Solr 101
Suche mit Apache Lucene & Co.
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Apache Solr - search for everyone!
Dev Con 2014
DRUPAL Search API Solr
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
Low latency scalable web crawling on Apache Storm

More from Mike Frampton (20)

PDF
Apache Airavata
PDF
Apache MADlib AI/ML
PDF
Apache MXNet AI
PDF
Apache Gobblin
PDF
Apache Singa AI
PDF
Apache Ranger
PDF
OrientDB
PDF
Prometheus
PDF
Apache Tephra
PDF
Apache Kudu
PDF
Apache Bahir
PDF
Apache Arrow
PDF
JanusGraph DB
PDF
Apache Ignite
PDF
Apache Samza
PDF
Apache Flink
PDF
Apache Edgent
PDF
Apache CouchDB
ODP
An introduction to Apache Mesos
ODP
An introduction to Pentaho
Apache Airavata
Apache MADlib AI/ML
Apache MXNet AI
Apache Gobblin
Apache Singa AI
Apache Ranger
OrientDB
Prometheus
Apache Tephra
Apache Kudu
Apache Bahir
Apache Arrow
JanusGraph DB
Apache Ignite
Apache Samza
Apache Flink
Apache Edgent
Apache CouchDB
An introduction to Apache Mesos
An introduction to Pentaho

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Spectroscopy.pptx food analysis technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Programs and apps: productivity, graphics, security and other tools
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Building Integrated photovoltaic BIPV_UPV.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Reach Out and Touch Someone: Haptics and Empathic Computing
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
MIND Revenue Release Quarter 2 2025 Press Release
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectroscopy.pptx food analysis technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Approach and Philosophy of On baking technology

Web scraping with nutch solr

  • 1. Web Scraping Using Nutch and Solr ● A simple example of using open source code ● Web Scrape a single web site - ours ● Environment and code – Using Centos V6.2 ( Linux ) – Apache Nutch 1.6 – Solr 4.2.1 – Java 1.6
  • 2. Nutch and Solr Architecture ● Nutch processes urls and feeds content to Solr ● Solr indexes content
  • 3. Where to get source code ● Nutch – http://nutch.apache.org ● Solr – http://lucene.apache.org/solr ● Java – http://java.com
  • 4. Installing Source - Nutch ● Nutch is delivered as – apache-nutch-1.6-bin.tar ( 64M ) – apache-nutch-1.6-src.tar ( 20M ) ● Copy each tar file to your desired location ● Install each tar file as – tar xvf <tar file> ● Second tar file optional
  • 5. Installing Source - Solr ● Solr is delivered as – solr-4.2.1.zip ( 116M ) ● Copy file to your desired location ● Install each tar file as – unzip <zip file>
  • 6. Configuring Nutch Part 1 ● Assuming we will crawl a single web site ● Ensure that JAVA_HOME is set ● cd apache-nutch-1.6 ● Edit agent name in conf/nutch-site.xml <property> <name>http.agent.name</name> <value>Nutch Spider</value> </property> ● mkdir -p urls ; cd urls ; touch seed.txt
  • 7. Configuring Nutch Part 2 ● Add following url ( ours ) to seed.txt – http://www.semtech-solutions.co.nz ● Change url filtering in conf/regex-urlfilter.txt, change the line – # accept anything else – +. – To be – +^http://([a-z0-9]*.)*semtech-solutions.co.nz/ ● This means that we will filter the urls found to only be from the local site
  • 8. Configuring Solr Part 1 ● cd solr-4.2.1/example/solr/collection1/conf ● Add some extra fields to schema.xml after _version_ field i.e.
  • 9. Start Solr Server – Part 1 ● Within solr-4.2.1/example ● Run the following command ● java -jar start.jar ● Now try to access admin web page for solr – http://localhost:8983/solr/admin ● You should now see the admin web site – ( see next page )
  • 10. Start Solr Server – Part 2 ● Solr Admin web page
  • 11. Run Nutch / Solr ● We are ready to crawl our first web site ● Go to apache-nutch-1.6 directory ● Run the following commands – touch nutch_start.bash – chmod 755 nutch_start.bash – vi nutch_start.bash ● Add the text to the file #!/bin/bash bin/nutch crawl urls -solr http://localhost:8983/solr/ -dir crawl -depth 3 -topN 3
  • 12. Run Nutch / Solr ● Now run the nutch bash file – ./nutch_start.bash ● Select the Logging option on the admin console ● Monitor for errors in Logging console ● The crawl should finish with no errors and the line – Crawl finished: crawl – In the crawl window
  • 13. Check Crawled Data ● Now we check the data that we have crawled ● In Admin Console window – Set Core Selector to collection1 – Select the Query option – Click execute query button ● You should now see some of the data that you have crawled
  • 14. Crawled Data ● Crawled data in solr query
  • 15. Crawled Data ● Thats your first simple crawl completed ● Further reading at – http://nutch.apache.org – http://lucene.apache.org/solr ● Now you can – Add more urls to your seed.txt – Increase the depth of your link search via options ● -depth ● -topN – Modify your url filtering
  • 16. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems