Web scraping with nutch solr

Web Scraping Using Nutch and Solr
● A simple example of using open source code
● Web Scrape a single web site - ours
● Environment and code
– Using Centos V6.2 ( Linux )
– Apache Nutch 1.6
– Solr 4.2.1
– Java 1.6

Nutch and Solr Architecture
● Nutch processes urls and feeds content to Solr
● Solr indexes content

Where to get source code
● Nutch
– http://nutch.apache.org
● Solr
– http://lucene.apache.org/solr
● Java
– http://java.com

Installing Source - Nutch
● Nutch is delivered as
– apache-nutch-1.6-bin.tar ( 64M )
– apache-nutch-1.6-src.tar ( 20M )
● Copy each tar file to your desired location
● Install each tar file as
– tar xvf <tar file>
● Second tar file optional

Installing Source - Solr
● Solr is delivered as
– solr-4.2.1.zip ( 116M )
● Copy file to your desired location
● Install each tar file as
– unzip <zip file>

Configuring Nutch Part 1
● Assuming we will crawl a single web site
● Ensure that JAVA_HOME is set
● cd apache-nutch-1.6
● Edit agent name in conf/nutch-site.xml
<property>
<name>http.agent.name</name>
<value>Nutch Spider</value>
</property>
● mkdir -p urls ; cd urls ; touch seed.txt

Configuring Nutch Part 2
● Add following url ( ours ) to seed.txt
– http://www.semtech-solutions.co.nz
● Change url filtering in conf/regex-urlfilter.txt, change the line
– # accept anything else
– +.
– To be
– +^http://([a-z0-9]*.)*semtech-solutions.co.nz/
● This means that we will filter the urls found to only be from the
local site

Configuring Solr Part 1
● cd solr-4.2.1/example/solr/collection1/conf
● Add some extra fields to schema.xml after _version_ field i.e.

Start Solr Server – Part 1
● Within solr-4.2.1/example
● Run the following command
● java -jar start.jar
● Now try to access admin web page for solr
– http://localhost:8983/solr/admin
● You should now see the admin web site
– ( see next page )

Start Solr Server – Part 2
● Solr Admin web page

Run Nutch / Solr
● We are ready to crawl our first web site
● Go to apache-nutch-1.6 directory
● Run the following commands
– touch nutch_start.bash
– chmod 755 nutch_start.bash
– vi nutch_start.bash
● Add the text to the file
#!/bin/bash
bin/nutch crawl urls -solr http://localhost:8983/solr/
-dir crawl -depth 3 -topN 3

Run Nutch / Solr
● Now run the nutch bash file
– ./nutch_start.bash
● Select the Logging option on the admin console
● Monitor for errors in Logging console
● The crawl should finish with no errors and the line
– Crawl finished: crawl
– In the crawl window

Check Crawled Data
● Now we check the data that we have crawled
● In Admin Console window
– Set Core Selector to collection1
– Select the Query option
– Click execute query button
● You should now see some of the data that you have crawled

Crawled Data
● Crawled data in solr query

Crawled Data
● Thats your first simple crawl completed
● Further reading at
– http://nutch.apache.org
– http://lucene.apache.org/solr
● Now you can
– Add more urls to your seed.txt
– Increase the depth of your link search via options
● -depth
● -topN
– Modify your url filtering

Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

Web scraping with nutch solr

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to Web scraping with nutch solr (20)

More from Mike Frampton (20)

Recently uploaded (20)

Web scraping with nutch solr