Sparkler Presentation for Spark Summit East 2017

Information Retrieval
and Data Science
Thamme Gowda
@thammegowda
Karanjeet Singh
@_karanjeet
A web-crawler on Apache Spark
Feb 7-9, 2017Spark Summit East 2017, Boston 1
SPARKLER
Dr. Chris Mattmann
@chrismattmann
https://github.com/USCDataScience/sparkler

and Data Science
ABOUT
2
Information Retrieval and Data Science (IRDS) Group
University of Southern California, Los Angeles, CA
Home page: https://irds.usc.edu Email: irds-L@mymaillists.usc.edu
Thamme Gowda Dr. Chris MattmannKaranjeet Singh
Graduate Student
@thammegowda
Graduate Student
@karanjeet_tw
Director, IRDS
@chrismattmann
Feb 7-9, 2017Spark Summit East 2017, Boston

and Data Science
OVERVIEW
● About Sparkler
● Motivations for building Sparkler
● Sparkler technology stack, internals
● Features of Sparkler
● Dashboard
● Demo
● What’s Next ?
3Feb 7-9, 2017Spark Summit East 2017, Boston

and Data Science
ABOUT: SPARKLER
● New Open Source Web Crawler
•A bot program that can fetch resources from the web
● Name: Spark Crawler
● Inspired by Apache Nutch
● Like Nutch: Distributed crawler that can scale horizontally
● Unlike Nutch: Runs on top of Apache Spark
● Easy to deploy and easy to use

and Data Science
MOTIVATION #1
● Challenges in DARPA MEMEX*
•MEMEX System has crawlers to fetch deep and dark web data
•ML based analysis to assist law keeping agencies
•Crawls are blackbox, we wanted real-time progress reports
● Dr. Chris Mattmann was considering an upgrade since 3 years
● Technology upgrade needed
* http://memex.jpl.nasa.gov/

and Data Science
WHY A NEW CRAWLER?
6
Modern Hadoop cluster has no Hadoop (Map-Reduce) left in it!
https://twitter.com/cutting/status/796566255830503424

and Data Science
MOTIVATION #2
● Challenges at DATOIN
•Intro: Datoin.com is a distributed text analytics platform
•Late 2014 - migrated the infrastructure from Hadoop Map Reduce to
Apache Spark
•But the crawler component (powered by Apache Nutch) was left behind
● Met Dr. Chris Mattmann at USC in Web Search Engines class
•Enquired about his thoughts for running Nutch on Spark

and Data Science
SPARKLER: TECH STACK
● Batch crawling (similar to Apache Nutch)
● Apache Solr as crawl database
● Multi module Maven project with OSGi bundles
● Stream crawled content through Apache Kafka
● Parses everything using Apache Tika
● Crawl visualization - Banana

and Data Science
SPARKLER: INTERNALS & WORKFLOW

and Data Science
SPARKLER: CRAWLDB

and Data Science
SPARKLER: RDD

and Data Science
SPARKLER: LINKS PIPELINE

and Data Science
SPARKLER: OUTPUT CONSUMPTION

and Data Science
SPARKLER: FEATURES

and Data Science
SPARKLER #1: Lucene/Solr powered Crawldb
● Crawldb needed indexing
•For real time analytics
•For instant visualizations
● This is internal data structure of sparkler
•Exposed over REST API
•Used by Sparkler-ui, the web application
● We chose Apache Solr
● Standalone Solr server or Solr cloud? Yes!
● Glued the crawldb and spark using CrawldbRDD

and Data Science
SPARKLER #2: URL Partitioning
● Politeness
•Doesn’t hit same server too many times in distributed mode
● First version
•Group by: Host name
•Sort by: depth, score
● Customization is easy
•Write your own Solr query
•Take advantage of boosting to alter the ranking
● Partitions the dataset based on the above criteria
● Lazy evaluations and delay between the requests
•Performs parsing instead of waiting
•Inserts delay only when it is necessary

and Data Science
SPARKLER #3: OSGI Plugins
● Plugins Interfaces are inspired by Nutch
● Plugins are developed as per Open Service Gateway Interface (OSGI)
● We chose Apache Felix implementation of OSGI
● Migrated a plugin from Nutch
•Regex URL Filter Plugin → The most used plugin in Nutch
● Added JavaScript plugin (described in the next slide)
● //TODO: Migrate more plugins from Nutch
•Mavenize nutch [NUTCH-2293]

and Data Science
SPARKLER #4: JavaScript Rendering
● Java Script Execution* has first class support
•Allows Sparkler to crawl the Deep/Dark web too
● Distributable on Spark Cluster without pain
•Pure JVM based JavaScript engine
● This is an implementation of FetchFunction
● FetchFunction
•Stream<URL> → Stream<Content>
•Note: URLS are grouped by host
•Preserves cookies and reuses sessions for each iteration
18
Thanks to: Madhav Sharan
Member of USC IRDS* JBrowserDriver by MachinePublishers

and Data Science
SPARKLER #5: Output in Kafka Streams
● Crawler is sometimes input for the applications that does deeper analysis
•Can’t fit all those deeper analysis into crawler
● Integrating to such applications made easy via Queues
● We chose Apache Kafka
•Suits our need
•Distributable, Scalable, Fault Tolerant
● FIXME: Larger messages such as Videos
● This is optional, default output on Shared File System (such as HDFS),
compatible with Nutch
19
Thanks to: Rahul Palamuttam
MS CS @ Stanford University; Intern @ NASA JPL

and Data Science
SPARKLER #6: Tika, the universal parser
● Apache Tika
•Is a toolkit of parsers
•Detects and extracts metadata, text, and URLS
•Over a thousand different file types
● Main application is to discover outgoing links
● The default Implementation for our ParseFunction

and Data Science
SPARKLER #7: Visual Analytics
● Charts and Graphs provides nice summary of crawl job
● Real time analytics
● Example:
•Distribution of URLS across hosts/domains
•Temporal activities
•Status reports
● Customizable in real time
● Using Banana Dashboard from Lucidworks
● Sparkler has a sub component named sparkler-ui
21
Thanks to: Manish Dwibedy
MS CS University of Southern California

and Data Science
SPARKLER #7 DASHBOARD

and Data Science
SPARKLER #8: Deployment
● Docker
● Juju Charms
23
Thanks to: Tom Barber
Spicule Analytics & NASA-JPL

and Data Science
SPARKLER #Next: What’s coming?
● Scoring Crawled Pages (Work in progress)
● Focused Crawling (Work in progress)
● Domain Discovery (Work in progress)
● Detailed documentation and tutorials on wiki (Work in progress)
● Interactive UI
● Crawl Graph Analysis
● Other useful plugins from Nutch
Being used for Polar Deep Insights project
https://www.earthcube.org/group/polar-data-insights-search-analytics-deep-scientific-web

and Data Science
DEMO
25
$ bin/dockler.sh

and Data Science
QUESTIONS?
26

and Data Science
THANK YOU
27

Sparkler Presentation for Spark Summit East 2017

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Sparkler Presentation for Spark Summit East 2017 (20)

Recently uploaded (20)

Sparkler Presentation for Spark Summit East 2017