SlideShare a Scribd company logo
Information Retrieval
and Data Science
Thamme Gowda
@thammegowda
Karanjeet Singh
@_karanjeet
A	web-crawler	on	Apache	Spark
Feb	7-9,	2017Spark	Summit	East	2017,	Boston 1
SPARKLER
Dr. Chris Mattmann
@chrismattmann
https://github.com/USCDataScience/sparkler
Information Retrieval
and Data Science
ABOUT
2
Information	Retrieval	and	Data	Science	(IRDS)	Group
University	of	Southern	California,	Los	Angeles,	CA
Home	page:	https://irds.usc.edu Email:	irds-L@mymaillists.usc.edu
Thamme	Gowda Dr.	Chris	MattmannKaranjeet	Singh
Graduate	Student
@thammegowda
Graduate	Student
@karanjeet_tw
Director,	IRDS
@chrismattmann
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
OVERVIEW
● About Sparkler
● Motivations for building Sparkler
● Sparkler technology stack, internals
● Features of Sparkler
● Dashboard
● Demo
● What’s Next ?
3Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
ABOUT:	SPARKLER
● New Open Source Web Crawler
•A bot program that can fetch resources from the web
● Name: Spark Crawler
● Inspired by Apache Nutch
● Like Nutch: Distributed crawler that can scale horizontally
● Unlike Nutch: Runs on top of Apache Spark
● Easy to deploy and easy to use
4Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
MOTIVATION	#1
● Challenges in DARPA MEMEX*
•MEMEX System has crawlers to fetch deep and dark web data
•ML based analysis to assist law keeping agencies
•Crawls are blackbox, we wanted real-time progress reports
● Dr. Chris Mattmann was considering an upgrade since 3 years
● Technology upgrade needed
5Feb	7-9,	2017Spark	Summit	East	2017,	Boston
* http://memex.jpl.nasa.gov/
Information Retrieval
and Data Science
WHY A NEW CRAWLER?
6
Modern	Hadoop	cluster	has	no	Hadoop	(Map-Reduce)	left	in	it!
https://twitter.com/cutting/status/796566255830503424
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
MOTIVATION	#2
● Challenges at DATOIN
•Intro: Datoin.com is a distributed text analytics platform
•Late 2014 - migrated the infrastructure from Hadoop Map Reduce to
Apache Spark
•But the crawler component (powered by Apache Nutch) was left behind
● Met Dr. Chris Mattmann at USC in Web Search Engines class
•Enquired about his thoughts for running Nutch on Spark
7Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: TECH STACK
● Batch crawling (similar to Apache Nutch)
● Apache Solr as crawl database
● Multi module Maven project with OSGi bundles
● Stream crawled content through Apache Kafka
● Parses everything using Apache Tika
● Crawl visualization - Banana
8Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: INTERNALS & WORKFLOW
9Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: CRAWLDB
10Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: RDD
11Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: LINKS PIPELINE
12Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: OUTPUT CONSUMPTION
13Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER: FEATURES
14Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #1: Lucene/Solr powered Crawldb
● Crawldb needed indexing
•For real time analytics
•For instant visualizations
● This is internal data structure of sparkler
•Exposed over REST API
•Used by Sparkler-ui, the web application
● We chose Apache Solr
● Standalone Solr server or Solr cloud? Yes!
● Glued the crawldb and spark using CrawldbRDD
15Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #2: URL Partitioning
● Politeness
•Doesn’t hit same server too many times in distributed mode
● First version
•Group by: Host name
•Sort by: depth, score
● Customization is easy
•Write your own Solr query
•Take advantage of boosting to alter the ranking
● Partitions the dataset based on the above criteria
● Lazy evaluations and delay between the requests
•Performs parsing instead of waiting
•Inserts delay only when it is necessary
16Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #3: OSGI Plugins
● Plugins Interfaces are inspired by Nutch
● Plugins are developed as per Open Service Gateway Interface (OSGI)
● We chose Apache Felix implementation of OSGI
● Migrated a plugin from Nutch
•Regex URL Filter Plugin → The most used plugin in Nutch
● Added JavaScript plugin (described in the next slide)
● //TODO: Migrate more plugins from Nutch
•Mavenize nutch [NUTCH-2293]
17Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #4: JavaScript Rendering
● Java Script Execution* has first class support
•Allows Sparkler to crawl the Deep/Dark web too
● Distributable on Spark Cluster without pain
•Pure JVM based JavaScript engine
● This is an implementation of FetchFunction
● FetchFunction
•Stream<URL> → Stream<Content>
•Note: URLS are grouped by host
•Preserves cookies and reuses sessions for each iteration
18
Thanks to: Madhav Sharan
Member of USC IRDS* JBrowserDriver by MachinePublishers
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #5: Output in Kafka Streams
● Crawler is sometimes input for the applications that does deeper analysis
•Can’t fit all those deeper analysis into crawler
● Integrating to such applications made easy via Queues
● We chose Apache Kafka
•Suits our need
•Distributable, Scalable, Fault Tolerant
● FIXME: Larger messages such as Videos
● This is optional, default output on Shared File System (such as HDFS),
compatible with Nutch
19
Thanks to: Rahul Palamuttam
MS CS @ Stanford University; Intern @ NASA JPL
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #6: Tika, the universal parser
● Apache Tika
•Is a toolkit of parsers
•Detects and extracts metadata, text, and URLS
•Over a thousand different file types
● Main application is to discover outgoing links
● The default Implementation for our ParseFunction
20Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #7: Visual Analytics
● Charts and Graphs provides nice summary of crawl job
● Real time analytics
● Example:
•Distribution of URLS across hosts/domains
•Temporal activities
•Status reports
● Customizable in real time
● Using Banana Dashboard from Lucidworks
● Sparkler has a sub component named sparkler-ui
21
Thanks to: Manish Dwibedy
MS CS University of Southern California
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #7 DASHBOARD
22Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #8: Deployment
● Docker
● Juju Charms
23
Thanks to: Tom Barber
Spicule Analytics & NASA-JPL
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
SPARKLER #Next: What’s coming?
● Scoring Crawled Pages (Work in progress)
● Focused Crawling (Work in progress)
● Domain Discovery (Work in progress)
● Detailed documentation and tutorials on wiki (Work in progress)
● Interactive UI
● Crawl Graph Analysis
● Other useful plugins from Nutch
24Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Being used for Polar Deep Insights project
https://www.earthcube.org/group/polar-data-insights-search-analytics-deep-scientific-web
Information Retrieval
and Data Science
DEMO
25
https://github.com/USCDataScience/sparkler
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
$ bin/dockler.sh
Information Retrieval
and Data Science
QUESTIONS?
26
https://github.com/USCDataScience/sparkler
Feb	7-9,	2017Spark	Summit	East	2017,	Boston
Information Retrieval
and Data Science
THANK YOU
27
https://github.com/USCDataScience/sparkler
Feb	7-9,	2017Spark	Summit	East	2017,	Boston

More Related Content

PPTX
Analysing GitHub commits with R
PDF
Talend spark meetup 03042017 - Paris Spark Meetup
PPTX
Analysing GitHub commits with R
PPTX
Analysing GitHub commits with R
PDF
Stream Processing: Choosing the Right Tool for the Job
PPTX
Indexing big data in the cloud
PDF
Analyzing Pwned Passwords with Spark and Scala
PDF
Improving ad hoc and production workflows at Stitch Fix
Analysing GitHub commits with R
Talend spark meetup 03042017 - Paris Spark Meetup
Analysing GitHub commits with R
Analysing GitHub commits with R
Stream Processing: Choosing the Right Tool for the Job
Indexing big data in the cloud
Analyzing Pwned Passwords with Spark and Scala
Improving ad hoc and production workflows at Stitch Fix

What's hot (20)

PDF
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
KEY
Cascalog at May Bay Area Hadoop User Group
PDF
A compute infrastructure for data scientists
PPTX
Pig on spark
PDF
Use Cases for Elastic Search Percolator
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PPTX
Optimizing Spark
PDF
Insight_150115_Demo
PDF
Lighthouse - an open-source library to build data lakes - Kris Peeters
PPTX
Spark: The Good, the Bad, and the Ugly
PDF
Elastic Stack Roadmap
PPTX
Google history nd architecture
PDF
Aditi_Wadekar_Resume1
PDF
Developing high frequency indicators using real time tick data on apache supe...
PDF
Sponsored Talk @ PGConf APAC 2018 - Migrating Oracle to EDB Postgres Approach...
PPTX
Presto: SQL-on-anything
PPTX
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
PDF
Ursa Labs and Apache Arrow in 2019
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Presto @ Facebook: Past, Present and Future
Analyzing Pwned Passwords with Spark - OWASP Meetup July 2018
Cascalog at May Bay Area Hadoop User Group
A compute infrastructure for data scientists
Pig on spark
Use Cases for Elastic Search Percolator
Apache Arrow: Cross-language Development Platform for In-memory Data
Optimizing Spark
Insight_150115_Demo
Lighthouse - an open-source library to build data lakes - Kris Peeters
Spark: The Good, the Bad, and the Ugly
Elastic Stack Roadmap
Google history nd architecture
Aditi_Wadekar_Resume1
Developing high frequency indicators using real time tick data on apache supe...
Sponsored Talk @ PGConf APAC 2018 - Migrating Oracle to EDB Postgres Approach...
Presto: SQL-on-anything
Expanding Elastic: Learn how anyone can leverage heterogeneous compute to ext...
Ursa Labs and Apache Arrow in 2019
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Presto @ Facebook: Past, Present and Future
Ad

Viewers also liked (6)

PPTX
Spark Summit East 2017: Apache spark and object stores
PDF
Beyond Parallelize and Collect by Holden Karau
PDF
Sparkler at spark summit east 2017
PDF
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
PDF
Big Data Meets Learning Science: Keynote by Al Essa
PDF
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit East 2017: Apache spark and object stores
Beyond Parallelize and Collect by Holden Karau
Sparkler at spark summit east 2017
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Big Data Meets Learning Science: Keynote by Al Essa
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Ad

Similar to Sparkler Presentation for Spark Summit East 2017 (20)

PDF
Sparkler - Spark Crawler
PDF
End-to-End Data Pipelines with Apache Spark
PPSX
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
PDF
Incrementally streaming rdbms data to your data lake automagically
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
Jump Start with Apache Spark 2.0 on Databricks
PDF
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
PDF
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
PDF
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
PDF
Applying large scale text analytics with graph databases
PPTX
Processing genetic data at scale
PPTX
Devteach 2017 Store 2 million of audit a day into elasticsearch
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
PDF
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
PPTX
Machine Learning with SparkR
PPTX
Python Web Scraper for ACM and Google Scholar.pptx
PPTX
Hadoop databases for oracle DBAs
PPTX
Data infrastructure architecture for medium size organization: tips for colle...
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Sparkler - Spark Crawler
End-to-End Data Pipelines with Apache Spark
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Incrementally streaming rdbms data to your data lake automagically
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
Jump Start with Apache Spark 2.0 on Databricks
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Hopsworks Secure Streaming as-a-service with Kafka Flinkspark - Theofilos Kak...
Secure Streaming-as-a-Service with Kafka/Spark/Flink in Hopsworks
Applying large scale text analytics with graph databases
Processing genetic data at scale
Devteach 2017 Store 2 million of audit a day into elasticsearch
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Machine Learning with SparkR
Python Web Scraper for ACM and Google Scholar.pptx
Hadoop databases for oracle DBAs
Data infrastructure architecture for medium size organization: tips for colle...
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Quality review (1)_presentation of this 21
PDF
Foundation of Data Science unit number two notes
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Reliability_Chapter_ presentation 1221.5784
Major-Components-ofNKJNNKNKNKNKronment.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Miokarditis (Inflamasi pada Otot Jantung)
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
.pdf is not working space design for the following data for the following dat...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Fluorescence-microscope_Botany_detailed content
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Quality review (1)_presentation of this 21
Foundation of Data Science unit number two notes
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
climate analysis of Dhaka ,Banglades.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
1_Introduction to advance data techniques.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Sparkler Presentation for Spark Summit East 2017

  • 1. Information Retrieval and Data Science Thamme Gowda @thammegowda Karanjeet Singh @_karanjeet A web-crawler on Apache Spark Feb 7-9, 2017Spark Summit East 2017, Boston 1 SPARKLER Dr. Chris Mattmann @chrismattmann https://github.com/USCDataScience/sparkler
  • 2. Information Retrieval and Data Science ABOUT 2 Information Retrieval and Data Science (IRDS) Group University of Southern California, Los Angeles, CA Home page: https://irds.usc.edu Email: irds-L@mymaillists.usc.edu Thamme Gowda Dr. Chris MattmannKaranjeet Singh Graduate Student @thammegowda Graduate Student @karanjeet_tw Director, IRDS @chrismattmann Feb 7-9, 2017Spark Summit East 2017, Boston
  • 3. Information Retrieval and Data Science OVERVIEW ● About Sparkler ● Motivations for building Sparkler ● Sparkler technology stack, internals ● Features of Sparkler ● Dashboard ● Demo ● What’s Next ? 3Feb 7-9, 2017Spark Summit East 2017, Boston
  • 4. Information Retrieval and Data Science ABOUT: SPARKLER ● New Open Source Web Crawler •A bot program that can fetch resources from the web ● Name: Spark Crawler ● Inspired by Apache Nutch ● Like Nutch: Distributed crawler that can scale horizontally ● Unlike Nutch: Runs on top of Apache Spark ● Easy to deploy and easy to use 4Feb 7-9, 2017Spark Summit East 2017, Boston
  • 5. Information Retrieval and Data Science MOTIVATION #1 ● Challenges in DARPA MEMEX* •MEMEX System has crawlers to fetch deep and dark web data •ML based analysis to assist law keeping agencies •Crawls are blackbox, we wanted real-time progress reports ● Dr. Chris Mattmann was considering an upgrade since 3 years ● Technology upgrade needed 5Feb 7-9, 2017Spark Summit East 2017, Boston * http://memex.jpl.nasa.gov/
  • 6. Information Retrieval and Data Science WHY A NEW CRAWLER? 6 Modern Hadoop cluster has no Hadoop (Map-Reduce) left in it! https://twitter.com/cutting/status/796566255830503424 Feb 7-9, 2017Spark Summit East 2017, Boston
  • 7. Information Retrieval and Data Science MOTIVATION #2 ● Challenges at DATOIN •Intro: Datoin.com is a distributed text analytics platform •Late 2014 - migrated the infrastructure from Hadoop Map Reduce to Apache Spark •But the crawler component (powered by Apache Nutch) was left behind ● Met Dr. Chris Mattmann at USC in Web Search Engines class •Enquired about his thoughts for running Nutch on Spark 7Feb 7-9, 2017Spark Summit East 2017, Boston
  • 8. Information Retrieval and Data Science SPARKLER: TECH STACK ● Batch crawling (similar to Apache Nutch) ● Apache Solr as crawl database ● Multi module Maven project with OSGi bundles ● Stream crawled content through Apache Kafka ● Parses everything using Apache Tika ● Crawl visualization - Banana 8Feb 7-9, 2017Spark Summit East 2017, Boston
  • 9. Information Retrieval and Data Science SPARKLER: INTERNALS & WORKFLOW 9Feb 7-9, 2017Spark Summit East 2017, Boston
  • 10. Information Retrieval and Data Science SPARKLER: CRAWLDB 10Feb 7-9, 2017Spark Summit East 2017, Boston
  • 11. Information Retrieval and Data Science SPARKLER: RDD 11Feb 7-9, 2017Spark Summit East 2017, Boston
  • 12. Information Retrieval and Data Science SPARKLER: LINKS PIPELINE 12Feb 7-9, 2017Spark Summit East 2017, Boston
  • 13. Information Retrieval and Data Science SPARKLER: OUTPUT CONSUMPTION 13Feb 7-9, 2017Spark Summit East 2017, Boston
  • 14. Information Retrieval and Data Science SPARKLER: FEATURES 14Feb 7-9, 2017Spark Summit East 2017, Boston
  • 15. Information Retrieval and Data Science SPARKLER #1: Lucene/Solr powered Crawldb ● Crawldb needed indexing •For real time analytics •For instant visualizations ● This is internal data structure of sparkler •Exposed over REST API •Used by Sparkler-ui, the web application ● We chose Apache Solr ● Standalone Solr server or Solr cloud? Yes! ● Glued the crawldb and spark using CrawldbRDD 15Feb 7-9, 2017Spark Summit East 2017, Boston
  • 16. Information Retrieval and Data Science SPARKLER #2: URL Partitioning ● Politeness •Doesn’t hit same server too many times in distributed mode ● First version •Group by: Host name •Sort by: depth, score ● Customization is easy •Write your own Solr query •Take advantage of boosting to alter the ranking ● Partitions the dataset based on the above criteria ● Lazy evaluations and delay between the requests •Performs parsing instead of waiting •Inserts delay only when it is necessary 16Feb 7-9, 2017Spark Summit East 2017, Boston
  • 17. Information Retrieval and Data Science SPARKLER #3: OSGI Plugins ● Plugins Interfaces are inspired by Nutch ● Plugins are developed as per Open Service Gateway Interface (OSGI) ● We chose Apache Felix implementation of OSGI ● Migrated a plugin from Nutch •Regex URL Filter Plugin → The most used plugin in Nutch ● Added JavaScript plugin (described in the next slide) ● //TODO: Migrate more plugins from Nutch •Mavenize nutch [NUTCH-2293] 17Feb 7-9, 2017Spark Summit East 2017, Boston
  • 18. Information Retrieval and Data Science SPARKLER #4: JavaScript Rendering ● Java Script Execution* has first class support •Allows Sparkler to crawl the Deep/Dark web too ● Distributable on Spark Cluster without pain •Pure JVM based JavaScript engine ● This is an implementation of FetchFunction ● FetchFunction •Stream<URL> → Stream<Content> •Note: URLS are grouped by host •Preserves cookies and reuses sessions for each iteration 18 Thanks to: Madhav Sharan Member of USC IRDS* JBrowserDriver by MachinePublishers Feb 7-9, 2017Spark Summit East 2017, Boston
  • 19. Information Retrieval and Data Science SPARKLER #5: Output in Kafka Streams ● Crawler is sometimes input for the applications that does deeper analysis •Can’t fit all those deeper analysis into crawler ● Integrating to such applications made easy via Queues ● We chose Apache Kafka •Suits our need •Distributable, Scalable, Fault Tolerant ● FIXME: Larger messages such as Videos ● This is optional, default output on Shared File System (such as HDFS), compatible with Nutch 19 Thanks to: Rahul Palamuttam MS CS @ Stanford University; Intern @ NASA JPL Feb 7-9, 2017Spark Summit East 2017, Boston
  • 20. Information Retrieval and Data Science SPARKLER #6: Tika, the universal parser ● Apache Tika •Is a toolkit of parsers •Detects and extracts metadata, text, and URLS •Over a thousand different file types ● Main application is to discover outgoing links ● The default Implementation for our ParseFunction 20Feb 7-9, 2017Spark Summit East 2017, Boston
  • 21. Information Retrieval and Data Science SPARKLER #7: Visual Analytics ● Charts and Graphs provides nice summary of crawl job ● Real time analytics ● Example: •Distribution of URLS across hosts/domains •Temporal activities •Status reports ● Customizable in real time ● Using Banana Dashboard from Lucidworks ● Sparkler has a sub component named sparkler-ui 21 Thanks to: Manish Dwibedy MS CS University of Southern California Feb 7-9, 2017Spark Summit East 2017, Boston
  • 22. Information Retrieval and Data Science SPARKLER #7 DASHBOARD 22Feb 7-9, 2017Spark Summit East 2017, Boston
  • 23. Information Retrieval and Data Science SPARKLER #8: Deployment ● Docker ● Juju Charms 23 Thanks to: Tom Barber Spicule Analytics & NASA-JPL Feb 7-9, 2017Spark Summit East 2017, Boston
  • 24. Information Retrieval and Data Science SPARKLER #Next: What’s coming? ● Scoring Crawled Pages (Work in progress) ● Focused Crawling (Work in progress) ● Domain Discovery (Work in progress) ● Detailed documentation and tutorials on wiki (Work in progress) ● Interactive UI ● Crawl Graph Analysis ● Other useful plugins from Nutch 24Feb 7-9, 2017Spark Summit East 2017, Boston Being used for Polar Deep Insights project https://www.earthcube.org/group/polar-data-insights-search-analytics-deep-scientific-web
  • 25. Information Retrieval and Data Science DEMO 25 https://github.com/USCDataScience/sparkler Feb 7-9, 2017Spark Summit East 2017, Boston $ bin/dockler.sh
  • 26. Information Retrieval and Data Science QUESTIONS? 26 https://github.com/USCDataScience/sparkler Feb 7-9, 2017Spark Summit East 2017, Boston
  • 27. Information Retrieval and Data Science THANK YOU 27 https://github.com/USCDataScience/sparkler Feb 7-9, 2017Spark Summit East 2017, Boston