SlideShare a Scribd company logo
Nov 15th
2016
@ Apache Big Data EU 2016, Seville, Spain
Thamme GowdaKaranjeet Singh
SPARKLER
Information Retrieval
and Data Science
Chris Mattmann
ABOUT: USC INFORMATION RETRIEVAL
AND DATA SCIENCE GROUP
● Established in August 2012 at the University of Southern California (USC)
● Dr. Chris Mattmann, Director of IRDS and our Advisor
● Funding from NSF, DARPA, NASA, DHS, private industry and other agencies
- in collaboration with NASA JPL
● 3 Postdocs, and 30+ Masters and PhD students, 20+ JPLers past 7 years
● Recent topical research in the DARPA XDATA/MEMEX program
Information Retrieval
and Data Science
Email : irds-L@mymaillists.usc.edu
Website : http://irds.usc.edu/
GitHub : https://github.com/USCDataScience/
ABOUT: US
Karanjeet Singh
Graduate Student at the University of Southern California, USA
Research Interest: Information Retrieval & Natural Language Processing
Research Affiliate at NASA Jet Propulsion Laboratory
Committer and PMC member of Apache Nutch
Information Retrieval
and Data Science
Thamme Gowda
Graduate Student at the University of Southern California, USA
Research Intern at NASA Jet Propulsion Laboratory, Co Founder at Datoin
Research Interest: NLP, Machine Learning and Information Retrieval
Committer and PMC member of Apache Nutch, Tika, and Joshua (Incubating)
Dr. Chris Mattmann
Director & Vice Chairman, Apache Software Foundation
Research Interest: Data Science, Open Source, Information Retrieval & NLP
Committer and PMC member of Apache Nutch, Tika, (former) Lucene, OODT, Incubator
OVERVIEW
● About Sparkler
● Motivations for building Sparkler
● Quick intro to Apache Spark
● Sparkler technology stack, internals
● Features of Sparkler
● Comparison with Nutch
● Going forward
Information Retrieval
and Data Science
ABOUT: SPARKLER
● New Open Source Web Crawler
○ A bot program that can fetch resources from the web
● Name: Spark Crawler
● Inspired by Apache Nutch
● Like Nutch: Distributed crawler that can scale horizontally
● Unlike Nutch: Runs on top of Apache Spark
● Easy to deploy and easy to use
Information Retrieval
and Data Science
Information Retrieval
and Data Science
MOTIVATION #1
● Challenges in DARPA MEMEX
○ Intro: MEMEX System has crawlers to fetch deep and
dark web data for assisting law keeping agencies
○ Crawls are kind of blackbox, we wanted real-time
progress reports
● Dr. Chris Mattmann was considering an upgrade since 3
years
● Technology upgrade needed
Information Retrieval
and Data Science
https://twitter.com/cutting/status/796566255830503424
Modern Hadoop cluster has no Hadoop (Map-Reduce) left in it!
WHY A NEW CRAWLER?
Information Retrieval
and Data Science
MOTIVATION #2
● Challenges at DATOIN
○ Intro: Datoin is a distributed text analytics platform
○ Late 2014 - migrated the infrastructure from Hadoop
Map Reduce to Apache Spark
○ But the crawler component (powered by Apache Nutch)
was left behind
● Met Dr. Chris Mattmann at USC in Web Search Engines
class
○ Enquired about his thoughts for running Nutch on Spark
○ Agreed to work on it.
● High performance & Fault tolerance
● Real time crawl analysis
● Easy to customize
Is the food ready?
How is it going?
I want less salt.
Information Retrieval
and Data Science
KEY FEATURES
APACHE SPARK: OVERVIEW
● Introduction
● Resilient Distributed Dataset (RDD)
● Driver, Workers & Executors
Information Retrieval
and Data Science
APACHE SPARK: INTRODUCTION
● Fast and general engine for large scale data processing
● Started at UC Berkeley in 2009
● The most popular distributed computing framework
● Provides high level APIs in Scala, Java, Python, R
● Integration with Hadoop and its ecosystem
● Open sourced in 2010 under Apache v2.0 license
● Mattmann helped to bring Spark to Apache under
DARPA XDATA effort
Information Retrieval
and Data Science
Resilient Distributed Dataset (RDD)
● A basic abstraction in Spark
● Immutable, Partitioned collection of elements operated in parallel
● Data in persistent store (HDFS, Cassandra) or in cache (memory, disk)
● Partitions are recomputed on failure or cache eviction
● Two classes of operations
○ Transformations
○ Actions
● Custom RDDs can also be
implemented - we have one!
Information Retrieval
and Data Science
Information Retrieval
and Data Science
Driver, Workers & Executors
* Photo credit - spark.apache.org
SPARKLER: TECH STACK
● Batch crawling (similar to Apache Nutch)
● Apache Solr as crawl database
● Multi module Maven project with OSGi bundles
● Stream crawled content through Apache Kafka
● Parses everything using Apache Tika
● Crawl visualization - Banana
Information Retrieval
and Data Science
SPARKLER: INTERNALS & WORKFLOW
Information Retrieval
and Data Science
SPARKLER: FEATURES
Information Retrieval
and Data Science
● Crawldb needed indexing
○ For real time analytics
○ For instant visualizations
● This is internal data structure of sparkler
○ Exposed over REST API
○ Used by Sparkler-ui, the web application
● We chose Apache Solr
● Standalone Solr Server or Solr Cloud?
● Glued the crawldb and spark using CrawldbRDD
SPARKLER #1: Lucene/Solr powered Crawldb
Information Retrieval
and Data Science
SPARKLER #2: Partitioning by host
Information Retrieval
and Data Science
● Politeness
* Doesn’t hit same server too many times in distributed mode
● First version
○ Group by: Host name
○ Sort by: depth, score
● Customization is easy
○ Write your own Solr query
○ Take advantage of boosting to alter the ranking
● Partitions the dataset based on the above criteria
● Lazy evaluations and delay between the requests
■ Performs parsing instead of waiting
■ Inserts delay only when it is necessary
SPARKLER #3: OSGI Plugins
Information Retrieval
and Data Science
● Plugins Interfaces are inspired by Nutch
● Plugins are developed as per Open Service Gateway
Interface (OSGI)
● We chose Apache Felix implementation of OSGI
● Migrated a plugin from Nutch
○ Regex URL Filter Plugin → The most used plugin in
Nutch
● Added JavaScript plugin (described in the next slide)
● //TODO: Migrate more plugins from Nutch
○ Mavenize nutch [NUTCH-2293]
SPARKLER #4: JavaScript Rendering
Information Retrieval
and Data Science
● Java Script Execution* has first class support
● Distributable on Spark Cluster without pain
○ Pure JVM based JavaScript engine
● This is an implementation of FetchFunction
● FetchFunction
○ Stream<URL> → Stream<Content>
○ Note: URLS are grouped by host
○ It preserves cookies and reuses sessions for each iteration
Thanks to: Madhav Sharan
Member of USC IRDS* JBrowserDriver by MachinePublishers
SPARKLER #5: Output in Kafka Streams
Information Retrieval
and Data Science
● Crawler is sometimes input for the applications that does
deeper analysis
○ Can’t fit all those deeper analysis into crawler
● Integrating to such applications made easy via Queues
● We chose Apache Kafka
○ Suits our need
■ Distributable, Scalable, Fault Tolerant
● FIXME: Larger messages such as Videos
● This is optional, default output on Shared File System
(such as HDFS), compatible with Nutch
*
Thanks to: Rahul Palamuttam
MS CS @ Stanford University; Intern @ NASA JPL)
SPARKLER #6: Tika, the universal parser
Information Retrieval
and Data Science
● Apache Tika
○ Is a toolkit of parsers
○ Detects and extracts metadata, text, and URLS
○ Over a thousand different file types
● Main application is to discover outgoing links
● The default Implementation for our ParseFunction
SPARKLER #7: Visual Analytics
Information Retrieval
and Data Science
● Charts and Graphs provides nice summary of crawl job
● Real time analytics
● Example:
○ Distribution of URLS across hosts/domains
○ Temporal activities
○ Status reports
● Customizable in real time
● Using Banana Dashboard from Lucidworks
● Sparkler has a sub component named sparkler-ui
* Thanks to : Manish Dwibedy
MS CS University of Southern California
SPARKLER #Next: what’s coming?
Information Retrieval
and Data Science
● Interactive UI
● More plugins
● Scoring Crawled Pages
● Focussed Crawling
● Crawl Graph Analysis
● Domain Discovery (another research challenge)
● Other useful plugins from Nutch
● Detailed documentation and tutorials on wiki
Nutch Configuration
Version : 1.12
topN : 50,000
Fetcher Thread : 1
Hadoop Configuration
Version : 2.6.0-cdh5.8.2
Slaves : 2
Memory : 8G (Map), 16G (Reduce)
22 Mappers, 11 reducers
HOW FAST IT RUNS - Comparison with Nutch
Information Retrieval
and Data Science
Crawl Iterations : 5
Fetch Delay : 1 sec
Sparkler Configuration
Version : 0.1-SNAPSHOT
topGroups : 252
topN : 1000
Spark Configuration
Version : 1.6.1 with Scala v2.11
Slaves : 2
22 Worker Instances with 210G memory
Information Retrieval
and Data Science
DIVERSIFIED - Comparison with Nutch
Information Retrieval
and Data Science
Sparkler Dashboard
Information Retrieval
and Data Science
SPARKLER IS COMING TO APACHE
proposal later this week!
Look for
● Get involved with our journey of Incubator
● Get started: Checkout README and wiki at
https://github.com/USCDataScience/sparkler
Information Retrieval
and Data Science
Questions?
THANK YOU

More Related Content

PDF
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
PDF
Introducing ELK
PPTX
PDF
Linux Namespaces
PDF
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
PDF
VictoriaLogs: Open Source Log Management System - Preview
PDF
PostgreSQL and Benchmarks
PDF
ロードバランスへの長い道
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Introducing ELK
Linux Namespaces
Kafka’s New Control Plane: The Quorum Controller | Colin McCabe, Confluent
VictoriaLogs: Open Source Log Management System - Preview
PostgreSQL and Benchmarks
ロードバランスへの長い道

What's hot (20)

PPTX
IxVM on CML
PPTX
ELK Stack
PDF
SplunkSummit 2015 - A Quick Guide to Search Optimization
PPTX
Elastic Stack Introduction
PPTX
Fundamental of ELK Stack
PPTX
Microservices, Node, Dapr and more - Part One (Fontys Hogeschool, Spring 2022)
PDF
How to Performance-Tune Apache Spark Applications in Large Clusters
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
PDF
User authentication and authorizarion in Kubernetes
PDF
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
PPTX
What Is A Docker Container? | Docker Container Tutorial For Beginners| Docker...
PDF
Apache kafka performance(latency)_benchmark_v0.3
PDF
ELK stack introduction
PDF
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
PDF
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
PPTX
k8s practice 2023.pptx
PDF
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
PDF
日本OpenStackユーザ会 第37回勉強会
PPTX
Prometheus and Grafana
PPTX
FIWAREシステム内の短期履歴の管理
IxVM on CML
ELK Stack
SplunkSummit 2015 - A Quick Guide to Search Optimization
Elastic Stack Introduction
Fundamental of ELK Stack
Microservices, Node, Dapr and more - Part One (Fontys Hogeschool, Spring 2022)
How to Performance-Tune Apache Spark Applications in Large Clusters
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
User authentication and authorizarion in Kubernetes
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
What Is A Docker Container? | Docker Container Tutorial For Beginners| Docker...
Apache kafka performance(latency)_benchmark_v0.3
ELK stack introduction
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
k8s practice 2023.pptx
Yahoo! JAPANのIaaSを支えるKubernetesクラスタ、アップデート自動化への挑戦 #yjtc
日本OpenStackユーザ会 第37回勉強会
Prometheus and Grafana
FIWAREシステム内の短期履歴の管理
Ad

Similar to Sparkler - Spark Crawler (20)

PDF
Sparkler Presentation for Spark Summit East 2017
PDF
Sparkler at spark summit east 2017
PDF
Web Crawling with Apache Nutch
ODP
If You Have The Content, Then Apache Has The Technology!
PDF
Started with-apache-spark
ODP
Large Scale Crawling with Apache Nutch and Friends
PDF
Crawling and Processing the Italian Corporate Web
PDF
Large Scale Crawling with Apache Nutch and Friends
PPTX
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
ODP
Large scale crawling with Apache Nutch
PPTX
big data analytics (BAD601) Module-5.pptx
PPT
Spark_Part 1
PPTX
Big Data Technologies and Why They Matter To R Users
PPTX
Data munging and analysis
PPTX
PPTX
Practical Machine Learning for Smarter Search with Spark+Solr
PPTX
Practical Machine Learning for Smarter Search with Solr and Spark
PPTX
Engineering patterns for implementing data science models on big data platforms
PPTX
In Memory Analytics with Apache Spark
PDF
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Sparkler Presentation for Spark Summit East 2017
Sparkler at spark summit east 2017
Web Crawling with Apache Nutch
If You Have The Content, Then Apache Has The Technology!
Started with-apache-spark
Large Scale Crawling with Apache Nutch and Friends
Crawling and Processing the Italian Corporate Web
Large Scale Crawling with Apache Nutch and Friends
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
Large scale crawling with Apache Nutch
big data analytics (BAD601) Module-5.pptx
Spark_Part 1
Big Data Technologies and Why They Matter To R Users
Data munging and analysis
Practical Machine Learning for Smarter Search with Spark+Solr
Practical Machine Learning for Smarter Search with Solr and Spark
Engineering patterns for implementing data science models on big data platforms
In Memory Analytics with Apache Spark
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Ad

More from Thamme Gowda (8)

PDF
Thamme Gowda's PhD dissertation defense slides
PDF
Macro average: rare types are important too
PDF
500 languages to English Machine Translation Model
PDF
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
PDF
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
PDF
Thamme Gowda's Summer2016- NASA JPL Internship
PPTX
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
PDF
Clustering output of Apache Nutch using Apache Spark
Thamme Gowda's PhD dissertation defense slides
Macro average: rare types are important too
500 languages to English Machine Translation Model
Large Scale Image Forensics using Tika and Tensorflow [ICMR MFSec 2017]
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Thamme Gowda's Summer2016- NASA JPL Internship
IEEE IRI 16 - Clustering Web Pages based on Structure and Style Similarity
Clustering output of Apache Nutch using Apache Spark

Recently uploaded (20)

PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Introduction to Artificial Intelligence
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
System and Network Administraation Chapter 3
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
history of c programming in notes for students .pptx
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Nekopoi APK 2025 free lastest update
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Introduction to Artificial Intelligence
2025 Textile ERP Trends: SAP, Odoo & Oracle
Upgrade and Innovation Strategies for SAP ERP Customers
VVF-Customer-Presentation2025-Ver1.9.pptx
How to Migrate SBCGlobal Email to Yahoo Easily
Reimagine Home Health with the Power of Agentic AI​
System and Network Administraation Chapter 3
wealthsignaloriginal-com-DS-text-... (1).pdf
history of c programming in notes for students .pptx
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Design an Analysis of Algorithms II-SECS-1021-03
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Understanding Forklifts - TECH EHS Solution
Softaken Excel to vCard Converter Software.pdf
CHAPTER 2 - PM Management and IT Context
Nekopoi APK 2025 free lastest update
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf

Sparkler - Spark Crawler

  • 1. Nov 15th 2016 @ Apache Big Data EU 2016, Seville, Spain Thamme GowdaKaranjeet Singh SPARKLER Information Retrieval and Data Science Chris Mattmann
  • 2. ABOUT: USC INFORMATION RETRIEVAL AND DATA SCIENCE GROUP ● Established in August 2012 at the University of Southern California (USC) ● Dr. Chris Mattmann, Director of IRDS and our Advisor ● Funding from NSF, DARPA, NASA, DHS, private industry and other agencies - in collaboration with NASA JPL ● 3 Postdocs, and 30+ Masters and PhD students, 20+ JPLers past 7 years ● Recent topical research in the DARPA XDATA/MEMEX program Information Retrieval and Data Science Email : irds-L@mymaillists.usc.edu Website : http://irds.usc.edu/ GitHub : https://github.com/USCDataScience/
  • 3. ABOUT: US Karanjeet Singh Graduate Student at the University of Southern California, USA Research Interest: Information Retrieval & Natural Language Processing Research Affiliate at NASA Jet Propulsion Laboratory Committer and PMC member of Apache Nutch Information Retrieval and Data Science Thamme Gowda Graduate Student at the University of Southern California, USA Research Intern at NASA Jet Propulsion Laboratory, Co Founder at Datoin Research Interest: NLP, Machine Learning and Information Retrieval Committer and PMC member of Apache Nutch, Tika, and Joshua (Incubating) Dr. Chris Mattmann Director & Vice Chairman, Apache Software Foundation Research Interest: Data Science, Open Source, Information Retrieval & NLP Committer and PMC member of Apache Nutch, Tika, (former) Lucene, OODT, Incubator
  • 4. OVERVIEW ● About Sparkler ● Motivations for building Sparkler ● Quick intro to Apache Spark ● Sparkler technology stack, internals ● Features of Sparkler ● Comparison with Nutch ● Going forward Information Retrieval and Data Science
  • 5. ABOUT: SPARKLER ● New Open Source Web Crawler ○ A bot program that can fetch resources from the web ● Name: Spark Crawler ● Inspired by Apache Nutch ● Like Nutch: Distributed crawler that can scale horizontally ● Unlike Nutch: Runs on top of Apache Spark ● Easy to deploy and easy to use Information Retrieval and Data Science
  • 6. Information Retrieval and Data Science MOTIVATION #1 ● Challenges in DARPA MEMEX ○ Intro: MEMEX System has crawlers to fetch deep and dark web data for assisting law keeping agencies ○ Crawls are kind of blackbox, we wanted real-time progress reports ● Dr. Chris Mattmann was considering an upgrade since 3 years ● Technology upgrade needed
  • 7. Information Retrieval and Data Science https://twitter.com/cutting/status/796566255830503424 Modern Hadoop cluster has no Hadoop (Map-Reduce) left in it! WHY A NEW CRAWLER?
  • 8. Information Retrieval and Data Science MOTIVATION #2 ● Challenges at DATOIN ○ Intro: Datoin is a distributed text analytics platform ○ Late 2014 - migrated the infrastructure from Hadoop Map Reduce to Apache Spark ○ But the crawler component (powered by Apache Nutch) was left behind ● Met Dr. Chris Mattmann at USC in Web Search Engines class ○ Enquired about his thoughts for running Nutch on Spark ○ Agreed to work on it.
  • 9. ● High performance & Fault tolerance ● Real time crawl analysis ● Easy to customize Is the food ready? How is it going? I want less salt. Information Retrieval and Data Science KEY FEATURES
  • 10. APACHE SPARK: OVERVIEW ● Introduction ● Resilient Distributed Dataset (RDD) ● Driver, Workers & Executors Information Retrieval and Data Science
  • 11. APACHE SPARK: INTRODUCTION ● Fast and general engine for large scale data processing ● Started at UC Berkeley in 2009 ● The most popular distributed computing framework ● Provides high level APIs in Scala, Java, Python, R ● Integration with Hadoop and its ecosystem ● Open sourced in 2010 under Apache v2.0 license ● Mattmann helped to bring Spark to Apache under DARPA XDATA effort Information Retrieval and Data Science
  • 12. Resilient Distributed Dataset (RDD) ● A basic abstraction in Spark ● Immutable, Partitioned collection of elements operated in parallel ● Data in persistent store (HDFS, Cassandra) or in cache (memory, disk) ● Partitions are recomputed on failure or cache eviction ● Two classes of operations ○ Transformations ○ Actions ● Custom RDDs can also be implemented - we have one! Information Retrieval and Data Science
  • 13. Information Retrieval and Data Science Driver, Workers & Executors * Photo credit - spark.apache.org
  • 14. SPARKLER: TECH STACK ● Batch crawling (similar to Apache Nutch) ● Apache Solr as crawl database ● Multi module Maven project with OSGi bundles ● Stream crawled content through Apache Kafka ● Parses everything using Apache Tika ● Crawl visualization - Banana Information Retrieval and Data Science
  • 15. SPARKLER: INTERNALS & WORKFLOW Information Retrieval and Data Science
  • 17. ● Crawldb needed indexing ○ For real time analytics ○ For instant visualizations ● This is internal data structure of sparkler ○ Exposed over REST API ○ Used by Sparkler-ui, the web application ● We chose Apache Solr ● Standalone Solr Server or Solr Cloud? ● Glued the crawldb and spark using CrawldbRDD SPARKLER #1: Lucene/Solr powered Crawldb Information Retrieval and Data Science
  • 18. SPARKLER #2: Partitioning by host Information Retrieval and Data Science ● Politeness * Doesn’t hit same server too many times in distributed mode ● First version ○ Group by: Host name ○ Sort by: depth, score ● Customization is easy ○ Write your own Solr query ○ Take advantage of boosting to alter the ranking ● Partitions the dataset based on the above criteria ● Lazy evaluations and delay between the requests ■ Performs parsing instead of waiting ■ Inserts delay only when it is necessary
  • 19. SPARKLER #3: OSGI Plugins Information Retrieval and Data Science ● Plugins Interfaces are inspired by Nutch ● Plugins are developed as per Open Service Gateway Interface (OSGI) ● We chose Apache Felix implementation of OSGI ● Migrated a plugin from Nutch ○ Regex URL Filter Plugin → The most used plugin in Nutch ● Added JavaScript plugin (described in the next slide) ● //TODO: Migrate more plugins from Nutch ○ Mavenize nutch [NUTCH-2293]
  • 20. SPARKLER #4: JavaScript Rendering Information Retrieval and Data Science ● Java Script Execution* has first class support ● Distributable on Spark Cluster without pain ○ Pure JVM based JavaScript engine ● This is an implementation of FetchFunction ● FetchFunction ○ Stream<URL> → Stream<Content> ○ Note: URLS are grouped by host ○ It preserves cookies and reuses sessions for each iteration Thanks to: Madhav Sharan Member of USC IRDS* JBrowserDriver by MachinePublishers
  • 21. SPARKLER #5: Output in Kafka Streams Information Retrieval and Data Science ● Crawler is sometimes input for the applications that does deeper analysis ○ Can’t fit all those deeper analysis into crawler ● Integrating to such applications made easy via Queues ● We chose Apache Kafka ○ Suits our need ■ Distributable, Scalable, Fault Tolerant ● FIXME: Larger messages such as Videos ● This is optional, default output on Shared File System (such as HDFS), compatible with Nutch * Thanks to: Rahul Palamuttam MS CS @ Stanford University; Intern @ NASA JPL)
  • 22. SPARKLER #6: Tika, the universal parser Information Retrieval and Data Science ● Apache Tika ○ Is a toolkit of parsers ○ Detects and extracts metadata, text, and URLS ○ Over a thousand different file types ● Main application is to discover outgoing links ● The default Implementation for our ParseFunction
  • 23. SPARKLER #7: Visual Analytics Information Retrieval and Data Science ● Charts and Graphs provides nice summary of crawl job ● Real time analytics ● Example: ○ Distribution of URLS across hosts/domains ○ Temporal activities ○ Status reports ● Customizable in real time ● Using Banana Dashboard from Lucidworks ● Sparkler has a sub component named sparkler-ui * Thanks to : Manish Dwibedy MS CS University of Southern California
  • 24. SPARKLER #Next: what’s coming? Information Retrieval and Data Science ● Interactive UI ● More plugins ● Scoring Crawled Pages ● Focussed Crawling ● Crawl Graph Analysis ● Domain Discovery (another research challenge) ● Other useful plugins from Nutch ● Detailed documentation and tutorials on wiki
  • 25. Nutch Configuration Version : 1.12 topN : 50,000 Fetcher Thread : 1 Hadoop Configuration Version : 2.6.0-cdh5.8.2 Slaves : 2 Memory : 8G (Map), 16G (Reduce) 22 Mappers, 11 reducers HOW FAST IT RUNS - Comparison with Nutch Information Retrieval and Data Science Crawl Iterations : 5 Fetch Delay : 1 sec Sparkler Configuration Version : 0.1-SNAPSHOT topGroups : 252 topN : 1000 Spark Configuration Version : 1.6.1 with Scala v2.11 Slaves : 2 22 Worker Instances with 210G memory
  • 26. Information Retrieval and Data Science DIVERSIFIED - Comparison with Nutch
  • 27. Information Retrieval and Data Science Sparkler Dashboard
  • 28. Information Retrieval and Data Science SPARKLER IS COMING TO APACHE proposal later this week! Look for
  • 29. ● Get involved with our journey of Incubator ● Get started: Checkout README and wiki at https://github.com/USCDataScience/sparkler Information Retrieval and Data Science Questions? THANK YOU