SlideShare a Scribd company logo
3
Most read
Scalable crawling with
Kafka, scrapy and spark
Maxim Lapan <max.lapan@gmail.com>
About myself
Have been balancing between big data
and machine learning for the last 17
years.
Currently working in Scoutbee on data
processing pipelines, ML models,
architecture, etc.
Wrote a book “Deep Reinforcement
Learning Hands-On”.
HPC cluster deployed in 2004. With 1TFlops/s it was 4th in
Russian supercomputer rating, now it is 1/13 of 2080ti gpu.
Scoutbee crawling requirements
The core of scoutbee business is “procurement
intelligence”, which basically means: find
relevant “suppliers” for specific user “demand”.
It comes down to answering tricky questions
about companies:
● What does this company manufacture?
● At what locations have they production
and storage facilities?
● Does the company have specific
certification?
● Etc, etc, etc
And there are millions of companies producing
billions of items.
We use lots of datasets to fill this puzzle into the
single “company profile”, but most of the time,
the best source of information is
the company website
Which just means: crawling thousands domains
every week (as quick as possible, of course)
Custom crawler
● Performance: some components require
domain data to become available in
minutes, not hours
● Efficiency and costs: data volume is large
and will be processed many times
● Long-tail domains: small manufacturers’
domains are not present in the already
existing crawls (like CommonCrawl)
Surprisingly, all existing crawlers either:
● Fast and expensive
● Slow and efficient
Motivation of this talk:
● Example of complex distributed system
design fulfilling contradicting requirements
● Kafka and scrapy - two gems which have
made this possible
● We might consider open-sourcing the
crawler
● Before that - you might write your own.
Architecture overview
● Kafka as storage and message transfer
○ Lots of relatively small data pieces
○ High throughput and low latency
○ Very simple load balancing of tasks
○ Combination of storage and message
passing
● Scrapy for scraping
● Scaling on domains NOT on urls
○ We assume domains to be relatively small
○ Scaling on domains much easier
● Data processing pipelining
○ As soon as we have retrieved url and put it
in kafka, we can process
○ Data latency decrease
Request lifecycle
1. Client sends a domain to restapi endpoint
a. “Request ID” generated (timestamp)
b. On s3 we create a “status file” with
status=running
c. Request sent to spider input topic
2. Spider does the crawling
a. Crawl the domain requested
b. Send all the documents to spider output
topic
c. On crawl complete, send event to the
events topic
3. Html converted to plaintext
a. For every html in the spider output topic we
convert it into plaintext and send to the
plaintext output topic
4. Save the data
a. Consume spider output and plaintext
output topics, writing to the s3 data file
b. On “end of crawl” event, close files and
update “status file” on s3
Common Crawl
commoncrawl.org, open repo of web crawled data,
3.3B urls from 36M domains. More stats here:
https://commoncrawl.github.io/cc-crawl-statistics/
Updated monthly and available on s3.
It has overlap to ~30% of domains we need →
good optimisation of expenses.
To simplify lookups, we maintain an index on S3 of
CC data for every snapshot (domain → list and
locations of urls in CC data files).
Storage
Crawled data will be used many times for a long
period:
● Scalability: millions of domains with
billions of urls should be easy to access
● Store everything: every header of every
request is being stored
● Access performance: need to retrieve
the data quickly:
○ In sequential read of domain’s documents
○ Retrieve individual document
○ Read only metadata of documents
Data is organized in inverse-domain order:
● blog.scoutbee.com → com/scoutbee/blog
For data format we reused CommonCrawl:
● WARC for requests and responses
storage
● Metafile with technical information + length
and offset of compressed chunk in the
data file
Some numbers
Production, only 3-5 pages per domain:
● 70GB, 2.5M files
● 1M domains
In-depth crawl, up to 10K pages:
● 270K domains, 170M urls
● 1M files, 3.1TB data
● 16TB html text (2.9TB compressed)
(could be processed in 1 hour on 64-core spark
cluster)
Data access and ML pipelines
Several levels of data access:
● Low-level, full control: work with metadata
and data, read directly from S3
● Higher-level wrappers: by list of domains
retrieve documents
● Specialized spark utils: automatically
parallelize documents transformations in a
generic way
Key takeaways
● Don’t be afraid of writing your own systems. Sometimes solution just doesn’t
exist (but worth checking first!)
● Not that often, but requirements which look contradicting (high throughput ↔
low latency) could be unified.
● Kafka is amazing
Thanks for your attention!

More Related Content

PDF
Looking towards an official cassandra sidecar netflix
PDF
Design Principles for a High-Performance Smalltalk
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
An Introduction to Apache Kafka
PDF
Scalability, Availability & Stability Patterns
PPTX
Apache Flink in the Cloud-Native Era
PDF
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Looking towards an official cassandra sidecar netflix
Design Principles for a High-Performance Smalltalk
Where is my bottleneck? Performance troubleshooting in Flink
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
An Introduction to Apache Kafka
Scalability, Availability & Stability Patterns
Apache Flink in the Cloud-Native Era
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...

What's hot (20)

PDF
Flink SQL: The Challenges to Build a Streaming SQL Engine
PDF
Natural Language Processing with Graph Databases and Neo4j
PDF
MongoDB Sharding Fundamentals
PDF
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
PDF
The Full MySQL and MariaDB Parallel Replication Tutorial
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PPTX
Flink SQL & TableAPI in Large Scale Production at Alibaba
PDF
Vue Fes Japan 2018 LINE株式会社 LunchスポンサーLT
PDF
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
PDF
Pharo foreign function interface (FFI) by example by Esteban Lorenzano
PDF
Delta Lake Streaming: Under the Hood
PPTX
Netflix viewing data architecture evolution - QCon 2014
PDF
Kafka to the Maxka - (Kafka Performance Tuning)
PPTX
Stability Patterns for Microservices
PPT
trace code tool 以及人月神話
PDF
Discovering User's Topics of Interest in Recommender Systems
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PPTX
Introduction to Apache Flink
PPTX
From distributed caches to in-memory data grids
PPTX
Introduction to Apache ZooKeeper
Flink SQL: The Challenges to Build a Streaming SQL Engine
Natural Language Processing with Graph Databases and Neo4j
MongoDB Sharding Fundamentals
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
The Full MySQL and MariaDB Parallel Replication Tutorial
Tuning Apache Kafka Connectors for Flink.pptx
Flink SQL & TableAPI in Large Scale Production at Alibaba
Vue Fes Japan 2018 LINE株式会社 LunchスポンサーLT
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Pharo foreign function interface (FFI) by example by Esteban Lorenzano
Delta Lake Streaming: Under the Hood
Netflix viewing data architecture evolution - QCon 2014
Kafka to the Maxka - (Kafka Performance Tuning)
Stability Patterns for Microservices
trace code tool 以及人月神話
Discovering User's Topics of Interest in Recommender Systems
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Introduction to Apache Flink
From distributed caches to in-memory data grids
Introduction to Apache ZooKeeper
Ad

Similar to Scalable crawling with Kafka, scrapy and spark - November 2021 (20)

PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Spark Meetup at Uber
PDF
Store stream data on Data Lake
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PDF
Traveloka's data journey — Traveloka data meetup #2
PPTX
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
PDF
Log ingestion kafka -- impala using apex
PDF
Leveraging Databricks for Spark Pipelines
PDF
Leveraging Databricks for Spark pipelines
PPTX
Meetup#2: Building responsive Symbology & Suggest WebService
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
PDF
Initial presentation of swift (for montreal user group)
PPTX
Centralized log-management-with-elastic-stack
PPTX
Hadoop introduction
PPTX
Scalable data pipeline at Traveloka - Facebook Dev Bandung
PDF
Netflix Open Source Meetup Season 4 Episode 2
PDF
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
AWS Big Data Demystified #1: Big data architecture lessons learned
Spark Meetup at Uber
Store stream data on Data Lake
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Traveloka's data journey — Traveloka data meetup #2
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Log ingestion kafka -- impala using apex
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark pipelines
Meetup#2: Building responsive Symbology & Suggest WebService
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Initial presentation of swift (for montreal user group)
Centralized log-management-with-elastic-stack
Hadoop introduction
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Netflix Open Source Meetup Season 4 Episode 2
Meet You in the Middle: 1000x Performance for Parquet Queries on PB-Scale Dat...
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Ad

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation theory and applications.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
MIND Revenue Release Quarter 2 2025 Press Release
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation theory and applications.pdf
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
Unlocking AI with Model Context Protocol (MCP)
20250228 LYD VKU AI Blended-Learning.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Scalable crawling with Kafka, scrapy and spark - November 2021

  • 1. Scalable crawling with Kafka, scrapy and spark Maxim Lapan <max.lapan@gmail.com>
  • 2. About myself Have been balancing between big data and machine learning for the last 17 years. Currently working in Scoutbee on data processing pipelines, ML models, architecture, etc. Wrote a book “Deep Reinforcement Learning Hands-On”. HPC cluster deployed in 2004. With 1TFlops/s it was 4th in Russian supercomputer rating, now it is 1/13 of 2080ti gpu.
  • 3. Scoutbee crawling requirements The core of scoutbee business is “procurement intelligence”, which basically means: find relevant “suppliers” for specific user “demand”. It comes down to answering tricky questions about companies: ● What does this company manufacture? ● At what locations have they production and storage facilities? ● Does the company have specific certification? ● Etc, etc, etc And there are millions of companies producing billions of items. We use lots of datasets to fill this puzzle into the single “company profile”, but most of the time, the best source of information is the company website Which just means: crawling thousands domains every week (as quick as possible, of course)
  • 4. Custom crawler ● Performance: some components require domain data to become available in minutes, not hours ● Efficiency and costs: data volume is large and will be processed many times ● Long-tail domains: small manufacturers’ domains are not present in the already existing crawls (like CommonCrawl) Surprisingly, all existing crawlers either: ● Fast and expensive ● Slow and efficient Motivation of this talk: ● Example of complex distributed system design fulfilling contradicting requirements ● Kafka and scrapy - two gems which have made this possible ● We might consider open-sourcing the crawler ● Before that - you might write your own.
  • 5. Architecture overview ● Kafka as storage and message transfer ○ Lots of relatively small data pieces ○ High throughput and low latency ○ Very simple load balancing of tasks ○ Combination of storage and message passing ● Scrapy for scraping ● Scaling on domains NOT on urls ○ We assume domains to be relatively small ○ Scaling on domains much easier ● Data processing pipelining ○ As soon as we have retrieved url and put it in kafka, we can process ○ Data latency decrease
  • 6. Request lifecycle 1. Client sends a domain to restapi endpoint a. “Request ID” generated (timestamp) b. On s3 we create a “status file” with status=running c. Request sent to spider input topic 2. Spider does the crawling a. Crawl the domain requested b. Send all the documents to spider output topic c. On crawl complete, send event to the events topic 3. Html converted to plaintext a. For every html in the spider output topic we convert it into plaintext and send to the plaintext output topic 4. Save the data a. Consume spider output and plaintext output topics, writing to the s3 data file b. On “end of crawl” event, close files and update “status file” on s3
  • 7. Common Crawl commoncrawl.org, open repo of web crawled data, 3.3B urls from 36M domains. More stats here: https://commoncrawl.github.io/cc-crawl-statistics/ Updated monthly and available on s3. It has overlap to ~30% of domains we need → good optimisation of expenses. To simplify lookups, we maintain an index on S3 of CC data for every snapshot (domain → list and locations of urls in CC data files).
  • 8. Storage Crawled data will be used many times for a long period: ● Scalability: millions of domains with billions of urls should be easy to access ● Store everything: every header of every request is being stored ● Access performance: need to retrieve the data quickly: ○ In sequential read of domain’s documents ○ Retrieve individual document ○ Read only metadata of documents Data is organized in inverse-domain order: ● blog.scoutbee.com → com/scoutbee/blog For data format we reused CommonCrawl: ● WARC for requests and responses storage ● Metafile with technical information + length and offset of compressed chunk in the data file
  • 9. Some numbers Production, only 3-5 pages per domain: ● 70GB, 2.5M files ● 1M domains In-depth crawl, up to 10K pages: ● 270K domains, 170M urls ● 1M files, 3.1TB data ● 16TB html text (2.9TB compressed) (could be processed in 1 hour on 64-core spark cluster)
  • 10. Data access and ML pipelines Several levels of data access: ● Low-level, full control: work with metadata and data, read directly from S3 ● Higher-level wrappers: by list of domains retrieve documents ● Specialized spark utils: automatically parallelize documents transformations in a generic way
  • 11. Key takeaways ● Don’t be afraid of writing your own systems. Sometimes solution just doesn’t exist (but worth checking first!) ● Not that often, but requirements which look contradicting (high throughput ↔ low latency) could be unified. ● Kafka is amazing Thanks for your attention!