Scalable crawling with Kafka, scrapy and spark - November 2021

Scalable crawling with
Kafka, scrapy and spark
Maxim Lapan <max.lapan@gmail.com>

About myself
Have been balancing between big data
and machine learning for the last 17
years.
Currently working in Scoutbee on data
processing pipelines, ML models,
architecture, etc.
Wrote a book “Deep Reinforcement
Learning Hands-On”.
HPC cluster deployed in 2004. With 1TFlops/s it was 4th in
Russian supercomputer rating, now it is 1/13 of 2080ti gpu.

Scoutbee crawling requirements
The core of scoutbee business is “procurement
intelligence”, which basically means: find
relevant “suppliers” for specific user “demand”.
It comes down to answering tricky questions
about companies:
● What does this company manufacture?
● At what locations have they production
and storage facilities?
● Does the company have specific
certification?
● Etc, etc, etc
And there are millions of companies producing
billions of items.
We use lots of datasets to fill this puzzle into the
single “company profile”, but most of the time,
the best source of information is
the company website
Which just means: crawling thousands domains
every week (as quick as possible, of course)

Custom crawler
● Performance: some components require
domain data to become available in
minutes, not hours
● Efficiency and costs: data volume is large
and will be processed many times
● Long-tail domains: small manufacturers’
domains are not present in the already
existing crawls (like CommonCrawl)
Surprisingly, all existing crawlers either:
● Fast and expensive
● Slow and efficient
Motivation of this talk:
● Example of complex distributed system
design fulfilling contradicting requirements
● Kafka and scrapy - two gems which have
made this possible
● We might consider open-sourcing the
crawler
● Before that - you might write your own.

Architecture overview
● Kafka as storage and message transfer
○ Lots of relatively small data pieces
○ High throughput and low latency
○ Very simple load balancing of tasks
○ Combination of storage and message
passing
● Scrapy for scraping
● Scaling on domains NOT on urls
○ We assume domains to be relatively small
○ Scaling on domains much easier
● Data processing pipelining
○ As soon as we have retrieved url and put it
in kafka, we can process
○ Data latency decrease

Request lifecycle
1. Client sends a domain to restapi endpoint
a. “Request ID” generated (timestamp)
b. On s3 we create a “status file” with
status=running
c. Request sent to spider input topic
2. Spider does the crawling
a. Crawl the domain requested
b. Send all the documents to spider output
topic
c. On crawl complete, send event to the
events topic
3. Html converted to plaintext
a. For every html in the spider output topic we
convert it into plaintext and send to the
plaintext output topic
4. Save the data
a. Consume spider output and plaintext
output topics, writing to the s3 data file
b. On “end of crawl” event, close files and
update “status file” on s3

Common Crawl
commoncrawl.org, open repo of web crawled data,
3.3B urls from 36M domains. More stats here:
https://commoncrawl.github.io/cc-crawl-statistics/
Updated monthly and available on s3.
It has overlap to ~30% of domains we need →
good optimisation of expenses.
To simplify lookups, we maintain an index on S3 of
CC data for every snapshot (domain → list and
locations of urls in CC data files).

Storage
Crawled data will be used many times for a long
period:
● Scalability: millions of domains with
billions of urls should be easy to access
● Store everything: every header of every
request is being stored
● Access performance: need to retrieve
the data quickly:
○ In sequential read of domain’s documents
○ Retrieve individual document
○ Read only metadata of documents
Data is organized in inverse-domain order:
● blog.scoutbee.com → com/scoutbee/blog
For data format we reused CommonCrawl:
● WARC for requests and responses
storage
● Metafile with technical information + length
and offset of compressed chunk in the
data file

Some numbers
Production, only 3-5 pages per domain:
● 70GB, 2.5M files
● 1M domains
In-depth crawl, up to 10K pages:
● 270K domains, 170M urls
● 1M files, 3.1TB data
● 16TB html text (2.9TB compressed)
(could be processed in 1 hour on 64-core spark
cluster)

Data access and ML pipelines
Several levels of data access:
● Low-level, full control: work with metadata
and data, read directly from S3
● Higher-level wrappers: by list of domains
retrieve documents
● Specialized spark utils: automatically
parallelize documents transformations in a
generic way

Key takeaways
● Don’t be afraid of writing your own systems. Sometimes solution just doesn’t
exist (but worth checking first!)
● Not that often, but requirements which look contradicting (high throughput ↔
low latency) could be unified.
● Kafka is amazing
Thanks for your attention!

Scalable crawling with Kafka, scrapy and spark - November 2021

More Related Content

What's hot (20)

Similar to Scalable crawling with Kafka, scrapy and spark - November 2021 (20)

Recently uploaded (20)

Scalable crawling with Kafka, scrapy and spark - November 2021