This document describes the architecture of a scalable crawling system used by Scoutbee to crawl thousands of company websites weekly. It uses Kafka to store and transfer crawled data for high throughput and low latency. Scrapy is used for distributed crawling scaled at the domain level. Spark is used for data processing pipelines to decrease data latency. Crawled data is stored long-term in S3 for efficient retrieval and reuse in machine learning pipelines. The system fulfills the contradictory requirements of fast, inexpensive crawling at scale.
Related topics: