This document discusses using Spark Streaming to process and normalize log streams in real time from 100k events per second to over 1 million per second. It proposes using RSyslog to collect logs from multiple sources into Kafka, then using Spark Streaming to apply regex matching and extract fields to normalize the data into a structured JSON format and write it to additional Kafka topics for storage and further processing. The solution was able to process 3 billion events per day with less than 20 seconds of end-to-end delay at peak throughput.
Related topics: