Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

1© Cloudera, Inc. All rights reserved.
Building Effective
Near-Real-Time Analytics with
Spark Streaming and Kudu
Jeremy Beard | Senior Solutions Architect, Cloudera
May 2016

Myself
• Jeremy Beard
• Senior Solutions Architect at Cloudera
• 3.5 years at Cloudera
• 6 years data warehousing before that
• jeremy@cloudera.com

Agenda
• What do we mean by near-real-time analytics?
• Which components can we use from the Cloudera stack?
• How do these components fit together?
• How do we implement the Spark Streaming to Kudu path?
• What if I don’t want to write all that code?

Defining near-real-time analytics (for this talk)
• Ability to analyze events happening right now in the real world
• And in the context of all the history that has gone before it
• By “near” we mean this is human scale (seconds), not machine scale (ns/us)
• Closer to real time is possible in CDH, but is more custom development
• SQL is the lingua franca of analytics
• Millions of people know it or the tools that run on it
• Say what you want to get not how you want to get it

Components from the Cloudera stack
• Four components come together to make this possible
• Apache Kafka
• Apache Spark
• Apache Kudu (incubating)
• Apache Impala (incubating)
• First we’ll discuss what they are, then how they fit together

Apache Kafka
• Publish-subscribe system
• Publish messages into topics
• Subscribe to messages arriving in topics
• Very high throughput
• Very low latency
• Distributed for fault tolerance and scale
• Supported by Cloudera

Apache Spark
• Modern distributed data processing engine
• Heavy utilizer of memory for speed
• Rich and intuitive API
• Spark Streaming
• Module for running a continuous loop of Spark transformations
• Each iteration is a micro-batch, usually in the single-digit seconds
• Supported by Cloudera (with some exceptions for experimental features)

Apache Kudu (incubating)
• New open-source columnar storage layer
• Data model of tables with finite typed columns
• Very fast random I/O
• Very fast scans
• Developed from scratch in C++
• Client APIs for C++, Java, Python
• First developed in Cloudera, now at Apache Software Foundation
• Currently in beta, not yet supported by Cloudera, not production ready

Apache Impala (incubating)
• Open-source SQL query engine
• Built for one purpose: really fast analytics SQL
• High concurrency
• Queries data stored in HDFS, HBase, and now Kudu
• Standard JDBC/ODBC interface for SQL editors and BI tools
• Uses JIT query compilation and modern CPU instructions
• First developed in Cloudera, now at Apache Software Foundation
• Fully supported by Cloudera and in production at many of our customers

Near-real-time analytics on the Cloudera stack

Implementing Spark Streaming to Kudu
• We define what we want Spark to do each micro-batch
• Spark then takes care of running the micro-batches for us
• We have limited time to process a micro-batch
• Storage lookups must be key lookups or very short scans
• A lot of repetitive boilerplate code to get up and running

Typical stages of a Spark Streaming to Kudu pipeline
• Sourcing from a queue of data
• Translating into a structured format
• Deriving the storage records
• Planning how to update the storage layer
• Applying the planned mutations to the storage layer

Queue sourcing
• Each micro-batch we first have to bring in data to process
• This is near-real-time, so we expect a queue of messages waiting to be processed
• Kafka fits this requirement very well
• Native no-data-loss integration with Spark Streaming
• Partitioned topics automatically parallelize across Spark executors
• Fault recovery simple because Kafka does not drop consumed messages
• In Spark Streaming this is the creation of a DStream object
• For Kafka use KafkaUtils.createDirectStream()

Translation
• Arriving messages could be in any format (XML, CSV, binary, proprietary, etc.)
• We need them in a common structured record format to effectively transform it
• When messages arrive, translate them first
• Avro’s GenericRecord is a widely adopted in-memory record format
• In Spark Streaming job use DStream.map() to define translation

Derivation
• We need to create the records that we want to write to the storage layer
• Often not identical to the arriving records
• Derive the storage records from the arriving records
• Spark SQL can define transformation, but much more plumbing code required
• May also require deriving from existing records in the storage layer
• Enrichment using reference data is a common example

Planning
• With derived storage records in hand we need to plan the storage mutations
• When existing records are never updated it is straight-forward
• Just plan inserts
• When updates for a key can occur it is a bit harder
• Plan insert if key does not exist, plan update if key does exist
• When all versions of a key are kept it can be a lot more complicated
• Insert arriving record, update metadata on existing records (e.g. end date)

Storing
• With the planned mutations for the micro-batch, we apply them to the storage
• For Kudu this requires using the Kudu client Java API
• Applied mutations are immediately visible to Impala users
• Use RDD.forEachPartition() so that you can open a Kudu connection per JVM
• Alternatively write to Solr, can be a good option where SQL is not required
• Alternatively write to HBase, but storage is too slow for analytics queries
• Alternatively write to HDFS, but storage does not support updates or deletes

Performance considerations
• Repartition the arriving records across all the cores of the Spark job
• If using Spark SQL, lower the number of shuffle partitions from default 200
• Use Spark Streaming backpressure to optimize micro-batch size
• If using Kafka, also use spark.streaming.kafka.maxRatePerPartition
• Experiment with micro-batch lengths to balance latency vs. throughput
• Ensure storage lookup predicates are at least by key, or face full table scans
• Avoid connecting and disconnecting from storage every micro-batch
• Singleton pattern can help to keep a connection per JVM
• Avoid instantiating objects for each record where they could be reused
• Batch mutations for higher throughput

New on Cloudera Labs: Envelope
• A pre-developed Spark Streaming application that implements these stages
• Pipelines are defined as simple configuration using a properties file
• Custom implementations of stages can be referenced in the configuration
• Available on Cloudera Labs (cloudera.com/labs)
• Not supported by Cloudera, not production ready

Envelope built-in functionality
• Queue source for Kafka
• Translators for delimited text, key-value pairs, and binary Avro
• Lookup of existing storage records
• Deriver for Spark SQL transformations
• Planners for appends, upserts, and history tracking
• Storage system for Kudu
• Support for many of the described performance considerations
• All stage implementations are also pluggable with user-provided classes

Example pipeline: Traffic

Example pipeline: FIX

Thank you
jeremy@cloudera.com

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu (20)

Recently uploaded (20)

Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu