SlideShare a Scribd company logo
State Management in
Structured Streaming
Chandan Prakash
00Copyright 2018 © Qubole
Agenda
● Structured Streaming : Brief Intro
● Types of Stream Processing : Stateless vs Stateful
● State in Stream Processing
● State Store in Stream Processing
● State Management in Old Spark Streaming
● State Management in Structured Streaming
● Demo with Code Example
● Quiz , Food For Thought
00Copyright 2018 © Qubole
What does this picture represent ?
Image Source: google
00Copyright 2018 © Qubole
Batch Processing Stream Processing
Image Source: google
00Copyright 2018 © Qubole
Structured Streaming : Brief Intro
Image Source: google
● Built on Spark SQL engine
● Illusion : Stream of incoming data as unbounded Input Table, Processing
logic as Sql Query, output of processing as Results Table
● Internally query gets converted into incremental Micro-batch processing
00Copyright 2018 © Qubole
Structured Streaming Query Example
00Copyright 2018 © Qubole
Types of Stream Processing
● Stateless Streaming
○ Processing of every record is independent
○ Operations like map, filter
● Stateful Streaming
○ Processing of record is dependent on
previous records
○ Operations like aggregating count of records
per distinct key, deduplicating records
00Copyright 2018 © Qubole
State in Stream Processing
● State of Streaming Progress
○ Metadata of stream processing : offsets
○ Keeping track how much data processed so far
○ Needed for fault tolerance
○ Present in both stateless and stateful processing
● State of Data
○ Intermediate data information between records
○ Operations like aggregation, deduplication
○ Present in Stateful Processing
Note: When we say “State”, in general it means the State of data for processing. The
other one is called metadata/offsets
00Copyright 2018 © Qubole
State Store in Streaming
● Reliable place providing read and write of
intermediate data (state)
● Can sustain streaming failures and restore
processing from the same point
● Options :
In-memory, File Systems, Storage Systems
In-Memory HashMap
00Copyright 2018 © Qubole
State Management in old/Dstream Spark Streaming
● RDD based Streaming
● Inefficient Flawed design
○ State persisted with offset metadata
○ Complete snapshot persistence every microbatch
○ Tightly coupled, synchronous with Spark RDD tasks
○ No provision for incremental state persistence
○ Processing overhead, bottleneck as state grows
00Copyright 2018 © Qubole
State Management in Structured Streaming
Fundamental shift from Old Spark Streaming
● Decoupled from offsets/metadata checkpointing
● Asynchronous to Spark Tasks/Jobs
● Incremental State persistence
00Copyright 2018 © Qubole
HDFS backed State Management
1. In-Memory Hashmap + HDFS
2. Versioned key-value store per
partition
3. Versioned Delta file per partition
4. Partition Task scheduled on same
executor where previous state is
5. Synchronous write to HashMap and
Delta file outputstream
6. Asynchronous daemon thread per
executor for snapshotting, file
purging/deletion in HDFS
7. Only one thread in Executor can write
to a delta file. But threads from
multiple executors can try to write to
same delta file.
00Copyright 2018 © Qubole
Code Entities in HDFS backed State Management
● StatefulOperators
○ defines computation logic to be executed against the state store with set of rows in a partition
● StateStoreOps
○ prepares a StateStoreRDD for doing computations against state store with the computation logic
passed by the stateful operator.
● storeUpdateFunction
○ contains the computation logic defining what to do against the state store with data generated in a
partition task.
● HDFSBackedStateStore
○ concrete implementation of State Store using concurrent hashmap, backed by HDFS file system
for persistence.
● HDFSBackedStateStoreProvider
○ contains methods to get given store and execute maintenance task (snapshotting , purging,
deleting files, cleaning old states).
● StateStoreCoordinator
○ ensures task for a partition gets scheduled on an executor where its last versioned state is
maintained in hashmap.
00Copyright 2018 © Qubole
Code Flow of Stateful Structured Streaming
00Copyright 2018 © Qubole
Quiz Time
Possible Issues with
the HDFS backed
implementation in
production ?
00Copyright 2018 © Qubole
Quiz Time
Possible Issues with
the HDFS backed
implementation in
production ?
● State is constrained by executor
memory
● Same executor memory to be shared
with RDD computation
● Single Daemon thread responsible
snapshotting entire state hashmaps,
file cleanings, etc
00Copyright 2018 © Qubole
In-Memory HashMap
Possible Solution ?
Food for Thought
00Copyright 2018 © Qubole
Embedded/Local Store :
● Key-Value embedded data store
● Improvised LevelDB open sourced by
Facebook
● Bring Database close to Processing
● Pros :
○ No Memory Issues (HashMap)
○ No Network Latency (Cassandra)
○ Fast writes : Buffer + Sequential Transaction Log
○ Isolation
● Cons
○ Not Distributed
○ Not Replicated
○ Overhead of maintenance, non-JVM memory
● Architecture
○ Memtable : in-memory buffer
○ Change Log
○ SST Table on disk
Image Source: google
00Copyright 2018 © Qubole
in Streaming Systems
● Apache Flink
https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html
● Apache Samza
https://samza.apache.org/learn/documentation/0.7.0/container/state-management.html
● Kafka Streams
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Mana
gement
00Copyright 2018 © Qubole
Summary
● What is Stateful Processing and State in Streaming
● Architecture of State Management in Stateful processing of Structured
Streaming
● Code Example
● Why Embedded Store like RocksDB is so important in Stream Processing
Thank You. Questions?
Qubole Blog : https://www.qubole.com/blog/

More Related Content

PPTX
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
PPTX
An Overview of Apache Cassandra
PDF
스타크래프트2 강화학습(StarCraft II Reinforcement Learning)
PDF
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
PPTX
Variational Autoencoder Tutorial
PPTX
Hable John Uncharted2 Hdr Lighting
PDF
superglue_slides.pdf
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
A Certain Slant of Light - Past, Present and Future Challenges of Global Illu...
An Overview of Apache Cassandra
스타크래프트2 강화학습(StarCraft II Reinforcement Learning)
Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Unde...
Variational Autoencoder Tutorial
Hable John Uncharted2 Hdr Lighting
superglue_slides.pdf
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

What's hot (20)

PPTX
The Rendering Technology of Killzone 2
PPTX
Lighting the City of Glass
PDF
The Advantages of Using SASS and Gulp
PDF
Top 5 mistakes when writing Spark applications
PDF
Unsupervised Data Augmentation for Consistency Training
PPTX
A Style-Based Generator Architecture for Generative Adversarial Networks
PDF
What Linux can learn from Solaris performance and vice-versa
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PPTX
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
PDF
Ozone and HDFS's Evolution
PPTX
Introduction to HDFS
PDF
Big Bird - Transformers for Longer Sequences
PDF
Checkerboard Rendering in Dark Souls: Remastered by QLOC
PDF
FPGA・リコンフィギャラブルシステム研究の最新動向
PDF
EuroSciPy 2019 - GANs: Theory and Applications
PPTX
Tempura: A General Cost-Based Optimizer Framework for Incremental Data Proces...
PPTX
Migrating from OpenGL to Vulkan
PPTX
Introduction to Apache Spark
PPTX
Spark architecture
PDF
Cassandra - A Decentralized Structured Storage System
The Rendering Technology of Killzone 2
Lighting the City of Glass
The Advantages of Using SASS and Gulp
Top 5 mistakes when writing Spark applications
Unsupervised Data Augmentation for Consistency Training
A Style-Based Generator Architecture for Generative Adversarial Networks
What Linux can learn from Solaris performance and vice-versa
Apache Tez: Accelerating Hadoop Query Processing
A (Very) Gentle Introduction to Generative Adversarial Networks (a.k.a GANs)
Ozone and HDFS's Evolution
Introduction to HDFS
Big Bird - Transformers for Longer Sequences
Checkerboard Rendering in Dark Souls: Remastered by QLOC
FPGA・リコンフィギャラブルシステム研究の最新動向
EuroSciPy 2019 - GANs: Theory and Applications
Tempura: A General Cost-Based Optimizer Framework for Incremental Data Proces...
Migrating from OpenGL to Vulkan
Introduction to Apache Spark
Spark architecture
Cassandra - A Decentralized Structured Storage System
Ad

Similar to State management in Structured Streaming (20)

PDF
Key considerations in productionizing streaming applications
PPTX
PDF
Benchmarking for postgresql workloads in kubernetes
PDF
Monitoring with Clickhouse
PDF
Storing State Forever: Why It Can Be Good For Your Analytics
PDF
Scaling ELK Stack - DevOpsDays Singapore
PDF
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
PDF
Big data should be simple
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
PDF
It's Time To Stop Using Lambda Architecture
PPTX
How YugaByte DB Implements Distributed PostgreSQL
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PPTX
Make your data fly - Building data platform in AWS
PDF
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
PPTX
YugaByte DB Internals - Storage Engine and Transactions
PDF
Backing up Wikipedia Databases
PPTX
Bootstrapping state in Apache Flink
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
PDF
Enabling Presto Caching at Uber with Alluxio
PDF
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Key considerations in productionizing streaming applications
Benchmarking for postgresql workloads in kubernetes
Monitoring with Clickhouse
Storing State Forever: Why It Can Be Good For Your Analytics
Scaling ELK Stack - DevOpsDays Singapore
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Big data should be simple
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture
How YugaByte DB Implements Distributed PostgreSQL
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Make your data fly - Building data platform in AWS
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
YugaByte DB Internals - Storage Engine and Transactions
Backing up Wikipedia Databases
Bootstrapping state in Apache Flink
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Enabling Presto Caching at Uber with Alluxio
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Ad

More from datamantra (20)

PPTX
Multi Source Data Analysis using Spark and Tellius
PDF
Spark on Kubernetes
PDF
Understanding transactional writes in datasource v2
PDF
Introduction to Datasource V2 API
PDF
Exploratory Data Analysis in Spark
PDF
Core Services behind Spark Job Execution
PDF
Optimizing S3 Write-heavy Spark workloads
PDF
Structured Streaming with Kafka
PDF
Understanding time in structured streaming
PDF
Spark stack for Model life-cycle management
PDF
Productionalizing Spark ML
PDF
Introduction to Structured streaming
PPTX
Building real time Data Pipeline using Spark Streaming
PDF
Testing Spark and Scala
PDF
Understanding Implicits in Scala
PDF
Migrating to Spark 2.0 - Part 2
PDF
Migrating to spark 2.0
PDF
Scalable Spark deployment using Kubernetes
PDF
Introduction to concurrent programming with akka actors
PDF
Functional programming in Scala
Multi Source Data Analysis using Spark and Tellius
Spark on Kubernetes
Understanding transactional writes in datasource v2
Introduction to Datasource V2 API
Exploratory Data Analysis in Spark
Core Services behind Spark Job Execution
Optimizing S3 Write-heavy Spark workloads
Structured Streaming with Kafka
Understanding time in structured streaming
Spark stack for Model life-cycle management
Productionalizing Spark ML
Introduction to Structured streaming
Building real time Data Pipeline using Spark Streaming
Testing Spark and Scala
Understanding Implicits in Scala
Migrating to Spark 2.0 - Part 2
Migrating to spark 2.0
Scalable Spark deployment using Kubernetes
Introduction to concurrent programming with akka actors
Functional programming in Scala

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Mega Projects Data Mega Projects Data
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Database Infoormation System (DBIS).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
1_Introduction to advance data techniques.pptx
PDF
Lecture1 pattern recognition............
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Quality review (1)_presentation of this 21
Fluorescence-microscope_Botany_detailed content
Supervised vs unsupervised machine learning algorithms
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Moving the Public Sector (Government) to a Digital Adoption
Mega Projects Data Mega Projects Data
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Database Infoormation System (DBIS).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
1_Introduction to advance data techniques.pptx
Lecture1 pattern recognition............
Acceptance and paychological effects of mandatory extra coach I classes.pptx
.pdf is not working space design for the following data for the following dat...
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Knowledge Engineering Part 1
Quality review (1)_presentation of this 21

State management in Structured Streaming

  • 1. State Management in Structured Streaming Chandan Prakash
  • 2. 00Copyright 2018 © Qubole Agenda ● Structured Streaming : Brief Intro ● Types of Stream Processing : Stateless vs Stateful ● State in Stream Processing ● State Store in Stream Processing ● State Management in Old Spark Streaming ● State Management in Structured Streaming ● Demo with Code Example ● Quiz , Food For Thought
  • 3. 00Copyright 2018 © Qubole What does this picture represent ? Image Source: google
  • 4. 00Copyright 2018 © Qubole Batch Processing Stream Processing Image Source: google
  • 5. 00Copyright 2018 © Qubole Structured Streaming : Brief Intro Image Source: google ● Built on Spark SQL engine ● Illusion : Stream of incoming data as unbounded Input Table, Processing logic as Sql Query, output of processing as Results Table ● Internally query gets converted into incremental Micro-batch processing
  • 6. 00Copyright 2018 © Qubole Structured Streaming Query Example
  • 7. 00Copyright 2018 © Qubole Types of Stream Processing ● Stateless Streaming ○ Processing of every record is independent ○ Operations like map, filter ● Stateful Streaming ○ Processing of record is dependent on previous records ○ Operations like aggregating count of records per distinct key, deduplicating records
  • 8. 00Copyright 2018 © Qubole State in Stream Processing ● State of Streaming Progress ○ Metadata of stream processing : offsets ○ Keeping track how much data processed so far ○ Needed for fault tolerance ○ Present in both stateless and stateful processing ● State of Data ○ Intermediate data information between records ○ Operations like aggregation, deduplication ○ Present in Stateful Processing Note: When we say “State”, in general it means the State of data for processing. The other one is called metadata/offsets
  • 9. 00Copyright 2018 © Qubole State Store in Streaming ● Reliable place providing read and write of intermediate data (state) ● Can sustain streaming failures and restore processing from the same point ● Options : In-memory, File Systems, Storage Systems In-Memory HashMap
  • 10. 00Copyright 2018 © Qubole State Management in old/Dstream Spark Streaming ● RDD based Streaming ● Inefficient Flawed design ○ State persisted with offset metadata ○ Complete snapshot persistence every microbatch ○ Tightly coupled, synchronous with Spark RDD tasks ○ No provision for incremental state persistence ○ Processing overhead, bottleneck as state grows
  • 11. 00Copyright 2018 © Qubole State Management in Structured Streaming Fundamental shift from Old Spark Streaming ● Decoupled from offsets/metadata checkpointing ● Asynchronous to Spark Tasks/Jobs ● Incremental State persistence
  • 12. 00Copyright 2018 © Qubole HDFS backed State Management 1. In-Memory Hashmap + HDFS 2. Versioned key-value store per partition 3. Versioned Delta file per partition 4. Partition Task scheduled on same executor where previous state is 5. Synchronous write to HashMap and Delta file outputstream 6. Asynchronous daemon thread per executor for snapshotting, file purging/deletion in HDFS 7. Only one thread in Executor can write to a delta file. But threads from multiple executors can try to write to same delta file.
  • 13. 00Copyright 2018 © Qubole Code Entities in HDFS backed State Management ● StatefulOperators ○ defines computation logic to be executed against the state store with set of rows in a partition ● StateStoreOps ○ prepares a StateStoreRDD for doing computations against state store with the computation logic passed by the stateful operator. ● storeUpdateFunction ○ contains the computation logic defining what to do against the state store with data generated in a partition task. ● HDFSBackedStateStore ○ concrete implementation of State Store using concurrent hashmap, backed by HDFS file system for persistence. ● HDFSBackedStateStoreProvider ○ contains methods to get given store and execute maintenance task (snapshotting , purging, deleting files, cleaning old states). ● StateStoreCoordinator ○ ensures task for a partition gets scheduled on an executor where its last versioned state is maintained in hashmap.
  • 14. 00Copyright 2018 © Qubole Code Flow of Stateful Structured Streaming
  • 15. 00Copyright 2018 © Qubole Quiz Time Possible Issues with the HDFS backed implementation in production ?
  • 16. 00Copyright 2018 © Qubole Quiz Time Possible Issues with the HDFS backed implementation in production ? ● State is constrained by executor memory ● Same executor memory to be shared with RDD computation ● Single Daemon thread responsible snapshotting entire state hashmaps, file cleanings, etc
  • 17. 00Copyright 2018 © Qubole In-Memory HashMap Possible Solution ? Food for Thought
  • 18. 00Copyright 2018 © Qubole Embedded/Local Store : ● Key-Value embedded data store ● Improvised LevelDB open sourced by Facebook ● Bring Database close to Processing ● Pros : ○ No Memory Issues (HashMap) ○ No Network Latency (Cassandra) ○ Fast writes : Buffer + Sequential Transaction Log ○ Isolation ● Cons ○ Not Distributed ○ Not Replicated ○ Overhead of maintenance, non-JVM memory ● Architecture ○ Memtable : in-memory buffer ○ Change Log ○ SST Table on disk Image Source: google
  • 19. 00Copyright 2018 © Qubole in Streaming Systems ● Apache Flink https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html ● Apache Samza https://samza.apache.org/learn/documentation/0.7.0/container/state-management.html ● Kafka Streams https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Mana gement
  • 20. 00Copyright 2018 © Qubole Summary ● What is Stateful Processing and State in Streaming ● Architecture of State Management in Stateful processing of Structured Streaming ● Code Example ● Why Embedded Store like RocksDB is so important in Stream Processing
  • 21. Thank You. Questions? Qubole Blog : https://www.qubole.com/blog/

Editor's Notes

  • #2: How many of you have idea about streaming, Worked on any streaming, understand the word “state management” ? …...should be useful for everyone of you. information about past input and can be used to influence the processing of future input, will see in detail Feel free to ask questions at any point of time during presentation
  • #3: Why you would like to listen this ? Although the talk is specific to Spark Structured Streaming, but the design, architecture, concepts and thought process behind why its there what its there will give you good understanding of any Streaming technology. All are like distant cousins of same family and you will see many overlaps between different streaming systems. Understanding one helps you to understand others. Many of them copy or say are inspired from each other. Will give you persepective of streaming engine developer
  • #4: *Quick question: What do you infer from this picture ?
  • #5: *pretty much sums up difference between batch and stream processing Batch is data at rest, you take chunk of data each time you process. In streaming you keep getting data and you need to process it as and when the data comes
  • #7: We will see running version of this example on Qubole Notebook after understanding State Management START THE CLUSTER Objective of showing this code example is to give you idea of stateful processing, so when we talk about state management , you can actually relate and understand easily
  • #8: Having given some rough idea about structured streaming, Lets start with the actual topic that we want to discuss today By analogy to SQL, the select and where clauses of a query are usually stateless, but join, group by and aggregation functions like sum and count require state.
  • #9: Intermediate information in stream processing State of progress: offsets/commits
  • #11: Often easy to understand when compared with predecessor, evolution is constant process, something new comes because of limitations of old Story about experience with Stateless stream processing, maintaining offsets in zookeeper
  • #12: This is the main meat of this talk that I want to go into detail
  • #13: Prepared diagram on my understanding of the internal code, how it works in upcoming Spark 2.4 It is very important to note here is that all these concepts like incremental checkpointing, asynchronous state management are not specific to Spark Streaming. Will find in other streaming systems like Flink,etc also with different names.
  • #14: Slide for guys interested in checking out code theirselves classes/interfaces/method involved in doing the State management Wont go in detail, instead will show the code flow of the state management in next slide
  • #15: Stateful operator is the place where logic to interact with state store resides. Show code
  • #16: Before I go forward, do you have any questions here Because now I have a question for you
  • #17: Do u see any possible issues with this architecture Honestly I have not encountered any issues but lets discuss what can be possible issues with this approach
  • #18: Go back to architecture diagram
  • #19: Had intentionally not talked about RocksDB at the starting, now is the time Really wanted to talk about this embedded storage or local persistent store
  • #20: Why Embedded Storage? Became famous because of Flash Memory era/ SSDs , writing to local disks became much faster compared to client-server model over network to storage systems. Sequential read/write : analogy of airport conveyor belt for spinning disks, latency involved in doing the rotation and seek time going to right sector of the data Hadoop was about moving processing closer to data, RocksDb is about moving database closer to processing. Improvised LevelDB : multithreaded write and compaction, support for bloom scans while reading data, improved compaction logic similar to HBase
  • #21: rocksDB is present in almost every latest streaming systems with need of keeping unlimited state without penalty of network call Storm : currently does not use local storage like rocksDb. It still relies on remote storages like redis,HBase,cassandra. Samza : features in LinkedIn like personalized feed to be sent to your wall is decided after joining lot of information with the available feed using Samza Kafka and Samza were written by same people in LinkedIn who later went on to found company called Confluent where they wrote kafka Streams. So you will find many similarities.
  • #22: Like said in the beginning, understanding one system will help us understand others. RocksDB understanding is one of them . Incremental checkpointing, snapshotting, Asynchronous state management are other concepts Technologies might be different, implementations might be different but after all they are trying to similar problem of distributed world which have same challenges, limitations and expectations like fault tolerance,exactly once processing,etc will be there everywhere
  • #23: Please have a close watch on Qubole Engineering. We write lot of interesting stuffs on Big data on cloud, Spark , open sourced SparkLens, Tuning, Hive , Presto, AWS,