SlideShare a Scribd company logo
Arvind Heda, Kapil Malik
Indicium: Interactive
Querying at Scale
#EUeco9
What’s in the session …
• Unified Data Platform on Spark
– Single data source for all scheduled / ad hoc jobs and interactive
lookup / queries
– Data Pipeline
– Compute Layer
– Interactive Queries?
• Indicium: Part 1 (managed context pool)
• Indicium: Part 2 (smart query scheduler)
2#EUeco9
Unified Data Platform
3#EUeco9
Unified Data Platform…(for anything / everything)
• Common Data Lake for storing
– Transactional data
– Behavioral data
– Computed data
• Drives all decisions / recommendations / reporting / analysis from
same store.
• Single data source for all Decision Edges, Algorithms, BI tools and
Ad Hoc and interactive Query / Analysis tools
• Data Platform needs to support
– Scale – Store everything from summary to raw data.
– Concurrency – Handle multiple requests in acceptable user response time.
– Ad Hoc Drill down to any level – query, join, correlation on any dimension.
4#EUeco9
Unified Data Platform
5#EUeco9
Query UI
Spark
Context
(Yarn)
HDFSHDFSHDFS / S3
Spark
Context
(Yarn)
Scheduled
Jobs
Compute
jobs
BI
sched
uled
report
s
Data
Collection
service
Real
time
lookup
Interactive Query
Compute Layer
Data Pipeline
Features
6#EUeco9
Features Details Approach
Data Persistence Store Large Data Volume of Txn, Behavioural and Computed
data;
Spark – Parquet format
on S3 / HDFS
Data Transformations Transformation / Aggregation – co relations and enrichments Batch Processing -
Kafka / Java / Spark
Jobs
Algorithmic Access Aggregated / Raw Data Access for scheduledAlgorithms Spark Processes with
SQL Context based data
access
Decision Making Aggregated Data Access for decision in real time In memory cache of
aggregated data
Reporting BI / Ad Hoc
Query
Aggregated / Raw Data Access for scheduled reports (BI)
Aggregated / Raw Data Access forAd Hoc Queries
BI tool with defined
scheduled spark SQL
queries on Data store;
Interactive Queries Drill down data access on BI tools for concurrent users
Ad hoc Query / Analysis on data for concurrent users
S c a l i n g
c h a l l e n g e s f o r
S p a r k S Q L ?
Data Pipeline
• Kafka / Sqoop based data collection
• Live lookup store for real time decisions
• Tenant / Event and time based data partition
• Time based compaction to optimize query on sparse data
• Summary Profile data to reduce Joins
• Shared compute resources but different context for Scheduled / Ad
Hoc jobs or for Algorithmic / Human touchpoints
7#EUeco9
Compute Layer
• No real ‘real time’ queries -- FIFO scheduling for user
tasks
• Static or rigid resource allocation between scheduled
and ad hoc queries / jobs
• Short lived and stateless context - no sticky ness for user
defined views like temp tables.
• Interactive queries ?
8#EUeco9
What was needed for Interactive query…
• SQL like Query Tool for Ad Hoc Analysis.
• Scalability for concurrent users,
– Fair Scheduling
– Responsiveness
• High Availability
• Performance – specifically for scans and Joins
• Extensibility – User Views / Datasets / UDF’s
9#EUeco9
Indicium ?
10#EUeco9
Indicium: Part 1
Managed Context Pool
11#EUeco9
Managed Context Pool
12#EUeco9
Apache
Zeppelin
SQL Context
(Yarn)
HDFS
HDFS
HDFS
Spark
Job-server
Managed Context Pool
Apache Zeppelin 0.6
• SQL like Query tool and a notebook
• Custom interpreter
- Configuration: SJS server + context
- Statement execution: Make asynchronousREST calls to SJS
• Concurrency - Multiple interpreters and notebooks
Spark Job-Server 0.6.x
• Custom SQL context with catalog override
• Custom application to execute queries
• High Availability: Multiple SJS servers and multiple contexts per server
13#EUeco9
Managed Context Pool
Features
• Familiar SQL interface on notebooks
• Concurrent multi-user support
• Visualization Dashboards
• Long running Spark Job – to support User Defined Views
• Access control on Spark APIs
• Custom SQL context with custom catalog
– Intercept lookupTable calls to query actual data
– Table wrappers for time windows - like select count(*) from `lastXDays(table)`
14#EUeco9
Managed Context Pool
Issues
• Interpreter hard wired to a context
• FIFO scheduling: Single statement per interpreter-context pair –
across notebooks / across users
• No automated failure handling
– Detecting a dead context / SJS server
– Recovery from the context / server failure
• No dynamic scheduling / load balancing
– No way of identify an overloaded context
• Incompatible with Spark 2.x
15#EUeco9
Indicium: Part 2
Smart Query Scheduler
16#EUeco9
Smart Query Scheduler
17#EUeco9
Apache
Zeppelin
SQL Context
(Yarn)
HDFS
HDFS
HDFS
Spark
Job-server
Smart
Query
Scheduler
Smart Query Scheduler
Zeppelin 0.7
• Supports per notebook statement execution
SJS 0.7 Custom Fork
• Support for Spark 2.x
Smart Query Scheduler:
• Scheduling: API to dynamically bind SJS server + context for every job / query
Other Optimizations:
• Monitoring: Monitor jobs running per context
• Availability: Track Health of SJS servers and contexts and ensures healthy context in
pool
18#EUeco9
Smart Query Scheduler
Dynamic scheduling for every query
• Zeppelin interpreter agnostic of actual SJS / context
• Load balancing of jobs per context
• Query Classification and intelligent routing
• Dynamic scaling / de-scaling the pool size
• Shared Cache
• User Defined Views
• Workspaces or custom time window view for every interpreter
19#EUeco9
Query Classification / routing
Custom resource configurations for context dedicated for
complex or asynchronous queries / jobs:
• Classify queries based on heuristics / historic data into
light / heavy queries and route them to different context.
• Separate contexts for interactive vs background queries
– An export table call does not starve an interactive SQL query
20#EUeco9
Spark Dynamic Context
Elastic scaling of contexts, co-existing on same cluster as
scheduled batch jobs
• Scale up in day time, when user load is high
• Scale down in night, when overnight batch jobs are
running
• Scaling also helped to create reserved bandwidth for any
set of users, if needed.
21#EUeco9
Shared Cache
Alluxio to store common datasets
• Single cache for common datasets across contexts
– Avoids replication across contexts
– Cached data safe from executor / context crashes
• Dedicated refresh thread to release / update data
consistently across contexts
22#EUeco9
Persistent User Defined Views
• Users can define a temp view for a SQL query
• Replicated across all SJS servers + contexts
• Definitions persisted in DB so that a context restart is
accompanied by temp views’ registration.
• Load on start to warm up load of views
• TTL support for expiry
23#EUeco9
Workspaces
• Support for multiple custom catalogs in SQL context for
table resolution
• Custom time range / source / caching
– Global
– Per catalog
– Per table
• Configurable via Zeppelin interpreter
• Decoupled time range from query syntax
– Join a behavior table(refer to last 30 days) with lookup table
(fetch complete data)
24#EUeco9
Automated Pool Management
• Monitoring scripts to track and restart unhealthy / un-
responsive SJS servers / contexts
• APIs on SJS to stop / start / refresh context / SJS
• APIs to refresh cached tables / views;
• APIs on Router Service to reconfigure routing / pool size
and resource allocation
25#EUeco9
Thank You !
26#EUeco9
Questions & Answers
kapil.ee06@gmail.com
arvind_heda@yahoo.com
References
• Apache Zeppelin: https://zeppelin.apache.org/
• Spark Job-server: https://github.com/spark-jobserver/spark-
jobserver
• Alluxio: http://www.alluxio.org/
27#EUeco9
Scale ….
• Data
– ~ 100 TB
– ~ 1000 Event Types
• 100+ Active concurrent users
• 30+ Automated Agents
• 10000+ Scheduled / 3000+ Ad Hoc Analysis
• Avg data churn per Analysis > 200 GB
28#EUeco9

More Related Content

PDF
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
PDF
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
PDF
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
PDF
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Using Spark with Tachyon by Gene Pang
PDF
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Scaling out Tensorflow-as-a-Service on Spark and Commodity GPUs
High Performance Enterprise Data Processing with Apache Spark with Sandeep Va...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Using Spark with Tachyon by Gene Pang
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio

What's hot (20)

PDF
Spark Summit EU talk by Jorg Schad
PDF
Apache Spark Performance: Past, Future and Present
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
PDF
Supporting Over a Thousand Custom Hive User Defined Functions
PDF
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
PDF
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
PDF
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PDF
Art of Feature Engineering for Data Science with Nabeel Sarwar
PDF
Managing Apache Spark Workload and Automatic Optimizing
PDF
Scaling Apache Spark at Facebook
PDF
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Spark Summit EU talk by Jorg Schad
Apache Spark Performance: Past, Future and Present
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Supporting Over a Thousand Custom Hive User Defined Functions
Speeding Up Spark with Data Compression on Xeon+FPGA with David Ojika
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Beyond unit tests: Testing for Spark/Hadoop Workflows with Shankar Manian Ana...
Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Next CERN Accelerator Logging Service with Jakub Wozniak
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Art of Feature Engineering for Data Science with Nabeel Sarwar
Managing Apache Spark Workload and Automatic Optimizing
Scaling Apache Spark at Facebook
Improving Apache Spark by Taking Advantage of Disaggregated Architecture
Ad

Viewers also liked (15)

PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
PDF
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Building Machine Learning Algorithms on Apache Spark with William Benton
PDF
Feature Hashing for Scalable Machine Learning with Nick Pentreath
PPTX
Low Touch Machine Learning with Leah McGuire (Salesforce)
PDF
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
PDF
Experimental Design for Distributed Machine Learning with Myles Baker
PPTX
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark and Tensorflow as a Service with Jim Dowling
Building Machine Learning Algorithms on Apache Spark with William Benton
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Low Touch Machine Learning with Leah McGuire (Salesforce)
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Experimental Design for Distributed Machine Learning with Myles Baker
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Ad

Similar to Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spark Job-Server with Arvind Heda Kapil Malik (20)

PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
PDF
Apache Spark 101 - Demi Ben-Ari
PPTX
Boosting big data with apache spark
PDF
A look ahead at spark 2.0
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Unified Big Data Processing with Apache Spark
PDF
Apache spark its place within a big data stack
PDF
02. UBER - BIG DATA CASE STUDY.pdf
PDF
SQL on Hadoop in Taiwan
PDF
Apache Spark and Python: unified Big Data analytics
PDF
Big data should be simple
PDF
Apache Spark 101 - Demi Ben-Ari - Panorays
PDF
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
PPTX
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
PDF
Apache Spark: The Analytics Operating System
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PPTX
Integrating Apache Phoenix with Distributed Query Engines
PDF
Big Data Architecture
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Apache Spark 101 - Demi Ben-Ari
Boosting big data with apache spark
A look ahead at spark 2.0
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Unified Big Data Processing with Apache Spark
Apache spark its place within a big data stack
02. UBER - BIG DATA CASE STUDY.pdf
SQL on Hadoop in Taiwan
Apache Spark and Python: unified Big Data analytics
Big data should be simple
Apache Spark 101 - Demi Ben-Ari - Panorays
Akka, Spark or Kafka? Selecting The Right Streaming Engine For the Job
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
Apache Spark: The Analytics Operating System
Apache Spark 2.0: Faster, Easier, and Smarter
Integrating Apache Phoenix with Distributed Query Engines
Big Data Architecture

More from Spark Summit (19)

PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
PDF
Variant-Apache Spark for Bioinformatics with Piotr Szul
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
PDF
Best Practices for Using Alluxio with Apache Spark with Gene Pang
PDF
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
PDF
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
PDF
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
PDF
Lucid—A Genetic Programming Library for Apache Spark with Jakub Guner
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark and Tensorflow as a Service with Jim Dowling
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Variant-Apache Spark for Bioinformatics with Piotr Szul
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Lucid—A Genetic Programming Library for Apache Spark with Jakub Guner

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Database Infoormation System (DBIS).pptx
PDF
Foundation of Data Science unit number two notes
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Fluorescence-microscope_Botany_detailed content
STUDY DESIGN details- Lt Col Maksud (21).pptx
Moving the Public Sector (Government) to a Digital Adoption
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Clinical guidelines as a resource for EBP(1).pdf
Miokarditis (Inflamasi pada Otot Jantung)
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Database Infoormation System (DBIS).pptx
Foundation of Data Science unit number two notes
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Fluorescence-microscope_Botany_detailed content

Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spark Job-Server with Arvind Heda Kapil Malik

  • 1. Arvind Heda, Kapil Malik Indicium: Interactive Querying at Scale #EUeco9
  • 2. What’s in the session … • Unified Data Platform on Spark – Single data source for all scheduled / ad hoc jobs and interactive lookup / queries – Data Pipeline – Compute Layer – Interactive Queries? • Indicium: Part 1 (managed context pool) • Indicium: Part 2 (smart query scheduler) 2#EUeco9
  • 4. Unified Data Platform…(for anything / everything) • Common Data Lake for storing – Transactional data – Behavioral data – Computed data • Drives all decisions / recommendations / reporting / analysis from same store. • Single data source for all Decision Edges, Algorithms, BI tools and Ad Hoc and interactive Query / Analysis tools • Data Platform needs to support – Scale – Store everything from summary to raw data. – Concurrency – Handle multiple requests in acceptable user response time. – Ad Hoc Drill down to any level – query, join, correlation on any dimension. 4#EUeco9
  • 5. Unified Data Platform 5#EUeco9 Query UI Spark Context (Yarn) HDFSHDFSHDFS / S3 Spark Context (Yarn) Scheduled Jobs Compute jobs BI sched uled report s Data Collection service Real time lookup Interactive Query Compute Layer Data Pipeline
  • 6. Features 6#EUeco9 Features Details Approach Data Persistence Store Large Data Volume of Txn, Behavioural and Computed data; Spark – Parquet format on S3 / HDFS Data Transformations Transformation / Aggregation – co relations and enrichments Batch Processing - Kafka / Java / Spark Jobs Algorithmic Access Aggregated / Raw Data Access for scheduledAlgorithms Spark Processes with SQL Context based data access Decision Making Aggregated Data Access for decision in real time In memory cache of aggregated data Reporting BI / Ad Hoc Query Aggregated / Raw Data Access for scheduled reports (BI) Aggregated / Raw Data Access forAd Hoc Queries BI tool with defined scheduled spark SQL queries on Data store; Interactive Queries Drill down data access on BI tools for concurrent users Ad hoc Query / Analysis on data for concurrent users S c a l i n g c h a l l e n g e s f o r S p a r k S Q L ?
  • 7. Data Pipeline • Kafka / Sqoop based data collection • Live lookup store for real time decisions • Tenant / Event and time based data partition • Time based compaction to optimize query on sparse data • Summary Profile data to reduce Joins • Shared compute resources but different context for Scheduled / Ad Hoc jobs or for Algorithmic / Human touchpoints 7#EUeco9
  • 8. Compute Layer • No real ‘real time’ queries -- FIFO scheduling for user tasks • Static or rigid resource allocation between scheduled and ad hoc queries / jobs • Short lived and stateless context - no sticky ness for user defined views like temp tables. • Interactive queries ? 8#EUeco9
  • 9. What was needed for Interactive query… • SQL like Query Tool for Ad Hoc Analysis. • Scalability for concurrent users, – Fair Scheduling – Responsiveness • High Availability • Performance – specifically for scans and Joins • Extensibility – User Views / Datasets / UDF’s 9#EUeco9
  • 11. Indicium: Part 1 Managed Context Pool 11#EUeco9
  • 12. Managed Context Pool 12#EUeco9 Apache Zeppelin SQL Context (Yarn) HDFS HDFS HDFS Spark Job-server
  • 13. Managed Context Pool Apache Zeppelin 0.6 • SQL like Query tool and a notebook • Custom interpreter - Configuration: SJS server + context - Statement execution: Make asynchronousREST calls to SJS • Concurrency - Multiple interpreters and notebooks Spark Job-Server 0.6.x • Custom SQL context with catalog override • Custom application to execute queries • High Availability: Multiple SJS servers and multiple contexts per server 13#EUeco9
  • 14. Managed Context Pool Features • Familiar SQL interface on notebooks • Concurrent multi-user support • Visualization Dashboards • Long running Spark Job – to support User Defined Views • Access control on Spark APIs • Custom SQL context with custom catalog – Intercept lookupTable calls to query actual data – Table wrappers for time windows - like select count(*) from `lastXDays(table)` 14#EUeco9
  • 15. Managed Context Pool Issues • Interpreter hard wired to a context • FIFO scheduling: Single statement per interpreter-context pair – across notebooks / across users • No automated failure handling – Detecting a dead context / SJS server – Recovery from the context / server failure • No dynamic scheduling / load balancing – No way of identify an overloaded context • Incompatible with Spark 2.x 15#EUeco9
  • 16. Indicium: Part 2 Smart Query Scheduler 16#EUeco9
  • 17. Smart Query Scheduler 17#EUeco9 Apache Zeppelin SQL Context (Yarn) HDFS HDFS HDFS Spark Job-server Smart Query Scheduler
  • 18. Smart Query Scheduler Zeppelin 0.7 • Supports per notebook statement execution SJS 0.7 Custom Fork • Support for Spark 2.x Smart Query Scheduler: • Scheduling: API to dynamically bind SJS server + context for every job / query Other Optimizations: • Monitoring: Monitor jobs running per context • Availability: Track Health of SJS servers and contexts and ensures healthy context in pool 18#EUeco9
  • 19. Smart Query Scheduler Dynamic scheduling for every query • Zeppelin interpreter agnostic of actual SJS / context • Load balancing of jobs per context • Query Classification and intelligent routing • Dynamic scaling / de-scaling the pool size • Shared Cache • User Defined Views • Workspaces or custom time window view for every interpreter 19#EUeco9
  • 20. Query Classification / routing Custom resource configurations for context dedicated for complex or asynchronous queries / jobs: • Classify queries based on heuristics / historic data into light / heavy queries and route them to different context. • Separate contexts for interactive vs background queries – An export table call does not starve an interactive SQL query 20#EUeco9
  • 21. Spark Dynamic Context Elastic scaling of contexts, co-existing on same cluster as scheduled batch jobs • Scale up in day time, when user load is high • Scale down in night, when overnight batch jobs are running • Scaling also helped to create reserved bandwidth for any set of users, if needed. 21#EUeco9
  • 22. Shared Cache Alluxio to store common datasets • Single cache for common datasets across contexts – Avoids replication across contexts – Cached data safe from executor / context crashes • Dedicated refresh thread to release / update data consistently across contexts 22#EUeco9
  • 23. Persistent User Defined Views • Users can define a temp view for a SQL query • Replicated across all SJS servers + contexts • Definitions persisted in DB so that a context restart is accompanied by temp views’ registration. • Load on start to warm up load of views • TTL support for expiry 23#EUeco9
  • 24. Workspaces • Support for multiple custom catalogs in SQL context for table resolution • Custom time range / source / caching – Global – Per catalog – Per table • Configurable via Zeppelin interpreter • Decoupled time range from query syntax – Join a behavior table(refer to last 30 days) with lookup table (fetch complete data) 24#EUeco9
  • 25. Automated Pool Management • Monitoring scripts to track and restart unhealthy / un- responsive SJS servers / contexts • APIs on SJS to stop / start / refresh context / SJS • APIs to refresh cached tables / views; • APIs on Router Service to reconfigure routing / pool size and resource allocation 25#EUeco9
  • 26. Thank You ! 26#EUeco9 Questions & Answers kapil.ee06@gmail.com arvind_heda@yahoo.com
  • 27. References • Apache Zeppelin: https://zeppelin.apache.org/ • Spark Job-server: https://github.com/spark-jobserver/spark- jobserver • Alluxio: http://www.alluxio.org/ 27#EUeco9
  • 28. Scale …. • Data – ~ 100 TB – ~ 1000 Event Types • 100+ Active concurrent users • 30+ Automated Agents • 10000+ Scheduled / 3000+ Ad Hoc Analysis • Avg data churn per Analysis > 200 GB 28#EUeco9