td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

Copyright 1995-2020 Arm Limited (or its aﬃliates). All rights reserved.
Taro L. Saito
Arm Treasure Data
July 31th, 2020
Spark Meetup Tokyo #3
td-spark internals
Extending Spark with Airframe
1

About Me: Taro L. Saito
2
● Ph.D., Principal Software Engineer of
Arm Treasure Data
● Living US for 5 years
● Created Presto as a service
● Processing 1 million SQL queries /
day on the cloud. Presto Webinar
● OSS:
● Airframe, snappy-java (used in
Parquet, Spark core),
sbt-sonatype, etc.
● Books:
WIP

Challenge: Adding Treasure Data Support to Spark
● PlazmaDB: Cloud Data Store of Treasure Data
● MessagePack-based columnar format (MPC1)
● Each table column is represented as a sequence of MessagePack values
● What was necessary for supporting Spark?
● td-spark driver (td-spark-assembly.jar)
■ MPC1 <-> DataFrame conversion
● Plazma Public API
■ APIs for reading and writing MPC1 ﬁles from PlazmaDB
● Created these two components with Airframe OSS
Airframe
3

Airframe: Core Scala Modules of Treasure Data
● Airframe
● Scala OSS assets of our knowledges, production experiences, and design decisions
● 20+ Common Utilities for Scala
● Dependency Injection (DI)
● Airframe RPC
■ HTTP Server, Client builder (ScalaMatsuri. Tokyo, October 2020)
● AirSpec
■ Testing framework for Scala (ScalaDays. Seattle, May 2021)
4
Knowledge
Experiences
Design Decisions
Products
24/7 Services
Business Values
Programming OSS Outcome
Airframe

Airframe Modules Used Inside td-spark
Airframe DI
DataFrame MPC1
airframe-codec
airframe-msgpack
Plazma Public API
airframe-http
airframe-finagle
Airframe DI
Airframe RPC
airframe-fluentd
Master Worker
DesignSparkContext
TDSparkContext TDSparkService
MPC1 Reader/Writer IO Manager
Airframe DI
airframe-http
airframe-config
airframe-launcher
airframe-jmx
airframe-metrics
airframe-control
airframe-metrics
td-spark.jarairframe-log
airframe-log
airframe-codec
airframe-json
Airframe
5

Reading MPC1 Partitions and Column Blocks
6
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
Plazma
Public API
Table Data
column blocks
column blocks
column blocks
column blocks
column blocks
td-spark
Table Data
Data Frame
Data Frame
Data Frame
Data Frame
Data Frame
td-pyspark
Parallel Read
User
Programs
Columnar Data Download
DataFrame

Uploading DataFrame as MPC1 Partitions
7
td-spark
td-pyspark
User
Programs
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
DataFrame
Format
Conversion
Plazma
Public API
Amazon S3
Parallel Upload
Copy
Transaction
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
Table DataTable Data

Airframe
Airframe RPC
RPC Interface
Router
Scala.js Client
RPC Web Server
Generate
HTTP/gRPC Client
Open API Spec
RPC Impl
Create
RPC CallsJSON
Cross-Language
RPC Client
Scala.js
Web Application
Micro Servicesbt-airframeairframe-http
airframe-http-finagle
airframe-http-rx
airframe-codec
API Documentation
airframe-gRPC
8
● Use Scala As An RPC Interface
● Generate HTTP Server/Client (REST or gRPC)
● HTTP calls -> JSON/MessagePack data -> Remote Scala function calls

td-spark: Adding More Functions to Spark
● Using an implicit class to extend SparkSession (spark variable)
● Adding TD-speciﬁc functionalities
● Time series data queries
■ e.g., : spark.td.table(“TD’s table”).within(“-1h”).df
● Predicate pushdown for time-series data
● etc.
9

Tips: Avoid Task Serialization Errors with Airframe DI
● Serializable
● spark.conf key-value properties inside SparkContext
● Non-Serializable
● Complex service objects
● Solution: Airframe DI (Dependency Injection)
● Distribute the service design (= how to construct objects) with the jar ﬁle
● Build service objects from the design (20+ components and conﬁg objects)
Airframe DI
Master
Worker
TDSparkContext TDSparkService
TDSparkContext
td-spark.jar
td-spark.jarSerialization
Error (!) Design
Design
TDSparkService
Airframe
Build OK!
Config
Config
10

Flexible Format Conversion with MessagePack
DataFrame
Airframe
Codec
Pack/Unpack Pack/Unpack
MPC1
JDBC
ResultSet
Plazma Public API
Airframe
11

Spark 3.0 and PySpark Support
12

Resources
● Airframe: https://wvlet.org/airframe/
● Airframe Meetup #1 ~ #3 reports
● ScalaMatsuri 2019 presentation
■ And more!!
● td-spark documentation: https://treasure-data.github.io/td-spark/
● See Also: Spark with Airframe (@smdmts)
● Spark to Spark data transfer with MessagePack-based airframe-codec
● Spark -> AWS service call management with airframe-control
13
Airframe
New!

Appendix
14

Treasure Data: A Ready-to-Use Cloud Data Platform
15
Logs
Device
Data
Batch
Data
PlazmaDB
Table Schema
Data Collection Cloud Storage Distributed Data Processing
Jobs
Job Management
SQL Editor
Scheduler
Workflows
Machine
Learning
Treasure Data OSS
Third Party OSS
Data

TDSparkContext
16

td-pyspark
● Supporting PySpark
● Access Scala methods of td-spark:
■ sparkContext._jvm.(jvm package name).method(...)
● Conversion to PySpark’s DataFrame
● DataFrame(Scala DataFrame, sqlContext)
17

Predicate Pushdown
● Traverse DataFrame Column Filters
● Extract time conditions (e.g., -1d, -1w, -7d, etc.)
●
18

Class Loader Hierarchy of Databricks
● Base class loader
● User library class loader
● td-spark.jar will be loaded here
● Shared between multiple notebooks
■ Static variables used inside td-spark.jar
will be shared by multiple notebooks!
● REPL class loader
● Shared between multiple notebooks
● Spark-library class loader
● Notebook-local
● Notebook-local class loader
● Caching local instances to static variables in
td-spark caused ClassNotFound error in
other notebooks
19

Using Presto with Spark
● presto-jdbc
● Submit select * from (Original SQL) limit 0 => Query result schema
● JDBC ResultSet => Airframe Codec => DataFrame
20

td-prestobase: A Proxy Gateway to Presto Clusters
21
● td-prestobase is a proxy gateway to Presto clusters that talks standard presto
protocol to support any Presto clients (e.g., presto-cli, jdbc, odbc, etc.)
● td-spark uses presto-jdbc and td-prestobase APIs for making Presto queries
Airframe

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

More Related Content

What's hot (20)

Similar to td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020 (20)

More from Taro L. Saito (18)

Recently uploaded (20)

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020