SlideShare a Scribd company logo
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Taro L. Saito
Arm Treasure Data
July 31th, 2020
Spark Meetup Tokyo #3
td-spark internals
Extending Spark with Airframe
1
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
About Me: Taro L. Saito
2
● Ph.D., Principal Software Engineer of
Arm Treasure Data
● Living US for 5 years
● Created Presto as a service
● Processing 1 million SQL queries /
day on the cloud. Presto Webinar
● OSS:
● Airframe, snappy-java (used in
Parquet, Spark core),
sbt-sonatype, etc.
● Books:
WIP
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Challenge: Adding Treasure Data Support to Spark
● PlazmaDB: Cloud Data Store of Treasure Data
● MessagePack-based columnar format (MPC1)
● Each table column is represented as a sequence of MessagePack values
● What was necessary for supporting Spark?
● td-spark driver (td-spark-assembly.jar)
■ MPC1 <-> DataFrame conversion
● Plazma Public API
■ APIs for reading and writing MPC1 files from PlazmaDB
● Created these two components with Airframe OSS
Airframe
3
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Airframe: Core Scala Modules of Treasure Data
● Airframe
● Scala OSS assets of our knowledges, production experiences, and design decisions
● 20+ Common Utilities for Scala
● Dependency Injection (DI)
● Airframe RPC
■ HTTP Server, Client builder (ScalaMatsuri. Tokyo, October 2020)
● AirSpec
■ Testing framework for Scala (ScalaDays. Seattle, May 2021)
4
Knowledge
Experiences
Design Decisions
Products
24/7 Services
Business Values
Programming OSS Outcome
Airframe
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Airframe Modules Used Inside td-spark
Airframe DI
DataFrame MPC1
airframe-codec
airframe-msgpack
Plazma Public API
airframe-http
airframe-finagle
Airframe DI
Airframe RPC
airframe-fluentd
Master Worker
DesignSparkContext
TDSparkContext TDSparkService
MPC1 Reader/Writer IO Manager
Airframe DI
airframe-http
airframe-config
airframe-launcher
airframe-jmx
airframe-metrics
airframe-control
airframe-metrics
td-spark.jarairframe-log
airframe-log
airframe-codec
airframe-json
Airframe
5
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Reading MPC1 Partitions and Column Blocks
6
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
Plazma
Public API
Table Data
column blocks
column blocks
column blocks
column blocks
column blocks
td-spark
Table Data
Data Frame
Data Frame
Data Frame
Data Frame
Data Frame
td-pyspark
Parallel Read
User
Programs
Columnar Data Download
DataFrame
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Uploading DataFrame as MPC1 Partitions
7
td-spark
td-pyspark
User
Programs
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
DataFrame
Format
Conversion
Plazma
Public API
Amazon S3
Parallel Upload
Copy
Transaction
Table Data
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
mpc1 partition
Table DataTable Data
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Airframe
Airframe RPC
RPC Interface
Router
Scala.js Client
RPC Web Server
Generate
HTTP/gRPC Client
Open API Spec
RPC Impl
Create
RPC CallsJSON
Cross-Language
RPC Client
Scala.js
Web Application
Micro Servicesbt-airframeairframe-http
airframe-http-finagle
airframe-http-rx
airframe-codec
API Documentation
airframe-gRPC
8
● Use Scala As An RPC Interface
● Generate HTTP Server/Client (REST or gRPC)
● HTTP calls -> JSON/MessagePack data -> Remote Scala function calls
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
td-spark: Adding More Functions to Spark
● Using an implicit class to extend SparkSession (spark variable)
● Adding TD-specific functionalities
● Time series data queries
■ e.g., : spark.td.table(“TD’s table”).within(“-1h”).df
● Predicate pushdown for time-series data
● etc.
9
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Tips: Avoid Task Serialization Errors with Airframe DI
● Serializable
● spark.conf key-value properties inside SparkContext
● Non-Serializable
● Complex service objects
● Solution: Airframe DI (Dependency Injection)
● Distribute the service design (= how to construct objects) with the jar file
● Build service objects from the design (20+ components and config objects)
Airframe DI
Master
Worker
TDSparkContext TDSparkService
TDSparkContext
td-spark.jar
td-spark.jarSerialization
Error (!) Design
Design
TDSparkService
Airframe
Build OK!
Config
Config
10
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Flexible Format Conversion with MessagePack
DataFrame
Airframe
Codec
Pack/Unpack Pack/Unpack
MPC1
JDBC
ResultSet
Plazma Public API
Airframe
11
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Spark 3.0 and PySpark Support
12
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Resources
● Airframe: https://wvlet.org/airframe/
● Airframe Meetup #1 ~ #3 reports
● ScalaMatsuri 2019 presentation
■ And more!!
● td-spark documentation: https://treasure-data.github.io/td-spark/
● See Also: Spark with Airframe (@smdmts)
● Spark to Spark data transfer with MessagePack-based airframe-codec
● Spark -> AWS service call management with airframe-control
13
Airframe
New!
Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved.
Appendix
14
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Treasure Data: A Ready-to-Use Cloud Data Platform
15
Logs
Device
Data
Batch
Data
PlazmaDB
Table Schema
Data Collection Cloud Storage Distributed Data Processing
Jobs
Job Management
SQL Editor
Scheduler
Workflows
Machine
Learning
Treasure Data OSS
Third Party OSS
Data
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
TDSparkContext
16
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
td-pyspark
● Supporting PySpark
● Access Scala methods of td-spark:
■ sparkContext._jvm.(jvm package name).method(...)
● Conversion to PySpark’s DataFrame
● DataFrame(Scala DataFrame, sqlContext)
17
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Predicate Pushdown
● Traverse DataFrame Column Filters
● Extract time conditions (e.g., -1d, -1w, -7d, etc.)
●
18
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Class Loader Hierarchy of Databricks
● Base class loader
● User library class loader
● td-spark.jar will be loaded here
● Shared between multiple notebooks
■ Static variables used inside td-spark.jar
will be shared by multiple notebooks!
● REPL class loader
● Shared between multiple notebooks
● Spark-library class loader
● Notebook-local
● Notebook-local class loader
● Caching local instances to static variables in
td-spark caused ClassNotFound error in
other notebooks
19
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
Using Presto with Spark
● presto-jdbc
● Submit select * from (Original SQL) limit 0 => Query result schema
● JDBC ResultSet => Airframe Codec => DataFrame
20
Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved.
td-prestobase: A Proxy Gateway to Presto Clusters
21
● td-prestobase is a proxy gateway to Presto clusters that talks standard presto
protocol to support any Presto clients (e.g., presto-cli, jdbc, odbc, etc.)
● td-spark uses presto-jdbc and td-prestobase APIs for making Presto queries
Airframe

More Related Content

PDF
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
PDF
Presto At Arm Treasure Data - 2019 Updates
PDF
Reading The Source Code of Presto
PDF
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
PDF
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
PDF
Airframe RPC
PDF
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
PDF
Presto conferencetokyo2019
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Presto At Arm Treasure Data - 2019 Updates
Reading The Source Code of Presto
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Airframe RPC
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Presto conferencetokyo2019

What's hot (20)

PDF
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
PDF
Wayfair Use Case: The four R's of Metrics Delivery
PDF
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
PDF
Search engine based on Elasticsearch
PDF
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
PDF
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
PDF
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
PDF
Go and Uber’s time series database m3
PDF
Fluentd: Data streams in Ruby world #rdrc2014
PDF
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PDF
201810 td tech_talk
PPTX
First impressions of SparkR: our own machine learning algorithm
PDF
Golang in TiDB (GopherChina 2017)
PDF
PGConf APAC 2018 Keynote: PostgreSQL goes eleven
PDF
OrientDB Distributed Architecture v2.0
PDF
A Brief Introduction of TiDB (Percona Live)
PDF
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PDF
Rust in TiKV
PDF
OrientDB and Hazelcast
PDF
How to build TiDB
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Wayfair Use Case: The four R's of Metrics Delivery
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Search engine based on Elasticsearch
PLAZMA TD Tech Talk 2018 at Shibuya: Hive2 as a new td hadoop core engine
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
Custom Script Execution Environment on TD Workflow @ TD Tech Talk 2018-10-17
Go and Uber’s time series database m3
Fluentd: Data streams in Ruby world #rdrc2014
The Dark Side Of Go -- Go runtime related problems in TiDB in production
201810 td tech_talk
First impressions of SparkR: our own machine learning algorithm
Golang in TiDB (GopherChina 2017)
PGConf APAC 2018 Keynote: PostgreSQL goes eleven
OrientDB Distributed Architecture v2.0
A Brief Introduction of TiDB (Percona Live)
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
Rust in TiKV
OrientDB and Hazelcast
How to build TiDB
Ad

Similar to td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020 (20)

PDF
Spark Summit EU 2015: Reynold Xin Keynote
PDF
Airframe Meetup #3: 2019 Updates & AirSpec
PDF
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
PPTX
Intro to Apache Spark
PPTX
Running Presto and Spark on the Netflix Big Data Platform
PDF
Strata NYC 2015: What's new in Spark Streaming
PPTX
Producing Spark on YARN for ETL
PPTX
Spark on Yarn @ Netflix
PDF
Introduction to Spark with Python
PDF
Make your PySpark Data Fly with Arrow!
ODP
PHP applications/environments monitoring: APM & Pinba
PDF
How to Upgrade Major Version of Your Production PostgreSQL
PPTX
Alexander Pavlenko, Java Software Engineer, DataArt.
PDF
Parallelizing Existing R Packages
PDF
Five cool ways the JVM can run Apache Spark faster
PDF
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
PDF
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
PPT
Easy enterprise application integration with RabbitMQ and AMQP
PDF
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
PPTX
PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...
Spark Summit EU 2015: Reynold Xin Keynote
Airframe Meetup #3: 2019 Updates & AirSpec
Automate Oracle database patches and upgrades using Fleet Provisioning and Pa...
Intro to Apache Spark
Running Presto and Spark on the Netflix Big Data Platform
Strata NYC 2015: What's new in Spark Streaming
Producing Spark on YARN for ETL
Spark on Yarn @ Netflix
Introduction to Spark with Python
Make your PySpark Data Fly with Arrow!
PHP applications/environments monitoring: APM & Pinba
How to Upgrade Major Version of Your Production PostgreSQL
Alexander Pavlenko, Java Software Engineer, DataArt.
Parallelizing Existing R Packages
Five cool ways the JVM can run Apache Spark faster
Pulsar summit asia 2021 apache pulsar with mqtt for edge computing
Apache Spark 2.3 boosts advanced analytics and deep learning with Python
Easy enterprise application integration with RabbitMQ and AMQP
Apache Pulsar with MQTT for Edge Computing - Pulsar Summit Asia 2021
PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...
Ad

More from Taro L. Saito (18)

PDF
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
PDF
Tips For Maintaining OSS Projects
PDF
Learning Silicon Valley Culture
PDF
Presto At Treasure Data
PDF
Scala at Treasure Data
PDF
Introduction to Presto at Treasure Data
PDF
Workflow Hacks #1 - dots. Tokyo
PDF
Presto @ Treasure Data - Presto Meetup Boston 2015
PDF
Presto As A Service - Treasure DataでのPresto運用事例
PPTX
JNuma Library
PDF
Presto as a Service - Tips for operation and monitoring
PDF
Treasure Dataを支える技術 - MessagePack編
PDF
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
PPTX
Spark Internals - Hadoop Source Code Reading #16 in Japan
PPTX
Streaming Distributed Data Processing with Silk #deim2014
PDF
Silkによる並列分散ワークフロープログラミング
PDF
2011年度 生物データベース論 2日目 木構造データ
PDF
Relational-Style XML Query @ SIGMOD-J 2008 Dec.
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Tips For Maintaining OSS Projects
Learning Silicon Valley Culture
Presto At Treasure Data
Scala at Treasure Data
Introduction to Presto at Treasure Data
Workflow Hacks #1 - dots. Tokyo
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto As A Service - Treasure DataでのPresto運用事例
JNuma Library
Presto as a Service - Tips for operation and monitoring
Treasure Dataを支える技術 - MessagePack編
Weaving Dataflows with Silk - ScalaMatsuri 2014, Tokyo
Spark Internals - Hadoop Source Code Reading #16 in Japan
Streaming Distributed Data Processing with Silk #deim2014
Silkによる並列分散ワークフロープログラミング
2011年度 生物データベース論 2日目 木構造データ
Relational-Style XML Query @ SIGMOD-J 2008 Dec.

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
sap open course for s4hana steps from ECC to s4
PDF
KodekX | Application Modernization Development
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Electronic commerce courselecture one. Pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
“AI and Expert System Decision Support & Business Intelligence Systems”
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Diabetes mellitus diagnosis method based random forest with bat algorithm
sap open course for s4hana steps from ECC to s4
KodekX | Application Modernization Development
Understanding_Digital_Forensics_Presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Electronic commerce courselecture one. Pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25 Week I
MIND Revenue Release Quarter 2 2025 Press Release
Review of recent advances in non-invasive hemoglobin estimation

td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020

  • 1. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Taro L. Saito Arm Treasure Data July 31th, 2020 Spark Meetup Tokyo #3 td-spark internals Extending Spark with Airframe 1
  • 2. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. About Me: Taro L. Saito 2 ● Ph.D., Principal Software Engineer of Arm Treasure Data ● Living US for 5 years ● Created Presto as a service ● Processing 1 million SQL queries / day on the cloud. Presto Webinar ● OSS: ● Airframe, snappy-java (used in Parquet, Spark core), sbt-sonatype, etc. ● Books: WIP
  • 3. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Challenge: Adding Treasure Data Support to Spark ● PlazmaDB: Cloud Data Store of Treasure Data ● MessagePack-based columnar format (MPC1) ● Each table column is represented as a sequence of MessagePack values ● What was necessary for supporting Spark? ● td-spark driver (td-spark-assembly.jar) ■ MPC1 <-> DataFrame conversion ● Plazma Public API ■ APIs for reading and writing MPC1 files from PlazmaDB ● Created these two components with Airframe OSS Airframe 3
  • 4. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe: Core Scala Modules of Treasure Data ● Airframe ● Scala OSS assets of our knowledges, production experiences, and design decisions ● 20+ Common Utilities for Scala ● Dependency Injection (DI) ● Airframe RPC ■ HTTP Server, Client builder (ScalaMatsuri. Tokyo, October 2020) ● AirSpec ■ Testing framework for Scala (ScalaDays. Seattle, May 2021) 4 Knowledge Experiences Design Decisions Products 24/7 Services Business Values Programming OSS Outcome Airframe
  • 5. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe Modules Used Inside td-spark Airframe DI DataFrame MPC1 airframe-codec airframe-msgpack Plazma Public API airframe-http airframe-finagle Airframe DI Airframe RPC airframe-fluentd Master Worker DesignSparkContext TDSparkContext TDSparkService MPC1 Reader/Writer IO Manager Airframe DI airframe-http airframe-config airframe-launcher airframe-jmx airframe-metrics airframe-control airframe-metrics td-spark.jarairframe-log airframe-log airframe-codec airframe-json Airframe 5
  • 6. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Reading MPC1 Partitions and Column Blocks 6 Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition Plazma Public API Table Data column blocks column blocks column blocks column blocks column blocks td-spark Table Data Data Frame Data Frame Data Frame Data Frame Data Frame td-pyspark Parallel Read User Programs Columnar Data Download DataFrame
  • 7. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Uploading DataFrame as MPC1 Partitions 7 td-spark td-pyspark User Programs Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition DataFrame Format Conversion Plazma Public API Amazon S3 Parallel Upload Copy Transaction Table Data mpc1 partition mpc1 partition mpc1 partition mpc1 partition mpc1 partition Table DataTable Data
  • 8. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Airframe Airframe RPC RPC Interface Router Scala.js Client RPC Web Server Generate HTTP/gRPC Client Open API Spec RPC Impl Create RPC CallsJSON Cross-Language RPC Client Scala.js Web Application Micro Servicesbt-airframeairframe-http airframe-http-finagle airframe-http-rx airframe-codec API Documentation airframe-gRPC 8 ● Use Scala As An RPC Interface ● Generate HTTP Server/Client (REST or gRPC) ● HTTP calls -> JSON/MessagePack data -> Remote Scala function calls
  • 9. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-spark: Adding More Functions to Spark ● Using an implicit class to extend SparkSession (spark variable) ● Adding TD-specific functionalities ● Time series data queries ■ e.g., : spark.td.table(“TD’s table”).within(“-1h”).df ● Predicate pushdown for time-series data ● etc. 9
  • 10. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Tips: Avoid Task Serialization Errors with Airframe DI ● Serializable ● spark.conf key-value properties inside SparkContext ● Non-Serializable ● Complex service objects ● Solution: Airframe DI (Dependency Injection) ● Distribute the service design (= how to construct objects) with the jar file ● Build service objects from the design (20+ components and config objects) Airframe DI Master Worker TDSparkContext TDSparkService TDSparkContext td-spark.jar td-spark.jarSerialization Error (!) Design Design TDSparkService Airframe Build OK! Config Config 10
  • 11. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Flexible Format Conversion with MessagePack DataFrame Airframe Codec Pack/Unpack Pack/Unpack MPC1 JDBC ResultSet Plazma Public API Airframe 11
  • 12. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Spark 3.0 and PySpark Support 12
  • 13. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Resources ● Airframe: https://wvlet.org/airframe/ ● Airframe Meetup #1 ~ #3 reports ● ScalaMatsuri 2019 presentation ■ And more!! ● td-spark documentation: https://treasure-data.github.io/td-spark/ ● See Also: Spark with Airframe (@smdmts) ● Spark to Spark data transfer with MessagePack-based airframe-codec ● Spark -> AWS service call management with airframe-control 13 Airframe New!
  • 14. Copyright 1995-2018 Arm Limited (or its affiliates). All rights reserved. Appendix 14
  • 15. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Treasure Data: A Ready-to-Use Cloud Data Platform 15 Logs Device Data Batch Data PlazmaDB Table Schema Data Collection Cloud Storage Distributed Data Processing Jobs Job Management SQL Editor Scheduler Workflows Machine Learning Treasure Data OSS Third Party OSS Data
  • 16. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. TDSparkContext 16
  • 17. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-pyspark ● Supporting PySpark ● Access Scala methods of td-spark: ■ sparkContext._jvm.(jvm package name).method(...) ● Conversion to PySpark’s DataFrame ● DataFrame(Scala DataFrame, sqlContext) 17
  • 18. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Predicate Pushdown ● Traverse DataFrame Column Filters ● Extract time conditions (e.g., -1d, -1w, -7d, etc.) ● 18
  • 19. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Class Loader Hierarchy of Databricks ● Base class loader ● User library class loader ● td-spark.jar will be loaded here ● Shared between multiple notebooks ■ Static variables used inside td-spark.jar will be shared by multiple notebooks! ● REPL class loader ● Shared between multiple notebooks ● Spark-library class loader ● Notebook-local ● Notebook-local class loader ● Caching local instances to static variables in td-spark caused ClassNotFound error in other notebooks 19
  • 20. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. Using Presto with Spark ● presto-jdbc ● Submit select * from (Original SQL) limit 0 => Query result schema ● JDBC ResultSet => Airframe Codec => DataFrame 20
  • 21. Copyright 1995-2020 Arm Limited (or its affiliates). All rights reserved. td-prestobase: A Proxy Gateway to Presto Clusters 21 ● td-prestobase is a proxy gateway to Presto clusters that talks standard presto protocol to support any Presto clients (e.g., presto-cli, jdbc, odbc, etc.) ● td-spark uses presto-jdbc and td-prestobase APIs for making Presto queries Airframe