SlideShare a Scribd company logo
Radu Chilom
radu.chilom@gmail.com
In-memory data pipeline
and warehouse at scale
using Spark, Spark SQL,
Tachyon and Parquet
Buzzwords Berlin - 2015
Ema Iancuta
iorhian@gmail.com
‹#›
• Big data analytics / machine learning
• 6+ years with Hadoop ecosystem
• 2 years with Spark
• http://atigeo.com/
• A research group that focuses on the technical
problems that exist in the big data industry and
provides open source solutions
• http://bigdataresearch.io/
‹#›
• Intro
• Use Case
• Data pipeline with Spark
• Spark Job Rest Service
• Spark SQL Rest Service (Jaws)
• Parquet
• Tachyon
• Demo
Agenda
‹#›
• Build an in memory data pipeline for millions
financial transactions used downstream by
data scientists for detecting fraud
• Ingestion from S3 to our Tachyon/HDFS
cluster
• Data transformation
• Data warehouse
Use Case
‹#›
• “fast and general engine for large-scale
data processing”
• Built around the concept of RDD
• API for Java/Scala/Python (80 operators)
• powers a stack of high level tools including
Spark SQL, MLlib, Spark Streaming.
Apache Spark
‹#›
Public S3 Bucket: public-financial-transactions
public-financial-
transactions
(s3-bucket)
scheme scheme.csv
data input-0.csv
data2
input-1.csv
. . .
. . .
‹#›
• Download from S3
1. Ingestion
• Resolving the wildcards means listing files
metadata
• Listing the metadata for a large number
of files from external sources can take a
long time
‹#›
Listing the metadata (distributed)
Driver
Worker Worker Worker
folder1
folder2
folder3
folder4
folder5
folder6
folder1
folder2
folder3
folder4
folder5
folder6
file-11
file-12
file-21
file-22
file-23
file-31
file-32
file-41
file-42
file-43
file-44
file-51
file-52
file-61
‹#›
Listing the metadata (distributed)
• For fine tuning, specify the number of partitions
‹#›
• Unbalanced partitions
Download Files
‹#›
Unbalanced partitions
Partition 0
transactions.csv
Partition 1
input.csv
data.csv
values.csv
buzzwords.csv
buzzwords.txt
‹#›
Balancing partitions
Partition 0
(0, transactions.csv)
(2, data.csv)
(4, buzzwords.csv)
Partition 1
(1, input.csv)
(3, values.csv)
(5, buzzwords.txt)
‹#›
• Balancing partitions
Keep in mind that repartitioning your data is a
fairly expensive operation.
Balancing partitions
‹#›
• Data cleaning is the first step in any data
science project
• For this use-case:
- Remove lines that don't match the structure
- Remove “useless” columns
- Transform data to be in a consistent format
2. Data Transformation
‹#›
• Join
Find Country char code
Numeric Format Alpha 2 Format
276 DE
Name
Germany
• Problem with skew in the key distribution
‹#›
Metrics for Join
‹#›
• Broadcast Country Codes Map
Find Country char code
‹#›
Metrics
‹#›
Transformation with
Join vs Broadcasted Map
(skewed key)
Seconds
0
60
120
180
240
300
Rows
1 Million 2 Million 3 Million
Join Broadcasted Map
‹#›
• Supports multiple contexts
• Launches a new process for each Spark context
• Inter-process communication with Akka actors
• Easy context creation & job runs
• Supports Java and Scala code
• Friendly UI
Spark-Job-Rest
https://github.com/Atigeo/spark-job-rest
‹#›
• Hive
• Apache Pig
• Impala
• Presto
• Stinger (Hive on Tez)
• Spark SQL
Build a data warehouse
‹#›
Spark SQL
• Support for multiple
input formats
• Rich language interfaces
• RDD-aware optimizer
RDD
DataFrame / SchemaRDD
JDBC
HIVE QL SQL
‹#›
Creating a data frame
‹#›
Perform a simple query:
Explore data
> Directly on the data frame
> Registering a temporary table
- select
- filter
- join
- groupBy
- agg
- join
- count
- sort
- where ..etc.
‹#›
Creating a data warehouse
https://github.com/Atigeo/xpatterns-spark-parquet
‹#›
• TextFile
• SequenceFile
• RCFile (RowColumnar)
• ORCFile (OptimizedRowColumnar)
• Avro
• Parquet
File Formats
> columnar format
> good for aggregation queries
> only the required columns are read from disk
> nested data structures
> schema with the data
> spark sql supports schema evolution
> efficient compression
‹#›
Tachyon
• memory-centric distributed file system
enabling reliable file sharing at memory-speed
across cluster frameworks
• Pluggable underlayer file system: hdfs, S3,…
‹#›
Caching in Spark SQL
• Cache data in columnar format
• Automatically compression tune
‹#›
• spark context might crash
Spark cache vs Tachyon
• GC kicks in
• share data between different applications
‹#›
- Highly scalable and resilient data warehouse
- Submit queries concurrently and asynchronously
- Restful alternative to Spark SQL JDBC having a
interactive UI
- Since Spark 091 with Shark
- Support for Spark SQL and Hive - MR (and more to
come)
https://github.com/Atigeo/jaws-spark-sql-rest
Jaws spark sql rest
‹#›
- Akka actors to communicate through instances
- Support cancel queries
- Supports large results retrieval
- Parquet in memory warehouse
- returns persisted logs, results, query history
- provides a metadata browser
- configuration file to fine tune spark
Jaws main features
‹#›
https://github.com/big-data-research/in-memory-data-pipeline
Code available at
‹#›
Q & A
© 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of
the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the
accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

More Related Content

PPTX
Jaws - Data Warehouse with Spark SQL by Ema Orhian
PDF
Spark SQL
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
PDF
Introduction to Spark SQL training workshop
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PPTX
Spark Sql for Training
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Spark sql
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Spark SQL
Spark - The Ultimate Scala Collections by Martin Odersky
Introduction to Spark SQL training workshop
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Spark Sql for Training
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark sql

What's hot (20)

PDF
Writing Continuous Applications with Structured Streaming in PySpark
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
PDF
The SparkSQL things you maybe confuse
PPTX
Apache Spark Fundamentals
PDF
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
PDF
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PPTX
Spark SQL
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
Hadoop & Complex Systems Research
PDF
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
PPTX
Introduction to Apache Spark
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PPTX
Learn Apache Spark: A Comprehensive Guide
PPTX
3 CityNetConf - sql+c#=u-sql
PPTX
Case study of Rujhaan.com (A social news app )
PDF
SQL on Hadoop in Taiwan
PDF
Performance of Spark vs MapReduce
Writing Continuous Applications with Structured Streaming in PySpark
Getting started with SparkSQL - Desert Code Camp 2016
The SparkSQL things you maybe confuse
Apache Spark Fundamentals
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Optimizing Delta/Parquet Data Lakes for Apache Spark
Spark SQL
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Hadoop & Complex Systems Research
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Introduction to Apache Spark
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Learn Apache Spark: A Comprehensive Guide
3 CityNetConf - sql+c#=u-sql
Case study of Rujhaan.com (A social news app )
SQL on Hadoop in Taiwan
Performance of Spark vs MapReduce
Ad

Similar to In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015 (20)

PPTX
Reshape Data Lake (as of 2020.07)
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Big Data, Ingeniería de datos, y Data Lakes en AWS
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PDF
Jump Start on Apache Spark 2.2 with Databricks
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PDF
Apache spark its place within a big data stack
PPTX
Building highly scalable data pipelines with Apache Spark
PPTX
Spark, Tachyon and Mesos internals
PPTX
xPatterns - Spark Summit 2014
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
PDF
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
PDF
Apache spark 2.4 and beyond
PPTX
xPatterns on Spark, Shark, Mesos, Tachyon
PDF
Dev Ops Training
PDF
Spark + AI Summit 2020 イベント概要
PPTX
Introduction to AWS Big Data
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Reshape Data Lake (as of 2020.07)
AWS Big Data Demystified #1: Big data architecture lessons learned
Big Data, Ingeniería de datos, y Data Lakes en AWS
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Jump Start on Apache Spark 2.2 with Databricks
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Apache spark its place within a big data stack
Building highly scalable data pipelines with Apache Spark
Spark, Tachyon and Mesos internals
xPatterns - Spark Summit 2014
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
Apache spark 2.4 and beyond
xPatterns on Spark, Shark, Mesos, Tachyon
Dev Ops Training
Spark + AI Summit 2020 イベント概要
Introduction to AWS Big Data
Spark Summit East 2015 Advanced Devops Student Slides
Ad

Recently uploaded (20)

PDF
Foundation of Data Science unit number two notes
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
1_Introduction to advance data techniques.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Lecture1 pattern recognition............
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
Foundation of Data Science unit number two notes
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Fluorescence-microscope_Botany_detailed content
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
1_Introduction to advance data techniques.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Reliability_Chapter_ presentation 1221.5784
.pdf is not working space design for the following data for the following dat...
Data_Analytics_and_PowerBI_Presentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
IB Computer Science - Internal Assessment.pptx
Lecture1 pattern recognition............
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction to Knowledge Engineering Part 1
Business Ppt On Nestle.pptx huunnnhhgfvu
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015

  • 1. Radu Chilom radu.chilom@gmail.com In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Buzzwords Berlin - 2015 Ema Iancuta iorhian@gmail.com
  • 2. ‹#› • Big data analytics / machine learning • 6+ years with Hadoop ecosystem • 2 years with Spark • http://atigeo.com/ • A research group that focuses on the technical problems that exist in the big data industry and provides open source solutions • http://bigdataresearch.io/
  • 3. ‹#› • Intro • Use Case • Data pipeline with Spark • Spark Job Rest Service • Spark SQL Rest Service (Jaws) • Parquet • Tachyon • Demo Agenda
  • 4. ‹#› • Build an in memory data pipeline for millions financial transactions used downstream by data scientists for detecting fraud • Ingestion from S3 to our Tachyon/HDFS cluster • Data transformation • Data warehouse Use Case
  • 5. ‹#› • “fast and general engine for large-scale data processing” • Built around the concept of RDD • API for Java/Scala/Python (80 operators) • powers a stack of high level tools including Spark SQL, MLlib, Spark Streaming. Apache Spark
  • 6. ‹#› Public S3 Bucket: public-financial-transactions public-financial- transactions (s3-bucket) scheme scheme.csv data input-0.csv data2 input-1.csv . . . . . .
  • 7. ‹#› • Download from S3 1. Ingestion • Resolving the wildcards means listing files metadata • Listing the metadata for a large number of files from external sources can take a long time
  • 8. ‹#› Listing the metadata (distributed) Driver Worker Worker Worker folder1 folder2 folder3 folder4 folder5 folder6 folder1 folder2 folder3 folder4 folder5 folder6 file-11 file-12 file-21 file-22 file-23 file-31 file-32 file-41 file-42 file-43 file-44 file-51 file-52 file-61
  • 9. ‹#› Listing the metadata (distributed) • For fine tuning, specify the number of partitions
  • 11. ‹#› Unbalanced partitions Partition 0 transactions.csv Partition 1 input.csv data.csv values.csv buzzwords.csv buzzwords.txt
  • 12. ‹#› Balancing partitions Partition 0 (0, transactions.csv) (2, data.csv) (4, buzzwords.csv) Partition 1 (1, input.csv) (3, values.csv) (5, buzzwords.txt)
  • 13. ‹#› • Balancing partitions Keep in mind that repartitioning your data is a fairly expensive operation. Balancing partitions
  • 14. ‹#› • Data cleaning is the first step in any data science project • For this use-case: - Remove lines that don't match the structure - Remove “useless” columns - Transform data to be in a consistent format 2. Data Transformation
  • 15. ‹#› • Join Find Country char code Numeric Format Alpha 2 Format 276 DE Name Germany • Problem with skew in the key distribution
  • 17. ‹#› • Broadcast Country Codes Map Find Country char code
  • 19. ‹#› Transformation with Join vs Broadcasted Map (skewed key) Seconds 0 60 120 180 240 300 Rows 1 Million 2 Million 3 Million Join Broadcasted Map
  • 20. ‹#› • Supports multiple contexts • Launches a new process for each Spark context • Inter-process communication with Akka actors • Easy context creation & job runs • Supports Java and Scala code • Friendly UI Spark-Job-Rest https://github.com/Atigeo/spark-job-rest
  • 21. ‹#› • Hive • Apache Pig • Impala • Presto • Stinger (Hive on Tez) • Spark SQL Build a data warehouse
  • 22. ‹#› Spark SQL • Support for multiple input formats • Rich language interfaces • RDD-aware optimizer RDD DataFrame / SchemaRDD JDBC HIVE QL SQL
  • 24. ‹#› Perform a simple query: Explore data > Directly on the data frame > Registering a temporary table - select - filter - join - groupBy - agg - join - count - sort - where ..etc.
  • 25. ‹#› Creating a data warehouse https://github.com/Atigeo/xpatterns-spark-parquet
  • 26. ‹#› • TextFile • SequenceFile • RCFile (RowColumnar) • ORCFile (OptimizedRowColumnar) • Avro • Parquet File Formats > columnar format > good for aggregation queries > only the required columns are read from disk > nested data structures > schema with the data > spark sql supports schema evolution > efficient compression
  • 27. ‹#› Tachyon • memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks • Pluggable underlayer file system: hdfs, S3,…
  • 28. ‹#› Caching in Spark SQL • Cache data in columnar format • Automatically compression tune
  • 29. ‹#› • spark context might crash Spark cache vs Tachyon • GC kicks in • share data between different applications
  • 30. ‹#› - Highly scalable and resilient data warehouse - Submit queries concurrently and asynchronously - Restful alternative to Spark SQL JDBC having a interactive UI - Since Spark 091 with Shark - Support for Spark SQL and Hive - MR (and more to come) https://github.com/Atigeo/jaws-spark-sql-rest Jaws spark sql rest
  • 31. ‹#› - Akka actors to communicate through instances - Support cancel queries - Supports large results retrieval - Parquet in memory warehouse - returns persisted logs, results, query history - provides a metadata browser - configuration file to fine tune spark Jaws main features
  • 34. © 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.