SlideShare a Scribd company logo
Introduction to Apache Tajo:
Future of Data Warehouse
Jihoon Son / Gruter Inc.
I am
● Jihoon Son (@jihoonson)
○ Ph.D at Korea Univ.
○ Tajo project co-founder
○ Committer and PMC member of Apache Tajo
○ Research engineer at Gruter
○ Linkedin
■ https://www.linkedin.com/in/jihoonson
2
Today's Topic: Tajo
● What is Tajo?
○ Tajo / tάːzo / 타조
○ Ostrich in Korean
■ Fastest two-legged animal in
the world
3
Today's Topic: Tajo
● What is Apache Tajo?
○ Our Ostrich can do SQL
processing on big data!
■ SQL-on-Hadoop system
■ Apache Top-level project
4
Maybe You Think ...
5
SQL-on-Hadoop?
Boring..
This Ostrich is Different!
6
SQL-on-Hadoop Systems
7
SQL-on-Hadoop Systems
8
SQL-on-Hadoop Systems
9
Long-running
ETL jobs
Low-latency
interactive analysis
SQL-on-Hadoop Systems
10
● Requirements
○ Stable query execution
■ Fault-tolerance
● Can avoid query
resubmission
○ Adaptation to dynamic
environment
■ Available resources,
unpredictable delays, ...
Long-running
ETL jobs
SQL-on-Hadoop Systems
11
● Requirements
○ Fast query execution
■ Several query execution
techniques
■ In-memory processing Low-latency
interactive analysis
Tajo is designed for Both Workloads
12
Long-running
ETL jobs
Low-latency
interactive analysis
Who are using Tajo?
13
Use Cases: SK Telecom
● Data warehousing & analysis
○ 1st
telco in South Korea
■ 40 TB/day compressed data (2014)
14
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: Before Tajo
15
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
Hadoop MPP DBMS
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: After Tajo
16
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: After Tajo
17
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
● Long-running ETL jobs
● Ad-hoc analysis
Use Cases: SK Telecom
● Significantly reduced ETL & analysis time
○ Daily analysis becomes possible
○ More exploratory analysis is newly available
with remaining resources
18
Use Cases: Bluehole Studio
● Game log analysis
○ Finding principal
causes of service-
quality deficiencies
19
Use Cases: Bluehole Studio
● Tajo on EMR
20
Use Cases: Bluehole Studio
● Their first log analysis system
○ Easy and rapid deployment of Tajo
○ Low learning curve with SQL standard
● Immediate action becomes possible for
user complaints and hidden bugs
21
Use Cases: Melon
● Data discovery
○ Music streaming service (26 million users)
○ Analysis of purchase history for target
marketing
● Significantly reduced analysis time
○ Faster analysis by replacing Hive with Tajo
○ More analysis becomes possible
22
So, Why should you use Tajo?
23
So, Why should you use Tajo?
● Easy to use
24
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
25
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
○ Mature SQL features
■ Most existing queries can be executed without
modification
26
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
○ Mature SQL features
■ Most existing queries can be executed without
modification
○ Various data format support
■ Text, JSON, Orc, Parquet, …
27
So, Why should you use Tajo?
● Optimized performance
28
So, Why should you use Tajo?
● Optimized performance
○ Optimized code
■ Optimized I/O performance
● Nearly max I/O performance (~120MB/s) per disk
■ Off-heap data processing
● Mitigating GC overhead
29
So, Why should you use Tajo?
● Optimized performance
○ Cost-based query plan optimization
■ Join ordering
■ Best algorithm selection
● According to input size
■ Progressive optimization
● Further optimize the query plan during query execution
● Especially excellent for long running queries
■ => Efficient start schema processing
30
So, Why should you use Tajo?
● Various storage type support
31
So, Why should you use Tajo?
● Various storage type support
32
Logical Data Warehouse with Tajo
33
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
Logical Data Warehouse with Tajo
34
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
● Fast delivery
● Easy maintenance
● Simple data flow
How fast is Tajo?
35
Evaluation on Cloud Environment
● Google Cloud Platform
○ Instance type: n1-standard-8
■ 8 core, 30GB RAM
36
Target Systems
● Hive (0.12)
○ Baseline performance
○ Default configuration provided by GCP
■ Use the whole cpu and memory
● Tajo (0.11.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
37
Target Systems
● Spark-SQL (1.5.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
■ Tungsten enabled by default
○ spark.sql.shuffle.partitions is
adjusted for better performance
38
TPC-DS
● Data
○ 24 tables
■ Plain text format
■ Stored on Google Cloud Storage
● Query
○ Which can be executed on every system
without modifications
■ For Hive, 0.12 doesn't support implicit join, so
every query had to be changed
39
SF 1000, 50 instances
40
SF 1000, 50 instances
41
SF 1000, 50 instances
42
Cannot be run
on 1TB
SF 10000, 50 instances
43
SF 10000, 50 instances
44
Demo
45
Simple Demo on EMR
46
● Using TPC-H data set, but
○ Lineitem table is stored on HDFS
○ Orders table is stored on PostgreSQL
○ Other tables are stored on S3
Apache Tajo
● Is excellent for both long-running ETL jobs
and exploratory ad-hoc analysis
● Is very fast
● Supports query federation on diverse data
sources
47
Get Involved!
● We are recruiting contributors!
● General
○ http://tajo.apache.org/
● Getting Started
○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads
○ http://tajo.apache.org/downloads.html
● Issue tracker
○ http://issues.apache.org/jira/browse/TAJO
● Join the mailing list
○ dev-subscribe@tajo.apache.org
○ issues-subscribe@tajo.apache.org
48
Useful Links
49
● EMR bootstrap
○ https://github.com/awslabs/emr-bootstrap-
actions/tree/master/tajo
● How to setup Tajo on EMR
○ http://www.gruter.com/blog/setting-up-a-
tajo-cluster-on-amazon-emr/
Q & A
50

More Related Content

PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PDF
Query optimization in Apache Tajo
PDF
Performance evaluation of apache tajo
PDF
Apache Tajo on Swift: Bringing SQL to the OpenStack World
PDF
Apache tajo configuration
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PPTX
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
PDF
Improve Presto Architectural Decisions with Shadow Cache
Introduction to Apache Tajo: Data Warehouse for Big Data
Query optimization in Apache Tajo
Performance evaluation of apache tajo
Apache Tajo on Swift: Bringing SQL to the OpenStack World
Apache tajo configuration
Introduction to Apache Tajo: Data Warehouse for Big Data
[Paper Reading] Efficient Query Processing with Optimistically Compressed Has...
Improve Presto Architectural Decisions with Shadow Cache

What's hot (20)

PDF
Data Analysis with TensorFlow in PostgreSQL
 
PDF
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
PDF
Enabling Presto Caching at Uber with Alluxio
PDF
IOT with PostgreSQL
 
PDF
PostgreSQL and Sphinx pgcon 2013
PDF
Terabyte-scale image similarity search: experience and best practice
PPTX
PDF
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
PPTX
Performance Tuning Cheat Sheet for MongoDB
PDF
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
PPTX
Bucket your partitions wisely - Cassandra summit 2016
PPTX
Performance Tuning and Optimization
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PDF
Argus Production Monitoring at Salesforce
PDF
Effectively deploying hadoop to the cloud
PPTX
Advanced MySql Data-at-Rest Encryption in Percona Server
PPTX
Need for Time series Database
PDF
Optiq: A dynamic data management framework
PDF
Caching in
PDF
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
Data Analysis with TensorFlow in PostgreSQL
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Enabling Presto Caching at Uber with Alluxio
IOT with PostgreSQL
 
PostgreSQL and Sphinx pgcon 2013
Terabyte-scale image similarity search: experience and best practice
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
Performance Tuning Cheat Sheet for MongoDB
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Bucket your partitions wisely - Cassandra summit 2016
Performance Tuning and Optimization
Iceberg: A modern table format for big data (Strata NY 2018)
Argus Production Monitoring at Salesforce
Effectively deploying hadoop to the cloud
Advanced MySql Data-at-Rest Encryption in Percona Server
Need for Time series Database
Optiq: A dynamic data management framework
Caching in
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
Ad

Similar to Introduction to Apache Tajo: Future of Data Warehouse (20)

PDF
Apache Tajo - An open source big data warehouse
PPTX
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
PDF
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
PDF
What's New Tajo 0.10 and Its Beyond
PPTX
Apache Tajo - BWC 2014
PPTX
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
PDF
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
PDF
Apache TAJO
PPTX
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
PPTX
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
 
PDF
Tajo_Meetup_20141120
PDF
Tajo case study bay area hug 20131105
PDF
Architecting Data in the AWS Ecosystem
PDF
Big Data Thailand 2016 Meetup 1
PDF
Future of-hadoop-analytics
PDF
TiDB Introduction - Boston MySQL Meetup Group
PDF
The Future of Analytics, Data Integration and BI on Big Data Platforms
PDF
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
PPTX
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
PDF
Introducing TiDB @ SF DevOps Meetup
Apache Tajo - An open source big data warehouse
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
What's New Tajo 0.10 and Its Beyond
Apache Tajo - BWC 2014
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Apache TAJO
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
 
Tajo_Meetup_20141120
Tajo case study bay area hug 20131105
Architecting Data in the AWS Ecosystem
Big Data Thailand 2016 Meetup 1
Future of-hadoop-analytics
TiDB Introduction - Boston MySQL Meetup Group
The Future of Analytics, Data Integration and BI on Big Data Platforms
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Introducing TiDB @ SF DevOps Meetup
Ad

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Artificial Intelligence
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Electronic commerce courselecture one. Pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Modernizing your data center with Dell and AMD
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Monthly Chronicles - July 2025
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Artificial Intelligence
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
Building Integrated photovoltaic BIPV_UPV.pdf
Electronic commerce courselecture one. Pdf
The AUB Centre for AI in Media Proposal.docx
Modernizing your data center with Dell and AMD
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

Introduction to Apache Tajo: Future of Data Warehouse

  • 1. Introduction to Apache Tajo: Future of Data Warehouse Jihoon Son / Gruter Inc.
  • 2. I am ● Jihoon Son (@jihoonson) ○ Ph.D at Korea Univ. ○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo ○ Research engineer at Gruter ○ Linkedin ■ https://www.linkedin.com/in/jihoonson 2
  • 3. Today's Topic: Tajo ● What is Tajo? ○ Tajo / tάːzo / 타조 ○ Ostrich in Korean ■ Fastest two-legged animal in the world 3
  • 4. Today's Topic: Tajo ● What is Apache Tajo? ○ Our Ostrich can do SQL processing on big data! ■ SQL-on-Hadoop system ■ Apache Top-level project 4
  • 5. Maybe You Think ... 5 SQL-on-Hadoop? Boring..
  • 6. This Ostrich is Different! 6
  • 10. SQL-on-Hadoop Systems 10 ● Requirements ○ Stable query execution ■ Fault-tolerance ● Can avoid query resubmission ○ Adaptation to dynamic environment ■ Available resources, unpredictable delays, ... Long-running ETL jobs
  • 11. SQL-on-Hadoop Systems 11 ● Requirements ○ Fast query execution ■ Several query execution techniques ■ In-memory processing Low-latency interactive analysis
  • 12. Tajo is designed for Both Workloads 12 Long-running ETL jobs Low-latency interactive analysis
  • 13. Who are using Tajo? 13
  • 14. Use Cases: SK Telecom ● Data warehousing & analysis ○ 1st telco in South Korea ■ 40 TB/day compressed data (2014) 14
  • 15. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: Before Tajo 15 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts Hadoop MPP DBMS
  • 16. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: After Tajo 16 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts
  • 17. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: After Tajo 17 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts ● Long-running ETL jobs ● Ad-hoc analysis
  • 18. Use Cases: SK Telecom ● Significantly reduced ETL & analysis time ○ Daily analysis becomes possible ○ More exploratory analysis is newly available with remaining resources 18
  • 19. Use Cases: Bluehole Studio ● Game log analysis ○ Finding principal causes of service- quality deficiencies 19
  • 20. Use Cases: Bluehole Studio ● Tajo on EMR 20
  • 21. Use Cases: Bluehole Studio ● Their first log analysis system ○ Easy and rapid deployment of Tajo ○ Low learning curve with SQL standard ● Immediate action becomes possible for user complaints and hidden bugs 21
  • 22. Use Cases: Melon ● Data discovery ○ Music streaming service (26 million users) ○ Analysis of purchase history for target marketing ● Significantly reduced analysis time ○ Faster analysis by replacing Hive with Tajo ○ More analysis becomes possible 22
  • 23. So, Why should you use Tajo? 23
  • 24. So, Why should you use Tajo? ● Easy to use 24
  • 25. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... 25
  • 26. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... ○ Mature SQL features ■ Most existing queries can be executed without modification 26
  • 27. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... ○ Mature SQL features ■ Most existing queries can be executed without modification ○ Various data format support ■ Text, JSON, Orc, Parquet, … 27
  • 28. So, Why should you use Tajo? ● Optimized performance 28
  • 29. So, Why should you use Tajo? ● Optimized performance ○ Optimized code ■ Optimized I/O performance ● Nearly max I/O performance (~120MB/s) per disk ■ Off-heap data processing ● Mitigating GC overhead 29
  • 30. So, Why should you use Tajo? ● Optimized performance ○ Cost-based query plan optimization ■ Join ordering ■ Best algorithm selection ● According to input size ■ Progressive optimization ● Further optimize the query plan during query execution ● Especially excellent for long running queries ■ => Efficient start schema processing 30
  • 31. So, Why should you use Tajo? ● Various storage type support 31
  • 32. So, Why should you use Tajo? ● Various storage type support 32
  • 33. Logical Data Warehouse with Tajo 33 Global view Application DBMS NoSQL Cloud storage On-premise storage
  • 34. Logical Data Warehouse with Tajo 34 Global view Application DBMS NoSQL Cloud storage On-premise storage ● Fast delivery ● Easy maintenance ● Simple data flow
  • 35. How fast is Tajo? 35
  • 36. Evaluation on Cloud Environment ● Google Cloud Platform ○ Instance type: n1-standard-8 ■ 8 core, 30GB RAM 36
  • 37. Target Systems ● Hive (0.12) ○ Baseline performance ○ Default configuration provided by GCP ■ Use the whole cpu and memory ● Tajo (0.11.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory 37
  • 38. Target Systems ● Spark-SQL (1.5.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory ■ Tungsten enabled by default ○ spark.sql.shuffle.partitions is adjusted for better performance 38
  • 39. TPC-DS ● Data ○ 24 tables ■ Plain text format ■ Stored on Google Cloud Storage ● Query ○ Which can be executed on every system without modifications ■ For Hive, 0.12 doesn't support implicit join, so every query had to be changed 39
  • 40. SF 1000, 50 instances 40
  • 41. SF 1000, 50 instances 41
  • 42. SF 1000, 50 instances 42 Cannot be run on 1TB
  • 43. SF 10000, 50 instances 43
  • 44. SF 10000, 50 instances 44
  • 46. Simple Demo on EMR 46 ● Using TPC-H data set, but ○ Lineitem table is stored on HDFS ○ Orders table is stored on PostgreSQL ○ Other tables are stored on S3
  • 47. Apache Tajo ● Is excellent for both long-running ETL jobs and exploratory ad-hoc analysis ● Is very fast ● Supports query federation on diverse data sources 47
  • 48. Get Involved! ● We are recruiting contributors! ● General ○ http://tajo.apache.org/ ● Getting Started ○ http://tajo.apache.org/docs/current/getting_started.html ● Downloads ○ http://tajo.apache.org/downloads.html ● Issue tracker ○ http://issues.apache.org/jira/browse/TAJO ● Join the mailing list ○ dev-subscribe@tajo.apache.org ○ issues-subscribe@tajo.apache.org 48
  • 49. Useful Links 49 ● EMR bootstrap ○ https://github.com/awslabs/emr-bootstrap- actions/tree/master/tajo ● How to setup Tajo on EMR ○ http://www.gruter.com/blog/setting-up-a- tajo-cluster-on-amazon-emr/