SlideShare a Scribd company logo
Apache  Tajo:  An  Open  Source    
Big  Data  Warehouse  
(what’s  new  in  recent  releases)
HadoopSphere	
  -­‐	
  Virtual	
  Conclave	
  2015	
  
Hyunsik	
  Choi,	
  Gruter	
  Inc.	
  
(hschoi	
  @	
  gruter.com)	
  
1	
  
Agenda
•  Tajo	
  Overview	
  
•  Milestones	
  and	
  0.10	
  Features	
  
•  What’s	
  Next	
  
2	
  
Tajo:  A  Big  Data  Warehouse  System
•  Apache	
  Top-­‐level	
  project	
  
•  Distributed	
  and	
  scalable	
  data	
  warehouse	
  system	
  on	
  various	
  data	
  
sources	
  (e.g,	
  HDFS,	
  S3,	
  Hbase,	
  …)	
  
•  Low	
  latency,	
  and	
  long	
  running	
  batch	
  queries	
  in	
  a	
  single	
  system	
  
•  Features	
  
•  ANSI	
  SQL	
  compliance	
  
•  Mature	
  SQL	
  features	
  
•  ParYYoned	
  table	
  support	
  
•  Java/Python	
  UDF	
  support	
  
•  JDBC	
  driver	
  and	
  Java-­‐based	
  asynchronous	
  API	
  
•  Read/Write	
  support	
  of	
  CSV,	
  JSON,	
  RCFile,	
  SequenceFile,	
  Parquet,	
  ORC	
  
3	
  
 
 
 
 
Master
 Server
 
 
 
 
 
 
 
 
 
TajoMaster
 
 
 
 
 
 
Slave
 Server
 
 
 
 
 
 

More Related Content

PPTX
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
PDF
HBase Status Report - Hadoop Summit Europe 2014
PDF
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Tajo: A Distributed Data Warehouse System for Hadoop
PPTX
Hadoop and HBase @eBay
PDF
SQL Engines for Hadoop - The case for Impala
PPTX
A Survey of HBase Application Archetypes
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
HBase Status Report - Hadoop Summit Europe 2014
HBaseCon 2013: Real-Time Model Scoring in Recommender Systems
Building a Hadoop Data Warehouse with Impala
Tajo: A Distributed Data Warehouse System for Hadoop
Hadoop and HBase @eBay
SQL Engines for Hadoop - The case for Impala
A Survey of HBase Application Archetypes

What's hot (20)

PDF
What's New Tajo 0.10 and Its Beyond
PPTX
Evolving HDFS to a Generalized Storage Subsystem
PPTX
Content Identification using HBase
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
PDF
Cloudera Impala
PPTX
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
PPTX
HBase in Practice
PPTX
Performance Optimizations in Apache Impala
PDF
Impala: Real-time Queries in Hadoop
PDF
SQL on Hadoop
PDF
Introduction to Impala
PPTX
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
PPTX
Facing enterprise specific challenges – utility programming in hadoop
PPTX
Impala 2.0 - The Best Analytic Database for Hadoop
PPTX
A brave new world in mutable big data relational storage (Strata NYC 2017)
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
PPTX
Intro to Apache Kudu (short) - Big Data Application Meetup
PDF
What database
PDF
Splice machine-bloor-webinar-data-lakes
What's New Tajo 0.10 and Its Beyond
Evolving HDFS to a Generalized Storage Subsystem
Content Identification using HBase
Building a Hadoop Data Warehouse with Impala
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
HBase in Practice
Performance Optimizations in Apache Impala
Impala: Real-time Queries in Hadoop
SQL on Hadoop
Introduction to Impala
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Facing enterprise specific challenges – utility programming in hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
A brave new world in mutable big data relational storage (Strata NYC 2017)
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Intro to Apache Kudu (short) - Big Data Application Meetup
What database
Splice machine-bloor-webinar-data-lakes
Ad

Viewers also liked (6)

PPTX
Open Source DWBI-A Primer
PDF
Complement Your Existing Data Warehouse with Big Data & Hadoop
PPTX
Hybrid Data Warehouse Hadoop Implementations
PPTX
Hadoop and Enterprise Data Warehouse
PPTX
Comparison of MPP Data Warehouse Platforms
PPTX
Hadoop and Your Data Warehouse
Open Source DWBI-A Primer
Complement Your Existing Data Warehouse with Big Data & Hadoop
Hybrid Data Warehouse Hadoop Implementations
Hadoop and Enterprise Data Warehouse
Comparison of MPP Data Warehouse Platforms
Hadoop and Your Data Warehouse
Ad

Similar to Apache Tajo - An open source big data warehouse (20)

PDF
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PDF
Introduction to Apache Tajo: Future of Data Warehouse
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PDF
Introduction to Apache Tajo: Data Warehouse for Big Data
PPTX
Apache Tajo - BWC 2014
PPTX
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
PPTX
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
PDF
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
PDF
Apache TAJO
PDF
Query optimization in Apache Tajo
PDF
Tajo_Meetup_20141120
PDF
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
PDF
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
PDF
SQL on Hadoop in Taiwan
PPTX
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
 
PDF
Future of-hadoop-analytics
PDF
Tajo Seoul Meetup-201501
PDF
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
PDF
Technologies for Data Analytics Platform
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
Apache Tajo - BWC 2014
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi
Apache TAJO
Query optimization in Apache Tajo
Tajo_Meetup_20141120
Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
Introduction to Apache Tajo - Bay Area HUG Nov. 2013 LinkedIn Special Event
SQL on Hadoop in Taiwan
[FOSS4G 2015 SEOUL] Spatial tajo supporting spatial queries on Apache Tajo
 
Future of-hadoop-analytics
Tajo Seoul Meetup-201501
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Technologies for Data Analytics Platform

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
Teaching material agriculture food technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Understanding_Digital_Forensics_Presentation.pptx
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
“AI and Expert System Decision Support & Business Intelligence Systems”
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Teaching material agriculture food technology
Spectral efficient network and resource selection model in 5G networks
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
Per capita expenditure prediction using model stacking based on satellite ima...

Apache Tajo - An open source big data warehouse

  • 1. Apache  Tajo:  An  Open  Source     Big  Data  Warehouse   (what’s  new  in  recent  releases) HadoopSphere  -­‐  Virtual  Conclave  2015   Hyunsik  Choi,  Gruter  Inc.   (hschoi  @  gruter.com)   1  
  • 2. Agenda •  Tajo  Overview   •  Milestones  and  0.10  Features   •  What’s  Next   2  
  • 3. Tajo:  A  Big  Data  Warehouse  System •  Apache  Top-­‐level  project   •  Distributed  and  scalable  data  warehouse  system  on  various  data   sources  (e.g,  HDFS,  S3,  Hbase,  …)   •  Low  latency,  and  long  running  batch  queries  in  a  single  system   •  Features   •  ANSI  SQL  compliance   •  Mature  SQL  features   •  ParYYoned  table  support   •  Java/Python  UDF  support   •  JDBC  driver  and  Java-­‐based  asynchronous  API   •  Read/Write  support  of  CSV,  JSON,  RCFile,  SequenceFile,  Parquet,  ORC   3  
  • 4.  
  • 5.  
  • 6.  
  • 9.  
  • 10.  
  • 11.  
  • 12.  
  • 13.  
  • 14.  
  • 15.  
  • 16.  
  • 18.  
  • 19.  
  • 20.  
  • 21.  
  • 22.  
  • 25.  
  • 26.  
  • 27.  
  • 28.  
  • 29.  
  • 30.  
  • 31.  
  • 32.  
  • 33.  
  • 34.  
  • 35.  
  • 36.  
  • 37.  
  • 38.  
  • 39.  
  • 40.  
  • 41.  
  • 42.  
  • 43.  
  • 44.  
  • 46.  
  • 47.  
  • 48.  
  • 49.  
  • 50.  
  • 51.  
  • 52.  
  • 53.  
  • 54.  
  • 55.  
  • 63.  
  • 64.  
  • 66.  
  • 67.  
  • 68.  
  • 69.  
  • 70.  
  • 74.  UI
  • 75.  
  • 76.  
  • 77.  
  • 78.  
  • 79.  
  • 82.  
  • 83.  
  • 84.  
  • 85.  
  • 86.  
  • 87.  
  • 88.  
  • 89.  
  • 90.  
  • 91.  
  • 92.  
  • 93.  
  • 94.  
  • 95.  
  • 96.  
  • 97.  
  • 98.  
  • 99.  
  • 100.  
  • 101.  
  • 103.  
  • 104.  
  • 105.  
  • 106.  
  • 107.  
  • 108.  
  • 109.  
  • 110.  
  • 111.  
  • 112.  
  • 118.  
  • 119.  
  • 120.  
  • 121.  
  • 122.  
  • 125.  
  • 126.  
  • 127.  
  • 128.  
  • 129.  
  • 130.  
  • 131.  
  • 132.  
  • 133.  
  • 134.  
  • 135.  
  • 136.  
  • 137.  
  • 138.  
  • 139.  
  • 140.  
  • 141.  
  • 142.  
  • 143.  
  • 144.  
  • 146.  
  • 147.  
  • 148.  
  • 149.  
  • 150.  
  • 151.  
  • 152.  
  • 153.  
  • 154.  
  • 155.  
  • 165.  
  • 166.  a
  • 171.  a
  • 175.  
  • 177.  
  • 180.  
  • 182.  
  • 188. Common  Scenarios •  ExtracYon,  TransformaYon,  Loading  (ETL)   •  InteracYve  BI/analyYcs  on  web-­‐scale  big  data   •  Data  discovery/Exploratory  analysis  with  R  and   exisYng  SQL  tools   5  
  • 189. Use  Cases:  Replacement  of  Commercial  DW •  Example:  Biggest  Telco  Company  in  South  Korea   •  Goal:   •  Replacement  of  slow  ETL  workloads  on  several  TB  datasets   •  Lots  daily  reports  generaYon  about  users’  behaviors   •  Ad-­‐hoc  analysis  on  Terabytes  data  sets   •  Key  Benefits  of  Tajo:   •  SimplificaYon  of  DW  ETL,  OLAP,  and  Hadoop  ETL  into  an   unified  system   •  Saved  license  over  commercial  DW   •  Much  less  cost,  more  data  analysis  within  the  same  SLA   6  
  • 190. Use  Cases:  Data  Discovery •  Example:  Music  streaming  service                                      (26  million  users)   •  Goal:     •  Analysis  on  purchase  history  for  target  markeYng     •  Benefits:   •  Query  interacYvity  on  large  data  sets   •  Ability  to  use  exisYng  BI  visualizaYon  tools   7  
  • 191. When  Tajo  is  right  choice? •  You  want  an  unified  system  for  batch  and   interacYve  queries  on  Hadoop,  Amazon  S3,  or   Hbase.   •  You  want  a  mixed  use  of  Hadoop-­‐based  DW  and   RDBMS-­‐based  DW  or  want  to  replace  exisYng   RDBMS  DW.   •  You  want  to  use  exisYng  SQL  tools  on  Hadoop  DW   8  
  • 192. Milestones 0.8   0.9   0.10   0.11   More  features       SQL  compaYbility   Stability     AnalyYcal   funcYon   Eco-­‐system   expansion   More  features   •  Python  UDF   •  Nested  Schema   •  Tablespace  support   •  Query  federaYon   •  Beker  query  scheduler   9  
  • 193. Selected  Features  in  0.10 10  
  • 194. Hbase  Storage  Support •  You  can  use  SQL  to  access  Hbase  tables.   •  Tajo  supports  Hbase  storage   •  CREATE  (EXTERNAL)/DROP/INSERT  (OVERWRITE)/ SELECT   •  Bulk  InserYon  through  Direct  HFile  wriYng     CREATE TABLE hbase_t1 (key TEXT, col1 TEXT, col2 INT) USING hbase WITH ( ‘table’ = ‘t1’, ‘columns’ = ‘:key,cf1:col1,cf2:col2`, ‘hbase.zookeeper.quorum’ = ‘host1:2181,host2:2181’ ) 11  
  • 195. BeNer  AWS  support •  OpYmized  for  S3  and  EMR  environments   •  Fixed  many  bugs  related  to  S3   •  EMR  bootstrap  supported  in  AWS  Labs  Github  repo   •  A  quick  guide  for  Tajo  on  EMR   •  hkp://www.gruter.com/blog/semng-­‐up-­‐a-­‐tajo-­‐cluster-­‐on-­‐amazon-­‐emr/   •  EMR  bootstrap  for  Tajo  on  EMR   •  hkps://github.com/awslabs/emr-­‐bootstrap-­‐acYons/tree/master/tajo   12  
  • 196. Tajo  JDBC   Tajo  Cluster ETL  Tools   BI  Tools   Repor.ng  tools   BeNer  SQL  tool  support  via  thin  JDBC HDFS   HBase   S3   Swin   13  
  • 198. Improved  Performance  and  Stability •  Ooeap  sort  operator  for  ORDER  BY  (TAJO-­‐907)   •  Hash  shuffle  IO  improvement  (TAJO-­‐374,  TAJO-­‐987)   •  Skewness  handling  of  hash  shuffle   •  AutomaYc  parallel  degree  choice  during  runYme   •  Lots  of  query  opYmizer  improvements   •  Add  Master  HA  (TAJO-­‐704)   •  More  error-­‐tolerant  shuffle  fetch  (TAJO-­‐789,  TAJO-­‐953)   15  
  • 199. What’s  New  in  Tajo  0.11 16  
  • 200. Nested  data  and  JSON  support •  Nested  data  is  becoming  common   •  JSON,  BSON,  XML,  Protocol  Buffer,  Avro,  Parquet,  …   •  Many  web  applicaYons  in  common  use  JSON.   •  MongoDB  by  default  uses  JSON  document   •  Many  Hbase  users  also  store  JSON  document  in  a  cell.   •  Flakening  causes  lots  of  data/computaYon   overhead.   •  Tajo  0.11  naYvely  supports  nested  data  types.   17  
  • 201. How  to  create  a  nested  schema  table Use  ‘RECORD’  keyword  to  define  complex  data  type   18  
  • 202. Loose  schema  for  self-­‐describing  formats You  can  handle  schema  evolving  with  ALTER  ADD  COLUMN!   19  
  • 203. How  to  retrieve  nested  fields Input  Data   Table  DefiniYon   SQL   20  
  • 204. Query  federaTon  and  Tablespace  support •  Query  support  across  mulYple  data  sources   •  You  can  perform  join  or  union  among  tables  on  different  systems.   •  Benefits:   •  Data  offload  from  RDBMS  to  Hadoop  vice  versa   •  A  mixed  use  of  exisYng  RDBMS  and  Hadoop.   •  Access  to  NoSQL  and  various  storages  through  SQL   •  An  unified  interface  for  SQL  tools   HDFS   NoSQL   S3   Swin   Apache  Tajo   21  
  • 205. Sequence  File   RCFile   Protocol  Buffer   Data   Formats   Storage   Types   Datasets  stored  in  Various  Formats/Storages ORC   22  
  • 206. Tablespace •  Tablespace   •  Registered  storage  space   •  A  table  space  is  idenYfied  by  an  unique  URI   •  ConfiguraYon  and  Policy  shared  in  all  tables  in  the  same   tablespace   •  It  allows  users  to  reuse  registered  storages  and  their   configuraYon.   23  
  • 207. Tablespace  ConfiguraTon Tablespace  name   Tablespace  URI   24  
  • 208. Create  Table  on  a  specified  Tablespace CREATE TABLE uptodate (key TEXT, …) TABLESPACE hbase1; CREATE TABLE archive (l_orderkey bigint, …) TABLESPACE warehouse USING text WITH (‘text.delimiter’ = ‘|’); Tablespace  Name   Format  name   25  
  • 209. OperaTon  Push  Down SELECT X, SUM(Y) FROM table1 WHERE x 100 GROUP BY x Underlying   Storage   Filter,  ProjecYon  or  Groupby  can  be  pushed  down  into   Underlying  storages  (like  RDBMS,  Hbase,     ElasYcsearch,  …)   26  
  • 210. Current  Status  of  Storages •  Storages:   •  HDFS  support   •  Amazon  S3  and  Openstack  Swin   •  Hbase  Scanner  and    Writer  -­‐  HFile  and  Put  Mode   •  JDBC-­‐based  Scanner  and  Writer  (Working)   •  Kara  Scanner  (Patch  Available)   •  ElasYc  Search  (Patch  Available)   •  Data  Formats   •  Text,  JSON,  RCFile,  SequenceFile,  Avro,  Parquet,  and   ORC  (Patch  Available)   27  
  • 211. Python  UDF •  Python  UDF  and  UDAF  are  supported  in  Tajo   •  hkp://tajo.apache.org/docs/devel/funcYons/python.html   @output_type('int4')
 def return_one():
  return 1
 
 @output_type('text')
 def helloworld():
  return 'Hello, World’
 
 @output_type('int4')
 def sum_py(a,b):
  return a+b 28  
  • 212. Get  Involved! •  We  are  recruiYng  contributors!   •  General   •  hkp://tajo.apache.org   •  Gemng  Started   •  hkp://tajo.apache.org/docs/0.10.0/gemng_started.html   •  Downloads   •  hkp://tajo.apache.org/downloads.html   •  Jira  –  Issue  Tracker   •  hkps://issues.apache.org/jira/browse/TAJO   •  Join  the  mailing  list   •  dev-­‐subscribe@tajo.apache.org   •  issues-­‐subscribe@tajo.apache.org   29