SlideShare a Scribd company logo
http://kylin.io
Apache Kylin Introduction
韩卿|Luke Han
Sr. Product Manager | lukehan@apache.org | @lukehq
v2015.3
http://kylin.io
Agenda
 What’s Apache Kylin?
 Features & Tech Highlights
 Performance
 Roadmap
 Q & A
http://kylin.io
Extreme OLAP Engine for Big Data
Kylin is an open source Distributed Analytics Engine from eBay
that provides SQL interface and multi-dimensional analysis
(OLAP) on Hadoop supporting extremely large datasets
What’s Kylin
kylin / ˈkiːˈlɪn / 麒麟
--n. (in Chinese art) a mythical animal of composite form
• Open Sourced on Oct 1st, 2014
• Be accepted as Apache Incubator Project on Nov 25th, 2014
http://kylin.io
Big Data Era
 More and more data becoming available on Hadoop
 Limitations in existing Business Intelligence (BI) Tools
 Limited support for Hadoop
 Data size growing exponentially
 High latency of interactive queries
 Scale-Up architecture
 Challenges to adopt Hadoop as interactive analysis system
 Majority of analyst groups are SQL savvy
 No mature SQL interface on Hadoop
 OLAP capability on Hadoop ecosystem not ready yet
http://kylin.io
Business Needs for Big Data Analysis
 Sub-second query latency on billions of rows
 ANSI SQL for both analysts and engineers
 Full OLAP capability to offer advanced functionality
 Seamless Integration with BI Tools
 Support of high cardinality and high dimensions
 High concurrency – thousands of end users
 Distributed and scale out architecture for large data volume
http://kylin.io6
Why not
Build an engine from scratch?
http://kylin.io
Transaction
Operation
Strategy
High Level
Aggregation
•Very High Level, e.g GMV by
site by vertical by weeks
Analysis
Query
•Middle level, e.g GMV by site by vertical,
by category (level x) past 12 weeks
Drill Down
to Detail
•Detail Level (Summary Table)
Low Level
Aggregation
•First Level
Aggragation
Transaction
Level
•Transaction Data
Analytics Query Taxonomy
OLAP
Kylin is designed to accelerate 80+% analytics queries performance on Hadoop
OLTP
http://kylin.io
 Huge volume data
 Table scan
 Big table joins
 Data shuffling
 Analysis on different granularity
 Runtime aggregation expensive
 Map Reduce job
 Batch processing
Technical Challenges
http://kylin.io
OLAP Cube – Balance between Space and Time
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, location
item, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>
2. (9/15, milk, Urbana, *) - <time, item, location>
3. (*, milk, Urbana, *) - <item, location>
4. (*, milk, Chicago, *) - <item, location>
5. (*, milk, *, *) - <item>
• Cuboid = one combination of dimensions
• Cube = all combination of dimensions (all cuboids)
http://kylin.io
From Relational to Key-Value
http://kylin.io
Kylin Architecture Overview
11
Cube Build Engine
(MapReduce…)
SQL
Low Latency -
Seconds
Mid Latency - Minutes
Routing
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
 Online Analysis Data Flow
 Offline Data Flow
 Clients/Users interactive with
Kylin via SQL
 OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cube
(HBase)
SQL
REST Server
http://kylin.io
 Hive
 Input source
 Pre-join star schema during cube building
 MapReduce
 Pre-aggregation metrics during cube building
 HDFS
 Store intermediated files during cube building.
 HBase
 Store data cube.
 Serve query on data cube.
 Coprocessor is used for query processing.
How Does Kylin Utilize Hadoop Components?
http://kylin.io
Agenda
 What’s Apache Kylin?
 Features & Tech Highlights
 Performance
 Roadmap
 Q & A
http://kylin.io
 Extremely Fast OLAP Engine at Scale
Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
 ANSI SQL Interface on Hadoop
Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions
 Seamless Integration with BI Tools
Kylin currently offers integration capability with BI Tools like Tableau.
 Interactive Query Capability
Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive
queries for the same dataset
 MOLAP Cube
User can define a data model and pre-build in Kylin with more than 10+ billions of raw
data records
Features Highlights
http://kylin.io
 Compression and Encoding Support
 Incremental Refresh of Cubes
 Approximate Query Capability for distinct Count (HyperLogLog)
 Leverage HBase Coprocessor for query latency
 Job Management and Monitoring
 Easy Web interface to manage, build, monitor and query cubes
 Security capability to set ACL at Cube/Project Level
 Support LDAP Integration
Features Highlights…
http://kylin.io
Cube Designer
http://kylin.io
Job Management
http://kylin.io
Query and Visualization
http://kylin.io
Tableau Integration
http://kylin.io
Data Modeling
Cube: …
Fact Table: …
Dimensions: …
Measures: …
Storage(HBase): …Fact
Dim Dim
Dim
Source
Star Schema
row A
row B
row C
Column Family
Val 1
Val 2
Val 3
Row Key Column
Target
HBase Storage
Mapping
Cube Metadata
End User Cube Modeler Admin
http://kylin.io
Cube Build Job Flow
http://kylin.io
How To Store Cube? – HBase Schema
http://kylin.io
Query Engine – Kylin Explain Plan
SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name,
test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT
FROM test_kylin_fact
LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt
LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id =
test_category.site_id
LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id
WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New'
GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name,
test_kylin_fact.lstg_format_name,test_sites.site_name
OLAPToEnumerableConverter
OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3],
LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8])
OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()])
OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14],
LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0])
OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))])
OLAPJoinRel(condition=[=($2, $25)], joinType=[left])
OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left])
OLAPJoinRel(condition=[=($4, $12)], joinType=[left])
OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]])
OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]])
OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
http://kylin.io
 Full Cube
 Pre-aggregate all dimension combinations
 “Curse of dimensionality”: N dimension cube has 2N cuboid.
 Partial Cube
 To avoid dimension explosion, we divide the dimensions into
different aggregation groups
 2N+M+L  2N + 2M + 2L
 For cube with 30 dimensions, if we divide these dimensions into 3
group, the cuboid number will reduce from 1 Billion to 3 Thousands
 230  210 + 210 + 210
 Tradeoff between online aggregation and offline pre-aggregation
How To Optimize Cube? – Full Cube vs. Partial
Cube
http://kylin.io
How To Optimize Cube? – Partial Cube
http://kylin.io
How To Optimize Cube? – Incremental Building
http://kylin.io
Agenda
 What’s Apache Kylin?
 Features & Tech Highlights
 Performance
 Roadmap
 Q & A
http://kylin.io
Kylin vs. Hive
# Query
Type
Return Dataset Query
On Kylin (s)
Query
On Hive (s)
Comments
1 High Level
Aggregation
4 0.129 157.437 1,217 times
2 Analysis Query 22,669 1.615 109.206 68 times
3 Drill Down to
Detail
325,029 12.058 113.123 9 times
4 Drill Down to
Detail
524,780 22.42 6383.21 278 times
5 Data Dump 972,002 49.054 N/A
0
50
100
150
200
SQL #1 SQL #2 SQL #3
Hive
Kylin
High Level
Aggregatio
n
Analysis
Query
Drill Down
to Detail
Low Level
Aggregatio
n
Transactio
n Level
Based on 12+B records case
http://kylin.io
Performance -- Concurrency
Linear scale out with more nodes
http://kylin.io
Performance - Query Latency
90%tile queries <5s
Green Line: 90%tile queries
Gray Line: 95%tile queries
http://kylin.io
Agenda
 What’s Apache Kylin?
 Features & Tech Highlights
 Performance
 Roadmap
 Q & A
http://kylin.io
Kylin Evolution Roadmap
201520142013
Initial
Prototype
for MOLAP
• Basic end to end
POC
MOLAP
• Incremental
Refresh
• ANSI SQL
• ODBC Driver
• Web GUI
• ACL
• Open Source
HOLAP
• Streaming OLAP
• JDBC Driver
• New UI
• Excel Support
• … more
Next Gen
• Automation
• Capacity
Management
• In-Memory
Analysis (TBD)
• Spark (TBD)
• … more
TBD
Future…
Sep, 2013
Jan, 2014
Sep, 2014
Q1, 2015
http://kylin.io
 Kylin Core
 Fundamental framework of
Kylin OLAP Engine
 Extension
 Plugins to support for
additional functions and
features
 Integration
 Lifecycle Management
Support to integrate with
other applications
 Interface
 Allows for third party users to
build more features via user-
interface atop Kylin core
 Driver
 ODBC and JDBC Drivers
Kylin OLAP
Core
Extension
 Security
 Redis Storage
 Spark Engine
 Docker
Interface
 Web Console
 Customized BI
 Ambari/Hue Plugin
Integration
 ODBC Driver
 ETL
 Drill
 SparkSQL
Kylin Ecosystem
http://kylin.io
 Kylin Site:
 http://kylin.io
 Twitter:
 @ApacheKylin
 Github:
 apache/incubator-kylin
 WeChat (微信)
 ApacheKylin
Open Source
http://kylin.io
If you want to go fast, go alone.
If you want to go far, go together.
--African Proverb

More Related Content

PDF
Apache Kylin - Balance Between Space and Time
PPTX
NoSQL and MapReduce
PPT
notes on Evolution Of Analytic Scalability.ppt
PPTX
Cluster computing ppt
PPT
Operating System Deadlock Galvin
PDF
Introduction to High-Performance Computing
PDF
Apache Calcite (a tutorial given at BOSS '21)
Apache Kylin - Balance Between Space and Time
NoSQL and MapReduce
notes on Evolution Of Analytic Scalability.ppt
Cluster computing ppt
Operating System Deadlock Galvin
Introduction to High-Performance Computing
Apache Calcite (a tutorial given at BOSS '21)

What's hot (20)

PPTX
Real-time Stream Processing with Apache Flink
PDF
Apache Druid 101
PPTX
Oracle DB Performance Tuning Tips
PPTX
Real-time Analytics with Trino and Apache Pinot
PPT
Query optimization
PPT
1.4 data warehouse
PPTX
Advanced Structural Modeling
PDF
An Apache Hive Based Data Warehouse
PDF
RSP4J: An API for RDF Stream Processing
PDF
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
PPTX
Introduction to Apache Hadoop
PPTX
Cluster computing
PPT
M03 2 Behavioral Diagrams
PPTX
Sizing Your Scylla Cluster
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
PDF
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
PPSX
OLAP OnLine Analytical Processing
PPTX
Design cube in Apache Kylin
PPTX
Sizing your alfresco platform
PPTX
Module 1_Data Warehousing Fundamentals.pptx
Real-time Stream Processing with Apache Flink
Apache Druid 101
Oracle DB Performance Tuning Tips
Real-time Analytics with Trino and Apache Pinot
Query optimization
1.4 data warehouse
Advanced Structural Modeling
An Apache Hive Based Data Warehouse
RSP4J: An API for RDF Stream Processing
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Introduction to Apache Hadoop
Cluster computing
M03 2 Behavioral Diagrams
Sizing Your Scylla Cluster
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
State: You're Doing It Wrong - Alternative Concurrency Paradigms For The JVM
OLAP OnLine Analytical Processing
Design cube in Apache Kylin
Sizing your alfresco platform
Module 1_Data Warehousing Fundamentals.pptx
Ad

Viewers also liked (20)

PPTX
Apache Kylin Extreme OLAP Engine for Big Data
PPTX
Apache Kylin - OLAP Cubes for SQL on Hadoop
PDF
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
PDF
5. Apache Kylin的金融大数据应用场景 - Apache Kylin Meetup @Shanghai
PDF
Apache Kylin Open Source Journey for QCon2015 Beijing
PDF
The Evolution of Apache Kylin by Luke Han
PDF
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
PPTX
Kylin OLAP Engine Tour
PDF
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
PPTX
Apache Kylin Streaming
PPTX
Apache Kylin – Cubes on Hadoop
PPTX
The Evolution of Apache Kylin
PPTX
Adding Spark support to Kylin at Bay Area Spark Meetup
PPTX
ТФРВС - весна 2014 - лекция 1
PDF
Sybase BAM Overview
PPTX
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
PPTX
Kylin Engineering Principles
PPTX
Kylin olap part 1- getting started
PDF
The Apache Way - Building Open Source Community in China - Luke Han
PDF
eBay Cloud CMS - QCon 2012 - http://yidb.org/
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
5. Apache Kylin的金融大数据应用场景 - Apache Kylin Meetup @Shanghai
Apache Kylin Open Source Journey for QCon2015 Beijing
The Evolution of Apache Kylin by Luke Han
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
Kylin OLAP Engine Tour
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
Apache Kylin Streaming
Apache Kylin – Cubes on Hadoop
The Evolution of Apache Kylin
Adding Spark support to Kylin at Bay Area Spark Meetup
ТФРВС - весна 2014 - лекция 1
Sybase BAM Overview
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Kylin Engineering Principles
Kylin olap part 1- getting started
The Apache Way - Building Open Source Community in China - Luke Han
eBay Cloud CMS - QCon 2012 - http://yidb.org/
Ad

Similar to Apache Kylin Introduction (20)

PPTX
Apache kylin - Big Data Technology Conference 2014 Beijing
PPTX
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
PPTX
ApacheKylin_HBaseCon2015
PPTX
Apache kylin (china hadoop summit 2015 shanghai)
PDF
Apache Kylin - Balance between space and time - Hadoop Summit 2015
PDF
Accelerating Big Data Analytics with Apache Kylin
PPTX
Apache Kylin @ Big Data Europe 2015
PDF
Apache kylin boost your SQLs on extremely large dataset
PDF
Apache kylin boost your sqls on extremely large dataset
PDF
Apache Kylin Use Cases in China and Japan
PDF
Apache Kylin and Use Cases - 2018 Big Data Spain
PPTX
Apache Kylin on HBase: Extreme OLAP engine for big data
PDF
Cloud-native Semantic Layer on Data Lake
PPTX
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
PPTX
Apache Kylin 1.5 Updates
PPTX
Apache Kylin’s Performance Boost from Apache HBase
PDF
Kylin and Druid Presentation
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
PPTX
Apache kylin 101 - Get Sub-Second Analytics on Massive Datasets
PPTX
Apache Kylin 101
Apache kylin - Big Data Technology Conference 2014 Beijing
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
ApacheKylin_HBaseCon2015
Apache kylin (china hadoop summit 2015 shanghai)
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Accelerating Big Data Analytics with Apache Kylin
Apache Kylin @ Big Data Europe 2015
Apache kylin boost your SQLs on extremely large dataset
Apache kylin boost your sqls on extremely large dataset
Apache Kylin Use Cases in China and Japan
Apache Kylin and Use Cases - 2018 Big Data Spain
Apache Kylin on HBase: Extreme OLAP engine for big data
Cloud-native Semantic Layer on Data Lake
HBaseConAsia2018 Track2-2: Apache Kylin on HBase: Extreme OLAP for big data
Apache Kylin 1.5 Updates
Apache Kylin’s Performance Boost from Apache HBase
Kylin and Druid Presentation
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache kylin 101 - Get Sub-Second Analytics on Massive Datasets
Apache Kylin 101

More from Luke Han (6)

PDF
Augmented OLAP for Big Data
PPTX
Refactoring your EDW with Mobile Analytics Products
PPTX
Building Enterprise OLAP on Hadoop for FSI
PDF
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
PPTX
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
PPTX
Actuate presentation 2011
Augmented OLAP for Big Data
Refactoring your EDW with Mobile Analytics Products
Building Enterprise OLAP on Hadoop for FSI
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
4.Building a Data Product using apache Zeppelin - Apache Kylin Meetup @Shanghai
Actuate presentation 2011

Recently uploaded (20)

PPTX
Online Work Permit System for Fast Permit Processing
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPT
Introduction Database Management System for Course Database
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
AI in Product Development-omnex systems
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Complete React Javascript Course Syllabus.pdf
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
System and Network Administraation Chapter 3
PDF
Digital Strategies for Manufacturing Companies
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Online Work Permit System for Fast Permit Processing
L1 - Introduction to python Backend.pptx
Operating system designcfffgfgggggggvggggggggg
Wondershare Filmora 15 Crack With Activation Key [2025
Design an Analysis of Algorithms I-SECS-1021-03
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Introduction Database Management System for Course Database
Odoo POS Development Services by CandidRoot Solutions
AI in Product Development-omnex systems
ManageIQ - Sprint 268 Review - Slide Deck
Upgrade and Innovation Strategies for SAP ERP Customers
Complete React Javascript Course Syllabus.pdf
How Creative Agencies Leverage Project Management Software.pdf
System and Network Administraation Chapter 3
Digital Strategies for Manufacturing Companies
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
The Five Best AI Cover Tools in 2025.docx
Adobe Illustrator 28.6 Crack My Vision of Vector Design
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx

Apache Kylin Introduction

  • 1. http://kylin.io Apache Kylin Introduction 韩卿|Luke Han Sr. Product Manager | lukehan@apache.org | @lukehq v2015.3
  • 2. http://kylin.io Agenda  What’s Apache Kylin?  Features & Tech Highlights  Performance  Roadmap  Q & A
  • 3. http://kylin.io Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets What’s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form • Open Sourced on Oct 1st, 2014 • Be accepted as Apache Incubator Project on Nov 25th, 2014
  • 4. http://kylin.io Big Data Era  More and more data becoming available on Hadoop  Limitations in existing Business Intelligence (BI) Tools  Limited support for Hadoop  Data size growing exponentially  High latency of interactive queries  Scale-Up architecture  Challenges to adopt Hadoop as interactive analysis system  Majority of analyst groups are SQL savvy  No mature SQL interface on Hadoop  OLAP capability on Hadoop ecosystem not ready yet
  • 5. http://kylin.io Business Needs for Big Data Analysis  Sub-second query latency on billions of rows  ANSI SQL for both analysts and engineers  Full OLAP capability to offer advanced functionality  Seamless Integration with BI Tools  Support of high cardinality and high dimensions  High concurrency – thousands of end users  Distributed and scale out architecture for large data volume
  • 7. http://kylin.io Transaction Operation Strategy High Level Aggregation •Very High Level, e.g GMV by site by vertical by weeks Analysis Query •Middle level, e.g GMV by site by vertical, by category (level x) past 12 weeks Drill Down to Detail •Detail Level (Summary Table) Low Level Aggregation •First Level Aggragation Transaction Level •Transaction Data Analytics Query Taxonomy OLAP Kylin is designed to accelerate 80+% analytics queries performance on Hadoop OLTP
  • 8. http://kylin.io  Huge volume data  Table scan  Big table joins  Data shuffling  Analysis on different granularity  Runtime aggregation expensive  Map Reduce job  Batch processing Technical Challenges
  • 9. http://kylin.io OLAP Cube – Balance between Space and Time time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item> • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids)
  • 11. http://kylin.io Kylin Architecture Overview 11 Cube Build Engine (MapReduce…) SQL Low Latency - Seconds Mid Latency - Minutes Routing 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cube (HBase) SQL REST Server
  • 12. http://kylin.io  Hive  Input source  Pre-join star schema during cube building  MapReduce  Pre-aggregation metrics during cube building  HDFS  Store intermediated files during cube building.  HBase  Store data cube.  Serve query on data cube.  Coprocessor is used for query processing. How Does Kylin Utilize Hadoop Components?
  • 13. http://kylin.io Agenda  What’s Apache Kylin?  Features & Tech Highlights  Performance  Roadmap  Q & A
  • 14. http://kylin.io  Extremely Fast OLAP Engine at Scale Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data  ANSI SQL Interface on Hadoop Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions  Seamless Integration with BI Tools Kylin currently offers integration capability with BI Tools like Tableau.  Interactive Query Capability Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset  MOLAP Cube User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records Features Highlights
  • 15. http://kylin.io  Compression and Encoding Support  Incremental Refresh of Cubes  Approximate Query Capability for distinct Count (HyperLogLog)  Leverage HBase Coprocessor for query latency  Job Management and Monitoring  Easy Web interface to manage, build, monitor and query cubes  Security capability to set ACL at Cube/Project Level  Support LDAP Integration Features Highlights…
  • 20. http://kylin.io Data Modeling Cube: … Fact Table: … Dimensions: … Measures: … Storage(HBase): …Fact Dim Dim Dim Source Star Schema row A row B row C Column Family Val 1 Val 2 Val 3 Row Key Column Target HBase Storage Mapping Cube Metadata End User Cube Modeler Admin
  • 23. http://kylin.io Query Engine – Kylin Explain Plan SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New' GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,test_sites.site_name OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
  • 24. http://kylin.io  Full Cube  Pre-aggregate all dimension combinations  “Curse of dimensionality”: N dimension cube has 2N cuboid.  Partial Cube  To avoid dimension explosion, we divide the dimensions into different aggregation groups  2N+M+L  2N + 2M + 2L  For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands  230  210 + 210 + 210  Tradeoff between online aggregation and offline pre-aggregation How To Optimize Cube? – Full Cube vs. Partial Cube
  • 26. http://kylin.io How To Optimize Cube? – Incremental Building
  • 27. http://kylin.io Agenda  What’s Apache Kylin?  Features & Tech Highlights  Performance  Roadmap  Q & A
  • 28. http://kylin.io Kylin vs. Hive # Query Type Return Dataset Query On Kylin (s) Query On Hive (s) Comments 1 High Level Aggregation 4 0.129 157.437 1,217 times 2 Analysis Query 22,669 1.615 109.206 68 times 3 Drill Down to Detail 325,029 12.058 113.123 9 times 4 Drill Down to Detail 524,780 22.42 6383.21 278 times 5 Data Dump 972,002 49.054 N/A 0 50 100 150 200 SQL #1 SQL #2 SQL #3 Hive Kylin High Level Aggregatio n Analysis Query Drill Down to Detail Low Level Aggregatio n Transactio n Level Based on 12+B records case
  • 30. http://kylin.io Performance - Query Latency 90%tile queries <5s Green Line: 90%tile queries Gray Line: 95%tile queries
  • 31. http://kylin.io Agenda  What’s Apache Kylin?  Features & Tech Highlights  Performance  Roadmap  Q & A
  • 32. http://kylin.io Kylin Evolution Roadmap 201520142013 Initial Prototype for MOLAP • Basic end to end POC MOLAP • Incremental Refresh • ANSI SQL • ODBC Driver • Web GUI • ACL • Open Source HOLAP • Streaming OLAP • JDBC Driver • New UI • Excel Support • … more Next Gen • Automation • Capacity Management • In-Memory Analysis (TBD) • Spark (TBD) • … more TBD Future… Sep, 2013 Jan, 2014 Sep, 2014 Q1, 2015
  • 33. http://kylin.io  Kylin Core  Fundamental framework of Kylin OLAP Engine  Extension  Plugins to support for additional functions and features  Integration  Lifecycle Management Support to integrate with other applications  Interface  Allows for third party users to build more features via user- interface atop Kylin core  Driver  ODBC and JDBC Drivers Kylin OLAP Core Extension  Security  Redis Storage  Spark Engine  Docker Interface  Web Console  Customized BI  Ambari/Hue Plugin Integration  ODBC Driver  ETL  Drill  SparkSQL Kylin Ecosystem
  • 34. http://kylin.io  Kylin Site:  http://kylin.io  Twitter:  @ApacheKylin  Github:  apache/incubator-kylin  WeChat (微信)  ApacheKylin Open Source
  • 35. http://kylin.io If you want to go fast, go alone. If you want to go far, go together. --African Proverb