SlideShare a Scribd company logo
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
BigData NoSQL System:ApsaraDB
HBase and Spark
Wei Li
ApsaraDB HBase X-Pack team
Overview
⽤户画像
爬⾍抓取信息
反欺诈系统
订单数据
⽤户⾏为分析
⽤户画像
推荐引擎
海量实时数据处理
监控数据
轨迹、设备数据
地理信息
区域分布统计
维表和结果表
离线分析
海量实时数据存储
Game
 Social
 News
海量帖⼦、⽂章
聊天、评论
海量实时数据处理
BigData processing scenario 
Personalized
recommendation
Safety control Statistical Analysis Time&space timing Feeds
New
retail
Finance
 E-
commerce
New
manufactu
ring
System iterations and challenges
Centralized database
 Distributed database
 Hadoop
 HBase X-Pack
Architecture &
Implementation
ApsaraDB HBase X-Pack Architecture 
One-stop shop for big data processing : Storage & Search& Computing, Scalability & Real-Time & Flexibility
Two integrations
•  Storage and Search integration
•  Online and offline integration


One cost reduction
•  Complex computing flexibility
ApsaraDB HBase X-Pack Deployment
SparkStreaming


Phoenix Search Index 
HBase + Solr
 Parquet
Spark 


BDS 

Streaming Compute Online storage& Search Complex analysis
Spark analysis HBase Data
!
!
!
!
HBase
!
Phoenix!
DataSource API
Get API Scan API
Snapshotregion
Filter
Get Scan
Filter
Snapshot
newAPIHadoopRDD
put/create
API
PhoenixInputFormat!
filter
Multi get Range Scan
TableSnapshot
InPutFormat
Spark on HBase
Spark Parser
GetPartition!
Required
Columns
PrunedFilter!
!
!
!
Schema
Mapping
Performance
•  distributed scan;
•  sql optimize like partition pruning column pruning predicate
pushdown 
•  direct reading hifles 
•  auto transform to column based storage
Spark on HBase
Spark analysis HBase Data
One-click archiving
•  Row-Oriented To Column-Oriented
•  Performance improvement 20 times
•  HBase Cluster more stable 
Executor0
Driver
ExecutorX
Spark
start/stop config
/tmp/20190606/00/15/
/tmp/20190606/00/30/
/tmp/20190606/23/45/
/tmp/20190607/00/00/
……
/data/20190606
!
!
!
!
hlog!
hlog!
hlog
HBase
….
region
hfile
Executor1
BulkLoad
!
!
!
!
!
BDS
ApsaraDB X-Pack Spark expand HBase ecology

POLARDB
RDS
Redis
HBase
Phoenix4.x
Phoenix5.0
Kafka
Kafka
LogHub
ADS4PG
DataHub
ODPS
TableStore
X-Pack
Spark
schema
BulkGet
OSS
ApsaraDB X-Pack Spark cost & dynamic
•  Calculate resource elasticity(1 times
lower)
•  OSS storage resource flexibility(3
times lower)
Master1 Master2
Core
Core
Core
Elastic
Elastic
Elastic
Elastic
Elastic
Elastic
ECS
OSS&DFS
HDFS
0!
60!
120!
180!
240!
300!
(node*h)!
ElasticNode!
ElasticNode
00:00-05:00
5 node
10 node
5 node
ApsaraDB X-Pack Spark data desktop
Scheduling, relying, interactive
ApsaraDB X-Pack Spark data desktop
Scheduling, relying, interactive
Solutions
HBase X-Pack Product recommendation platform

Scenario: With the increasing number of users accumulating in the APP, the customer is ready to
launch the product recommendation function, which requires real-time ETL analysis, storage and
model calculation of the user behavior log.
HBase X-Pack: Integrated data processing platform


Pain points

• Online HBase and offline analysis shared clusters affect
online HBase query performance
• Spark&hive sql directly reads and writes HBase in bulk,
affecting HBase stability
values

• HBase data is asynchronously archived to Spark number
warehouse, which has no effect on online
• After Spark analysis, the result data is transferred by the bulkload
method, which does not affect the online business.
HBase X-Pack: Big data risk control platform




Spark
SQL MLlib
( HDFS)

( )
+ + 
Parquet
(HDFS OSS)


Load


Kafka
Spark Streaming

 HBase

•  Real-time news: Kafka accepts real-time collected messages, and can do simple things with smoke streaming
•  Archive by day increment: Data increments that are streamed to the storage service each day are archived to the spark offline warehouse
•  Offline data warehouse: used to store the full amount of data, the data is stored in HDFS on the column.
•  Full training model: spark supports complex computing, mlib, python is suitable for data computing training model
•  Model data Load: The new model Loaded to the model service for offline service to provide external control decision
•  Risk control simulation: When adding a heart rule or a new model at the training center, verify its good or bad, you can use the full amount of
data to do training in the spark offline warehouse.
HBase X-Pack: Game log processing platform



https://yq.aliyun.com/articles/702337?spm=a2c4e.11163080.searchblog.27.154c2ec1x4glPb
values

• Support high performance offline computing and real-time
computing;
• Manage data job scheduling;
• Support for elastic scaling calculations (cost savings)
• Support hot and cold storage (cost saving)
• Meet data lake scenarios and support high-throughput mass
storage structured and unstructured data;
HBase X-Pack:Real-time scene



values

• Pre-computation generates a common indicator layer, and uses
HBase&Solr's real-time analysis and processing capabilities to meet
real-time report calculations of different services.
• Pre-calculation is to use spark streaming, the delay is less than 10s
• Spark streaming can be used with hbase to do de-weighting,
correlation dimension table
HBase X-Pack:offline data warehouse


-
!
!
!
-
! ! ! !
! ! ! !
!
Spark
Spark
Streaming
PolarDB RDS ADB HBase Mongo Redis Spark
Spark ( Parquet HIVEMeta)
!
!
!
!
!
•  Operational data layer: The most primitive data in the message middleware is similar to Kafka, LogHUB, or in online databases such as PolarDB, RDS, Mongo, HBase, etc.
•  Detail wide surface layer: Use the Spark batch ETL or Spark Streaming table to build a detailed wide table
•  Public summary wide surface layer: Classification and modeling in Spark according to certain business themes, such as daily/monthly reports, model training, etc.
•  Public dimension surface: static dimension table
•  Data application layer: high-level summary data processed by offline number bins is stored in the online library for query service.
Ø  If you are interested in the online sql analysis engine

Ø  If you are interested in the spark kernel and ecosystem
We are hiring!
ApsaraDB HBase X-Pack:
https://help.aliyun.com/document_detail/93899.html
Thanks!

More Related Content

PDF
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
PPTX
Hadoop and HBase @eBay
PDF
What database
PPTX
Big Data tools in practice
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
PDF
Building tiered data stores using aesop to bridge sql and no sql systems
PDF
Aesop change data propagation
PDF
GCP Data Engineer cheatsheet
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Hadoop and HBase @eBay
What database
Big Data tools in practice
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Building tiered data stores using aesop to bridge sql and no sql systems
Aesop change data propagation
GCP Data Engineer cheatsheet

What's hot (20)

PPTX
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
PPTX
Qubole - Big data in cloud
PPTX
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
PPTX
Introduction to Kudu - StampedeCon 2016
PDF
Treasure Data From MySQL to Redshift
PPTX
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
PPTX
Summer Shorts: Big Data Integration
 
PPTX
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PPSX
Hadoop Ecosystem
PDF
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
PDF
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
PDF
Developing high frequency indicators using real time tick data on apache supe...
PPTX
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
PPTX
Digital Transformation with Microsoft Azure
PDF
Presto @ Uber Hadoop summit2017
PPTX
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
PPTX
Presto: SQL-on-anything
PDF
Operationalizing Big Data Pipelines At Scale
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Qubole - Big data in cloud
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
Introduction to Kudu - StampedeCon 2016
Treasure Data From MySQL to Redshift
Webinar: Buckle Up: The Future of the Distributed Database is Here - DataStax...
Summer Shorts: Big Data Integration
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
AWS Big Data Demystified #1: Big data architecture lessons learned
Hadoop Ecosystem
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Developing high frequency indicators using real time tick data on apache supe...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Digital Transformation with Microsoft Azure
Presto @ Uber Hadoop summit2017
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Presto: SQL-on-anything
Operationalizing Big Data Pipelines At Scale
HBase Global Indexing to support large-scale data ingestion at Uber
Ad

Similar to hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark (20)

PPTX
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
PPTX
Trafodion overview
PDF
Architectural Evolution Starting from Hadoop
PPTX
HBaseConAsia2018 Track3-2: HBase at China Telecom
PPTX
Horizon for Big Data
PDF
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
מיכאל
KEY
HBase and Hadoop at Urban Airship
PDF
Azure Cafe Marketplace with Hortonworks March 31 2016
PDF
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
PPTX
Big data solutions in Azure
PDF
2015 nov 27_thug_paytm_rt_ingest_brief_final
PPTX
Building Big data solutions in Azure
PDF
Big Data Journey
PDF
Discover.hdp2.2.h base.final[2]
PPTX
Stream processing on mobile networks
PPTX
Microsoft Data Platform - What's included
PPTX
Big Data_Architecture.pptx
PDF
SnappyData Toronto Meetup Nov 2017
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
Trafodion overview
Architectural Evolution Starting from Hadoop
HBaseConAsia2018 Track3-2: HBase at China Telecom
Horizon for Big Data
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Big Data Analytics with Hadoop, MongoDB and SQL Server
מיכאל
HBase and Hadoop at Urban Airship
Azure Cafe Marketplace with Hortonworks March 31 2016
HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud
Big data solutions in Azure
2015 nov 27_thug_paytm_rt_ingest_brief_final
Building Big data solutions in Azure
Big Data Journey
Discover.hdp2.2.h base.final[2]
Stream processing on mobile networks
Microsoft Data Platform - What's included
Big Data_Architecture.pptx
SnappyData Toronto Meetup Nov 2017
Ad

More from Michael Stack (20)

PDF
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
PDF
hbaseconasia2019 Recent work on HBase at Pinterest
PDF
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
PDF
hbaseconasia2019 HBase at Didi
PDF
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
PDF
hbaseconasia2019 HBase at Tencent
PDF
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
PDF
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
PDF
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
PDF
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
PDF
hbaseconasia2019 OpenTSDB at Xiaomi
PDF
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
PDF
hbaseconasia2019 Distributed Bitmap Index Solution
PDF
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
PDF
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
PDF
hbaseconasia2019 BDS: A data synchronization platform for HBase
PDF
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
PDF
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
PDF
HBaseConAsia2019 Keynote
PDF
HBaseConAsia2018 Track3-1: Serving billions of queries in millisecond latencies
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on Cloud
hbaseconasia2019 Recent work on HBase at Pinterest
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., Ltd
hbaseconasia2019 HBase at Didi
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...
hbaseconasia2019 HBase at Tencent
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...
hbaseconasia2019 Pharos as a Pluggable Secondary Index Component
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at Alibaba
hbaseconasia2019 OpenTSDB at Xiaomi
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBase
hbaseconasia2019 Distributed Bitmap Index Solution
hbaseconasia2019 HBase Bucket Cache on Persistent Memory
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACL
hbaseconasia2019 BDS: A data synchronization platform for HBase
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
HBaseConAsia2019 Keynote
HBaseConAsia2018 Track3-1: Serving billions of queries in millisecond latencies

Recently uploaded (20)

PPTX
artificial intelligence overview of it and more
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
Internet___Basics___Styled_ presentation
PPTX
innovation process that make everything different.pptx
DOCX
Unit-3 cyber security network security of internet system
PDF
An introduction to the IFRS (ISSB) Stndards.pdf
PPTX
Introuction about ICD -10 and ICD-11 PPT.pptx
PPTX
SAP Ariba Sourcing PPT for learning material
PPTX
introduction about ICD -10 & ICD-11 ppt.pptx
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
PDF
Triggering QUIC, presented by Geoff Huston at IETF 123
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
international classification of diseases ICD-10 review PPT.pptx
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
The Internet -By the Numbers, Sri Lanka Edition
PPTX
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
PPTX
QR Codes Qr codecodecodecodecocodedecodecode
artificial intelligence overview of it and more
Module 1 - Cyber Law and Ethics 101.pptx
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
Internet___Basics___Styled_ presentation
innovation process that make everything different.pptx
Unit-3 cyber security network security of internet system
An introduction to the IFRS (ISSB) Stndards.pdf
Introuction about ICD -10 and ICD-11 PPT.pptx
SAP Ariba Sourcing PPT for learning material
introduction about ICD -10 & ICD-11 ppt.pptx
Tenda Login Guide: Access Your Router in 5 Easy Steps
Triggering QUIC, presented by Geoff Huston at IETF 123
presentation_pfe-universite-molay-seltan.pptx
international classification of diseases ICD-10 review PPT.pptx
Job_Card_System_Styled_lorem_ipsum_.pptx
The New Creative Director: How AI Tools for Social Media Content Creation Are...
The Internet -By the Numbers, Sri Lanka Edition
Introduction about ICD -10 and ICD11 on 5.8.25.pptx
QR Codes Qr codecodecodecodecocodedecodecode

hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark

  • 2. BigData NoSQL System:ApsaraDB HBase and Spark Wei Li ApsaraDB HBase X-Pack team
  • 5. System iterations and challenges Centralized database Distributed database Hadoop HBase X-Pack
  • 7. ApsaraDB HBase X-Pack Architecture One-stop shop for big data processing : Storage & Search& Computing, Scalability & Real-Time & Flexibility Two integrations •  Storage and Search integration •  Online and offline integration One cost reduction •  Complex computing flexibility
  • 8. ApsaraDB HBase X-Pack Deployment SparkStreaming Phoenix Search Index HBase + Solr Parquet Spark BDS Streaming Compute Online storage& Search Complex analysis
  • 9. Spark analysis HBase Data ! ! ! ! HBase ! Phoenix! DataSource API Get API Scan API Snapshotregion Filter Get Scan Filter Snapshot newAPIHadoopRDD put/create API PhoenixInputFormat! filter Multi get Range Scan TableSnapshot InPutFormat Spark on HBase Spark Parser GetPartition! Required Columns PrunedFilter! ! ! ! Schema Mapping Performance •  distributed scan; •  sql optimize like partition pruning column pruning predicate pushdown •  direct reading hifles •  auto transform to column based storage Spark on HBase
  • 10. Spark analysis HBase Data One-click archiving •  Row-Oriented To Column-Oriented •  Performance improvement 20 times •  HBase Cluster more stable Executor0 Driver ExecutorX Spark start/stop config /tmp/20190606/00/15/ /tmp/20190606/00/30/ /tmp/20190606/23/45/ /tmp/20190607/00/00/ …… /data/20190606 ! ! ! ! hlog! hlog! hlog HBase …. region hfile Executor1 BulkLoad ! ! ! ! ! BDS
  • 11. ApsaraDB X-Pack Spark expand HBase ecology POLARDB RDS Redis HBase Phoenix4.x Phoenix5.0 Kafka Kafka LogHub ADS4PG DataHub ODPS TableStore X-Pack Spark schema BulkGet OSS
  • 12. ApsaraDB X-Pack Spark cost & dynamic •  Calculate resource elasticity(1 times lower) •  OSS storage resource flexibility(3 times lower) Master1 Master2 Core Core Core Elastic Elastic Elastic Elastic Elastic Elastic ECS OSS&DFS HDFS 0! 60! 120! 180! 240! 300! (node*h)! ElasticNode! ElasticNode 00:00-05:00 5 node 10 node 5 node
  • 13. ApsaraDB X-Pack Spark data desktop Scheduling, relying, interactive
  • 14. ApsaraDB X-Pack Spark data desktop Scheduling, relying, interactive
  • 16. HBase X-Pack Product recommendation platform Scenario: With the increasing number of users accumulating in the APP, the customer is ready to launch the product recommendation function, which requires real-time ETL analysis, storage and model calculation of the user behavior log.
  • 17. HBase X-Pack: Integrated data processing platform Pain points • Online HBase and offline analysis shared clusters affect online HBase query performance • Spark&hive sql directly reads and writes HBase in bulk, affecting HBase stability values • HBase data is asynchronously archived to Spark number warehouse, which has no effect on online • After Spark analysis, the result data is transferred by the bulkload method, which does not affect the online business.
  • 18. HBase X-Pack: Big data risk control platform Spark SQL MLlib ( HDFS) ( ) + + Parquet (HDFS OSS) Load Kafka Spark Streaming HBase •  Real-time news: Kafka accepts real-time collected messages, and can do simple things with smoke streaming •  Archive by day increment: Data increments that are streamed to the storage service each day are archived to the spark offline warehouse •  Offline data warehouse: used to store the full amount of data, the data is stored in HDFS on the column. •  Full training model: spark supports complex computing, mlib, python is suitable for data computing training model •  Model data Load: The new model Loaded to the model service for offline service to provide external control decision •  Risk control simulation: When adding a heart rule or a new model at the training center, verify its good or bad, you can use the full amount of data to do training in the spark offline warehouse.
  • 19. HBase X-Pack: Game log processing platform https://yq.aliyun.com/articles/702337?spm=a2c4e.11163080.searchblog.27.154c2ec1x4glPb values • Support high performance offline computing and real-time computing; • Manage data job scheduling; • Support for elastic scaling calculations (cost savings) • Support hot and cold storage (cost saving) • Meet data lake scenarios and support high-throughput mass storage structured and unstructured data;
  • 20. HBase X-Pack:Real-time scene values • Pre-computation generates a common indicator layer, and uses HBase&Solr's real-time analysis and processing capabilities to meet real-time report calculations of different services. • Pre-calculation is to use spark streaming, the delay is less than 10s • Spark streaming can be used with hbase to do de-weighting, correlation dimension table
  • 21. HBase X-Pack:offline data warehouse - ! ! ! - ! ! ! ! ! ! ! ! ! Spark Spark Streaming PolarDB RDS ADB HBase Mongo Redis Spark Spark ( Parquet HIVEMeta) ! ! ! ! ! •  Operational data layer: The most primitive data in the message middleware is similar to Kafka, LogHUB, or in online databases such as PolarDB, RDS, Mongo, HBase, etc. •  Detail wide surface layer: Use the Spark batch ETL or Spark Streaming table to build a detailed wide table •  Public summary wide surface layer: Classification and modeling in Spark according to certain business themes, such as daily/monthly reports, model training, etc. •  Public dimension surface: static dimension table •  Data application layer: high-level summary data processed by offline number bins is stored in the online library for query service.
  • 22. Ø  If you are interested in the online sql analysis engine Ø  If you are interested in the spark kernel and ecosystem We are hiring! ApsaraDB HBase X-Pack: https://help.aliyun.com/document_detail/93899.html