hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark

BigData NoSQL System:ApsaraDB
HBase and Spark
Wei Li
ApsaraDB HBase X-Pack team

⽤户画像
爬⾍抓取信息
反欺诈系统
订单数据
⽤户⾏为分析
⽤户画像
推荐引擎
海量实时数据处理
监控数据
轨迹、设备数据
地理信息
区域分布统计
维表和结果表
离线分析
海量实时数据存储
Game
Social
News
海量帖⼦、⽂章
聊天、评论
海量实时数据处理
BigData processing scenario
Personalized
recommendation
Safety control Statistical Analysis Time&space timing Feeds
New
retail
Finance
E-
commerce
New
manufactu
ring

System iterations and challenges
Centralized database
Distributed database
Hadoop
HBase X-Pack

ApsaraDB HBase X-Pack Architecture
One-stop shop for big data processing : Storage & Search& Computing, Scalability & Real-Time & Flexibility
Two integrations
•  Storage and Search integration
•  Online and oﬄine integration

One cost reduction
•  Complex computing ﬂexibility

ApsaraDB HBase X-Pack Deployment
SparkStreaming

Phoenix Search Index
HBase + Solr
Parquet
Spark

BDS

Streaming Compute Online storage& Search Complex analysis

Spark analysis HBase Data
!
!
!
!
HBase
!
Phoenix!
DataSource API
Get API Scan API
Snapshotregion
Filter
Get Scan
Filter
Snapshot
newAPIHadoopRDD
put/create
API
PhoenixInputFormat!
ﬁlter
Multi get Range Scan
TableSnapshot
InPutFormat
Spark on HBase
Spark Parser
GetPartition!
Required
Columns
PrunedFilter!
!
!
!
Schema
Mapping
Performance
•  distributed scan;
•  sql optimize like partition pruning column pruning predicate
pushdown
•  direct reading hiﬂes
•  auto transform to column based storage
Spark on HBase

Spark analysis HBase Data
One-click archiving
•  Row-Oriented To Column-Oriented
•  Performance improvement 20 times
•  HBase Cluster more stable
Executor0
Driver
ExecutorX
Spark
start/stop conﬁg
/tmp/20190606/00/15/
/tmp/20190606/00/30/
/tmp/20190606/23/45/
/tmp/20190607/00/00/
……
/data/20190606
!
!
!
!
hlog!
hlog!
hlog
HBase
….
region
hﬁle
Executor1
BulkLoad
!
!
!
!
!
BDS

ApsaraDB X-Pack Spark expand HBase ecology

POLARDB
RDS
Redis
HBase
Phoenix4.x
Phoenix5.0
Kafka
Kafka
LogHub
ADS4PG
DataHub
ODPS
TableStore
X-Pack
Spark
schema
BulkGet
OSS

ApsaraDB X-Pack Spark cost & dynamic
•  Calculate resource elasticity(1 times
lower)
•  OSS storage resource ﬂexibility(3
times lower)
Master1 Master2
Core
Core
Core
Elastic
Elastic
Elastic
Elastic
Elastic
Elastic
ECS
OSS&DFS
HDFS
0!
60!
120!
180!
240!
300!
(node*h)!
ElasticNode!
ElasticNode
00:00-05:00
5 node
10 node
5 node

ApsaraDB X-Pack Spark data desktop
Scheduling, relying, interactive

HBase X-Pack Product recommendation platform

Scenario: With the increasing number of users accumulating in the APP, the customer is ready to
launch the product recommendation function, which requires real-time ETL analysis, storage and
model calculation of the user behavior log.

HBase X-Pack: Integrated data processing platform

Pain points

• Online HBase and offline analysis shared clusters affect
online HBase query performance
• Spark&hive sql directly reads and writes HBase in bulk,
affecting HBase stability
values

• HBase data is asynchronously archived to Spark number
warehouse, which has no effect on online
• After Spark analysis, the result data is transferred by the bulkload
method, which does not affect the online business.

HBase X-Pack: Big data risk control platform

Spark
SQL MLlib
( HDFS)

( )
+ +
Parquet
(HDFS OSS)

Load

Kafka
Spark Streaming

HBase

•  Real-time news: Kafka accepts real-time collected messages, and can do simple things with smoke streaming
•  Archive by day increment: Data increments that are streamed to the storage service each day are archived to the spark offline warehouse
•  Offline data warehouse: used to store the full amount of data, the data is stored in HDFS on the column.
•  Full training model: spark supports complex computing, mlib, python is suitable for data computing training model
•  Model data Load: The new model Loaded to the model service for offline service to provide external control decision
•  Risk control simulation: When adding a heart rule or a new model at the training center, verify its good or bad, you can use the full amount of
data to do training in the spark offline warehouse.

HBase X-Pack: Game log processing platform

https://yq.aliyun.com/articles/702337?spm=a2c4e.11163080.searchblog.27.154c2ec1x4glPb
values

• Support high performance oﬄine computing and real-time
computing;
• Manage data job scheduling;
• Support for elastic scaling calculations (cost savings)
• Support hot and cold storage (cost saving)
• Meet data lake scenarios and support high-throughput mass
storage structured and unstructured data;

HBase X-Pack:Real-time scene

values

• Pre-computation generates a common indicator layer, and uses
HBase&Solr's real-time analysis and processing capabilities to meet
real-time report calculations of diﬀerent services.
• Pre-calculation is to use spark streaming, the delay is less than 10s
• Spark streaming can be used with hbase to do de-weighting,
correlation dimension table

HBase X-Pack:offline data warehouse

-
!
!
!
-
! ! ! !
! ! ! !
!
Spark
Spark
Streaming
PolarDB RDS ADB HBase Mongo Redis Spark
Spark ( Parquet HIVEMeta)
!
!
!
!
!
•  Operational data layer: The most primitive data in the message middleware is similar to Kafka, LogHUB, or in online databases such as PolarDB, RDS, Mongo, HBase, etc.
•  Detail wide surface layer: Use the Spark batch ETL or Spark Streaming table to build a detailed wide table
•  Public summary wide surface layer: Classification and modeling in Spark according to certain business themes, such as daily/monthly reports, model training, etc.
•  Public dimension surface: static dimension table
•  Data application layer: high-level summary data processed by offline number bins is stored in the online library for query service.

Ø  If you are interested in the online sql analysis engine

Ø  If you are interested in the spark kernel and ecosystem
We are hiring!
ApsaraDB HBase X-Pack:
https://help.aliyun.com/document_detail/93899.html

hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark

More Related Content

What's hot (20)

Similar to hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark (20)

More from Michael Stack (20)

Recently uploaded (20)

hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark