SlideShare a Scribd company logo
Big Machine Data - Two Exemplary Applications in China
Jianmin Wang
Tsinghua University
Beijing, China
Agenda
• Background
• Two Exemplary Applications in China
Who we are?
• Institute for Data Science,
Tsinghua University
• Founded in April 2014
• Missions & Status Quo
– Recruiting world-class researchers and engineers from industry and academia
– Long-term dedication to system research and industry practice
– Leading China’s big data strategy, especially for industrial big data
BIG data
Big Data Landscape
People
generated
2
1
3
Computer
generated
Machine
generated
Machine Generated Data
• Broadly exist
– Industrial business
– Agriculture
– Utility
– Military
– Smarter City
– Logistics
– Smart devices
– Science research
Data Rate
24*7, up to million
data points/s, and
millions of devices
DataType
Mostly are time-series, temporal sequence,
and spatial-temporal and array data
Data Usage
Real-time processing.
From monitoring to content, shape,
signal based query and analysis
 Industrial businesses have entered the era of “big data”, but the real challenge is
how to extract value from data.
 Machine generated data is the core of industrial big data
Big Machine data is beyond 3Vs
Our research spans big data lifecycle
Storage1 Access & Exploration3
Preprocessing2 Modeling & Analytics4
Agenda
• Background
• Two Exemplary Applications in China
1 Industrial Sensor Data Management:
Cassandra at China Sany Group
2 Climate Data Management:
Cassandra at China Meteorological Administration
9© 2015. All Rights Reserved.
China Sany Group
10© 2015. All Rights Reserved.
More than 200K active engineering machineries
In more than 150 countries
SANY Group is a global company in the
construction machinery industry.
In 2011, SANY became China’s unique
company listed among the world’s top
500 companies in the construction
machinery industry.
Pipeline of Industrial Sensor Data Processing
© 2015. All Rights Reserved. 11
Internet
三一运动控制器
SYMC
三一工业显示屏
SYLD
三一移动终端
SYMT
产品主控制柜
基
于
SCP协
议
包
车
辆
工
况
数
据
无线基站
无线到有线
指定IP与端口
快反工程师
资料工程师
...
用户计算机 服务人员
业主
1
2
3
4
execute
collect
decision
transfer
The data records the operational
statuses of the machineries
5000 kinds of sensors
50 billion records per year
2008
• Start
managing
sensor data
2010
• 60k
machineries
2012
• 80k
machineries
• Can only
support 6
month data
online
2014
• >100k
machineries
• All data
online
2020
•>500K
machineries
•>10K users
Technology Roadmap in Sany Group
© 2015. All Rights Reserved. 12
SQL Server
➡ Oracle
Oracle ➡
Cassandra
Why Cassandra?
•Cost performance
•Scalability
•P2P Architecture
Operation Worst case Average
case
Write 30% 2x
Query 22.6% 10x
Software Stack of Sensor Data Management
© 2015. All Rights Reserved. 13
Collect Store Analyze
Storm
设备(主键)
工况1(列族1) 工况2(列族2) ……
接收时间1
(列1)
接收时间 2
(列2)
……
接收时间 1
(列1)
接收时间 2
(列2)
…… ……
设备1 监测值 监测值 …… 监测值 监测值 …… ……
设备2 监测值 监测值 …… 监测值 监测值 …… ……
…… …… …… …… …… …… …… ……
Map/Reduce
row
key
sensor1(cf1) sensor2(cf2)
device2
received
time1
received
time2
received
time1
received
time2
device1 value
value
value
value
value value
value value
Structured Storage
gathertime
Cassandra Storage:
machine
gather time
sensors
。。。
。。。
Schema Design – Row and Column
• Use sensor as Column Family (CF)
• In each Column Family (CF)
– Use as the row key
– Use as the column name
– Use as the column value
– Columns of each row are sorted in advance
– The number of columns is readily increasable
machine
gather time
。。。
machine
gather time
。。。
…
sensor1 sensor2 sensorN
~5000
sensors
5000+ column families
Cassandra v1.2
CQL2 (not CQL3)
© 2015. All Rights Reserved. 15
Why 5000+ Column Families?
• Cassandra V1.* does not support multiple primary key & clustering key
• This makes programming more complex
• Manually split the row key or column name
• All the data in one SSTABLE belongs to a specific CF
• When querying a specified sensor, we need not scan unnecessary data
Row Key Column Name
machine_id sensor_id : gather_time
Row Key Column Name
sensor_id : machine_id gather_time
Cassandra v1.2
CQL2 (not CQL3)
Challenge 1 – Creating Schema Hang
• Problem
– Create 5000+ CFs in batch
– Creation cost increases dramatically
© 2015. All Rights Reserved.
0
5000
10000
15000
1
28
55
82
109
136
163
190
217
244
271
298
325
352
379
406
433
460
487
514
541
TimeCost(ms)
CF Serial Number
Time Cost Create 1
CF: 10s
Create 1
CF: 0.1s
• Root Cause
– Protocol Conflict
• Between Gossip Protocol and Request
Propagation Mechanism
– Message Overhead
• May transform the whole schema instead of
the changed part
ReceiveSchema
Message
Memory Cost
SendSchema
Message
Memory Cost
Total
N1 4.465G 4.236G 8.70G
N2 4.308G 4.907G 9.21G
N3 4.236G 4.024G 8.26G
N4 4.808G 4.387G 9.19G
N5 6.111G 6.373G 12.48G
Memory used by Gossip
Challenge 1 – Creating Schema Hang
• Solution
– Gossip takes effect only when:
• Propagation messages lost/timeout
• Nodes recovered from a failure
– Creation time cost can keep constant 17
Propagate
LOAD
STATUS
SHCEMA
VERSION
...
LOAD
STATUS
SHCEMA 延迟:t秒
VERSION
...
metadata metadata
Delay T sT strategy:
1 2
34
Adaptive Lazy Gossip
3 4
Challenge 2 – Balancing Consistency &
Throughput
• Production environment
– Sany production: 5 nodes cluster,
2x4 cores 64GB
• Problem
– Throughput = 200K data points/sec
– 75% data is written successfully only
in one replica, while the other
replicas are stale (inconsistent)
• Cassandra is NOT very consistent
• Big obstacle for query operation
– Repair is required, but is very slow
© 2015. All Rights Reserved. 18
Experiment on Amazon EC2
2 cores, 8GB, 5 nodes
rywc: read your write consistency
Challenge 2 – Balancing Consistency &
Throughput
• Root Cause of slow Repair
– Too many column families (5000+)
– Too many ranges in the consistent
hashing ring
• 256 virtual nodes (VN) per physical
node
• Too many merkle trees (ranges x CFs)
• Experience and Suggestions
– Repair CFs and ranges one by one
• Do not repair the whole keyspace (all
CFs) at once
– Repair the important CFs first
– Perform repair at light workload
© 2015. All Rights Reserved. 19
- 5 physical nodes
- each has 2 VNs
- 10 ranges in total
For each range and each CF, create merkle
tree and compare them between two nodes
Challenge 3 – Heterogeneous Nodes
• Problem
– How to assign the data partitions
in a heterogeneous cluster?
• Experiment Study
– Deploy a heterogeneous cluster
• 2 powerful servers and 8 PCs
– Throughput performance
• Heterogeneous cluster cluster
only with the 2 powerful servers
© 2015. All Rights Reserved. 20
Assign the position of the nodes (i.e. Tokens) in
the ring according to their computing capacities
Challenge 3 – Heterogeneous Nodes
• Root Cause
– The replica mechanism makes the
unbalanced problem complicated
• Each Node’s configurations may impact
other nodes’ performance
– The Virtual Node (VN) mechanism
cannot fit all scenarios
• Too many VNs make the lookup table
too big and slow down repair speed
• Max #VNs in a physical node is 1536
(restricted by Cassandra source code)
© 2015. All Rights Reserved. 21
The capacity of N1 is the worst, and E is short
But N1 is responsible for many data records
to the cluster:
• N5 finish the operation quickly
• But N5 has to wait for N1, which is slow
Challenge 3 – Heterogeneous Nodes
• Solution
– Initialize the cluster properly
• Use Quadratic Optimization
(QP) to find the best positions
of the (virtual) servers
• Has been deployed to China
Sany Group successfully
– Scaling out the cluster
• Use a dynamic algorithm to
find the best positions for the
new added server
© 2015. All Rights Reserved. 22
Scaling out: Xiangdong Huang, Jianmin Wang et al. Optimizing Data Partition for Scaling out NoSQL Cluster. Concurrency and Computation: Practice and Experience (Early View)
Scaling out: find the best position
Optimize:
1. the order of the nodes in the ring
2. the range length of the ring
Datasets & Results in China Sany Group
• 5000+ column families for sensor data
• 100K+ engineering machineries
• Amount of historical data loaded
– From 2012.4 to now
• Data size
– Tens of billions operational statuses records
– Several billion GPS data
– Write throughput
– 5 nodes (2*4 cores CPU, 64GB memory, 9TB Disk)
– 20K TPS as regular workload, 200K at peak
23
Industrial Big Data Platform: More Requirements
——Beyond Sany Applications
High frequency sensor
High volume sensors
10+ M data point/second
Time and value based query
Richer set of analytical queries
<1 Second response
Edge synchronization
Compression, out-of-order,
retransmission
Different data, different algorithms
Transparent to query
Deep compression to historical data
Spatial-temporal index
Trajectory based queries
Even higher
throughput
Native time-series
query
Synchronization
Adaptive deep
compression
Moving object
support
Industrial Data Analysis Pipeline
© 2015. All Rights Reserved. 25
Boolean value
Status values
Analogue value
1.046Billion
Basic indicator
8030
Baseline
1.046Billion
Variance
Specific
features
Common
features
Outliers
Specifiedoperational
statusesdata
General
count baseline variance
frequency baseline variance
..
Analogue
average baseline variance
variance baseline variance
extremum baseline variance
…
Boolean
times baseline variance
duration baseline variance
…
States
Changes
times
baseline variance
duration baseline variance
…
Driver profile
Hydraulic oil
temperature analysis
Temporal parameter
analysis for vehicle start
Parameter correlation
Spatial analysis
for failure
Service
Quality
ControlR&D
Key components
anomaly detection
Industry Practice – Value-Added Analytics
horizontal
inclination
angle
Concrete pump truck’s tip-over is mainly caused by insufficient leg’s cylinder
support, which is a major issue of production safety
Big Data Application 1
—Concrete Pump Truck Tip-over Detection
Big Data Application 1
—Concrete Pump Truck Tip-over Detection
Fast spot and prevent dangerous operation through group behavior
analysis of concrete pump trucks
The overall distribution of horizontal (X-axis) & vertical (Y-axis)
inclination angle of concrete pumps
Unstable instances
Idle instances
Inclination angle vibration
level filter
Inclination angle distribution of
individual concrete pump
Idle instances:
unplugging operation leads to
malfunctioning
Unstable instances:
Early degradation pattern of cylinder
Typical instances:
stable oscillation
Data driven anomaly and potential accidents detection
Big Data Application 1
—Concrete Pump Truck Tip-over Detection
Big Data Application 2
—Fault Diagnosis
Investigation proved that salt-spray environment and the water quality
along the seaside caused the corrosion of cylinder’s potted component
Via time series pattern analysis and spatial correlation, leakage problem of master
cylinder is highly correlated with a high-speed rail construction project.
Hangzhou-Shenzhen high-speed rail
Salt-spray corrosion environment
Big Data Application 3
- Spare Components Demand Forecasting
• Traditional approach is
based on marketers’
experience
• New approach
– Combining the real-time data from
machines, sale history, holdings of
vehicles, environment and GDP,
etc.
• Result
– Reduce half of inactive spare part
inventory
0
50
100
150
200
250
中
旬
下
旬
上
旬
中
旬
下
旬
上
旬
中
旬
下
旬
上
旬
中
旬
下
旬
上
旬
中
旬
下
旬
上
旬
中
旬
下
旬
上
旬
中
旬
下
旬
上
旬
中
旬
下
旬
上
旬
中
旬
下
旬
2012/10 2012/11 2012/12 2013/1 2013/2 2013/3 2013/4 2013/5 2013/6
配件需求量数量/个
实际备件需求量 基于矩阵分解的多地区协同备件预测结果 企业实际备货量
The predicted result fits the actual
demand better
Sparepartsnumber
Actual demand Actual prepared
Results of Multi-Region Collaborative
Spare Components Prediction Based
on Matrix Factorization
1 Industrial Data Management:
Cassandra at China Sany Group
2 Climate Data Management:
Cassandra at China Meteorological Administration
32© 2015. All Rights Reserved.
Pipeline of Climate Data Processing
© 2015. All Rights Reserved.
Data Center
Internet
Collection
2 Transmission
3 Access
4 Browsing
1
T639
windfield
temperaturefield
humidity
rainfall
snowfall
…...
model
Ground
Aerological
Satellite
Radar
Lightning
Typhoon
850Pa
800Pa
……
900Pa
temperature
8AM, 3h
8AM, 6h
…
8PM, 3h
8PM, 6h
Characteristics of Climate Data
Challenge in Meteorological Application
—Pattern Data
© 2015. All Rights Reserved. 35
• Hierarchical pattern data + flat others
• A highly-efficient data-deliver system for end users
– Support millions of small files
– Access data fast
– Scan data in various order
• Performance requirement
– Get ~1MB data in 50ms
– 600 concurrent clients
/
T639d1 ...
windtemper ...d2
d3
d4
d5
800 850 900
2014.2.
18.08
2014.2.
18.20
2014.2.
19.08
3 6 9
...
... ... ...
2014.2.
18.08
...
...
2014.2.
18.08
...
... ...
3 3 3 3
t1t2t3 t4 t5 t6 t7
d3
Why Cassandra?
• Scalability
• Fast read/write data
• Some columns are sorted
– Easy to scan data sequentially
• Time-based Compaction (>=Cassandra v2.0) for time series
© 2015. All Rights Reserved. 36
key 3h 6h 9h …
T639/temperature/800Pa file file file …
1. Get the data where key=‘T639…/800Pa’
2. Retrieval the data before 6h
Or retrieval the data after 6h
key 3h 6h 9h 12h … 3h 6h 9h
T639/temperature/800Pa file file file file … file file file
Solution – Schema Design for Pattern Data
• Data items
– 5-tuple
– Pattern and variable are disordered
– Level, time, ageing are ordered
© 2015. All Rights Reserved. 37
time
level
ageingData space
(pattern, variable)
ColumnFamily
Row key
Column
/
T639d1 ...
windtemper ...d2
d3
d4
d5
800 850 900
2014.2.
18.08
2014.2.
18.20
2014.2.
19.08
3 6 9
...
... ... ...
2014.2.
18.08
...
...
2014.2.
18.08
...
... ...
3 3 3 3
t1t2t3 t4 t5 t6 t7
Performance Results
• 10 servers: 2*4 cores of CPU, 64GB memory, 9TB SAS Disk
• Store 7 kinds of model data
– 16TB per day
• Get data quickly
– 100 times faster than the older
system
© 2015. All Rights Reserved. 38
Thank you

More Related Content

PDF
Azure + DataStax Enterprise Powers Office 365 Per User Store
PDF
PagerDuty: Span the WAN? Yes you can!
PDF
The Last Pickle: Distributed Tracing from Application to Database
PDF
Macy's: Changing Engines in Mid-Flight
PDF
Capital One: Using Cassandra In Building A Reporting Platform
PPTX
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
PDF
Data Stores @ Netflix
PDF
Target: Performance Tuning Cassandra at Target
Azure + DataStax Enterprise Powers Office 365 Per User Store
PagerDuty: Span the WAN? Yes you can!
The Last Pickle: Distributed Tracing from Application to Database
Macy's: Changing Engines in Mid-Flight
Capital One: Using Cassandra In Building A Reporting Platform
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
Data Stores @ Netflix
Target: Performance Tuning Cassandra at Target

What's hot (20)

PDF
Lambda at Weather Scale - Cassandra Summit 2015
PDF
Cassandra serving netflix @ scale
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
PDF
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
PDF
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
PPTX
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
PDF
Webinar: How to Shrink Your Datacenter Footprint by 50%
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
PDF
Proofpoint: Fraud Detection and Security on Social Media
PDF
A glimpse of cassandra 4.0 features netflix
PDF
Building and running cloud native cassandra
PPTX
mParticle's Journey to Scylla from Cassandra
PDF
Renegotiating the boundary between database latency and consistency
PDF
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
PDF
How netflix manages petabyte scale apache cassandra in the cloud
PDF
Cassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra
PPTX
Seastar Summit 2019 Keynote
PDF
The True Cost of NoSQL DBaaS Options
PDF
Scylla Summit 2016: ScyllaDB, Present and Future
PDF
Scylla Summit 2016: Scylla at Samsung SDS
Lambda at Weather Scale - Cassandra Summit 2015
Cassandra serving netflix @ scale
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Webinar: How to Shrink Your Datacenter Footprint by 50%
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Proofpoint: Fraud Detection and Security on Social Media
A glimpse of cassandra 4.0 features netflix
Building and running cloud native cassandra
mParticle's Journey to Scylla from Cassandra
Renegotiating the boundary between database latency and consistency
Eliminating Volatile Latencies Inside Rakuten’s NoSQL Migration
How netflix manages petabyte scale apache cassandra in the cloud
Cassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra
Seastar Summit 2019 Keynote
The True Cost of NoSQL DBaaS Options
Scylla Summit 2016: ScyllaDB, Present and Future
Scylla Summit 2016: Scylla at Samsung SDS
Ad

Similar to Tsinghua University: Two Exemplary Applications in China (20)

PDF
IoT meets Big Data
PDF
WW Historian 10
PDF
How HPC and large-scale data analytics are transforming experimental science
PDF
Apache con 2020 use cases and optimizations of iotdb
PDF
陸永祥/全球網路攝影機帶來的機會與挑戰
PPTX
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
PPTX
The Need for Complex Analytics from Forwarding Pipelines
PDF
Huawei Advanced Data Science With Spark Streaming
PDF
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
PPTX
Dunning time-series-2015
PPTX
How the Internet of Things is Turning the Internet Upside Down
PPTX
Dealing with an Upside Down Internet With High Performance Time Series Database
PDF
A Dataflow Processing Chip for Training Deep Neural Networks
PPTX
Brad stack - Digital Health and Well-Being Festival
PDF
Webinar: SQL for Machine Data?
PPTX
Exascale Capabl
PDF
Deep Turnover Forecast - meetup Lille
PDF
Distributed Decision Tree Learning for Mining Big Data Streams
PPTX
Python for High Throughput Science by Mark Basham
PPTX
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
IoT meets Big Data
WW Historian 10
How HPC and large-scale data analytics are transforming experimental science
Apache con 2020 use cases and optimizations of iotdb
陸永祥/全球網路攝影機帶來的機會與挑戰
InfluxEnterprise Architecture Patterns by Tim Hall & Sam Dillard
The Need for Complex Analytics from Forwarding Pipelines
Huawei Advanced Data Science With Spark Streaming
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
Dunning time-series-2015
How the Internet of Things is Turning the Internet Upside Down
Dealing with an Upside Down Internet With High Performance Time Series Database
A Dataflow Processing Chip for Training Deep Neural Networks
Brad stack - Digital Health and Well-Being Festival
Webinar: SQL for Machine Data?
Exascale Capabl
Deep Turnover Forecast - meetup Lille
Distributed Decision Tree Learning for Mining Big Data Streams
Python for High Throughput Science by Mark Basham
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
PPTX
Introduction to DataStax Enterprise Graph Database
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
PPTX
Cassandra on Docker @ Walmart Labs
PDF
Cassandra 3.0 Data Modeling
PPTX
Cassandra Adoption on Cisco UCS & Open stack
PDF
Data Modeling for Apache Cassandra
PDF
Coursera Cassandra Driver
PDF
Production Ready Cassandra
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
PDF
Standing Up Your First Cluster
PDF
Real Time Analytics with Dse
PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Cassandra Core Concepts
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PPTX
Bad Habits Die Hard
PDF
Advanced Data Modeling with Apache Cassandra
PDF
Advanced Cassandra
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Cassandra on Docker @ Walmart Labs
Cassandra 3.0 Data Modeling
Cassandra Adoption on Cisco UCS & Open stack
Data Modeling for Apache Cassandra
Coursera Cassandra Driver
Production Ready Cassandra
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 2
Standing Up Your First Cluster
Real Time Analytics with Dse
Introduction to Data Modeling with Apache Cassandra
Cassandra Core Concepts
Enabling Search in your Cassandra Application with DataStax Enterprise
Bad Habits Die Hard
Advanced Data Modeling with Apache Cassandra
Advanced Cassandra

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
Modernizing your data center with Dell and AMD
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Monthly Chronicles - July 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding

Tsinghua University: Two Exemplary Applications in China

  • 1. Big Machine Data - Two Exemplary Applications in China Jianmin Wang Tsinghua University Beijing, China
  • 2. Agenda • Background • Two Exemplary Applications in China
  • 3. Who we are? • Institute for Data Science, Tsinghua University • Founded in April 2014 • Missions & Status Quo – Recruiting world-class researchers and engineers from industry and academia – Long-term dedication to system research and industry practice – Leading China’s big data strategy, especially for industrial big data
  • 4. BIG data Big Data Landscape People generated 2 1 3 Computer generated Machine generated
  • 5. Machine Generated Data • Broadly exist – Industrial business – Agriculture – Utility – Military – Smarter City – Logistics – Smart devices – Science research Data Rate 24*7, up to million data points/s, and millions of devices DataType Mostly are time-series, temporal sequence, and spatial-temporal and array data Data Usage Real-time processing. From monitoring to content, shape, signal based query and analysis
  • 6.  Industrial businesses have entered the era of “big data”, but the real challenge is how to extract value from data.  Machine generated data is the core of industrial big data Big Machine data is beyond 3Vs
  • 7. Our research spans big data lifecycle Storage1 Access & Exploration3 Preprocessing2 Modeling & Analytics4
  • 8. Agenda • Background • Two Exemplary Applications in China
  • 9. 1 Industrial Sensor Data Management: Cassandra at China Sany Group 2 Climate Data Management: Cassandra at China Meteorological Administration 9© 2015. All Rights Reserved.
  • 10. China Sany Group 10© 2015. All Rights Reserved. More than 200K active engineering machineries In more than 150 countries SANY Group is a global company in the construction machinery industry. In 2011, SANY became China’s unique company listed among the world’s top 500 companies in the construction machinery industry.
  • 11. Pipeline of Industrial Sensor Data Processing © 2015. All Rights Reserved. 11 Internet 三一运动控制器 SYMC 三一工业显示屏 SYLD 三一移动终端 SYMT 产品主控制柜 基 于 SCP协 议 包 车 辆 工 况 数 据 无线基站 无线到有线 指定IP与端口 快反工程师 资料工程师 ... 用户计算机 服务人员 业主 1 2 3 4 execute collect decision transfer The data records the operational statuses of the machineries 5000 kinds of sensors 50 billion records per year
  • 12. 2008 • Start managing sensor data 2010 • 60k machineries 2012 • 80k machineries • Can only support 6 month data online 2014 • >100k machineries • All data online 2020 •>500K machineries •>10K users Technology Roadmap in Sany Group © 2015. All Rights Reserved. 12 SQL Server ➡ Oracle Oracle ➡ Cassandra Why Cassandra? •Cost performance •Scalability •P2P Architecture Operation Worst case Average case Write 30% 2x Query 22.6% 10x
  • 13. Software Stack of Sensor Data Management © 2015. All Rights Reserved. 13 Collect Store Analyze Storm 设备(主键) 工况1(列族1) 工况2(列族2) …… 接收时间1 (列1) 接收时间 2 (列2) …… 接收时间 1 (列1) 接收时间 2 (列2) …… …… 设备1 监测值 监测值 …… 监测值 监测值 …… …… 设备2 监测值 监测值 …… 监测值 监测值 …… …… …… …… …… …… …… …… …… …… Map/Reduce row key sensor1(cf1) sensor2(cf2) device2 received time1 received time2 received time1 received time2 device1 value value value value value value value value
  • 14. Structured Storage gathertime Cassandra Storage: machine gather time sensors 。。。 。。。 Schema Design – Row and Column • Use sensor as Column Family (CF) • In each Column Family (CF) – Use as the row key – Use as the column name – Use as the column value – Columns of each row are sorted in advance – The number of columns is readily increasable machine gather time 。。。 machine gather time 。。。 … sensor1 sensor2 sensorN ~5000 sensors 5000+ column families Cassandra v1.2 CQL2 (not CQL3)
  • 15. © 2015. All Rights Reserved. 15 Why 5000+ Column Families? • Cassandra V1.* does not support multiple primary key & clustering key • This makes programming more complex • Manually split the row key or column name • All the data in one SSTABLE belongs to a specific CF • When querying a specified sensor, we need not scan unnecessary data Row Key Column Name machine_id sensor_id : gather_time Row Key Column Name sensor_id : machine_id gather_time Cassandra v1.2 CQL2 (not CQL3)
  • 16. Challenge 1 – Creating Schema Hang • Problem – Create 5000+ CFs in batch – Creation cost increases dramatically © 2015. All Rights Reserved. 0 5000 10000 15000 1 28 55 82 109 136 163 190 217 244 271 298 325 352 379 406 433 460 487 514 541 TimeCost(ms) CF Serial Number Time Cost Create 1 CF: 10s Create 1 CF: 0.1s • Root Cause – Protocol Conflict • Between Gossip Protocol and Request Propagation Mechanism – Message Overhead • May transform the whole schema instead of the changed part ReceiveSchema Message Memory Cost SendSchema Message Memory Cost Total N1 4.465G 4.236G 8.70G N2 4.308G 4.907G 9.21G N3 4.236G 4.024G 8.26G N4 4.808G 4.387G 9.19G N5 6.111G 6.373G 12.48G Memory used by Gossip
  • 17. Challenge 1 – Creating Schema Hang • Solution – Gossip takes effect only when: • Propagation messages lost/timeout • Nodes recovered from a failure – Creation time cost can keep constant 17 Propagate LOAD STATUS SHCEMA VERSION ... LOAD STATUS SHCEMA 延迟:t秒 VERSION ... metadata metadata Delay T sT strategy: 1 2 34 Adaptive Lazy Gossip 3 4
  • 18. Challenge 2 – Balancing Consistency & Throughput • Production environment – Sany production: 5 nodes cluster, 2x4 cores 64GB • Problem – Throughput = 200K data points/sec – 75% data is written successfully only in one replica, while the other replicas are stale (inconsistent) • Cassandra is NOT very consistent • Big obstacle for query operation – Repair is required, but is very slow © 2015. All Rights Reserved. 18 Experiment on Amazon EC2 2 cores, 8GB, 5 nodes rywc: read your write consistency
  • 19. Challenge 2 – Balancing Consistency & Throughput • Root Cause of slow Repair – Too many column families (5000+) – Too many ranges in the consistent hashing ring • 256 virtual nodes (VN) per physical node • Too many merkle trees (ranges x CFs) • Experience and Suggestions – Repair CFs and ranges one by one • Do not repair the whole keyspace (all CFs) at once – Repair the important CFs first – Perform repair at light workload © 2015. All Rights Reserved. 19 - 5 physical nodes - each has 2 VNs - 10 ranges in total For each range and each CF, create merkle tree and compare them between two nodes
  • 20. Challenge 3 – Heterogeneous Nodes • Problem – How to assign the data partitions in a heterogeneous cluster? • Experiment Study – Deploy a heterogeneous cluster • 2 powerful servers and 8 PCs – Throughput performance • Heterogeneous cluster cluster only with the 2 powerful servers © 2015. All Rights Reserved. 20 Assign the position of the nodes (i.e. Tokens) in the ring according to their computing capacities
  • 21. Challenge 3 – Heterogeneous Nodes • Root Cause – The replica mechanism makes the unbalanced problem complicated • Each Node’s configurations may impact other nodes’ performance – The Virtual Node (VN) mechanism cannot fit all scenarios • Too many VNs make the lookup table too big and slow down repair speed • Max #VNs in a physical node is 1536 (restricted by Cassandra source code) © 2015. All Rights Reserved. 21 The capacity of N1 is the worst, and E is short But N1 is responsible for many data records to the cluster: • N5 finish the operation quickly • But N5 has to wait for N1, which is slow
  • 22. Challenge 3 – Heterogeneous Nodes • Solution – Initialize the cluster properly • Use Quadratic Optimization (QP) to find the best positions of the (virtual) servers • Has been deployed to China Sany Group successfully – Scaling out the cluster • Use a dynamic algorithm to find the best positions for the new added server © 2015. All Rights Reserved. 22 Scaling out: Xiangdong Huang, Jianmin Wang et al. Optimizing Data Partition for Scaling out NoSQL Cluster. Concurrency and Computation: Practice and Experience (Early View) Scaling out: find the best position Optimize: 1. the order of the nodes in the ring 2. the range length of the ring
  • 23. Datasets & Results in China Sany Group • 5000+ column families for sensor data • 100K+ engineering machineries • Amount of historical data loaded – From 2012.4 to now • Data size – Tens of billions operational statuses records – Several billion GPS data – Write throughput – 5 nodes (2*4 cores CPU, 64GB memory, 9TB Disk) – 20K TPS as regular workload, 200K at peak 23
  • 24. Industrial Big Data Platform: More Requirements ——Beyond Sany Applications High frequency sensor High volume sensors 10+ M data point/second Time and value based query Richer set of analytical queries <1 Second response Edge synchronization Compression, out-of-order, retransmission Different data, different algorithms Transparent to query Deep compression to historical data Spatial-temporal index Trajectory based queries Even higher throughput Native time-series query Synchronization Adaptive deep compression Moving object support
  • 25. Industrial Data Analysis Pipeline © 2015. All Rights Reserved. 25 Boolean value Status values Analogue value 1.046Billion Basic indicator 8030 Baseline 1.046Billion Variance Specific features Common features Outliers Specifiedoperational statusesdata General count baseline variance frequency baseline variance .. Analogue average baseline variance variance baseline variance extremum baseline variance … Boolean times baseline variance duration baseline variance … States Changes times baseline variance duration baseline variance …
  • 26. Driver profile Hydraulic oil temperature analysis Temporal parameter analysis for vehicle start Parameter correlation Spatial analysis for failure Service Quality ControlR&D Key components anomaly detection Industry Practice – Value-Added Analytics
  • 27. horizontal inclination angle Concrete pump truck’s tip-over is mainly caused by insufficient leg’s cylinder support, which is a major issue of production safety Big Data Application 1 —Concrete Pump Truck Tip-over Detection
  • 28. Big Data Application 1 —Concrete Pump Truck Tip-over Detection Fast spot and prevent dangerous operation through group behavior analysis of concrete pump trucks The overall distribution of horizontal (X-axis) & vertical (Y-axis) inclination angle of concrete pumps Unstable instances Idle instances Inclination angle vibration level filter Inclination angle distribution of individual concrete pump
  • 29. Idle instances: unplugging operation leads to malfunctioning Unstable instances: Early degradation pattern of cylinder Typical instances: stable oscillation Data driven anomaly and potential accidents detection Big Data Application 1 —Concrete Pump Truck Tip-over Detection
  • 30. Big Data Application 2 —Fault Diagnosis Investigation proved that salt-spray environment and the water quality along the seaside caused the corrosion of cylinder’s potted component Via time series pattern analysis and spatial correlation, leakage problem of master cylinder is highly correlated with a high-speed rail construction project. Hangzhou-Shenzhen high-speed rail Salt-spray corrosion environment
  • 31. Big Data Application 3 - Spare Components Demand Forecasting • Traditional approach is based on marketers’ experience • New approach – Combining the real-time data from machines, sale history, holdings of vehicles, environment and GDP, etc. • Result – Reduce half of inactive spare part inventory 0 50 100 150 200 250 中 旬 下 旬 上 旬 中 旬 下 旬 上 旬 中 旬 下 旬 上 旬 中 旬 下 旬 上 旬 中 旬 下 旬 上 旬 中 旬 下 旬 上 旬 中 旬 下 旬 上 旬 中 旬 下 旬 上 旬 中 旬 下 旬 2012/10 2012/11 2012/12 2013/1 2013/2 2013/3 2013/4 2013/5 2013/6 配件需求量数量/个 实际备件需求量 基于矩阵分解的多地区协同备件预测结果 企业实际备货量 The predicted result fits the actual demand better Sparepartsnumber Actual demand Actual prepared Results of Multi-Region Collaborative Spare Components Prediction Based on Matrix Factorization
  • 32. 1 Industrial Data Management: Cassandra at China Sany Group 2 Climate Data Management: Cassandra at China Meteorological Administration 32© 2015. All Rights Reserved.
  • 33. Pipeline of Climate Data Processing © 2015. All Rights Reserved. Data Center Internet Collection 2 Transmission 3 Access 4 Browsing 1
  • 35. Challenge in Meteorological Application —Pattern Data © 2015. All Rights Reserved. 35 • Hierarchical pattern data + flat others • A highly-efficient data-deliver system for end users – Support millions of small files – Access data fast – Scan data in various order • Performance requirement – Get ~1MB data in 50ms – 600 concurrent clients / T639d1 ... windtemper ...d2 d3 d4 d5 800 850 900 2014.2. 18.08 2014.2. 18.20 2014.2. 19.08 3 6 9 ... ... ... ... 2014.2. 18.08 ... ... 2014.2. 18.08 ... ... ... 3 3 3 3 t1t2t3 t4 t5 t6 t7 d3
  • 36. Why Cassandra? • Scalability • Fast read/write data • Some columns are sorted – Easy to scan data sequentially • Time-based Compaction (>=Cassandra v2.0) for time series © 2015. All Rights Reserved. 36 key 3h 6h 9h … T639/temperature/800Pa file file file … 1. Get the data where key=‘T639…/800Pa’ 2. Retrieval the data before 6h Or retrieval the data after 6h key 3h 6h 9h 12h … 3h 6h 9h T639/temperature/800Pa file file file file … file file file
  • 37. Solution – Schema Design for Pattern Data • Data items – 5-tuple – Pattern and variable are disordered – Level, time, ageing are ordered © 2015. All Rights Reserved. 37 time level ageingData space (pattern, variable) ColumnFamily Row key Column / T639d1 ... windtemper ...d2 d3 d4 d5 800 850 900 2014.2. 18.08 2014.2. 18.20 2014.2. 19.08 3 6 9 ... ... ... ... 2014.2. 18.08 ... ... 2014.2. 18.08 ... ... ... 3 3 3 3 t1t2t3 t4 t5 t6 t7
  • 38. Performance Results • 10 servers: 2*4 cores of CPU, 64GB memory, 9TB SAS Disk • Store 7 kinds of model data – 16TB per day • Get data quickly – 100 times faster than the older system © 2015. All Rights Reserved. 38