SlideShare a Scribd company logo
Big Data Analytics
with MariaDB AX
Maria Luisa Raviol
Senior Sales Engineer
EMEA
Agenda
• The Task - Analytics – Why and what
• The Requirements – What do we need for analytics
• The Solution – Column Based Storage
• The Product – MariaDB AX and MariaDB ColumnStore
• The Uses – MariaDB ColumnStore in action
Why Analytics ?
• Get the most value of your data asset
• Faster Better decision making process
• Cost reduction
• New products and services
What is likely
to happen?
Why is it
happening?
Types of analytics
What is
happening?
What should I
do about it?
Descriptive: What happened ?
● Reports
○ Sales Report
○ Expense summary
● Ad-hoc requests to analyst
Diagnostics: Why did it happen
• Aggregates: aggregate measure over one or
more dimension
– Find total sales
– Top five product ranked by sales
• Roll-ups: Aggregate at different levels of
dimension hierarchy
– given total sales by city, roll-up to get sales by
state
• Drill-down: Inverse of roll-ups
– given total sales by state, drill-down to get total
by city
• Slicing and Dicing:
– Equality and range selections on one or more
dimensions
Predictive: What is likely to happen
• Sales Prediction
– Analyze data to identify trends, spot
weakness or determine conditions
among broader data sets for making
decisions about the future
• Targeted marketing
– what is likelihood of a customer buying
a particular product based on past
buying behavior
Big Data Analytics Use Cases
By industry
Finance
Identify trade patterns
Detect fraud and anomolies
Predict trading outcomes
Manufacturing
Simulations to improve design/yield
Detect production anomolies
Predict machine failures (sensor data)
Telecom
Behavioral analysis of customer calls
Network analysis (perf and reliability)
Healthcare
Find genetic profiles/matches
Analyze health vs spending
Predict viral oubreaks
The analytics workload
• Deals with data from a high level perspective
• Handles data in large groups of rows
– SELECTs data by date, customer location, product id etc.
– Data is loaded in batch or streamed in
– Data is mostly just INSERTed
• Dealing with individual data items is usually ineffective
• Data structures are optimized for analytics use and performance
• Data is sometimes purged, but just as often not
• Contains structured, semi-structured and sometimes unstructured data
• Data often comes from many different sources, internal and external
• Queries are ad-hoc, largely
• Transactions and ACID requirements are relaxed
Analytics database requirements
• Fast access to large amounts of data
• Scalable as data grows over time
– Analytics requirements increasing
– Regulatory requirements
– New data sources are added
• Load performance must be fast, scalable and predictable
• Data loading should be very flexible due to the different sources of data
– Some data loaded in batch, other is streamed
• Query performance also need to be scalable
• Data compression is a requirement
– Data size constraints, as well as read performance from disk
In summary, what do we need
• Something that can compress data A LOT
• Something that can be written to with fast and predictable performance
• Something that doesn't necessarily support transactions
– It doesn't hurt, but performance is so much more important
• Something that can support analytics queries
– Ad-hoc queries
– Aggregate queries
• Something that can scale as data grows
• Something that can still have a level of high availability
• Something that works with analytics tools, like Tableau, R etc.
Existing Approaches
Limited real time analytics
Slow releases of product innovation
Expensive hardware and software
Data Warehouses
Hadoop / NoSQL
LIMITED SQL SUPPORT
DIFFICULT TO
INSTALL/MANAGE
LIMITED TALENT POOL
DATA LAKE W/ NO DATA
MANAGEMENT
Hard to use
The Solution
Distributed Column based storage
Row-oriented vs. Column-oriented format
• Row oriented
– Rows stored sequentially in
a file
– Scans through every record
row by row
• Column oriented:
– Each column is stored in a
separate file
– Scans only the relevant
columns
ID Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
ID
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
SELECT Fname FROM People WHERE State = 'NY'
Single-Row Operations - Insert
Row oriented:
new rows appended to
the end.
Column oriented:
new value added to
each file.
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Columnar insert not efficient for singleton insertions (OLTP). Batch loads touches row vs.
column. Batch load on column-oriented is faster (compression, no indexes).
Single-Row Operations - Update
Row oriented:
Update 100% of rows
means change 100%
of blocks on disk.
Column oriented:
Just update the blocks
needed to be updated
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
Single-Row Operations - Delete
Row oriented:
new rows deleted
Column oriented:
value deleted from
each file
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Changing the table structure
Row oriented:
requires rebuilding of
the whole table
Column oriented:
Create new file for the
new column
Column-oriented is very flexible for adding columns, no need for a full rebuild
required with it.
Key Fname Lname State Zip Phone Age Sex Active
1 Bugs Bunny NY 11217 (718) 938-3235 34 M Y
2 Yosemite Sam CA 95389 (209) 375-6572 52 M N
3 Daffy Duck NY 10013 (212) 227-1810 35 M N
4 Elmer Fudd ME 04578 (207) 882-7323 43 M Y
5 Witch Hazel MA 01970 (978) 744-0991 57 F N
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
Active
Y
N
N
Y
N
MariaDB ColumnStore Architecture
MariaDB ColumnStore
High performance columnar storage engine that supports a wide variety
of analytical use cases in highly scalable distributed environments
Parallel query
processing for distributed
environments
Faster, More
Efficient Queries
Single Interface for
OLTP and analytics
Easy to Manage and
Scale
Easier Enterprise
Analytics
Power of SQL and
Freedom of Open
Source to Big Data
Analytics
Better Price
Performance
Easier Enterprise
Analytics
ANSI SQL
Single SQL Front-end
• Use a single SQL interface for analytics and OLTP
• Leverage MariaDB Security features - Encryption for
data in motion , role based access and auditability
Full ANSI SQL
• No more SQL “like” query
• Support complex join, aggregation and window
function
Easy to manage and scale
• Eliminate needs for indexes and views
• Automated horizontal/vertical partitioning
• Linear scalable by adding new nodes as data grows
• Out of box connection with BI tools
Faster, More
Efficient Queries
Optimized for Columnar storage
• Columnar storage reduces disk I/O
• Blazing fast read-intensive workload
• Ultra fast data import
Parallel distributed query execution
• Distributed queries into series of parallel operations
• Fully parallel high speed data ingestion
Highly available analytic environment
• Built-in Redundancy
• Automatic fail-over
Parallel
Query Processing
MariaDB ColumnStore Architecture
Columnar Distributed Data Storage
User Connections
User Module nUser Module 1
Performance
Module n
Performance
Module 2
Performance
Module 1
MariaDB
Front End
Query Engine
User Module
Processes SQL Requests
Performance Module
Distributed Processing Engine
Storage Architecture
• Columnar storage
– Each column stored as separate file
– No index management for query
performance tuning
– Online Schema changes: Add new column
without impacting running queries
• Automatic horizontal partitioning
– Logical partition every 8 Million rows
– In memory metadata of partition min and max
– Query engine performs partition elimination.
– No partition management for query
performance tuning
• Compression
– Accelerate decompression rate
– Reduce I/O for compressed blocks
Column 1
Extent 1 (8 million rows, 8MB 64MB)
Extent 2 (8 million rows)
Extent M (8 million rows)
Column 2 Column 3 ... Column N
Data automatically arranged by
• Column – Acts as Vertical Partitioning
• Extents – Acts as horizontal partition
Vertical
Partition
Horizontal
Partition
...
Vertical
Partition
Vertical
Partition
Vertical
Partition
Horizontal
Partition
Horizontal
Partition
High Performance Query Processing
Horizontal
Partition:
8 Million Rows
Extent 2
Horizontal
Partition:
8 Million Rows
Extent 3
Horizontal
Partition:
8 Million Rows
Extent 1
Storage Architecture reduces I/O
• Only touch column files
that are in filter, projection,
group by, and join conditions
• Eliminate disk block touches
to partitions outside filter
and join conditions
Extent 1:
ShipDate: 2016-01-12 - 2016-03-05
Extent 2:
ShipDate: 2016-03-05 - 2016-09-23
Extent 3:
ShipDate: 2016-09-24 - 2017-01-06
SELECT Item, sum(Quantity) FROM Orders
WHERE ShipDate between ‘2016-01-01’ and ‘2016-01-31’
GROUP BY Item
Id OrderId Line Item Quantity Price Supplier ShipDate ShipMode
1 1 1 Laptop 5 1000 Dell 2016-01-12 G
2 1 2 Monitor 5 200 LG 2016-01-13 G
3 2 1 Mouse 1 20 Logitech 2016-02-05 M
4 3 1 Laptop 3 1600 Apple 2016-01-31 P
... ... ... ... ... ... ... ... ...
8M 2016-03-05
8M+1 2016-03-05
... ... ... ... ... ... ... ... ...
16M 2016-09-23
16M+1 2016-09-24
... ... ... ... ... ... ... ... ...
24M 2017-01-06
ELIMINATED PARTITION
ELIMINATED PARTITION
MariaDB ColumnStore
Analytics Use Cases
Healthcare / Life Science Industry
Genome analysis
• In-depth genome research for the dairy industry to improve production of milk and protein.
• Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load)
Healthcare spending analysis
• Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data
• Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition
Why MariaDB ColumnStore
• Strong security features including role based data access and audit plug in
• MPP architecture handles analytics on big data with high speed
• Easy to analyze archived data with SQL based analytics
• Does not require DBA to index or partition data
Telecommunication Industry
Customer behavior analysis
• Analyze call data record to segment customers based on their behavior
• Data-driven analysis for customer satisfaction
• Create behavioral based upsell or cross-sell opportunity
Call data analysis
• Data size: 6TB
• Ingest 1.5 million rows of logs per day with 30million texts and 3million calls
• Call and network quality analysis
• Provide higher quality customer services based on data
Why MariaDB ColumnStore
• ColumnStore support time based partitioning and time-series analysis
• Fast data load for real-time analytics
• MPP architecture handles analytics on big data with high speed
• Easy to analyze the archived data with SQL based analytics
Writing Data
ETL
32
Bulk Data Load
cpimport, LOAD DATA INFILE
Bulk Data Export
mysql client, odbc, jdbc
Integration with MariaDB
ColumnStore cpimport and sql
interface
Bulk Data Load: cpimport
• Fastest way to load data into MariaDB ColumnStore
• Load data from CSV file
cpimport dbName tblName [loadFile]
• Load data from Standard Input
mysql -e 'select * from source_table;' -N db2 | cpimport destination_db
destination_tbl -s 't‘
• Load data from Binary Source file
cpimport -I1 mydb mytable sourcefile.bin
• Multiple tables in can be loaded in parallel by launching multiple jobs
• Read queries continue without being blocked
• Successful cpimport is auto-committed
• In case of errors, entire load is rolled back
Traditional way of
importing data into
any MariaDB storage
engine table
Bulk Data Load:
LOAD DATA INFILE
Up to 2 times slower
than cpimport for
large size imports
mysql> load data infile '/tmp/
outfile1.txt' into table destinationTable;
Query OK, 9765625 rows affected
(2 min 20.01 sec)
Records: 9765625 Deleted:
0 Skipped: 0 Warnings: 0
Either success or
error operation can
be rolled back
Sizing
Minimum Spec
UM
4 core,
32 G RAM PM
4 core,
16 G RAM
Typical Server spec
PM
8 core 64G RAM
UM
8 core, 264G RAM
Data Storage
External Data Volumes
• Maximum 2 data volume per IO
channel per PM node server
• up to 2TB on the disk per data
volume ≈ Max 4 TB per PM node
Local disk
Up to 2TB on the disk per
PM node server
DETAILED SIZING GUIDE
based on data size
and workload
Thank you

More Related Content

PDF
Big Data Analytics with MariaDB AX
PDF
The final frontier v3
PDF
The final frontier
PPTX
Maximizing Service Maps To Include The Critical CIs on The Mainframe
PDF
Transactional and Analytics together: MariaDB and ColumnStore
PDF
Sesión técnica: Big Data Analytics con MariaDB ColumnStore
PDF
MariaDB AX: Analytics with MariaDB ColumnStore
PDF
MariaDB AX: Solución analítica con ColumnStore
Big Data Analytics with MariaDB AX
The final frontier v3
The final frontier
Maximizing Service Maps To Include The Critical CIs on The Mainframe
Transactional and Analytics together: MariaDB and ColumnStore
Sesión técnica: Big Data Analytics con MariaDB ColumnStore
MariaDB AX: Analytics with MariaDB ColumnStore
MariaDB AX: Solución analítica con ColumnStore

Similar to Delivering fast, powerful and scalable analytics #OPEN18 (20)

PDF
In-depth session: Big Data Analytics with MariaDB AX
PDF
Big Data Analytics with MariaDB ColumnStore
PDF
MariaDB ColumnStore - LONDON MySQL Meetup
PDF
Big-Data-Analysen mit MariaDB ColumnStore
PDF
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
PDF
Big Data Analytics with MariaDB ColumnStore
PDF
04 2017 emea_roadshowmilan_mariadb columnstore
PDF
What to expect from MariaDB Platform X5, part 2
PPT
Making MySQL Great For Business Intelligence
PDF
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
PDF
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
PDF
Introduction of MariaDB AX / TX
PPTX
Modernizing Mission-Critical Apps with SQL Server
PDF
MariaDB ColumnStore
PDF
Big Data Analytics with MariaDB ColumnStore
PDF
Operational-Analytics
PDF
MariaDB ColumnStore
PDF
Intro to column stores
PDF
Fast, Powerful and Scalable Analytics
PDF
Columnar databases on Big data analytics
In-depth session: Big Data Analytics with MariaDB AX
Big Data Analytics with MariaDB ColumnStore
MariaDB ColumnStore - LONDON MySQL Meetup
Big-Data-Analysen mit MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
04 2017 emea_roadshowmilan_mariadb columnstore
What to expect from MariaDB Platform X5, part 2
Making MySQL Great For Business Intelligence
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
Introduction of MariaDB AX / TX
Modernizing Mission-Critical Apps with SQL Server
MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
Operational-Analytics
MariaDB ColumnStore
Intro to column stores
Fast, Powerful and Scalable Analytics
Columnar databases on Big data analytics
Ad

More from Kangaroot (20)

PPTX
So you think you know SUSE?
PDF
Live demo: Protect your Data
PDF
RootStack - Devfactory
PDF
Welcome at OPEN'22
PDF
EDB Postgres in Public Sector
PDF
Deploying NGINX in Cloud Native Kubernetes
PDF
Cloud demystified, what remains after the fog has lifted.
PDF
Zimbra at Kangaroot / OPEN{virtual}
PDF
NGINX Controller: faster deployments, fewer headaches
PDF
Kangaroot EDB Webinar Best Practices in Security with PostgreSQL
PDF
Do you want to start with OpenShift but don’t have the manpower, knowledge, e...
PDF
Red Hat multi-cluster management & what's new in OpenShift
PDF
There is no such thing as “Vanilla Kubernetes”
PDF
Elastic SIEM (Endpoint Security)
PDF
Hashicorp Vault - OPEN Public Sector
PDF
Kangaroot - Bechtle kadercontracten
PDF
Red Hat Enterprise Linux 8
PDF
Kangaroot open shift best practices - straight from the battlefield
PDF
Kubecontrol - managed Kubernetes by Kangaroot
PDF
OpenShift 4, the smarter Kubernetes platform
So you think you know SUSE?
Live demo: Protect your Data
RootStack - Devfactory
Welcome at OPEN'22
EDB Postgres in Public Sector
Deploying NGINX in Cloud Native Kubernetes
Cloud demystified, what remains after the fog has lifted.
Zimbra at Kangaroot / OPEN{virtual}
NGINX Controller: faster deployments, fewer headaches
Kangaroot EDB Webinar Best Practices in Security with PostgreSQL
Do you want to start with OpenShift but don’t have the manpower, knowledge, e...
Red Hat multi-cluster management & what's new in OpenShift
There is no such thing as “Vanilla Kubernetes”
Elastic SIEM (Endpoint Security)
Hashicorp Vault - OPEN Public Sector
Kangaroot - Bechtle kadercontracten
Red Hat Enterprise Linux 8
Kangaroot open shift best practices - straight from the battlefield
Kubecontrol - managed Kubernetes by Kangaroot
OpenShift 4, the smarter Kubernetes platform
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPT
Teaching material agriculture food technology
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine learning based COVID-19 study performance prediction
Teaching material agriculture food technology
Per capita expenditure prediction using model stacking based on satellite ima...
Spectroscopy.pptx food analysis technology
Review of recent advances in non-invasive hemoglobin estimation
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf
MYSQL Presentation for SQL database connectivity
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation_ Review paper, used for researhc scholars

Delivering fast, powerful and scalable analytics #OPEN18

  • 1. Big Data Analytics with MariaDB AX Maria Luisa Raviol Senior Sales Engineer EMEA
  • 2. Agenda • The Task - Analytics – Why and what • The Requirements – What do we need for analytics • The Solution – Column Based Storage • The Product – MariaDB AX and MariaDB ColumnStore • The Uses – MariaDB ColumnStore in action
  • 3. Why Analytics ? • Get the most value of your data asset • Faster Better decision making process • Cost reduction • New products and services
  • 4. What is likely to happen? Why is it happening? Types of analytics What is happening? What should I do about it?
  • 5. Descriptive: What happened ? ● Reports ○ Sales Report ○ Expense summary ● Ad-hoc requests to analyst
  • 6. Diagnostics: Why did it happen • Aggregates: aggregate measure over one or more dimension – Find total sales – Top five product ranked by sales • Roll-ups: Aggregate at different levels of dimension hierarchy – given total sales by city, roll-up to get sales by state • Drill-down: Inverse of roll-ups – given total sales by state, drill-down to get total by city • Slicing and Dicing: – Equality and range selections on one or more dimensions
  • 7. Predictive: What is likely to happen • Sales Prediction – Analyze data to identify trends, spot weakness or determine conditions among broader data sets for making decisions about the future • Targeted marketing – what is likelihood of a customer buying a particular product based on past buying behavior
  • 8. Big Data Analytics Use Cases By industry Finance Identify trade patterns Detect fraud and anomolies Predict trading outcomes Manufacturing Simulations to improve design/yield Detect production anomolies Predict machine failures (sensor data) Telecom Behavioral analysis of customer calls Network analysis (perf and reliability) Healthcare Find genetic profiles/matches Analyze health vs spending Predict viral oubreaks
  • 9. The analytics workload • Deals with data from a high level perspective • Handles data in large groups of rows – SELECTs data by date, customer location, product id etc. – Data is loaded in batch or streamed in – Data is mostly just INSERTed • Dealing with individual data items is usually ineffective • Data structures are optimized for analytics use and performance • Data is sometimes purged, but just as often not • Contains structured, semi-structured and sometimes unstructured data • Data often comes from many different sources, internal and external • Queries are ad-hoc, largely • Transactions and ACID requirements are relaxed
  • 10. Analytics database requirements • Fast access to large amounts of data • Scalable as data grows over time – Analytics requirements increasing – Regulatory requirements – New data sources are added • Load performance must be fast, scalable and predictable • Data loading should be very flexible due to the different sources of data – Some data loaded in batch, other is streamed • Query performance also need to be scalable • Data compression is a requirement – Data size constraints, as well as read performance from disk
  • 11. In summary, what do we need • Something that can compress data A LOT • Something that can be written to with fast and predictable performance • Something that doesn't necessarily support transactions – It doesn't hurt, but performance is so much more important • Something that can support analytics queries – Ad-hoc queries – Aggregate queries • Something that can scale as data grows • Something that can still have a level of high availability • Something that works with analytics tools, like Tableau, R etc.
  • 12. Existing Approaches Limited real time analytics Slow releases of product innovation Expensive hardware and software Data Warehouses Hadoop / NoSQL LIMITED SQL SUPPORT DIFFICULT TO INSTALL/MANAGE LIMITED TALENT POOL DATA LAKE W/ NO DATA MANAGEMENT Hard to use
  • 14. Row-oriented vs. Column-oriented format • Row oriented – Rows stored sequentially in a file – Scans through every record row by row • Column oriented: – Each column is stored in a separate file – Scans only the relevant columns ID Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F ID 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F SELECT Fname FROM People WHERE State = 'NY'
  • 15. Single-Row Operations - Insert Row oriented: new rows appended to the end. Column oriented: new value added to each file. Key Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F 6 Marvin Martian CA 91602 (818) 761-9964 26 M Key 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F 6 Marvin Martian CA 91602 (818) 761-9964 26 M Columnar insert not efficient for singleton insertions (OLTP). Batch loads touches row vs. column. Batch load on column-oriented is faster (compression, no indexes).
  • 16. Single-Row Operations - Update Row oriented: Update 100% of rows means change 100% of blocks on disk. Column oriented: Just update the blocks needed to be updated Key Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F Key 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F
  • 17. Single-Row Operations - Delete Row oriented: new rows deleted Column oriented: value deleted from each file Key Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F 6 Marvin Martian CA 91602 (818) 761-9964 26 M Key 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F 6 Marvin Martian CA 91602 (818) 761-9964 26 M
  • 18. Changing the table structure Row oriented: requires rebuilding of the whole table Column oriented: Create new file for the new column Column-oriented is very flexible for adding columns, no need for a full rebuild required with it. Key Fname Lname State Zip Phone Age Sex Active 1 Bugs Bunny NY 11217 (718) 938-3235 34 M Y 2 Yosemite Sam CA 95389 (209) 375-6572 52 M N 3 Daffy Duck NY 10013 (212) 227-1810 35 M N 4 Elmer Fudd ME 04578 (207) 882-7323 43 M Y 5 Witch Hazel MA 01970 (978) 744-0991 57 F N Key 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F Active Y N N Y N
  • 20. MariaDB ColumnStore High performance columnar storage engine that supports a wide variety of analytical use cases in highly scalable distributed environments Parallel query processing for distributed environments Faster, More Efficient Queries Single Interface for OLTP and analytics Easy to Manage and Scale Easier Enterprise Analytics Power of SQL and Freedom of Open Source to Big Data Analytics Better Price Performance
  • 21. Easier Enterprise Analytics ANSI SQL Single SQL Front-end • Use a single SQL interface for analytics and OLTP • Leverage MariaDB Security features - Encryption for data in motion , role based access and auditability Full ANSI SQL • No more SQL “like” query • Support complex join, aggregation and window function Easy to manage and scale • Eliminate needs for indexes and views • Automated horizontal/vertical partitioning • Linear scalable by adding new nodes as data grows • Out of box connection with BI tools
  • 22. Faster, More Efficient Queries Optimized for Columnar storage • Columnar storage reduces disk I/O • Blazing fast read-intensive workload • Ultra fast data import Parallel distributed query execution • Distributed queries into series of parallel operations • Fully parallel high speed data ingestion Highly available analytic environment • Built-in Redundancy • Automatic fail-over Parallel Query Processing
  • 23. MariaDB ColumnStore Architecture Columnar Distributed Data Storage User Connections User Module nUser Module 1 Performance Module n Performance Module 2 Performance Module 1 MariaDB Front End Query Engine User Module Processes SQL Requests Performance Module Distributed Processing Engine
  • 24. Storage Architecture • Columnar storage – Each column stored as separate file – No index management for query performance tuning – Online Schema changes: Add new column without impacting running queries • Automatic horizontal partitioning – Logical partition every 8 Million rows – In memory metadata of partition min and max – Query engine performs partition elimination. – No partition management for query performance tuning • Compression – Accelerate decompression rate – Reduce I/O for compressed blocks Column 1 Extent 1 (8 million rows, 8MB 64MB) Extent 2 (8 million rows) Extent M (8 million rows) Column 2 Column 3 ... Column N Data automatically arranged by • Column – Acts as Vertical Partitioning • Extents – Acts as horizontal partition Vertical Partition Horizontal Partition ... Vertical Partition Vertical Partition Vertical Partition Horizontal Partition Horizontal Partition
  • 25. High Performance Query Processing Horizontal Partition: 8 Million Rows Extent 2 Horizontal Partition: 8 Million Rows Extent 3 Horizontal Partition: 8 Million Rows Extent 1 Storage Architecture reduces I/O • Only touch column files that are in filter, projection, group by, and join conditions • Eliminate disk block touches to partitions outside filter and join conditions Extent 1: ShipDate: 2016-01-12 - 2016-03-05 Extent 2: ShipDate: 2016-03-05 - 2016-09-23 Extent 3: ShipDate: 2016-09-24 - 2017-01-06 SELECT Item, sum(Quantity) FROM Orders WHERE ShipDate between ‘2016-01-01’ and ‘2016-01-31’ GROUP BY Item Id OrderId Line Item Quantity Price Supplier ShipDate ShipMode 1 1 1 Laptop 5 1000 Dell 2016-01-12 G 2 1 2 Monitor 5 200 LG 2016-01-13 G 3 2 1 Mouse 1 20 Logitech 2016-02-05 M 4 3 1 Laptop 3 1600 Apple 2016-01-31 P ... ... ... ... ... ... ... ... ... 8M 2016-03-05 8M+1 2016-03-05 ... ... ... ... ... ... ... ... ... 16M 2016-09-23 16M+1 2016-09-24 ... ... ... ... ... ... ... ... ... 24M 2017-01-06 ELIMINATED PARTITION ELIMINATED PARTITION
  • 27. Healthcare / Life Science Industry Genome analysis • In-depth genome research for the dairy industry to improve production of milk and protein. • Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load) Healthcare spending analysis • Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data • Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition Why MariaDB ColumnStore • Strong security features including role based data access and audit plug in • MPP architecture handles analytics on big data with high speed • Easy to analyze archived data with SQL based analytics • Does not require DBA to index or partition data
  • 28. Telecommunication Industry Customer behavior analysis • Analyze call data record to segment customers based on their behavior • Data-driven analysis for customer satisfaction • Create behavioral based upsell or cross-sell opportunity Call data analysis • Data size: 6TB • Ingest 1.5 million rows of logs per day with 30million texts and 3million calls • Call and network quality analysis • Provide higher quality customer services based on data Why MariaDB ColumnStore • ColumnStore support time based partitioning and time-series analysis • Fast data load for real-time analytics • MPP architecture handles analytics on big data with high speed • Easy to analyze the archived data with SQL based analytics
  • 31. Bulk Data Load cpimport, LOAD DATA INFILE Bulk Data Export mysql client, odbc, jdbc Integration with MariaDB ColumnStore cpimport and sql interface
  • 32. Bulk Data Load: cpimport • Fastest way to load data into MariaDB ColumnStore • Load data from CSV file cpimport dbName tblName [loadFile] • Load data from Standard Input mysql -e 'select * from source_table;' -N db2 | cpimport destination_db destination_tbl -s 't‘ • Load data from Binary Source file cpimport -I1 mydb mytable sourcefile.bin • Multiple tables in can be loaded in parallel by launching multiple jobs • Read queries continue without being blocked • Successful cpimport is auto-committed • In case of errors, entire load is rolled back
  • 33. Traditional way of importing data into any MariaDB storage engine table Bulk Data Load: LOAD DATA INFILE Up to 2 times slower than cpimport for large size imports mysql> load data infile '/tmp/ outfile1.txt' into table destinationTable; Query OK, 9765625 rows affected (2 min 20.01 sec) Records: 9765625 Deleted: 0 Skipped: 0 Warnings: 0 Either success or error operation can be rolled back
  • 34. Sizing Minimum Spec UM 4 core, 32 G RAM PM 4 core, 16 G RAM Typical Server spec PM 8 core 64G RAM UM 8 core, 264G RAM Data Storage External Data Volumes • Maximum 2 data volume per IO channel per PM node server • up to 2TB on the disk per data volume ≈ Max 4 TB per PM node Local disk Up to 2TB on the disk per PM node server DETAILED SIZING GUIDE based on data size and workload