SlideShare a Scribd company logo
Modernizing Your 
Data Warehouse 
using APS 
Big data. Small data. All data. 
Stéphane Fréchette - SQL Server MVP - @sfrechette 
Database / Business Intelligence Solution Architect
- Gartner, “The State of Data Warehousing in 2012”
Increasing 
data volumes 
1 
Real-time 
data 
2 
New data 
sources and types 
3 
4 
Cloud-born 
data 
Data sources
 
The modern data warehouse 
Data sources Non-relational data
Insights from all your data 
Enrich and optimize your data from non-traditional sources 
5
Roadblocks to a modern data warehouse 
Keep legacy 
investment 
Buy new tier-one 
hardware appliance 
Acquire Big Data 
solution 
Acquire business 
intelligence 
Limited 
scalability and ability to 
handle new data types 
Significant training 
and data silos 
High acquisition 
and migration 
costs 
Complex with low 
adoption
Introducing the Microsoft Analytics Platform System 
The turnkey modern data warehouse appliance 
• Relational and non-relational 
data in a single appliance 
• Enterprise-ready Hadoop 
• Integrated querying across 
Hadoop and PDW using T-SQL 
• Direct integration with 
Microsoft BI tools such as 
Microsoft Excel 
• Near real-time performance 
with In-Memory Columnstore 
• Ability to scale out to 
accommodate growing data 
• Removal of data warehouse 
bottlenecks with MPP SQL 
Server 
• Concurrency that fuels rapid 
adoption 
• Industry’s lowest data 
warehouse appliance price per 
terabyte 
• Value through a single 
appliance solution 
• Value with flexible hardware 
options using commodity 
hardware
Microsoft Analytics Platform System 
The turnkey modern data warehouse appliance
Evolution in the nature and use of data in the enterprise 
Data complexity: 
variety and velocity 
Petabytes 
Historical 
analysis 
Insight 
analysis 
Predictive 
analytics 
Predictive 
forecasting 
Value to the business
What is Hadoop? 
Microsoft Confidential 
10 
OPERATIONAL 
SERVICES 
AMBARI 
Core Services 
DATA 
SERVICES 
MAP 
REDUCE 
HDFS 
FLUME 
SQOOP 
LOAD & 
EXTRACT 
NFS 
WebHDFS 
OOZIE 
YARN 
HIVE & 
HCATALOG 
PIG 
FALCON HBASE 
Hadoop Cluster 
compute 
& 
. . . 
storage . . . 
. . 
compute 
& 
storage 
. 
. 
Hadoop clusters provide 
scale-out storage and 
distributed data processing 
on commodity hardware
Manageable, secured, and highly available Hadoop integrated into the appliance 
High performance 
and tuned within the 
appliance 
End-user 
authentication with 
Active Directory 
Accessible insights 
for everyone with 
Microsoft BI tools 
Managed and 
monitored using 
System Center 
100-percent Apache 
Hadoop 
SQL Server 
Parallel Data 
Warehouse 
PolyBase 
Microsoft 
HDInsight
Parallel Data Warehouse 
workload 
HDInsight workload 
Fabric 
Hardware 
Appliance 
A region is a logical container within an 
appliance 
Each workload contains the following 
boundaries: 
• Security 
• Metering 
• Servicing
Bringing Hadoop point solutions and the data warehouse together for users and IT 
Provides a single T-SQL query model for PDW 
and Hadoop with rich features of T-SQL, 
including joins without ETL 
Uses the power of MPP to enhance query 
execution performance 
Supports Windows Azure HDInsight to enable 
new hybrid cloud scenarios 
Provides the ability to query non-Microsoft 
Hadoop distributions, such as Hortonworks and 
Cloudera 
SQL Server 
Parallel Data 
Warehouse 
Microsoft Azure 
HDInsight 
PolyBase 
Microsoft 
HDInsight 
Hortonworks for 
Windows and Linux 
Cloudera 
Select… Result set
Results 
Direct and parallelized HDFS access 
Enhancing the Data Movement Service (DMS) of APS to allow direct communication between HDFS data nodes and PDW compute 
nodes 
Non-relational data 
Social 
apps 
Sensor 
and RFID 
Mobile 
apps 
Web 
apps 
Hadoop 
Relational data 
Traditional schema-based 
data warehouse applications 
Regular 
T-SQL 
External table 
External data 
source 
External file 
format 
Enhanced PDW 
query engine 
HDFS bridge PDW
Hadoop / Data Lake 
(Cloudera, Hortonworks, 
HDInsight) 
Source systems 
Day / Hour / Minute Refresh 
SQL Server 
Data Marts 
SQL Server 
Reporting Services 
SQL Server 
Analytics / Ad-hoc / Visualization 
MapReduce T-SQL 
SQL Server 
Parallel Data 
Warehouse 
PolyBase 
Microsoft 
HDInsight 
Analysis Services APS
HDFS File / Directory 
//hdfs/social_media/twitter 
//hdfs/social_media/twitter/Daily.log 
1 
0 
Hadoop 
Dynamic binding 
Column filtering 
Row filtering 
User Location Product Sentiment Rtwt Hour Date 
Sean 
Audie 
Suz 
Tom 
Sanjay 
Roger 
Steve 
CA 
CO 
WA 
IL 
MN 
TX 
AL 
xbox 
excel 
xbox 
sqls 
wp8 
ssas 
ssrs 
-1 
0 
1 
1 
1 
1 
5 
0 
8 
0 
0 
0 
8 
8 
2 
2 
1 
23 
23 
5-15-14 
5-15-14 
5-15-14 
5-13-14 
5-14-14 
5-14-14 
5-13-14 
SELECT User, Product, Sentiment 
FROM Twitter_Table 
WHERE Hour = Current - 1 
AND Date = Today 
AND Sentiment >= 0
Improve APS operations by extending PolyBase 
HDFS file formats 
Textfile and 
RCFile support 
• Microsoft Azure HDInsight 
• HDInsight on APS 
• Hortonworks Data Platform 
1.3 and 2.0 (Linux/Windows 
Server) 
• Cloudera Linux 4.3 
Security and 
permission model 
External table 
source and file 
format syntax 
Microsoft 
Azure 
Storage 
Blobs 
AU1 
PolyBase v2 
Analytics Platform 
System 
(powered by PolyBase)
Big Data insights for anyone 
New insights with familiar tools through native Microsoft BI integration 
Minimizes IT 
intervention for 
discovering data 
with tools such as 
Microsoft Excel 
Enables DBA and 
power users to join 
relational and 
Hadoop data with 
T-SQL 
Takes advantage of 
high adoption 
of Excel, Power 
View, PowerPivot, 
and SQL Server 
Analysis Services 
Offers Hadoop 
tools like 
MapReduce, Hive, 
and Pig for data 
scientists 
Everyone else using 
Microsoft BI tools 
Power users 
Data scientist
CREATE EXTERNAL TABLE table_name 
({<column_definition>}[,..n ]) 
{WITH ( 
DATA_SOURCE = <data_source>, 
FILE_FORMAT = <file_format>, 
LOCATION =‘<file_path>’, 
[REJECT_VALUE = <value>], 
…)}; 
1 Referencing external data source 
2 Referencing external file format 
3 Path of the Hadoop file/folder 
4 (Optional) Reject parameters
CREATE EXTERNAL DATA SOURCE datasource_name 
{WITH ( 
TYPE = <data_source>, 
LOCATION =‘<location>’, 
[JOB_TRACKER_LOCATION = ‘<jb_location>’] 
}; 
1 Type of external data source 
2 Location of external data source 
Enabling or disabling of MapReduce 
job generation 
3
CREATE EXTERNAL FILE FORMAT fileformat_name 
{WITH ( 
FORMAT_TYPE = <type>, 
[SERDE_METHOD = ‘<sede_method>’,] 
[DATA_COMPRESSION = ‘<compr_method>’, 
[FORMAT_OPTIONS (<format_options>)] 
}; 
1 Type of external data source 
2 (De)Serialization method [Hive RCFile] 
3 Compression method 
4 (Optional) Format Options [Text Files]
<Format Options> :: = 
[,FIELD_TERMINATOR = ‘value’], 
[,STRING_DELIMITER = ‘value’], 
[,DATE_FORMAT = ‘value’], 
[USE_TYPE_DEFAULT = ‘value’] 
1 Column delimiter 
2 Delimiter for string data types 
3 To specify a particular date format 
4 How missing entries are handled
Bringing islands of Hadoop data together 
Running high performance queries against Hadoop data 
Archiving data warehouse data to Hadoop (move) 
Exporting relational data to Hadoop (copy) 
Importing Hadoop data into a data warehouse (copy)
Microsoft Analytics Platform System 
The turnkey modern data warehouse appliance
Scale up Rowstore 
Diminishing scale as requirements grow 
Data 
Querying data by row 
Page 1 Page 2 Page 3 
C1 C2 C3 C4 
R1 R1 R1 R1 
R2 R2 R2 R2 
R3 R3 R3 R3 
R4 R4 R4 R4 
R5 R5 R5 R5 
R6 R6 R6 R6 
Sub-optimal performance for many data 
warehouse queries 
Forklift 
Forklift
Scale out Multiple nodes with dedicated CPU, 
memory, and storage 
Ability to incrementally add hardware 
for near-linear scale to multiple 
petabytes 
Ability to handle query complexity and 
concurrency at scale 
No “forklift” of prior warehouse to 
increase capacity 
Ability to scale out HDInsight and PDW 
Scaling out your data to petabytes 
Scale-out technologies in the Analytics Platform System 
PDW / 
HDInsight 
PDW / 
HDInsight 
PDW / 
HDInsight 
PDW 
PDW / 
HDInsight 
PDW / 
HDInsight 
PDW / 
HDInsight 
0 terabytes 6 petabytes
Blazing-fast performance 
MPP and In-Memory Columnstore for next-generation performance 
Up to 100x 
faster queries 
Updateable clustered columnstore vs. table with customary indexing 
• Store data in columnar format for massive 
compression 
• Load data into or out of memory for next-generation 
performance with up to 60% 
improvement in data loading speed 
• Updateable and clustered for real-time trickle 
loading 
Up to 15x 
more compression 
Columnstore index representation 
Parallel query execution 
Query 
Results
Why is a clustered columnstore index 
important? 
• Saves space 
• Provides easier management by eliminating 
maintenance of secondary indexes 
• Supports all PDW data types, including high-precision 
decimal data types and more 
Space used in GB (table with 101 million rows) 
Space used = table space + index space 
20.0 
15.0 
10.0 
5.0 
0.0 
91% 
savings 
1 2 3 4 5 6 
In-Memory Columnstore is featured in the 
storage engine in PDW AU1
Relational query execution processing 
1 SQL queries sent to control node 
Control node creates query 
execution plan 
2 
Query plan creates distributed 
queries to run on each compute 
node 
3 
Distributed queries sent to compute 
nodes (all running in parallel) 
4 
Control node collects query results 
and returns them to user 
5 
Create query plan 
User query 
Client Control 
Compute 
Compute 
Compute 
Compute 
Appliance 
Management 
Query results 
Aggregate query results Compute nodes 
process query plan 
operations in parallel
SQL Server SMP 
Reporting and cubes 
BI Tools 
Great performance with mixed workloads 
Analytics Platform System 
ETL/ELT with SSIS, DQS, MDS 
ERP CRM LOB APPS 
ETL/ELT with DWLoader 
Hadoop / Big Data 
PDW 
PolyBase 
HDInsight 
Ad hoc queries 
Intra-Day 
Near real-time 
Fast ad hoc 
Columnstore 
Polybase 
CRTAS 
Link Table 
Real-Time 
ROLAP / MOLAP 
DirectQuery 
SNAC
Microsoft Analytics Platform System 
The turnkey modern data warehouse appliance
High performance using commodity hardware 
Price per terabyte for leading vendors 
Significantly lower 
price per terabyte 
than the closest competitor 
Price per terabyte for user-available storage (compressed) 
NOTE: Orange line indicates average price per 
terabyte. 
Thousands 
Oracle EMC IBM Teradata Microsoft 
$30 
$25 
$20 
$15 
$10 
$5 
$0 
Lower storage costs 
with Windows Server 2012 
Storage Spaces
Hardware and software engineered together 
The ease of an appliance 
Co-engineered 
with HP, Dell, and 
Quanta best 
practices 
Leading 
performance with 
commodity 
hardware 
Integrated 
support plan with 
a single Microsoft 
PDW contact 
Pre-configured, 
built, and tuned 
software and 
hardware 
PolyBase 
HDInsight
Hardware architecture InfiniBand 
InfiniBand 
PDW region 
Ethernet 
Ethernet 
Control node 
Failover node 
Master node 
Failover node 
Compute nodes 
Economical disk storage 
Compute nodes 
Economical disk storage 
Compute nodes 
Economical disk storage 
Networking 
HDInsight region 
PDW region 
Rack #1 
InfiniBand 
InfiniBand 
Ethernet 
Ethernet 
Failover node 
Compute nodes 
Economical disk storage 
Compute nodes 
Economical disk storage 
Compute nodes 
Economical disk storage 
HDI extension base 
unit 
HDI active scale 
unit 
HDI active scale 
unit 
HDI extension base 
unit 
Rack #2 
HST-01 
HST-02 
HSA-01 
HST-02 
Economical 
disk storage 
IB and Ethernet 
Active Unit Addition of two or three compute nodes 
depending on OEM hardware 
configuration and related storage 
Passive Unit Host for non-worker HDInsight nodes 
Failover Node High availability for the rack
• PDW engine 
• DMS Manager 
• SQL Server 2012 Enterprise Edition (PDW build) 
Base Unit C 
T 
L 
Host 1 
Host 2 
Host 3 
Host 4 
Economical 
disk storage 
IB and 
Ethernet 
Direct attached SAS 
M 
A 
D 
A 
D 
V 
M 
M 
Compute 1 
Compute 2 
Software details 
• All hosts run Windows Server 2012 Standard and 
Windows Azure Virtual Machines 
• Fabric or workload in Hyper-V Virtual Machines 
• Fabric virtual machine, management server (MAD01), 
and control server (CTL) share one server 
• PDW agent that runs on all hosts and all virtual 
machines 
• DWConfig and Admin Console 
• Windows Storage Spaces and Azure Storage blobs
CT 
Base Unit 
L 
Host 1 
Host 1 
Host 2 
Host 3 
Host 4 
Economical 
disk 
storage 
IB and 
Ethernet 
Direct attached SAS 
M 
AD 
A 
D 
V 
M 
M 
Compute 1 
Compute 1 
Compute 2 
Host 5 
Passive Unit 
2 
Base Unit 
CT 
L 
M 
AD 
FA 
B 
AD 
V 
M 
M 
Compute 1 
CT 
L 
Virtual machine migration can be used to move 
workload nodes to new hosts after hardware failure 
Cluster Shared Volumes 
• Enable all nodes to access logical unit numbers 
(LUNs) on economical disk storage 
• Use Server Message Block (SMB3) protocol 
Failover capabilities 
• Uses one cluster across the whole appliance 
• Automatically migrates virtual machines on host 
failure 
• Enforces rules with affinity and anti-affinity maps 
• Uses Windows Failover Cluster Manager
Modernizing Your Data Warehouse using APS

More Related Content

PPTX
Versa Shore Microsoft APS PDW webinar
PPTX
Accelerating Big Data Analytics
PPTX
Pentaho Analytics on MongoDB
PPTX
How Glidewell Moves Data to Amazon Redshift
PPTX
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
PPTX
Big Data Analytics Projects - Real World with Pentaho
PDF
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
PPTX
Optimize Data for the Logical Data Warehouse
Versa Shore Microsoft APS PDW webinar
Accelerating Big Data Analytics
Pentaho Analytics on MongoDB
How Glidewell Moves Data to Amazon Redshift
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Big Data Analytics Projects - Real World with Pentaho
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Optimize Data for the Logical Data Warehouse

What's hot (20)

PPTX
Pentaho Big Data Analytics with Vertica and Hadoop
PPTX
Solving Performance Problems on Hadoop
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
PPTX
Break Free From Oracle with Attunity and Microsoft
PPTX
Big Data in the Real World
PPTX
Which data should you move to Hadoop?
PPTX
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
PPTX
Hadoop Journey at Walgreens
PPTX
Breakout: Hadoop and the Operational Data Store
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
PPTX
Scalable data pipeline
PDF
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
PPTX
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
PPTX
Big Data in the Cloud with Azure Marketplace Images
PPTX
Introduction to Kudu - StampedeCon 2016
PPTX
Scaling Data Science on Big Data
PPTX
Microsoft Data Platform - What's included
PPTX
Real-time Data Pipelines with SAP and Apache Kafka
PPTX
Building a Big Data Solution
PDF
Filling the Data Lake
Pentaho Big Data Analytics with Vertica and Hadoop
Solving Performance Problems on Hadoop
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Break Free From Oracle with Attunity and Microsoft
Big Data in the Real World
Which data should you move to Hadoop?
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
Hadoop Journey at Walgreens
Breakout: Hadoop and the Operational Data Store
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Scalable data pipeline
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
Big Data in the Cloud with Azure Marketplace Images
Introduction to Kudu - StampedeCon 2016
Scaling Data Science on Big Data
Microsoft Data Platform - What's included
Real-time Data Pipelines with SAP and Apache Kafka
Building a Big Data Solution
Filling the Data Lake
Ad

Viewers also liked (19)

PDF
Bi303 data warehousing with fast track and pdw - Assaf Fraenkel
PDF
Sql server 2012_parallel_data_warehouse_breakthrough_platform_white_paper
PPTX
Microsoft Azure Data Warehouse Overview
PPTX
PDW value proposition
PPTX
What exactly is Business Intelligence?
PPTX
Introducing Azure SQL Data Warehouse
PPTX
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
PDF
Modern Veri Ambarı_Cem Kubilay
PPTX
Transitioning to a BI Role
PPTX
Introducing Azure SQL Database
PPTX
Best Practices to Deliver BI Solutions
PPTX
SQL Server 2016: Just a Few of Our DBA's Favorite Things
PPTX
Benefits of the Azure cloud
PPTX
SQL - Parallel Data Warehouse (PDW)
PDF
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
PPTX
Should I move my database to the cloud?
PPTX
Power BI Made Simple
PPTX
SQL Server on Linux - march 2017
PPTX
What's new in SQL Server 2016
Bi303 data warehousing with fast track and pdw - Assaf Fraenkel
Sql server 2012_parallel_data_warehouse_breakthrough_platform_white_paper
Microsoft Azure Data Warehouse Overview
PDW value proposition
What exactly is Business Intelligence?
Introducing Azure SQL Data Warehouse
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Modern Veri Ambarı_Cem Kubilay
Transitioning to a BI Role
Introducing Azure SQL Database
Best Practices to Deliver BI Solutions
SQL Server 2016: Just a Few of Our DBA's Favorite Things
Benefits of the Azure cloud
SQL - Parallel Data Warehouse (PDW)
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
Should I move my database to the cloud?
Power BI Made Simple
SQL Server on Linux - march 2017
What's new in SQL Server 2016
Ad

Similar to Modernizing Your Data Warehouse using APS (20)

PDF
Teradata - Presentation at Hortonworks Booth - Strata 2014
PDF
Prague data management meetup 2018-03-27
PPTX
Testing Big Data: Automated Testing of Hadoop with QuerySurge
PPTX
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
PPTX
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
PPTX
Big Data Practice_Planning_steps_RK
PPTX
Hd insight overview
PPTX
How does Microsoft solve Big Data?
PPTX
Testing Big Data: Automated ETL Testing of Hadoop
PDF
Azure Big data
PPTX
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
PPTX
Skilwise Big data
PPTX
Skillwise Big Data part 2
PPTX
Opportunity: Data, Analytic & Azure
PDF
Hitachi Data Systems Hadoop Solution
PDF
Big data talking stories in Healthcare
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
PPTX
Hadoop in the Cloud – The What, Why and How from the Experts
PDF
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
PDF
QuerySurge Slide Deck for Big Data Testing Webinar
Teradata - Presentation at Hortonworks Booth - Strata 2014
Prague data management meetup 2018-03-27
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Big Data Practice_Planning_steps_RK
Hd insight overview
How does Microsoft solve Big Data?
Testing Big Data: Automated ETL Testing of Hadoop
Azure Big data
Planning your Next-Gen Change Data Capture (CDC) Architecture in 2019 - Strea...
Skilwise Big data
Skillwise Big Data part 2
Opportunity: Data, Analytic & Azure
Hitachi Data Systems Hadoop Solution
Big data talking stories in Healthcare
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Hadoop in the Cloud – The What, Why and How from the Experts
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
QuerySurge Slide Deck for Big Data Testing Webinar

More from Stéphane Fréchette (18)

PPTX
Back to the future - Temporal Table in SQL Server 2016
PPTX
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston
PPTX
Power BI - Bring your data together
PPTX
Data Analytics with R and SQL Server
PPTX
Self-Service Data Integration with Power Query
PPTX
Introduction to Azure HDInsight
PDF
Le journalisme de données... par où commencer?
PPTX
Graph Databases for SQL Server Professionals - SQLSaturday #350 Winnipeg
PPTX
Graph Databases for SQL Server Professionals
PDF
SQL Server 2014 Faster Insights from Any Data
PPTX
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
PPTX
TEDxGatineau
PPTX
PPTX
Introduction to Master Data Services in SQL Server 2012
PDF
Data Quality Services in SQL Server 2012
PDF
Business Intelligence in Excel 2013
KEY
Gatineau Ouverte troisième rencontre publique
KEY
Gatineau Ouverte première rencontre publique
Back to the future - Temporal Table in SQL Server 2016
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston
Power BI - Bring your data together
Data Analytics with R and SQL Server
Self-Service Data Integration with Power Query
Introduction to Azure HDInsight
Le journalisme de données... par où commencer?
Graph Databases for SQL Server Professionals - SQLSaturday #350 Winnipeg
Graph Databases for SQL Server Professionals
SQL Server 2014 Faster Insights from Any Data
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
TEDxGatineau
Introduction to Master Data Services in SQL Server 2012
Data Quality Services in SQL Server 2012
Business Intelligence in Excel 2013
Gatineau Ouverte troisième rencontre publique
Gatineau Ouverte première rencontre publique

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation theory and applications.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Empathic Computing: Creating Shared Understanding
PDF
Machine learning based COVID-19 study performance prediction
PPTX
sap open course for s4hana steps from ECC to s4
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
20250228 LYD VKU AI Blended-Learning.pptx
Unlocking AI with Model Context Protocol (MCP)
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation theory and applications.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation_ Review paper, used for researhc scholars
Empathic Computing: Creating Shared Understanding
Machine learning based COVID-19 study performance prediction
sap open course for s4hana steps from ECC to s4
The AUB Centre for AI in Media Proposal.docx
Building Integrated photovoltaic BIPV_UPV.pdf

Modernizing Your Data Warehouse using APS

  • 1. Modernizing Your Data Warehouse using APS Big data. Small data. All data. Stéphane Fréchette - SQL Server MVP - @sfrechette Database / Business Intelligence Solution Architect
  • 2. - Gartner, “The State of Data Warehousing in 2012”
  • 3. Increasing data volumes 1 Real-time data 2 New data sources and types 3 4 Cloud-born data Data sources
  • 4.  The modern data warehouse Data sources Non-relational data
  • 5. Insights from all your data Enrich and optimize your data from non-traditional sources 5
  • 6. Roadblocks to a modern data warehouse Keep legacy investment Buy new tier-one hardware appliance Acquire Big Data solution Acquire business intelligence Limited scalability and ability to handle new data types Significant training and data silos High acquisition and migration costs Complex with low adoption
  • 7. Introducing the Microsoft Analytics Platform System The turnkey modern data warehouse appliance • Relational and non-relational data in a single appliance • Enterprise-ready Hadoop • Integrated querying across Hadoop and PDW using T-SQL • Direct integration with Microsoft BI tools such as Microsoft Excel • Near real-time performance with In-Memory Columnstore • Ability to scale out to accommodate growing data • Removal of data warehouse bottlenecks with MPP SQL Server • Concurrency that fuels rapid adoption • Industry’s lowest data warehouse appliance price per terabyte • Value through a single appliance solution • Value with flexible hardware options using commodity hardware
  • 8. Microsoft Analytics Platform System The turnkey modern data warehouse appliance
  • 9. Evolution in the nature and use of data in the enterprise Data complexity: variety and velocity Petabytes Historical analysis Insight analysis Predictive analytics Predictive forecasting Value to the business
  • 10. What is Hadoop? Microsoft Confidential 10 OPERATIONAL SERVICES AMBARI Core Services DATA SERVICES MAP REDUCE HDFS FLUME SQOOP LOAD & EXTRACT NFS WebHDFS OOZIE YARN HIVE & HCATALOG PIG FALCON HBASE Hadoop Cluster compute & . . . storage . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  • 11. Manageable, secured, and highly available Hadoop integrated into the appliance High performance and tuned within the appliance End-user authentication with Active Directory Accessible insights for everyone with Microsoft BI tools Managed and monitored using System Center 100-percent Apache Hadoop SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight
  • 12. Parallel Data Warehouse workload HDInsight workload Fabric Hardware Appliance A region is a logical container within an appliance Each workload contains the following boundaries: • Security • Metering • Servicing
  • 13. Bringing Hadoop point solutions and the data warehouse together for users and IT Provides a single T-SQL query model for PDW and Hadoop with rich features of T-SQL, including joins without ETL Uses the power of MPP to enhance query execution performance Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera SQL Server Parallel Data Warehouse Microsoft Azure HDInsight PolyBase Microsoft HDInsight Hortonworks for Windows and Linux Cloudera Select… Result set
  • 14. Results Direct and parallelized HDFS access Enhancing the Data Movement Service (DMS) of APS to allow direct communication between HDFS data nodes and PDW compute nodes Non-relational data Social apps Sensor and RFID Mobile apps Web apps Hadoop Relational data Traditional schema-based data warehouse applications Regular T-SQL External table External data source External file format Enhanced PDW query engine HDFS bridge PDW
  • 15. Hadoop / Data Lake (Cloudera, Hortonworks, HDInsight) Source systems Day / Hour / Minute Refresh SQL Server Data Marts SQL Server Reporting Services SQL Server Analytics / Ad-hoc / Visualization MapReduce T-SQL SQL Server Parallel Data Warehouse PolyBase Microsoft HDInsight Analysis Services APS
  • 16. HDFS File / Directory //hdfs/social_media/twitter //hdfs/social_media/twitter/Daily.log 1 0 Hadoop Dynamic binding Column filtering Row filtering User Location Product Sentiment Rtwt Hour Date Sean Audie Suz Tom Sanjay Roger Steve CA CO WA IL MN TX AL xbox excel xbox sqls wp8 ssas ssrs -1 0 1 1 1 1 5 0 8 0 0 0 8 8 2 2 1 23 23 5-15-14 5-15-14 5-15-14 5-13-14 5-14-14 5-14-14 5-13-14 SELECT User, Product, Sentiment FROM Twitter_Table WHERE Hour = Current - 1 AND Date = Today AND Sentiment >= 0
  • 17. Improve APS operations by extending PolyBase HDFS file formats Textfile and RCFile support • Microsoft Azure HDInsight • HDInsight on APS • Hortonworks Data Platform 1.3 and 2.0 (Linux/Windows Server) • Cloudera Linux 4.3 Security and permission model External table source and file format syntax Microsoft Azure Storage Blobs AU1 PolyBase v2 Analytics Platform System (powered by PolyBase)
  • 18. Big Data insights for anyone New insights with familiar tools through native Microsoft BI integration Minimizes IT intervention for discovering data with tools such as Microsoft Excel Enables DBA and power users to join relational and Hadoop data with T-SQL Takes advantage of high adoption of Excel, Power View, PowerPivot, and SQL Server Analysis Services Offers Hadoop tools like MapReduce, Hive, and Pig for data scientists Everyone else using Microsoft BI tools Power users Data scientist
  • 19. CREATE EXTERNAL TABLE table_name ({<column_definition>}[,..n ]) {WITH ( DATA_SOURCE = <data_source>, FILE_FORMAT = <file_format>, LOCATION =‘<file_path>’, [REJECT_VALUE = <value>], …)}; 1 Referencing external data source 2 Referencing external file format 3 Path of the Hadoop file/folder 4 (Optional) Reject parameters
  • 20. CREATE EXTERNAL DATA SOURCE datasource_name {WITH ( TYPE = <data_source>, LOCATION =‘<location>’, [JOB_TRACKER_LOCATION = ‘<jb_location>’] }; 1 Type of external data source 2 Location of external data source Enabling or disabling of MapReduce job generation 3
  • 21. CREATE EXTERNAL FILE FORMAT fileformat_name {WITH ( FORMAT_TYPE = <type>, [SERDE_METHOD = ‘<sede_method>’,] [DATA_COMPRESSION = ‘<compr_method>’, [FORMAT_OPTIONS (<format_options>)] }; 1 Type of external data source 2 (De)Serialization method [Hive RCFile] 3 Compression method 4 (Optional) Format Options [Text Files]
  • 22. <Format Options> :: = [,FIELD_TERMINATOR = ‘value’], [,STRING_DELIMITER = ‘value’], [,DATE_FORMAT = ‘value’], [USE_TYPE_DEFAULT = ‘value’] 1 Column delimiter 2 Delimiter for string data types 3 To specify a particular date format 4 How missing entries are handled
  • 23. Bringing islands of Hadoop data together Running high performance queries against Hadoop data Archiving data warehouse data to Hadoop (move) Exporting relational data to Hadoop (copy) Importing Hadoop data into a data warehouse (copy)
  • 24. Microsoft Analytics Platform System The turnkey modern data warehouse appliance
  • 25. Scale up Rowstore Diminishing scale as requirements grow Data Querying data by row Page 1 Page 2 Page 3 C1 C2 C3 C4 R1 R1 R1 R1 R2 R2 R2 R2 R3 R3 R3 R3 R4 R4 R4 R4 R5 R5 R5 R5 R6 R6 R6 R6 Sub-optimal performance for many data warehouse queries Forklift Forklift
  • 26. Scale out Multiple nodes with dedicated CPU, memory, and storage Ability to incrementally add hardware for near-linear scale to multiple petabytes Ability to handle query complexity and concurrency at scale No “forklift” of prior warehouse to increase capacity Ability to scale out HDInsight and PDW Scaling out your data to petabytes Scale-out technologies in the Analytics Platform System PDW / HDInsight PDW / HDInsight PDW / HDInsight PDW PDW / HDInsight PDW / HDInsight PDW / HDInsight 0 terabytes 6 petabytes
  • 27. Blazing-fast performance MPP and In-Memory Columnstore for next-generation performance Up to 100x faster queries Updateable clustered columnstore vs. table with customary indexing • Store data in columnar format for massive compression • Load data into or out of memory for next-generation performance with up to 60% improvement in data loading speed • Updateable and clustered for real-time trickle loading Up to 15x more compression Columnstore index representation Parallel query execution Query Results
  • 28. Why is a clustered columnstore index important? • Saves space • Provides easier management by eliminating maintenance of secondary indexes • Supports all PDW data types, including high-precision decimal data types and more Space used in GB (table with 101 million rows) Space used = table space + index space 20.0 15.0 10.0 5.0 0.0 91% savings 1 2 3 4 5 6 In-Memory Columnstore is featured in the storage engine in PDW AU1
  • 29. Relational query execution processing 1 SQL queries sent to control node Control node creates query execution plan 2 Query plan creates distributed queries to run on each compute node 3 Distributed queries sent to compute nodes (all running in parallel) 4 Control node collects query results and returns them to user 5 Create query plan User query Client Control Compute Compute Compute Compute Appliance Management Query results Aggregate query results Compute nodes process query plan operations in parallel
  • 30. SQL Server SMP Reporting and cubes BI Tools Great performance with mixed workloads Analytics Platform System ETL/ELT with SSIS, DQS, MDS ERP CRM LOB APPS ETL/ELT with DWLoader Hadoop / Big Data PDW PolyBase HDInsight Ad hoc queries Intra-Day Near real-time Fast ad hoc Columnstore Polybase CRTAS Link Table Real-Time ROLAP / MOLAP DirectQuery SNAC
  • 31. Microsoft Analytics Platform System The turnkey modern data warehouse appliance
  • 32. High performance using commodity hardware Price per terabyte for leading vendors Significantly lower price per terabyte than the closest competitor Price per terabyte for user-available storage (compressed) NOTE: Orange line indicates average price per terabyte. Thousands Oracle EMC IBM Teradata Microsoft $30 $25 $20 $15 $10 $5 $0 Lower storage costs with Windows Server 2012 Storage Spaces
  • 33. Hardware and software engineered together The ease of an appliance Co-engineered with HP, Dell, and Quanta best practices Leading performance with commodity hardware Integrated support plan with a single Microsoft PDW contact Pre-configured, built, and tuned software and hardware PolyBase HDInsight
  • 34. Hardware architecture InfiniBand InfiniBand PDW region Ethernet Ethernet Control node Failover node Master node Failover node Compute nodes Economical disk storage Compute nodes Economical disk storage Compute nodes Economical disk storage Networking HDInsight region PDW region Rack #1 InfiniBand InfiniBand Ethernet Ethernet Failover node Compute nodes Economical disk storage Compute nodes Economical disk storage Compute nodes Economical disk storage HDI extension base unit HDI active scale unit HDI active scale unit HDI extension base unit Rack #2 HST-01 HST-02 HSA-01 HST-02 Economical disk storage IB and Ethernet Active Unit Addition of two or three compute nodes depending on OEM hardware configuration and related storage Passive Unit Host for non-worker HDInsight nodes Failover Node High availability for the rack
  • 35. • PDW engine • DMS Manager • SQL Server 2012 Enterprise Edition (PDW build) Base Unit C T L Host 1 Host 2 Host 3 Host 4 Economical disk storage IB and Ethernet Direct attached SAS M A D A D V M M Compute 1 Compute 2 Software details • All hosts run Windows Server 2012 Standard and Windows Azure Virtual Machines • Fabric or workload in Hyper-V Virtual Machines • Fabric virtual machine, management server (MAD01), and control server (CTL) share one server • PDW agent that runs on all hosts and all virtual machines • DWConfig and Admin Console • Windows Storage Spaces and Azure Storage blobs
  • 36. CT Base Unit L Host 1 Host 1 Host 2 Host 3 Host 4 Economical disk storage IB and Ethernet Direct attached SAS M AD A D V M M Compute 1 Compute 1 Compute 2 Host 5 Passive Unit 2 Base Unit CT L M AD FA B AD V M M Compute 1 CT L Virtual machine migration can be used to move workload nodes to new hosts after hardware failure Cluster Shared Volumes • Enable all nodes to access logical unit numbers (LUNs) on economical disk storage • Use Server Message Block (SMB3) protocol Failover capabilities • Uses one cluster across the whole appliance • Automatically migrates virtual machines on host failure • Enforces rules with affinity and anti-affinity maps • Uses Windows Failover Cluster Manager