SlideShare a Scribd company logo
Henk van der Valk
Technical Sales Professional
Jan Pieter Posthuma
Microsoft BI Consultant
7/11/2013

Hadoop
Access to online
training content

JOIN THE PASS
COMMUNITY
Become a PASS member for free
and join the world‟s biggest SQL
Server Community.

Join Local
Chapters

Personalize your PASS website experience

Access to events at
discounted rates

Join Virtual
Chapters

2
Agenda
•
•
•
•
•
•
•
•
•

Introduction
Hadoop
HDFS
Data access to HDFS
Map/Reduce
Hive
Data access from HDFS
SQL PDW PolyBase
Wrap up

3
Introduction Henk
•
•
•
•
•

10 years of Unisys-EMEA Performance Center
2002- Largest SQL DWH in the world (SQL2000)
Project Real – (SQL 2005)
ETL WR - loading 1TB within 30 mins (SQL 2008)
Contributed to various SQL whitepapers

•
•
•

Schuberg Philis-100% uptime for mission critical applications
Since april 1st, 2011 – Microsoft SQL PDW - Western Europe
SQLPass speaker & volunteer since 2005

4
Introduction

Alerts, Notifications
SQL Server
StreamInsight

Big Data Sources
(Raw, Unstructured)
Data & Compute Intensive
Application

Business
Insights
SQL Server FTDW Data
Marts

Sensors

Load

SQL Server Reporting
Services

Fast

Devices

Summarize &
Load

HDInsight on
Windows Azure

Bots

HDInsight on
Windows Server

SQL Server Parallel Data
Warehouse

Historical Data
(Beyond Active Window)

Interactive
Reports

Integrate/Enrich

SQL Server Analysis
Server

Crawlers

Performance
Scorecards
Azure Market Place

Enterprise ETL with
SSIS, DQS, MDS

ERP

CRM

LOB

APPS

Source Systems

5
Introduction Jan Pieter Posthuma
Jan Pieter Posthuma
• Technical Lead Microsoft BI and
Big Data consultant
• Inter Access, local consultancy firm in the
Netherlands
• Architect role at multiple projects
• Analysis Service, Reporting Service,
PerformancePoint Service, Big Data,
HDInsight, Cloud BI
http://twitter.com/jppp
http://linkedin.com/jpposthuma
jan.pieter.posthuma@interaccess.nl

6
Hadoop
• Hadoop is a collection of software to create a data-intensive
distributed cluster running on commodity hardware.
• Original idea by Google (2003).
• Widely accepted by Database vendors as a solution for unstructured
data
• Microsoft partners with HortonWorks and delivers their Hadoop
Data Platform as Microsoft HDInsight
• Available as an Azure service and on premise
• HortonWorks Data Platform (HDP) 100% Open Source!

7

7
Hadoop

Map/
Reduce
HBase
HDFS

Poly
base

Avro (Serialization)

Zookeeper

• HDFS – distributed, fault tolerant file system
• MapReduce – framework for writing/executing distributed, fault
tolerant algorithms
• Hive & Pig – SQL-like declarative languages
• Sqoop/PolyBase – package
for moving data between HDFS
BI
ETL
RDBMS
Reporting Tools
and relational DB systems
• + Others…
Hive & Pig
• Hadoop 2.0
Sqoop /

8
HDFS
Large File

1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
…

6440MB
Let‟s color-code them
Block
1

Block
2

Block
3

Block
4

Block
5

Block
6

64MB

64MB

64MB

64MB

64MB

64MB

e.g., Block Size = 64MB

…

Block
100

Block
101

64MB

40MB

Files are composed of set of blocks
• Typically 64MB in size
• Each block is stored as a separate file
in the local file system (e.g. NTFS)
9

9
HDFS
NameNode

HDFS was designed with the
expectation that failures (both
hardware and software) would
occur frequently

BackupNode

namespace backups

Hadoop 2.0 is more decentralized
• Interaction between DataNodes
• Less dependent on primary
NameNode

(heartbeat, balancing, replication, etc.)

DataNode

DataNode

DataNode

DataNode

nodes write to local disk

DataNode
Data access to HDFS
FTP – Upload your data files
Streaming – Via AVRO (RPC) or Flume
Hadoop command – hadoop fs -copyFromLocal
Windows Azure BLOB storage – HDInsight Service (Azure) uses BLOB
storage instead of local VM storage. Data can be uploaded without a
provisioned Hadoop cluster
• PolyBase – Feature of PDW 2012. Direct read/write data access to
the datanodes.
•
•
•
•

11
Data access
Hadoop command Demo

12
13
Map/Reduce
• MR: all functions in a batch oriented architecture
•
•

Map: Apply the logic to the data, eg page hits count.
Reduce: Reduces (aggregate) the results of the Mappers to one.

• YARN: split the JobTracker in to Resource Manager and Node
Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker

14
Map/Reduce
Total page hits

15
Hive
•
•
•
•
•
•
•
•
•

Build for easy data retrieval
Uses Map/Reduce
Created by Facebook
HiveQL: SQL like language
Stores data in tables, which are stored as HDFS file(s)
Only initial INSERT supported, no UPDATE or DELETE
External tables possible on existing (CSV) file(s)
Extra language options to use benefits of Hadoop
Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12).
Improve Hive 100x

16
Hive
Star schema join – (Based on TPC-DS Query 27)
SELECT
col5, avg(col6)
FROM
store_sales_fact ssf
41 GB
join item_dim on (ssf.col1 = item_dim.col1)
58 MB
join date_dim on (ssf.col2 = date_dim.col2)
11 MB
join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3)
80 MB
join store_dim on (ssf.col4 = store_dim.col4)
106 KB
GROUP BY col5
ORDER BY col5
LIMIT 100;

Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB)
17
Hive
File Type

# MR jobs

Input Size

# Mappers

Time

Text / Hive 0.10

5

43.1 GB

179

21:00 min

Text / Hive 0.11

1

38.0 GB

151

4:06 min

RC / Hive 0.11

1

8.21 GB

76

2:16 min

ORC / Hive 0.11

1

2.83 GB

38

1:44 min

RC / Hive 0.11 /
Partitioned /
Bucketed

1

1.73 GB

19

1:44 min

ORC / Hive 0.11 /
Partitioned /
Bucketed

1

687 MB

27

01:19 min

Data: ~64x less data
Time; ~16x times faster
18
Data access from Hadoop
Excel
FTP
Hadoop command – hadoop fs -copyToLocal
ODBC[1] – Via Hive (HiveQL) data can be extracted.
Power Query – Is capable of extracting data directly from HDFS or
Azure BLOB storage
• PolyBase – Feature of PDW 2012. Direct read/write data access to
the datanodes.
•
•
•
•
•

[1] http://www.microsoft.com/en-us/download/details.aspx?id=40886
[2] Power BI Excel add-in – http://www.powerbi.com
19
Data access
Excel 2013 Demo

20
21
PDW – Polybase

…
SQL Server
SQL Server

SQL Server

SQL Server

Sqoop

This is PDW!

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

Hadoop Cluster

22

22
PDW – External Tables
• An external table is PDW‟s representation of data residing in HDFS
• The “table” (metadata) lives in the context of a SQL Server database

• The actual table data resides in HDFS
• No support for DML operations
• No concurrency control or isolation level guarantees
CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ])
{WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])}
[;]

Required to indicate
location of Hadoop cluster

Optional format options
associated with parsing of data
from HDFS (e.g. field delimiters
& reject-related thresholds)

23
PDW – Hadoop use cases & examples

[1] Retrieve data from HDFS with a PDW query
• Seamlessly join structured and semi-structured data
SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID
AND c.URL=‘www.bing.com’;

[2] Import data from HDFS to PDW
• Parallelized CREATE TABLE AS SELECT (CTAS)
• External tables as the source
• PDW table, either replicated or distributed, as destination
CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL)
AS SELECT URL, EventDate, UserID FROM ClickStream;

[3] Export data from PDW to HDFS
• Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)
• External table as the destination; creates a set of HDFS files
CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID)
WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...)
AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
SQL Server 2012
PDW
Polybase demo

25
Wrap up
Hadoop „just another data source‟ @ your fingertips!
Batch processing large datasets before loading into your DWH
Offloading DWH data, but still accessible for analysis/reporting

Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase
Near future: deeply integration between Hadoop and SQL PDW
Try Hadoop / HDInsight yourself:
Azure: http://www.windowsazure.com/en-us/pricing/free-trial/
Web PI: http://www.microsoft.com/web/downloads/platform.aspx

26
Q&A

27
References
Microsoft Big Data
http://www.microsoft.com/bigdata
Windows Azure HDInsight Service (3 months free trail)
http://www.windowsazure.com/en-us/services/hdinsight/
SQL Server Parallel Data Warehouse (PDW) Landing Page

http://www.microsoft.com/PDW
http://www.upgradetopdw.com
Introduction to Polybase
http://www.microsoft.com/en-us/sqlserver/solutionstechnologies/data-warehousing/polybase.aspx
28

28
Thanks!

29

More Related Content

PPTX
HDFS: Hadoop Distributed Filesystem
PPT
Meethadoop
PPTX
BIG DATA: Apache Hadoop
PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
PPTX
Introduction to HDFS
PPTX
Pptx present
PPTX
Big data- HDFS(2nd presentation)
HDFS: Hadoop Distributed Filesystem
Meethadoop
BIG DATA: Apache Hadoop
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Introduction to HDFS
Pptx present
Big data- HDFS(2nd presentation)

What's hot (20)

PPTX
Hadoop distributed file system
PDF
Hadoop Overview kdd2011
PPTX
Introduction to hadoop and hdfs
PPTX
Introduction to HDFS and MapReduce
PPT
An Introduction to Hadoop
PPTX
Introduction to Hadoop and Hadoop component
PDF
Apache Hadoop and HBase
PPTX
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
PPT
Hadoop Technologies
PPT
Hadoop training in hyderabad-kellytechnologies
PPTX
Introduction to Hadoop
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
PDF
Introduction to Hadoop
PPT
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
PDF
Lecture 2 part 1
PPT
Hadoop ppt2
PPT
Hadoop 1.x vs 2
PDF
Distributed Computing with Apache Hadoop: Technology Overview
PDF
02.28.13 WANdisco ApacheCon 2013
PPT
Hadoop - Introduction to Hadoop
Hadoop distributed file system
Hadoop Overview kdd2011
Introduction to hadoop and hdfs
Introduction to HDFS and MapReduce
An Introduction to Hadoop
Introduction to Hadoop and Hadoop component
Apache Hadoop and HBase
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Technologies
Hadoop training in hyderabad-kellytechnologies
Introduction to Hadoop
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Introduction to Hadoop
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Lecture 2 part 1
Hadoop ppt2
Hadoop 1.x vs 2
Distributed Computing with Apache Hadoop: Technology Overview
02.28.13 WANdisco ApacheCon 2013
Hadoop - Introduction to Hadoop
Ad

Viewers also liked (6)

PDF
World of Watson 2016 - Information Insecurity
PDF
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
PPTX
SQLBits XI - ETL with Hadoop
PPTX
ETL big data with apache hadoop
PPTX
Simplifying Big Data ETL with Talend
PPTX
Hadoop from Hive with Stinger to Tez
World of Watson 2016 - Information Insecurity
InfoSphere: Leading from the Front - Accelerating Data Integration through Me...
SQLBits XI - ETL with Hadoop
ETL big data with apache hadoop
Simplifying Big Data ETL with Talend
Hadoop from Hive with Stinger to Tez
Ad

Similar to SQLRally Amsterdam 2013 - Hadoop (20)

PPTX
מיכאל
PPTX
SQL Server 2012 and Big Data
PPTX
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
PDF
Hadoop and SQL: Delivery Analytics Across the Organization
PDF
What is hadoop
PDF
VMUGIT UC 2013 - 08a VMware Hadoop
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
PPTX
Overview of Big data, Hadoop and Microsoft BI - version1
PDF
Big SQL 3.0 - Toronto Meetup -- May 2014
PPTX
Overview of big data & hadoop v1
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
PPTX
Intro to Hybrid Data Warehouse
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
PPTX
Intro to Hadoop
PPTX
Big data Hadoop
ODP
Hadoop demo ppt
PPTX
Big Data and NoSQL for Database and BI Pros
PPTX
Hortonworks.bdb
PDF
Running Cognos on Hadoop
PPTX
Modernizing Your Data Warehouse using APS
מיכאל
SQL Server 2012 and Big Data
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop and SQL: Delivery Analytics Across the Organization
What is hadoop
VMUGIT UC 2013 - 08a VMware Hadoop
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big SQL 3.0 - Toronto Meetup -- May 2014
Overview of big data & hadoop v1
Big Data Developers Moscow Meetup 1 - sql on hadoop
Intro to Hybrid Data Warehouse
Hadoop - Architectural road map for Hadoop Ecosystem
Intro to Hadoop
Big data Hadoop
Hadoop demo ppt
Big Data and NoSQL for Database and BI Pros
Hortonworks.bdb
Running Cognos on Hadoop
Modernizing Your Data Warehouse using APS

More from Jan Pieter Posthuma (11)

PPTX
Power BI for Developers
PPTX
Extending Power BI with your own custom visual
PPTX
Extending Power BI with your own custom visual
PPTX
Azure Global Bootcamp - CIS Handson
PPTX
Extending Power BI With Your Own Custom Visual
PPTX
PBIG - Power BI en R visuals
PPTX
SQLSaturday 551 - Extending Power BI
PPTX
SQLServer Days - Power BI Custom Visuals
PPTX
TechDays - Power BI Custom Visuals
PPTX
SQLSaturday 541 - Extending Power BI
PPTX
Power BI API
Power BI for Developers
Extending Power BI with your own custom visual
Extending Power BI with your own custom visual
Azure Global Bootcamp - CIS Handson
Extending Power BI With Your Own Custom Visual
PBIG - Power BI en R visuals
SQLSaturday 551 - Extending Power BI
SQLServer Days - Power BI Custom Visuals
TechDays - Power BI Custom Visuals
SQLSaturday 541 - Extending Power BI
Power BI API

Recently uploaded (20)

PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Approach and Philosophy of On baking technology
PDF
August Patch Tuesday
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Tartificialntelligence_presentation.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Unlocking AI with Model Context Protocol (MCP)
A comparative study of natural language inference in Swahili using monolingua...
Approach and Philosophy of On baking technology
August Patch Tuesday
Zenith AI: Advanced Artificial Intelligence
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
WOOl fibre morphology and structure.pdf for textiles
DP Operators-handbook-extract for the Mautical Institute
Tartificialntelligence_presentation.pptx
Getting Started with Data Integration: FME Form 101
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A comparative analysis of optical character recognition models for extracting...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Programs and apps: productivity, graphics, security and other tools
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Web App vs Mobile App What Should You Build First.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
TLE Review Electricity (Electricity).pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Unlocking AI with Model Context Protocol (MCP)

SQLRally Amsterdam 2013 - Hadoop

  • 1. Henk van der Valk Technical Sales Professional Jan Pieter Posthuma Microsoft BI Consultant 7/11/2013 Hadoop
  • 2. Access to online training content JOIN THE PASS COMMUNITY Become a PASS member for free and join the world‟s biggest SQL Server Community. Join Local Chapters Personalize your PASS website experience Access to events at discounted rates Join Virtual Chapters 2
  • 3. Agenda • • • • • • • • • Introduction Hadoop HDFS Data access to HDFS Map/Reduce Hive Data access from HDFS SQL PDW PolyBase Wrap up 3
  • 4. Introduction Henk • • • • • 10 years of Unisys-EMEA Performance Center 2002- Largest SQL DWH in the world (SQL2000) Project Real – (SQL 2005) ETL WR - loading 1TB within 30 mins (SQL 2008) Contributed to various SQL whitepapers • • • Schuberg Philis-100% uptime for mission critical applications Since april 1st, 2011 – Microsoft SQL PDW - Western Europe SQLPass speaker & volunteer since 2005 4
  • 5. Introduction Alerts, Notifications SQL Server StreamInsight Big Data Sources (Raw, Unstructured) Data & Compute Intensive Application Business Insights SQL Server FTDW Data Marts Sensors Load SQL Server Reporting Services Fast Devices Summarize & Load HDInsight on Windows Azure Bots HDInsight on Windows Server SQL Server Parallel Data Warehouse Historical Data (Beyond Active Window) Interactive Reports Integrate/Enrich SQL Server Analysis Server Crawlers Performance Scorecards Azure Market Place Enterprise ETL with SSIS, DQS, MDS ERP CRM LOB APPS Source Systems 5
  • 6. Introduction Jan Pieter Posthuma Jan Pieter Posthuma • Technical Lead Microsoft BI and Big Data consultant • Inter Access, local consultancy firm in the Netherlands • Architect role at multiple projects • Analysis Service, Reporting Service, PerformancePoint Service, Big Data, HDInsight, Cloud BI http://twitter.com/jppp http://linkedin.com/jpposthuma jan.pieter.posthuma@interaccess.nl 6
  • 7. Hadoop • Hadoop is a collection of software to create a data-intensive distributed cluster running on commodity hardware. • Original idea by Google (2003). • Widely accepted by Database vendors as a solution for unstructured data • Microsoft partners with HortonWorks and delivers their Hadoop Data Platform as Microsoft HDInsight • Available as an Azure service and on premise • HortonWorks Data Platform (HDP) 100% Open Source! 7 7
  • 8. Hadoop Map/ Reduce HBase HDFS Poly base Avro (Serialization) Zookeeper • HDFS – distributed, fault tolerant file system • MapReduce – framework for writing/executing distributed, fault tolerant algorithms • Hive & Pig – SQL-like declarative languages • Sqoop/PolyBase – package for moving data between HDFS BI ETL RDBMS Reporting Tools and relational DB systems • + Others… Hive & Pig • Hadoop 2.0 Sqoop / 8
  • 9. HDFS Large File 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001 … 6440MB Let‟s color-code them Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 64MB 64MB 64MB 64MB 64MB 64MB e.g., Block Size = 64MB … Block 100 Block 101 64MB 40MB Files are composed of set of blocks • Typically 64MB in size • Each block is stored as a separate file in the local file system (e.g. NTFS) 9 9
  • 10. HDFS NameNode HDFS was designed with the expectation that failures (both hardware and software) would occur frequently BackupNode namespace backups Hadoop 2.0 is more decentralized • Interaction between DataNodes • Less dependent on primary NameNode (heartbeat, balancing, replication, etc.) DataNode DataNode DataNode DataNode nodes write to local disk DataNode
  • 11. Data access to HDFS FTP – Upload your data files Streaming – Via AVRO (RPC) or Flume Hadoop command – hadoop fs -copyFromLocal Windows Azure BLOB storage – HDInsight Service (Azure) uses BLOB storage instead of local VM storage. Data can be uploaded without a provisioned Hadoop cluster • PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes. • • • • 11
  • 13. 13
  • 14. Map/Reduce • MR: all functions in a batch oriented architecture • • Map: Apply the logic to the data, eg page hits count. Reduce: Reduces (aggregate) the results of the Mappers to one. • YARN: split the JobTracker in to Resource Manager and Node Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker 14
  • 16. Hive • • • • • • • • • Build for easy data retrieval Uses Map/Reduce Created by Facebook HiveQL: SQL like language Stores data in tables, which are stored as HDFS file(s) Only initial INSERT supported, no UPDATE or DELETE External tables possible on existing (CSV) file(s) Extra language options to use benefits of Hadoop Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12). Improve Hive 100x 16
  • 17. Hive Star schema join – (Based on TPC-DS Query 27) SELECT col5, avg(col6) FROM store_sales_fact ssf 41 GB join item_dim on (ssf.col1 = item_dim.col1) 58 MB join date_dim on (ssf.col2 = date_dim.col2) 11 MB join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3) 80 MB join store_dim on (ssf.col4 = store_dim.col4) 106 KB GROUP BY col5 ORDER BY col5 LIMIT 100; Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB) 17
  • 18. Hive File Type # MR jobs Input Size # Mappers Time Text / Hive 0.10 5 43.1 GB 179 21:00 min Text / Hive 0.11 1 38.0 GB 151 4:06 min RC / Hive 0.11 1 8.21 GB 76 2:16 min ORC / Hive 0.11 1 2.83 GB 38 1:44 min RC / Hive 0.11 / Partitioned / Bucketed 1 1.73 GB 19 1:44 min ORC / Hive 0.11 / Partitioned / Bucketed 1 687 MB 27 01:19 min Data: ~64x less data Time; ~16x times faster 18
  • 19. Data access from Hadoop Excel FTP Hadoop command – hadoop fs -copyToLocal ODBC[1] – Via Hive (HiveQL) data can be extracted. Power Query – Is capable of extracting data directly from HDFS or Azure BLOB storage • PolyBase – Feature of PDW 2012. Direct read/write data access to the datanodes. • • • • • [1] http://www.microsoft.com/en-us/download/details.aspx?id=40886 [2] Power BI Excel add-in – http://www.powerbi.com 19
  • 21. 21
  • 22. PDW – Polybase … SQL Server SQL Server SQL Server SQL Server Sqoop This is PDW! DN DN DN DN DN DN DN DN DN DN DN DN Hadoop Cluster 22 22
  • 23. PDW – External Tables • An external table is PDW‟s representation of data residing in HDFS • The “table” (metadata) lives in the context of a SQL Server database • The actual table data resides in HDFS • No support for DML operations • No concurrency control or isolation level guarantees CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ]) {WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])} [;] Required to indicate location of Hadoop cluster Optional format options associated with parsing of data from HDFS (e.g. field delimiters & reject-related thresholds) 23
  • 24. PDW – Hadoop use cases & examples [1] Retrieve data from HDFS with a PDW query • Seamlessly join structured and semi-structured data SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID AND c.URL=‘www.bing.com’; [2] Import data from HDFS to PDW • Parallelized CREATE TABLE AS SELECT (CTAS) • External tables as the source • PDW table, either replicated or distributed, as destination CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL) AS SELECT URL, EventDate, UserID FROM ClickStream; [3] Export data from PDW to HDFS • Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS) • External table as the destination; creates a set of HDFS files CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID) WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...) AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
  • 26. Wrap up Hadoop „just another data source‟ @ your fingertips! Batch processing large datasets before loading into your DWH Offloading DWH data, but still accessible for analysis/reporting Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase Near future: deeply integration between Hadoop and SQL PDW Try Hadoop / HDInsight yourself: Azure: http://www.windowsazure.com/en-us/pricing/free-trial/ Web PI: http://www.microsoft.com/web/downloads/platform.aspx 26
  • 28. References Microsoft Big Data http://www.microsoft.com/bigdata Windows Azure HDInsight Service (3 months free trail) http://www.windowsazure.com/en-us/services/hdinsight/ SQL Server Parallel Data Warehouse (PDW) Landing Page http://www.microsoft.com/PDW http://www.upgradetopdw.com Introduction to Polybase http://www.microsoft.com/en-us/sqlserver/solutionstechnologies/data-warehousing/polybase.aspx 28 28

Editor's Notes

  • #12: DEMO:Upload a local file with hadoop -copyFromLocal
  • #13: - Hadoopcommand- CoudXplorer
  • #15: DEMO: Total hit count W3C logs
  • #16: - TotalHits MR job
  • #17: ‘Of the 150k jobs Facebook runs daily, only 500 are MapReduce jobs. The rest are is HiveQL’
  • #19: Hive &lt;0.11:stores data is plain text filesno join optimization.typical DHW query (star schema join) results in 6 MR jobsHive 0.11:introduces (O)RC files, loosely based on column store indexesjoin optimizationTypical DWH query result in 1 MR jobHive 0.12:- Uses Yarn and Tez, optimized for DWH queries and less overhead then MR
  • #20: DEMO: Retrieving data via ODBC and Power Query in Excel
  • #21: Via ExcelData Explorer (Azure BLOB storage)