SQLRally Amsterdam 2013 - Hadoop

Henk van der Valk
Technical Sales Professional
Jan Pieter Posthuma
Microsoft BI Consultant
7/11/2013

Hadoop

Access to online
training content

JOIN THE PASS
COMMUNITY
Become a PASS member for free
and join the world‟s biggest SQL
Server Community.

Join Local
Chapters

Personalize your PASS website experience

Access to events at
discounted rates

Join Virtual
Chapters

2

Agenda
•
•
•
•
•
•
•
•
•

Introduction
Hadoop
HDFS
Data access to HDFS
Map/Reduce
Hive
Data access from HDFS
SQL PDW PolyBase
Wrap up

3

Introduction Henk
•
•
•
•
•

10 years of Unisys-EMEA Performance Center
2002- Largest SQL DWH in the world (SQL2000)
Project Real – (SQL 2005)
ETL WR - loading 1TB within 30 mins (SQL 2008)
Contributed to various SQL whitepapers

•
•
•

Schuberg Philis-100% uptime for mission critical applications
Since april 1st, 2011 – Microsoft SQL PDW - Western Europe
SQLPass speaker & volunteer since 2005

4

Introduction

Alerts, Notifications
SQL Server
StreamInsight

Big Data Sources
(Raw, Unstructured)
Data & Compute Intensive
Application

Business
Insights
SQL Server FTDW Data
Marts

Sensors

Load

SQL Server Reporting
Services

Fast

Devices

Summarize &
Load

HDInsight on
Windows Azure

Bots

HDInsight on
Windows Server

SQL Server Parallel Data
Warehouse

Historical Data
(Beyond Active Window)

Interactive
Reports

Integrate/Enrich

SQL Server Analysis
Server

Crawlers

Performance
Scorecards
Azure Market Place

Enterprise ETL with
SSIS, DQS, MDS

ERP

CRM

LOB

APPS

Source Systems

5

Introduction Jan Pieter Posthuma
Jan Pieter Posthuma
• Technical Lead Microsoft BI and
Big Data consultant
• Inter Access, local consultancy firm in the
Netherlands
• Architect role at multiple projects
• Analysis Service, Reporting Service,
PerformancePoint Service, Big Data,
HDInsight, Cloud BI
http://twitter.com/jppp
http://linkedin.com/jpposthuma
jan.pieter.posthuma@interaccess.nl

6

Hadoop
• Hadoop is a collection of software to create a data-intensive
distributed cluster running on commodity hardware.
• Original idea by Google (2003).
• Widely accepted by Database vendors as a solution for unstructured
data
• Microsoft partners with HortonWorks and delivers their Hadoop
Data Platform as Microsoft HDInsight
• Available as an Azure service and on premise
• HortonWorks Data Platform (HDP) 100% Open Source!

7

7

Hadoop

Map/
Reduce
HBase
HDFS

Poly
base

Avro (Serialization)

Zookeeper

• HDFS – distributed, fault tolerant file system
• MapReduce – framework for writing/executing distributed, fault
tolerant algorithms
• Hive & Pig – SQL-like declarative languages
• Sqoop/PolyBase – package
for moving data between HDFS
BI
ETL
RDBMS
Reporting Tools
and relational DB systems
• + Others…
Hive & Pig
• Hadoop 2.0
Sqoop /

8

HDFS
Large File

1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
1100101010011100101010011100101010011100101010011100110010101001110010101001110010101001110010101001110010101001
…

6440MB
Let‟s color-code them
Block
1

Block
2

Block
3

Block
4

Block
5

Block
6

64MB

64MB

64MB

64MB

64MB

64MB

e.g., Block Size = 64MB

…

Block
100

Block
101

64MB

40MB

Files are composed of set of blocks
• Typically 64MB in size
• Each block is stored as a separate file
in the local file system (e.g. NTFS)
9

9

HDFS
NameNode

HDFS was designed with the
expectation that failures (both
hardware and software) would
occur frequently

BackupNode

namespace backups

Hadoop 2.0 is more decentralized
• Interaction between DataNodes
• Less dependent on primary
NameNode

(heartbeat, balancing, replication, etc.)

DataNode

DataNode

DataNode

DataNode

nodes write to local disk

DataNode

Data access to HDFS
FTP – Upload your data files
Streaming – Via AVRO (RPC) or Flume
Hadoop command – hadoop fs -copyFromLocal
Windows Azure BLOB storage – HDInsight Service (Azure) uses BLOB
storage instead of local VM storage. Data can be uploaded without a
provisioned Hadoop cluster
• PolyBase – Feature of PDW 2012. Direct read/write data access to
the datanodes.
•
•
•
•

11

Data access
Hadoop command Demo

12

Map/Reduce
• MR: all functions in a batch oriented architecture
•
•

Map: Apply the logic to the data, eg page hits count.
Reduce: Reduces (aggregate) the results of the Mappers to one.

• YARN: split the JobTracker in to Resource Manager and Node
Manager. And MR in Hadoop 2.0 uses YARN as its JobTacker

14

Map/Reduce
Total page hits

15

Hive
•
•
•
•
•
•
•
•
•

Build for easy data retrieval
Uses Map/Reduce
Created by Facebook
HiveQL: SQL like language
Stores data in tables, which are stored as HDFS file(s)
Only initial INSERT supported, no UPDATE or DELETE
External tables possible on existing (CSV) file(s)
Extra language options to use benefits of Hadoop
Stinger initiative: Phase 1 (0.11) and Phase 2 (0.12).
Improve Hive 100x

16

Hive
Star schema join – (Based on TPC-DS Query 27)
SELECT
col5, avg(col6)
FROM
store_sales_fact ssf
41 GB
join item_dim on (ssf.col1 = item_dim.col1)
58 MB
join date_dim on (ssf.col2 = date_dim.col2)
11 MB
join custdmgrphcs_dim on (ssf.col3 = custdmgrphcs_dim.col3)
80 MB
join store_dim on (ssf.col4 = store_dim.col4)
106 KB
GROUP BY col5
ORDER BY col5
LIMIT 100;

Cluster: 6 Nodes (2 Name, 4 Compute – dual core, 14GB)
17

Hive
File Type

# MR jobs

Input Size

# Mappers

Time

Text / Hive 0.10

5

43.1 GB

179

21:00 min

Text / Hive 0.11

1

38.0 GB

151

4:06 min

RC / Hive 0.11

1

8.21 GB

76

2:16 min

ORC / Hive 0.11

1

2.83 GB

38

1:44 min

RC / Hive 0.11 /
Partitioned /
Bucketed

1

1.73 GB

19

1:44 min

ORC / Hive 0.11 /
Partitioned /
Bucketed

1

687 MB

27

01:19 min

Data: ~64x less data
Time; ~16x times faster
18

Data access from Hadoop
Excel
FTP
Hadoop command – hadoop fs -copyToLocal
ODBC[1] – Via Hive (HiveQL) data can be extracted.
Power Query – Is capable of extracting data directly from HDFS or
Azure BLOB storage
• PolyBase – Feature of PDW 2012. Direct read/write data access to
the datanodes.
•
•
•
•
•

[1] http://www.microsoft.com/en-us/download/details.aspx?id=40886
[2] Power BI Excel add-in – http://www.powerbi.com
19

Data access
Excel 2013 Demo

20

PDW – Polybase

…
SQL Server
SQL Server

SQL Server

SQL Server

Sqoop

This is PDW!

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

DN

Hadoop Cluster

22

22

PDW – External Tables
• An external table is PDW‟s representation of data residing in HDFS
• The “table” (metadata) lives in the context of a SQL Server database

• The actual table data resides in HDFS
• No support for DML operations
• No concurrency control or isolation level guarantees
CREATE EXTERNAL TABLE table_name ({<column_definition>} [,...n ])
{WITH (LOCATION =‘<URI>’,[FORMAT_OPTIONS = (<VALUES>)])}
[;]

Required to indicate
location of Hadoop cluster

Optional format options
associated with parsing of data
from HDFS (e.g. field delimiters
& reject-related thresholds)

23

PDW – Hadoop use cases & examples

[1] Retrieve data from HDFS with a PDW query
• Seamlessly join structured and semi-structured data
SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID
AND c.URL=‘www.bing.com’;

[2] Import data from HDFS to PDW
• Parallelized CREATE TABLE AS SELECT (CTAS)
• External tables as the source
• PDW table, either replicated or distributed, as destination
CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL)
AS SELECT URL, EventDate, UserID FROM ClickStream;

[3] Export data from PDW to HDFS
• Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)
• External table as the destination; creates a set of HDFS files
CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID)
WITH (LOCATION =‘hdfs://MyHadoop:5000/joe’, FORMAT_OPTIONS (...)
AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;

SQL Server 2012
PDW
Polybase demo

25

Wrap up
Hadoop „just another data source‟ @ your fingertips!
Batch processing large datasets before loading into your DWH
Offloading DWH data, but still accessible for analysis/reporting

Integrate Hadoop via SQOOP, ODBC (Hive) or PolyBase
Near future: deeply integration between Hadoop and SQL PDW
Try Hadoop / HDInsight yourself:
Azure: http://www.windowsazure.com/en-us/pricing/free-trial/
Web PI: http://www.microsoft.com/web/downloads/platform.aspx

26

References
Microsoft Big Data
http://www.microsoft.com/bigdata
Windows Azure HDInsight Service (3 months free trail)
http://www.windowsazure.com/en-us/services/hdinsight/
SQL Server Parallel Data Warehouse (PDW) Landing Page

http://www.microsoft.com/PDW
http://www.upgradetopdw.com
Introduction to Polybase
http://www.microsoft.com/en-us/sqlserver/solutionstechnologies/data-warehousing/polybase.aspx
28

28

SQLRally Amsterdam 2013 - Hadoop

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to SQLRally Amsterdam 2013 - Hadoop (20)

More from Jan Pieter Posthuma (11)

Recently uploaded (20)

SQLRally Amsterdam 2013 - Hadoop

Editor's Notes