SlideShare a Scribd company logo
Big Data in the Real World
Orlando PASS
October 2013

http://www.pssug.org
Mark Kromer
http://www.kromerbigdata.com
@kromerbigdata
@mssqldude
What we’ll (try) to cover today

‣ What is Big Data?
‣ The Big Data and Apache Hadoop environment
‣ Big Data Analytics
‣ SQL Server in the Big Data world
‣ Microsoft + Hortonworks (Yahoo!) = HDInsights

2
Big Data 101
‣ 3 V’s
‣ Volume – Terabyte records, transactions, tables, files
‣ Velocity – Batch, near-time, real-time (analytics), streams.
‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix

‣ Text Processing
‣ Techniques for processing and analyzing unstructured (and structured) LARGE files

‣ Analytics & Insights
‣ Distributed File System & Programming
Mark’s Big Data Myths
‣ Big Data ≠ NoSQL
‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!,
Google, Facebook, et al) but not the same thing

‣ Facebook, for example, uses Hbase from the Hadoop stack
‣ NoSQL does not have to be Big Data

‣ Big Data ≠ Real Time
‣ Big Data is primarily about batch processing huge files in a distributed manner
and analyzing data that was otherwise too complex to provide value

‣ Use in-memory analytics for real time insights

‣ Big Data ≠ Data Warehouse
‣ I still refer to large multi-TB DWs as “VLDB”
‣ Big Data is about crunching stats in text files for discovery of new patterns and
insights

‣ Use the DW to aggregate and store the summaries of those calculations for
reporting
‣
‣
‣
‣
‣
‣

Batch Processing
Commodity Hardware

Data Locality, no shared storage
Scales linearly
Great for large text file processing, not so great on small files
Distributed programming paradigm
Popular Hadoop Distributions
Hosted PaaS Hadoop platforms: Amazon
EMR, Pivotal, Microsoft Hadoop on Azure
Popular NoSQL Distributions
Transactional-based, not analytics schemas
Popular MPP Distributions
Big Data as distributed, scale-out, sharded data stores
Big Data Analytics Web Platform - Example
MapReduce Framework (Map)
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}
MapReduce Framework (Reduce & Job)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext
context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}
Get Data into Hadoop
‣
‣
‣
‣

Linux shell commands to access data in HDFS
Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv
List files in HDFS:
c:Hadoop>hadoop fs -ls /import
Found 1 items
-rw-r--r-- 1 makromer supergroup

114 2013-05-07 12:11 /import/sales.csv

‣ View file in HDFS:
c:Hadoop>hadoop fs -cat /import/sales.csv
Kromer,123,5,55
Smith,567,1,25
Jones,123,9,99

James,11,12,1
Johnson,456,2,2.5
Singh,456,1,3.25
Yu,123,1,11

‣ Now, we can work on the data with MapReduce, Hive, Pig, etc.
Use Hive for Data Schema and Analysis
create external table ext_sales
(
lastname string,
productid int,
quantity int,
sales_amount float

)
row format delimited fields terminated by ',' stored as
textfile location '/user/makromer/hiveext/input';
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE
INTO TABLE ext_sales;
Sqoop
Data transfer to & from Hadoop & SQL Server
‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password
password –table customers -m 1

‣ > hadoop fs -cat /user/mark/customers/part-m-00000
‣ > 5,Bob Smith
‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password
password -m 1 –table customers –export-dir /user/mark/data/employees3

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in
32.6364 seconds (6.1588 bytes/sec)

‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
SQL Server Big Data – Data Loading

Amazon HDFS & EMR

Data Loading

Amazon S3 Bucket
Role of NoSQL in a Big Data Analytics Solution
‣ Use NoSQL to store data quickly without the overhead of RDBMS
‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few

‣ Why NoSQL?
‣ In the world of “Big Data”
‣ “Schema later”
‣ Ignore ACID properties
‣ Drop data into key-value store quick & dirty
‣ Worry about query & read later

‣ Why NOT NoSQL?
‣ In the world of Big Data Analytics, you will need support from analytical tools with a
SQL, SAS, MR interface

‣ SQL Server and NoSQL
‣ Not a natural fit
‣ Use HDFS or your favorite NoSQL database
‣ Consider turning off SQL Server locking mechanisms
‣ Focus on writes, not reads (read uncommitted)
SQL Server Big Data Environment
‣ SQL Server Database
‣
‣
‣
‣
‣
‣
‣
‣

SQL 2012 Enterprise Edition
Page Compression
2012 Columnar Compression on Fact Tables
Clustered Index on all tables
Auto-update Stats Asynch

Partition Fact Tables by month and archive data with sliding window technique
Drop all indexes before nightly ETL load jobs
Rebuild all indexes when ETL completes

‣ SQL Server Analysis Services
‣
‣
‣
‣

SSAS 2012 Enterprise Edition
2008 R2 OLAP cubes partition-aligned with DW
2012 cubes in-memory tabular cubes
All access through MSMDPUMP or SharePoint
SQL Server Big Data Analytics Features

‣ Columnstore
‣ Sqoop adapter
‣ PolyBase
‣ Hive
‣ In-memory analytics
‣ Scale-out MPP
Microsoft’s Data Solution – Big Data & PDW
Excel with PowerPivot

Predictive Analytics

Embedded BI

Data Market Place

Familiar End User Tools

S
S
R
S

SSAS

BI Platform

Hundreds of TB of Data
(structured)

Petabytes of Data
(Unstructured)
Hadoop On
Windows
Azure

Sensors

Hadoop On
Windows
Server

Devices

Data Market

Bots

Connectors
Parallel Data Warehouse
Crawlers

ERP

CRM

LOB

APPs

Unstructured and Structured Data
19

19
MICROSOFT BIG DATA
immersive data
experiences

PowerPivot

Self-Service

Power View

Collaboration

Corporate Apps

Devices

connecting with
worlds data
Combine

Discover

Refine

Microsoft HDInsight Server

any data, any
size, anywhere

StreamInsight
Parallel Data Warehouse

Relational

HDInsight Service

Non-relational

Analytical

Streaming
Big Data in the Real World
Microsoft .NET Hadoop APIs
‣ WebHDFS
‣ Linq to Hive
‣ MapReduce
‣ C#
‣ Java
‣ Hive
‣ Pig
‣ http://hadoopsdk.codeplex.com/
‣ SQL on Hadoop
‣ Cloudera Impala
‣ Teradata SQL-H
‣ Microsoft Polybase
‣ Hadapt
Data Movement to the Cloud

‣ Use Windows Azure Blob Storage
• Already stored in 3 copies
• Hadoop can read from Azure blob storage
• Allows you to upload while using no Hadoop network or CPU resources

‣ Compress files
•
•
•
•

Hadoop can read Gzip
Uses less network resources than uncompressed
Costs less for direct storage costs
Compress directories where source files are created as well.

23
Wrap-up
‣ What is a Big Data approach to Analytics?
‣ Massive scale
‣ Data discovery & research
‣ Self-service
‣ Reporting & BI
‣ Why do we take this Big Data Analytics approach?
‣ TBs of change data in each subject area
‣ The data in the sources are variable and unstructured
‣ SSIS ETL alone couldn’t keep up or handle complexity
‣ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL
Server for Big Data

‣ With the configs mentioned previously, SQL Server works great
‣ Analytics on Big Data also requires Big Data Analytics tools
‣ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse

More Related Content

PPTX
Big Data Analytics Projects - Real World with Pentaho
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Pentaho Analytics on MongoDB
PPTX
Data lake-itweekend-sharif university-vahid amiry
PPTX
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
PDF
Hd insight essentials quick view
PPTX
Pentaho Big Data Analytics with Vertica and Hadoop
PPTX
Hadoop vs. RDBMS for Advanced Analytics
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics with Hadoop, MongoDB and SQL Server
Pentaho Analytics on MongoDB
Data lake-itweekend-sharif university-vahid amiry
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
Hd insight essentials quick view
Pentaho Big Data Analytics with Vertica and Hadoop
Hadoop vs. RDBMS for Advanced Analytics

What's hot (20)

PPTX
Big data in Azure
PPTX
Big Data Use Cases
PDF
Innovation in the Data Warehouse - StampedeCon 2016
PPTX
Big Data on azure
PDF
Big Data Architecture Workshop - Vahid Amiri
PDF
Big Data Architecture and Deployment
PPTX
The Microsoft BigData Story
PPTX
Introduction to Kudu - StampedeCon 2016
PDF
Big Data Architecture and Design Patterns
PPTX
Big Data in the Cloud with Azure Marketplace Images
PDF
Data platform architecture
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
PDF
Big data on Azure for Architects
PPTX
Big Data with Azure
PPTX
Azure cafe marketplace with looker data analytics
PPTX
Hadoop Journey at Walgreens
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PPTX
Big Data with SQL Server
PPTX
Big Data Introduction
Big data in Azure
Big Data Use Cases
Innovation in the Data Warehouse - StampedeCon 2016
Big Data on azure
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture and Deployment
The Microsoft BigData Story
Introduction to Kudu - StampedeCon 2016
Big Data Architecture and Design Patterns
Big Data in the Cloud with Azure Marketplace Images
Data platform architecture
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Big data on Azure for Architects
Big Data with Azure
Azure cafe marketplace with looker data analytics
Hadoop Journey at Walgreens
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data Analytics in the Cloud with Microsoft Azure
Big Data with SQL Server
Big Data Introduction
Ad

Viewers also liked (19)

PDF
Big Data - A Real Life Revolution
PPTX
Big data in the real world opportunities and challenges facing healthcare -...
PPTX
ETL in the Cloud With Microsoft Azure
PDF
Definitions for Real World of Big Data Marketing
PDF
Analytics: The Real-world Use of Big Data
PPTX
Microsoft Azure Big Data Analytics
PPTX
Microsoft Event Registration System Hosted on Windows Azure
PPTX
PSSUG Nov 2012: Big Data with SQL Server
PPTX
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
PPTX
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
PPTX
What's new in SQL Server 2012 for philly code camp 2012.1
DOCX
MEC Data sheet
PPTX
Anexinet Big Data Solutions
PPTX
Sql server 2012 roadshow masd overview 003
PPTX
Microsoft SQL Server Data Warehouses for SQL Server DBAs
PPTX
Azure vs. amazon
PPTX
AWS vs Azure - Cloud Services Comparison
PDF
The real world use of Big Data to change business
PPTX
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Big Data - A Real Life Revolution
Big data in the real world opportunities and challenges facing healthcare -...
ETL in the Cloud With Microsoft Azure
Definitions for Real World of Big Data Marketing
Analytics: The Real-world Use of Big Data
Microsoft Azure Big Data Analytics
Microsoft Event Registration System Hosted on Windows Azure
PSSUG Nov 2012: Big Data with SQL Server
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
What's new in SQL Server 2012 for philly code camp 2012.1
MEC Data sheet
Anexinet Big Data Solutions
Sql server 2012 roadshow masd overview 003
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Azure vs. amazon
AWS vs Azure - Cloud Services Comparison
The real world use of Big Data to change business
Microsoft Azure vs Amazon Web Services (AWS) Services & Feature Mapping
Ad

Similar to Big Data in the Real World (20)

PDF
Big data and you
 
PPTX
Microsoft cloud big data strategy
PPTX
Fundamentals of big data analytics and Hadoop
PPTX
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
PPTX
Choosing technologies for a big data solution in the cloud
PPTX
Big data unit 2
PPTX
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
PPTX
Big Data on the Microsoft Platform
PDF
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
PPTX
Modernizing Your Data Warehouse using APS
PPT
Data analytics & its Trends
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
PPTX
Architecting Your First Big Data Implementation
PDF
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
PDF
Big data analysis concepts and references
PDF
Big Data - Module 1
PPTX
Big Data Strategy for the Relational World
PDF
Big Data Analytics
PPTX
Big Data and NoSQL for Database and BI Pros
Big data and you
 
Microsoft cloud big data strategy
Fundamentals of big data analytics and Hadoop
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Choosing technologies for a big data solution in the cloud
Big data unit 2
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Modernizing Your Data Warehouse using APS
Data analytics & its Trends
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Architecting Your First Big Data Implementation
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
Lecture 5 - Big Data and Hadoop Intro.ppt
Big data analysis concepts and references
Big Data - Module 1
Big Data Strategy for the Relational World
Big Data Analytics
Big Data and NoSQL for Database and BI Pros

More from Mark Kromer (20)

PPTX
Fabric Data Factory Pipeline Copy Perf Tips.pptx
PPTX
Build data quality rules and data cleansing into your data pipelines
PPTX
Mapping Data Flows Training deck Q1 CY22
PPTX
Data cleansing and prep with synapse data flows
PPTX
Data cleansing and data prep with synapse data flows
PPTX
Mapping Data Flows Training April 2021
PPTX
Mapping Data Flows Perf Tuning April 2021
PPTX
Data Lake ETL in the Cloud with ADF
PPTX
Azure Data Factory Data Wrangling with Power Query
PPTX
Azure Data Factory Data Flow Performance Tuning 101
PPTX
Data Quality Patterns in the Cloud with ADF
PPTX
Azure Data Factory Data Flows Training (Sept 2020 Update)
PPTX
Data quality patterns in the cloud with ADF
PPTX
Azure Data Factory Data Flows Training v005
PPTX
Data Quality Patterns in the Cloud with Azure Data Factory
PPTX
ADF Mapping Data Flows Level 300
PPTX
ADF Mapping Data Flows Training V2
PPTX
ADF Mapping Data Flows Training Slides V1
PDF
ADF Mapping Data Flow Private Preview Migration
PPTX
Azure Data Factory ETL Patterns in the Cloud
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Build data quality rules and data cleansing into your data pipelines
Mapping Data Flows Training deck Q1 CY22
Data cleansing and prep with synapse data flows
Data cleansing and data prep with synapse data flows
Mapping Data Flows Training April 2021
Mapping Data Flows Perf Tuning April 2021
Data Lake ETL in the Cloud with ADF
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Flow Performance Tuning 101
Data Quality Patterns in the Cloud with ADF
Azure Data Factory Data Flows Training (Sept 2020 Update)
Data quality patterns in the cloud with ADF
Azure Data Factory Data Flows Training v005
Data Quality Patterns in the Cloud with Azure Data Factory
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flow Private Preview Migration
Azure Data Factory ETL Patterns in the Cloud

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Empathic Computing: Creating Shared Understanding
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Monthly Chronicles - July 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
CIFDAQ's Market Insight: SEC Turns Pro Crypto
20250228 LYD VKU AI Blended-Learning.pptx
Modernizing your data center with Dell and AMD
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Building Integrated photovoltaic BIPV_UPV.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
The AUB Centre for AI in Media Proposal.docx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
Empathic Computing: Creating Shared Understanding
KodekX | Application Modernization Development
NewMind AI Monthly Chronicles - July 2025

Big Data in the Real World

  • 1. Big Data in the Real World Orlando PASS October 2013 http://www.pssug.org Mark Kromer http://www.kromerbigdata.com @kromerbigdata @mssqldude
  • 2. What we’ll (try) to cover today ‣ What is Big Data? ‣ The Big Data and Apache Hadoop environment ‣ Big Data Analytics ‣ SQL Server in the Big Data world ‣ Microsoft + Hortonworks (Yahoo!) = HDInsights 2
  • 3. Big Data 101 ‣ 3 V’s ‣ Volume – Terabyte records, transactions, tables, files ‣ Velocity – Batch, near-time, real-time (analytics), streams. ‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix ‣ Text Processing ‣ Techniques for processing and analyzing unstructured (and structured) LARGE files ‣ Analytics & Insights ‣ Distributed File System & Programming
  • 4. Mark’s Big Data Myths ‣ Big Data ≠ NoSQL ‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing ‣ Facebook, for example, uses Hbase from the Hadoop stack ‣ NoSQL does not have to be Big Data ‣ Big Data ≠ Real Time ‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value ‣ Use in-memory analytics for real time insights ‣ Big Data ≠ Data Warehouse ‣ I still refer to large multi-TB DWs as “VLDB” ‣ Big Data is about crunching stats in text files for discovery of new patterns and insights ‣ Use the DW to aggregate and store the summaries of those calculations for reporting
  • 5. ‣ ‣ ‣ ‣ ‣ ‣ Batch Processing Commodity Hardware Data Locality, no shared storage Scales linearly Great for large text file processing, not so great on small files Distributed programming paradigm
  • 6. Popular Hadoop Distributions Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure
  • 8. Popular MPP Distributions Big Data as distributed, scale-out, sharded data stores
  • 9. Big Data Analytics Web Platform - Example
  • 10. MapReduce Framework (Map) using Microsoft.Hadoop.MapReduce; using System.Text.RegularExpressions; public class TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } }
  • 11. MapReduce Framework (Reduce & Job) public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } }
  • 12. Get Data into Hadoop ‣ ‣ ‣ ‣ Linux shell commands to access data in HDFS Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv List files in HDFS: c:Hadoop>hadoop fs -ls /import Found 1 items -rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv ‣ View file in HDFS: c:Hadoop>hadoop fs -cat /import/sales.csv Kromer,123,5,55 Smith,567,1,25 Jones,123,9,99 James,11,12,1 Johnson,456,2,2.5 Singh,456,1,3.25 Yu,123,1,11 ‣ Now, we can work on the data with MapReduce, Hive, Pig, etc.
  • 13. Use Hive for Data Schema and Analysis create external table ext_sales ( lastname string, productid int, quantity int, sales_amount float ) row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input'; LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
  • 14. Sqoop Data transfer to & from Hadoop & SQL Server ‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1 ‣ > hadoop fs -cat /user/mark/customers/part-m-00000 ‣ > 5,Bob Smith ‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3 ‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) ‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
  • 15. SQL Server Big Data – Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  • 16. Role of NoSQL in a Big Data Analytics Solution ‣ Use NoSQL to store data quickly without the overhead of RDBMS ‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few ‣ Why NoSQL? ‣ In the world of “Big Data” ‣ “Schema later” ‣ Ignore ACID properties ‣ Drop data into key-value store quick & dirty ‣ Worry about query & read later ‣ Why NOT NoSQL? ‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface ‣ SQL Server and NoSQL ‣ Not a natural fit ‣ Use HDFS or your favorite NoSQL database ‣ Consider turning off SQL Server locking mechanisms ‣ Focus on writes, not reads (read uncommitted)
  • 17. SQL Server Big Data Environment ‣ SQL Server Database ‣ ‣ ‣ ‣ ‣ ‣ ‣ ‣ SQL 2012 Enterprise Edition Page Compression 2012 Columnar Compression on Fact Tables Clustered Index on all tables Auto-update Stats Asynch Partition Fact Tables by month and archive data with sliding window technique Drop all indexes before nightly ETL load jobs Rebuild all indexes when ETL completes ‣ SQL Server Analysis Services ‣ ‣ ‣ ‣ SSAS 2012 Enterprise Edition 2008 R2 OLAP cubes partition-aligned with DW 2012 cubes in-memory tabular cubes All access through MSMDPUMP or SharePoint
  • 18. SQL Server Big Data Analytics Features ‣ Columnstore ‣ Sqoop adapter ‣ PolyBase ‣ Hive ‣ In-memory analytics ‣ Scale-out MPP
  • 19. Microsoft’s Data Solution – Big Data & PDW Excel with PowerPivot Predictive Analytics Embedded BI Data Market Place Familiar End User Tools S S R S SSAS BI Platform Hundreds of TB of Data (structured) Petabytes of Data (Unstructured) Hadoop On Windows Azure Sensors Hadoop On Windows Server Devices Data Market Bots Connectors Parallel Data Warehouse Crawlers ERP CRM LOB APPs Unstructured and Structured Data 19 19
  • 20. MICROSOFT BIG DATA immersive data experiences PowerPivot Self-Service Power View Collaboration Corporate Apps Devices connecting with worlds data Combine Discover Refine Microsoft HDInsight Server any data, any size, anywhere StreamInsight Parallel Data Warehouse Relational HDInsight Service Non-relational Analytical Streaming
  • 22. Microsoft .NET Hadoop APIs ‣ WebHDFS ‣ Linq to Hive ‣ MapReduce ‣ C# ‣ Java ‣ Hive ‣ Pig ‣ http://hadoopsdk.codeplex.com/ ‣ SQL on Hadoop ‣ Cloudera Impala ‣ Teradata SQL-H ‣ Microsoft Polybase ‣ Hadapt
  • 23. Data Movement to the Cloud ‣ Use Windows Azure Blob Storage • Already stored in 3 copies • Hadoop can read from Azure blob storage • Allows you to upload while using no Hadoop network or CPU resources ‣ Compress files • • • • Hadoop can read Gzip Uses less network resources than uncompressed Costs less for direct storage costs Compress directories where source files are created as well. 23
  • 24. Wrap-up ‣ What is a Big Data approach to Analytics? ‣ Massive scale ‣ Data discovery & research ‣ Self-service ‣ Reporting & BI ‣ Why do we take this Big Data Analytics approach? ‣ TBs of change data in each subject area ‣ The data in the sources are variable and unstructured ‣ SSIS ETL alone couldn’t keep up or handle complexity ‣ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL Server for Big Data ‣ With the configs mentioned previously, SQL Server works great ‣ Analytics on Big Data also requires Big Data Analytics tools ‣ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse