SlideShare a Scribd company logo
Big Data with SQL Server
Philly Code Camp 2013.1
May 2013
http://www.pssug.org
Mark Kromer
http://www.kromerbigdata.com
@kromerbigdata
@mssqldude
makromer@microsoft.com
‣What is Big Data?
‣The Big Data and Apache Hadoop environment
‣Big Data Analytics
‣SQL Server in the Big Data world
‣Microsoft + Hortonworks (Yahoo!) = HDInsights
What we’ll (try) to cover today
2
Big Data 101
‣ 3 V’s
‣ Volume – Terabyte records, transactions, tables, files
‣ Velocity – Batch, near-time, real-time (analytics), streams.
‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix
‣ Text Processing
‣ Techniques for processing and analyzing unstructured (and structured) LARGE files
‣ Analytics & Insights
‣ Distributed File System & Programming
‣ Batch Processing
‣ Commodity Hardware
‣ Data Locality, no shared storage
‣ Scales linearly
‣ Great for large text file processing, not so great on small files
‣ Distributed programming paradigm
Popular Hadoop Distributions
Hosted PaaS Hadoop platforms: Amazon
EMR, Pivotal, Microsoft Hadoop on Azure
‣ Big Data ≠ NoSQL
‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!,
Google, Facebook, et al) but not the same thing
‣ Facebook, for example, uses Hbase from the Hadoop stack
‣ Big Data ≠ Real Time
‣ Big Data is primarily about batch processing huge files in a distributed manner
and analyzing data that was otherwise too complex to provide value
‣ Use in-memory analytics for real time insights
‣ Big Data ≠ Data Warehouse
‣ I still refer to large multi-TB DWs as “VLDB”
‣ Big Data is about crunching stats in text files for discovery of new patterns and
insights
‣ Use the DW to aggregate and store the summaries of those calculations for
reporting
Mark’s Big Data Myths
Big Data Analytics Web Platform - Example
using Microsoft.Hadoop.MapReduce;
using System.Text.RegularExpressions;
public class TotalHitsForPageMap : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
context.Log(inputLine);
var parts = Regex.Split(inputLine, "s+");
if (parts.Length != expected) //only take records with all values
{
return;
}
context.EmitKeyValue(parts[pagePos], hit);
}
}
MapReduce Framework (Map)
public class TotalHitsForPageReducerCombiner : ReducerCombinerBase
{
public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext
context)
{
context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString());
}
}
public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
var retVal = new HadoopJobConfiguration();
retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT");
retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT");
retVal.DeleteOutputFolder = true;
return retVal;
}
}
MapReduce Framework (Reduce & Job)
‣ Linux shell commands to access data in HDFS
‣ Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv
‣ List files in HDFS:
‣ c:Hadoop>hadoop fs -ls /import
Found 1 items
-rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv
‣ View file in HDFS:
c:Hadoop>hadoop fs -cat /import/sales.csv
Kromer,123,5,55
Smith,567,1,25
Jones,123,9,99
James,11,12,1
Johnson,456,2,2.5
Singh,456,1,3.25
Yu,123,1,11
‣ Now, we can work on the data with MapReduce, Hive, Pig, etc.
Get Data into Hadoop
create external table ext_sales
(
lastname string,
productid int,
quantity int,
sales_amount float
)
row format delimited fields terminated by ',' stored as
textfile location '/user/makromer/hiveext/input';
LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE
INTO TABLE ext_sales;
Use Hive for Data Schema and Analysis
‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password
password –table customers -m 1
‣ > hadoop fs -cat /user/mark/customers/part-m-00000
‣ > 5,Bob Smith
‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password
password -m 1 –table customers –export-dir /user/mark/data/employees3
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in
32.6364 seconds (6.1588 bytes/sec)
‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records.
Sqoop
Data transfer to & from Hadoop & SQL Server
SQL Server Big Data – Data Loading
Amazon HDFS & EMR Data Loading
Amazon S3 Bucket
Role of NoSQL in a Big Data Analytics Solution
‣ Use NoSQL to store data quickly without the overhead of RDBMS
‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few
‣ Why NoSQL?
‣ In the world of “Big Data”
‣ “Schema later”
‣ Ignore ACID properties
‣ Drop data into key-value store quick & dirty
‣ Worry about query & read later
‣ Why NOT NoSQL?
‣ In the world of Big Data Analytics, you will need support from analytical tools with a
SQL, SAS, MR interface
‣ SQL Server and NoSQL
‣ Not a natural fit
‣ Use HDFS or your favorite NoSQL database
‣ Consider turning off SQL Server locking mechanisms
‣ Focus on writes, not reads (read uncommitted)
‣ SQL Server Database
‣ SQL 2012 Enterprise Edition
‣ Page Compression
‣ 2012 Columnar Compression on Fact Tables
‣ Clustered Index on all tables
‣ Auto-update Stats Asynch
‣ Partition Fact Tables by month and archive data with sliding window technique
‣ Drop all indexes before nightly ETL load jobs
‣ Rebuild all indexes when ETL completes
‣ SQL Server Analysis Services
‣ SSAS 2012 Enterprise Edition
‣ 2008 R2 OLAP cubes partition-aligned with DW
‣ 2012 cubes in-memory tabular cubes
‣ All access through MSMDPUMP or SharePoint
SQL Server Big Data Environment
‣Columnstore
‣Sqoop adapter
‣PolyBase
‣Hive
‣In-memory analytics
‣Scale-out MPP
SQL Server Big Data Analytics Features
17 17
Sensors Devices Bots Crawlers
ERP CRM LOB APPs
Unstructured and Structured Data
Parallel Data Warehouse
Hadoop On
Windows
Azure
Hadoop On
Windows
Server
Connectors
S
S
R
S
SSAS
BI Platform
Familiar End User Tools
Excel with PowerPivot Embedded BIPredictive Analytics
Data Market Place
Data Market
Petabytes of Data
(Unstructured)
Hundreds of TB of Data
(structured)
Microsoft’s Data Solution – Big Data & PDW
MICROSOFT BIG DATA
Discover Combine Refine
Relational Non-relational Streaming
immersive data
experiences
connecting with
worlds data
any data, any
size, anywhere
Self-Service Collaboration Corporate Apps Devices
Analytical
Parallel Data Warehouse
Microsoft HDInsight Server
HDInsight Service
StreamInsight
PowerPivot
Power View
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Microsoft .NET Hadoop APIs
‣ WebHDFS
‣ Linq to Hive
‣ MapReduce
‣ C#
‣ Java
‣ Hive
‣ Pig
‣ http://hadoopsdk.codeplex.com/
‣ SQL on Hadoop
‣ Cloudera Impala
‣ Teradata SQL-H
‣ Microsoft Polybase
‣ Hadapt
Data Movement to the Cloud
‣Use Windows Azure Blob Storage
• Already stored in 3 copies
• Hadoop can read from Azure blob storage
• Allows you to upload while using no Hadoop network or CPU resources
‣Compress files
• Hadoop can read Gzip
• Uses less network resources than uncompressed
• Costs less for direct storage costs
• Compress directories where source files are created as well.
21
‣ What is a Big Data approach to Analytics?
‣ Massive scale
‣ Data discovery & research
‣ Self-service
‣ Reporting & BI
‣ Why do we take this Big Data Analytics approach?
‣ TBs of change data in each subject area
‣ The data in the sources are variable and unstructured
‣ SSIS ETL alone couldn’t keep up or handle complexity
‣ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL
Server for Big Data
‣ With the configs mentioned previously, SQL Server works great
‣ Analytics on Big Data also requires Big Data Analytics tools
‣ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse
Wrap-up

More Related Content

PPTX
Big data in Azure
PPTX
Not only SQL - Database Choices
PDF
Big Data Streams Architectures. Why? What? How?
PDF
Treasure Data From MySQL to Redshift
PPTX
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PPTX
Introduction to Google BigQuery
PDF
Big data real time architectures
PPTX
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data in Azure
Not only SQL - Database Choices
Big Data Streams Architectures. Why? What? How?
Treasure Data From MySQL to Redshift
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
Introduction to Google BigQuery
Big data real time architectures
Big data vahidamiri-tabriz-13960226-datastack.ir

What's hot (20)

PPTX
Database Choices
PPTX
An Intro to Elasticsearch and Kibana
PDF
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
PPTX
Managed Cluster Services
PDF
Organising for Data Success
PDF
Scaling to Infinity - Open Source meets Big Data
PPTX
Data pipelines from zero
PPTX
Big Data Best Practices on GCP
PDF
Google Dremel. Concept and Implementations.
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Continuous delivery for machine learning
PPTX
Augmenting Mongo DB with treasure data
PDF
Architecting Data in the AWS Ecosystem
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PDF
Exploring BigData with Google BigQuery
PPTX
MongoDB & Hadoop - Understanding Your Big Data
PPTX
Digital Transformation with Microsoft Azure
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PDF
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
PDF
GCP Data Engineer cheatsheet
Database Choices
An Intro to Elasticsearch and Kibana
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Managed Cluster Services
Organising for Data Success
Scaling to Infinity - Open Source meets Big Data
Data pipelines from zero
Big Data Best Practices on GCP
Google Dremel. Concept and Implementations.
AWS Big Data Demystified #1: Big data architecture lessons learned
Continuous delivery for machine learning
Augmenting Mongo DB with treasure data
Architecting Data in the AWS Ecosystem
Presto: Optimizing Performance of SQL-on-Anything Engine
Exploring BigData with Google BigQuery
MongoDB & Hadoop - Understanding Your Big Data
Digital Transformation with Microsoft Azure
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Intro to new Google cloud technologies: Google Storage, Prediction API, BigQuery
GCP Data Engineer cheatsheet
Ad

Viewers also liked (20)

PPTX
Big Data in the Cloud with Azure Marketplace Images
PPTX
Microsoft Event Registration System Hosted on Windows Azure
PPTX
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
PPTX
PSSUG Nov 2012: Big Data with SQL Server
PPTX
What's new in SQL Server 2012 for philly code camp 2012.1
DOCX
MEC Data sheet
PPTX
Big Data with SQL Server
PPTX
Pentaho Big Data Analytics with Vertica and Hadoop
PPTX
Anexinet Big Data Solutions
PPTX
Big Data in the Real World
PPTX
Pentaho Analytics on MongoDB
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPTX
Sql server 2012 roadshow masd overview 003
PPTX
Microsoft SQL Server Data Warehouses for SQL Server DBAs
PPTX
Azure vs. amazon
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
ETL in the Cloud With Microsoft Azure
PPTX
Azure cafe marketplace with looker data analytics
PPTX
AWS vs Azure - Cloud Services Comparison
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
Big Data in the Cloud with Azure Marketplace Images
Microsoft Event Registration System Hosted on Windows Azure
Microsoft Cloud BI Update 2012 for SQL Saturday Philly
PSSUG Nov 2012: Big Data with SQL Server
What's new in SQL Server 2012 for philly code camp 2012.1
MEC Data sheet
Big Data with SQL Server
Pentaho Big Data Analytics with Vertica and Hadoop
Anexinet Big Data Solutions
Big Data in the Real World
Pentaho Analytics on MongoDB
Big Data Analytics Projects - Real World with Pentaho
Sql server 2012 roadshow masd overview 003
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Azure vs. amazon
Big Data Analytics with Hadoop, MongoDB and SQL Server
ETL in the Cloud With Microsoft Azure
Azure cafe marketplace with looker data analytics
AWS vs Azure - Cloud Services Comparison
Big Data Analytics in the Cloud with Microsoft Azure
Ad

Similar to Philly Code Camp 2013 Mark Kromer Big Data with SQL Server (20)

PPTX
Big data unit 2
PDF
Big data and you
 
PDF
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
PDF
Big Data - Module 1
PDF
Big data analysis concepts and references
PDF
Prague data management meetup 2018-03-27
PPTX
Big data4businessusers
PDF
Azure HDInsight
PPT
Data analytics & its Trends
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
PDF
02 a holistic approach to big data
PPTX
Modernizing Your Data Warehouse using APS
PPT
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
PDF
Big Data Analytics
PPTX
Derfor skal du bruge en DataLake
PPTX
Build Big Data Enterprise Solutions Faster on Azure HDInsight
PDF
Meta scale kognitio hadoop webinar
PPTX
Big Data Strategy for the Relational World
PPTX
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
PPTX
Big Data Analytics MIS presentation
Big data unit 2
Big data and you
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Big Data - Module 1
Big data analysis concepts and references
Prague data management meetup 2018-03-27
Big data4businessusers
Azure HDInsight
Data analytics & its Trends
Lecture 5 - Big Data and Hadoop Intro.ppt
02 a holistic approach to big data
Modernizing Your Data Warehouse using APS
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Big Data Analytics
Derfor skal du bruge en DataLake
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Meta scale kognitio hadoop webinar
Big Data Strategy for the Relational World
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Big Data Analytics MIS presentation

More from Mark Kromer (20)

PPTX
Fabric Data Factory Pipeline Copy Perf Tips.pptx
PPTX
Build data quality rules and data cleansing into your data pipelines
PPTX
Mapping Data Flows Training deck Q1 CY22
PPTX
Data cleansing and prep with synapse data flows
PPTX
Data cleansing and data prep with synapse data flows
PPTX
Mapping Data Flows Training April 2021
PPTX
Mapping Data Flows Perf Tuning April 2021
PPTX
Data Lake ETL in the Cloud with ADF
PPTX
Azure Data Factory Data Wrangling with Power Query
PPTX
Azure Data Factory Data Flow Performance Tuning 101
PPTX
Data Quality Patterns in the Cloud with ADF
PPTX
Azure Data Factory Data Flows Training (Sept 2020 Update)
PPTX
Data quality patterns in the cloud with ADF
PPTX
Azure Data Factory Data Flows Training v005
PPTX
Data Quality Patterns in the Cloud with Azure Data Factory
PPTX
ADF Mapping Data Flows Level 300
PPTX
ADF Mapping Data Flows Training V2
PPTX
ADF Mapping Data Flows Training Slides V1
PDF
ADF Mapping Data Flow Private Preview Migration
PPTX
Azure Data Factory ETL Patterns in the Cloud
Fabric Data Factory Pipeline Copy Perf Tips.pptx
Build data quality rules and data cleansing into your data pipelines
Mapping Data Flows Training deck Q1 CY22
Data cleansing and prep with synapse data flows
Data cleansing and data prep with synapse data flows
Mapping Data Flows Training April 2021
Mapping Data Flows Perf Tuning April 2021
Data Lake ETL in the Cloud with ADF
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Flow Performance Tuning 101
Data Quality Patterns in the Cloud with ADF
Azure Data Factory Data Flows Training (Sept 2020 Update)
Data quality patterns in the cloud with ADF
Azure Data Factory Data Flows Training v005
Data Quality Patterns in the Cloud with Azure Data Factory
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training Slides V1
ADF Mapping Data Flow Private Preview Migration
Azure Data Factory ETL Patterns in the Cloud

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
Teaching material agriculture food technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Network Security Unit 5.pdf for BCA BBA.
Chapter 3 Spatial Domain Image Processing.pdf
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
MIND Revenue Release Quarter 2 2025 Press Release
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf

Philly Code Camp 2013 Mark Kromer Big Data with SQL Server

  • 1. Big Data with SQL Server Philly Code Camp 2013.1 May 2013 http://www.pssug.org Mark Kromer http://www.kromerbigdata.com @kromerbigdata @mssqldude makromer@microsoft.com
  • 2. ‣What is Big Data? ‣The Big Data and Apache Hadoop environment ‣Big Data Analytics ‣SQL Server in the Big Data world ‣Microsoft + Hortonworks (Yahoo!) = HDInsights What we’ll (try) to cover today 2
  • 3. Big Data 101 ‣ 3 V’s ‣ Volume – Terabyte records, transactions, tables, files ‣ Velocity – Batch, near-time, real-time (analytics), streams. ‣ Variety – Structures, unstructured, semi-structured, and all the above in a mix ‣ Text Processing ‣ Techniques for processing and analyzing unstructured (and structured) LARGE files ‣ Analytics & Insights ‣ Distributed File System & Programming
  • 4. ‣ Batch Processing ‣ Commodity Hardware ‣ Data Locality, no shared storage ‣ Scales linearly ‣ Great for large text file processing, not so great on small files ‣ Distributed programming paradigm
  • 5. Popular Hadoop Distributions Hosted PaaS Hadoop platforms: Amazon EMR, Pivotal, Microsoft Hadoop on Azure
  • 6. ‣ Big Data ≠ NoSQL ‣ NoSQL has similar Internet-scale Web origins of Hadoop stack (Yahoo!, Google, Facebook, et al) but not the same thing ‣ Facebook, for example, uses Hbase from the Hadoop stack ‣ Big Data ≠ Real Time ‣ Big Data is primarily about batch processing huge files in a distributed manner and analyzing data that was otherwise too complex to provide value ‣ Use in-memory analytics for real time insights ‣ Big Data ≠ Data Warehouse ‣ I still refer to large multi-TB DWs as “VLDB” ‣ Big Data is about crunching stats in text files for discovery of new patterns and insights ‣ Use the DW to aggregate and store the summaries of those calculations for reporting Mark’s Big Data Myths
  • 7. Big Data Analytics Web Platform - Example
  • 8. using Microsoft.Hadoop.MapReduce; using System.Text.RegularExpressions; public class TotalHitsForPageMap : MapperBase { public override void Map(string inputLine, MapperContext context) { context.Log(inputLine); var parts = Regex.Split(inputLine, "s+"); if (parts.Length != expected) //only take records with all values { return; } context.EmitKeyValue(parts[pagePos], hit); } } MapReduce Framework (Map)
  • 9. public class TotalHitsForPageReducerCombiner : ReducerCombinerBase { public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context) { context.EmitKeyValue(key, values.Sum(e=>long.Parse(e)).ToString()); } } public class TotalHitsJob : HadoopJob<TotalHitsForPageMap,TotalHitsForPageReducerCombiner> { public override HadoopJobConfiguration Configure(ExecutorContext context) { var retVal = new HadoopJobConfiguration(); retVal.InputPath = Environment.GetEnvironmentVariable("W3C_INPUT"); retVal.OutputFolder = Environment.GetEnvironmentVariable("W3C_OUTPUT"); retVal.DeleteOutputFolder = true; return retVal; } } MapReduce Framework (Reduce & Job)
  • 10. ‣ Linux shell commands to access data in HDFS ‣ Put file in HDFS: hadoop fs -put sales.csv /import/sales.csv ‣ List files in HDFS: ‣ c:Hadoop>hadoop fs -ls /import Found 1 items -rw-r--r-- 1 makromer supergroup 114 2013-05-07 12:11 /import/sales.csv ‣ View file in HDFS: c:Hadoop>hadoop fs -cat /import/sales.csv Kromer,123,5,55 Smith,567,1,25 Jones,123,9,99 James,11,12,1 Johnson,456,2,2.5 Singh,456,1,3.25 Yu,123,1,11 ‣ Now, we can work on the data with MapReduce, Hive, Pig, etc. Get Data into Hadoop
  • 11. create external table ext_sales ( lastname string, productid int, quantity int, sales_amount float ) row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input'; LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales; Use Hive for Data Schema and Analysis
  • 12. ‣ sqoop import –connect jdbc:sqlserver://localhost –username sqoop -password password –table customers -m 1 ‣ > hadoop fs -cat /user/mark/customers/part-m-00000 ‣ > 5,Bob Smith ‣ sqoop export –connect jdbc:sqlserver://localhost –username sqoop -password password -m 1 –table customers –export-dir /user/mark/data/employees3 ‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Transferred 201 bytes in 32.6364 seconds (6.1588 bytes/sec) ‣ 12/11/11 22:19:24 INFO mapreduce.ExportJobBase: Exported 4 records. Sqoop Data transfer to & from Hadoop & SQL Server
  • 13. SQL Server Big Data – Data Loading Amazon HDFS & EMR Data Loading Amazon S3 Bucket
  • 14. Role of NoSQL in a Big Data Analytics Solution ‣ Use NoSQL to store data quickly without the overhead of RDBMS ‣ Hbase, Plain Old HDFS, Cassandra, MongoDB, Dynamo, just to name a few ‣ Why NoSQL? ‣ In the world of “Big Data” ‣ “Schema later” ‣ Ignore ACID properties ‣ Drop data into key-value store quick & dirty ‣ Worry about query & read later ‣ Why NOT NoSQL? ‣ In the world of Big Data Analytics, you will need support from analytical tools with a SQL, SAS, MR interface ‣ SQL Server and NoSQL ‣ Not a natural fit ‣ Use HDFS or your favorite NoSQL database ‣ Consider turning off SQL Server locking mechanisms ‣ Focus on writes, not reads (read uncommitted)
  • 15. ‣ SQL Server Database ‣ SQL 2012 Enterprise Edition ‣ Page Compression ‣ 2012 Columnar Compression on Fact Tables ‣ Clustered Index on all tables ‣ Auto-update Stats Asynch ‣ Partition Fact Tables by month and archive data with sliding window technique ‣ Drop all indexes before nightly ETL load jobs ‣ Rebuild all indexes when ETL completes ‣ SQL Server Analysis Services ‣ SSAS 2012 Enterprise Edition ‣ 2008 R2 OLAP cubes partition-aligned with DW ‣ 2012 cubes in-memory tabular cubes ‣ All access through MSMDPUMP or SharePoint SQL Server Big Data Environment
  • 17. 17 17 Sensors Devices Bots Crawlers ERP CRM LOB APPs Unstructured and Structured Data Parallel Data Warehouse Hadoop On Windows Azure Hadoop On Windows Server Connectors S S R S SSAS BI Platform Familiar End User Tools Excel with PowerPivot Embedded BIPredictive Analytics Data Market Place Data Market Petabytes of Data (Unstructured) Hundreds of TB of Data (structured) Microsoft’s Data Solution – Big Data & PDW
  • 18. MICROSOFT BIG DATA Discover Combine Refine Relational Non-relational Streaming immersive data experiences connecting with worlds data any data, any size, anywhere Self-Service Collaboration Corporate Apps Devices Analytical Parallel Data Warehouse Microsoft HDInsight Server HDInsight Service StreamInsight PowerPivot Power View
  • 20. Microsoft .NET Hadoop APIs ‣ WebHDFS ‣ Linq to Hive ‣ MapReduce ‣ C# ‣ Java ‣ Hive ‣ Pig ‣ http://hadoopsdk.codeplex.com/ ‣ SQL on Hadoop ‣ Cloudera Impala ‣ Teradata SQL-H ‣ Microsoft Polybase ‣ Hadapt
  • 21. Data Movement to the Cloud ‣Use Windows Azure Blob Storage • Already stored in 3 copies • Hadoop can read from Azure blob storage • Allows you to upload while using no Hadoop network or CPU resources ‣Compress files • Hadoop can read Gzip • Uses less network resources than uncompressed • Costs less for direct storage costs • Compress directories where source files are created as well. 21
  • 22. ‣ What is a Big Data approach to Analytics? ‣ Massive scale ‣ Data discovery & research ‣ Self-service ‣ Reporting & BI ‣ Why do we take this Big Data Analytics approach? ‣ TBs of change data in each subject area ‣ The data in the sources are variable and unstructured ‣ SSIS ETL alone couldn’t keep up or handle complexity ‣ SQL Server 2012 columnstore and tabular SSAS 2012 are key to using SQL Server for Big Data ‣ With the configs mentioned previously, SQL Server works great ‣ Analytics on Big Data also requires Big Data Analytics tools ‣ Aster, Tableau, PowerPivot, SAS, Parallel Data Warehouse Wrap-up