SlideShare a Scribd company logo
SQLSaturday #230 Rheinland
Sascha Dittmann
Softwarearchitekt & Entwickler – Ernst & Young GmbH
www.sascha-dittmann.de
Georg Urban
Snr. Technology Solution Professional | Data Platform
georg.urban@microsoft.com
13.07.2013
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
THE HADOOP ECOSYSTEM
Big Data Characteristics: „3 Vs“
How to deal with the „3 Vs“?
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
A brief history of Hadoop
2002: Apache Nutch open source search engine ist started by Doug Cutting
2003: Google publishes a paper on GFS (Google Distributed File System)
2004: Nutch Distributed Files System (NDFS) is developed
2004: Google publishes a paper on MapReduce
2005: MapReduce is implemented on NDFS
2006: Doug Cutting joins Yahoo! & starts Apache Hadoop subproject
2008: Hadoop is made a Apache top level project.
…Yahoo„s search index runs on a 10.000 node cluster
…Hadoop breaks record on 1TB sort: 209s on 910 nodes
...New York Times converts 4TB archives in PDFs in 24h on 100 nodes
http://labs.google.com/papers/mapreduce.htm
Today: Hadoop becomes a synonym for Big Data processing
Hadoop: The popular Face of Big Data
RDBMS & Hadoop Comparison
Traditional RDBMS MapReduce
Data Volume Terabytes Petabytes / Hexabytes
Access Interactiv & Batch Batch
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low (BASE*)
Scaling non linear Linear
DBA Ratio 1:40 1:3000
Quelle: Tom White’s Hadoop: The Definitive Guide
*Basically Available, Soft state, Eventual consistency
MapReduce is simple… (well: basically)
The Hadoop Ecosystem (simplified)
Quelle: Tom White’s Hadoop: The Definitive Guide
The Hadoop Ecosystem (parts of it…)
HBase (Column DB)
Hive Mahout
Oozie
Sqoop
HBase/Cassandra/Couch/
MongoDB
Avro
Zookeeper
Pig
Karmasphere
Flume
Cascad-
ing
R
Ambari
HCatalog
Datameer
Hortonworks
Cloudera
SplunkHStreaming
MapRHadapt
Hadoop = MapReduce + HDFS
There‟s even more: Mahout for machine
learning
 Scalable machine learning library that leverages
the Hadoop infrastructure
 Key use cases:
 Recommendation mining
 Clustering
 Classification
 Algorithmns:
K-means Clustering, Naïve Bayes,
Decision Tree, Neural network,
Hierarchical Clustering,
Positive Matrix Factorization and more…
R for statistical computing
 An open and extensible statistical
computing environment
 Based on the S language
 Used by Data Scientists to
explore data and
generate graphical output
 A well-developed
programming language
 Many “Packages” available
to extend R
…but: That‟s not Enterprise ready… Really not…
HDINSIGHT
SERVER & SERVICE
Big Data in the Enterprise should…
fit in an present IT Infrastructure
be easy to manage
rely on existing skill sets
be cost optimized
Why Apache Hadoop on Windows?
 According to IDC Windows Server held 73% market share in 2012
 Hadoop was traditionally built for Linux servers so there are a large number of underserved organizations
 According to 2012 Barclays CIO study big data outranks
virtualization as #1 trend driving spending initiatives
 Unstructured data growth exceeds 80% year/year in most enterprises
 Apache Hadoop is the defacto big data platform
for processing massive amounts of unstructured data
 Complementary to existing Microsoft technologies
 There is a huge untapped community of Windows developers and ecosystem partners
 A strong Microsoft-Hortonworks partnership and 18 months of development makes this a natural next step
OS Cloud VM Appliance
Enterprise Hadoop Distribution Hortonworks
Data Platform (HDP)
Hadoop
designed for Enterprises
The “really complete“ Open
Source Distribution
Eco-System designed for
InteroperabilityPLATTFORM SERVICES
HADOOP CORE
DATA
SERVICES
OPERATIONAL
SERVICES
Management
of Hadoop
Environment
Store, Process
&
Connect
HORTONWORKS
DATA PLATFORM (HDP)
Distributed
Data Storage & Processing
Enterprise Availability
Leadership that Starts at the Core
 Driving next generation Hadoop
 YARN, MapReduce2, HDFS2, High Availability, Disaster Recovery
 420k+ lines authored since 2006
 More than twice nearest contributor
 Deeply integrating w/ecosystem
 Enabling new deployment platforms
 (ex. Windows & Azure, Linux & VMware HA)
 Creating deeply engineered solutions
 (ex. Teradata big data appliance)
 All Apache, NO holdbacks
 100% of code contributed to Apache
HDInsight Windows optimized
Hadoop
Big Data @Microsoft
Microsoft HDInsight Server on Windows Server
Windows Azure HDInsight Service (Cloud)
Enterprise Ready Hadoop
Simplicity & Managebility of Windows
AD Integration
Monitoring (System Center)
Integrated in Microsoft Business Intelligence
JavaScript, HiveODBC, .NET
…
Up and running in minutes with HDInsight Service
Microsoft Big Data Solution (two months ago…)
BIG DATA IN THE CLOUD
Windows Azure: Elastic Big Data
Windows Azure HDInsight Service
Hadoop Cluster
Hadoop on Azure
Azure Blob
Storage
Name
Node
Data
Node
Data
Node
Data
Node
Data
Node
HDFS
On Premise Enterprise
Content
• Transactional DBs
• On Prem logs
• Internal sensors
Cloud Enterprise Content
• Generated in Azure
3rd Party Content
• Azure Datamarket
• Generated/stored
elsewhere
• Public content
• Delivered online
Azure Blob
Storage
SQL Azure
Application
end point
Using Blob Storage From HDInsight
 HDInsight cluster is bound to one “default” blob storage account
& container at cluster create time
 Using the “default” container requires no special addressing to
access (“/” == root folder, etc)
 Access additional blob storage accounts or containers:
 Storage accounts need to be registered in site-config.xml:
asv[s]://<container>@<account>.blob.core.windows.net/<path>
<property>
<name>fs.azure.account.key.accountname</name>
<value>enterthekeyvaluehere</value>
</property>
Transporting Data with AzCopy
 Utility for moving data to/from Azure Blob Storage
(like robocopy)
 50MB/s transfer rate in data center
Container Blob Name
mycontainer a.txt
mycontainer b.txt
mycontainer dir1c.txt
mycontainer dir1dir2d.txt
Intro to HDInsight
Map/Reduce
Map
Sort
Shuffle
DataNode
Map
Sort
Shuffle
DataNode
Map
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,[22,33,55]
1952,-11
1949,0
1950,55
1952,-11
Map/Reduce mit Combine
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Map
Combine
Sort
Shuffle
DataNode
Reduce
0067011990999991950051507004+68750
0043011990999991950051512004+68750
0043011990999991950051518004+68750
0043012650999991949032412004+62300
0043012650999991949032418004+62300
1949,0
1950,22
1950,55
1952,-11
1950,33
1949,0
1950,55
1952,-11
1950,33
1949,0
1950,[33,55]
1952,-11
1949,0
1950,55
1952,-11
Map/Reduce (JavaScript)
Verfeinern mit Pig Latin
pig
.from("/user/Sascha/input/twitter")
.mapReduce("/user/…/FollowersCount.js"
, "User, Followers:long")
.orderBy("Followers DESC")
.take(10)
.to("/user/Sascha/output/Top10Followers")
Pig Latin
Map in C# (Classic)
Reduce in C# (Classic)
Map/Reduce mit C#
.NET Job Submission Framework (Map)
.NET Job Submission Framework (Reduce)
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Vielen Dank an die Volunteers!
13.07.2013 |
Große Verlosung!
 Am Ende der Veranstaltung (ca. 18:00 Uhr)
 Gewinnt viele Preise!
 Deshalb:
13.07.2013 |
Besucht unsere Sponsoren!
Unsere „You Rock! “ Sponsoren
13.07.2013 |
Vielen Dank an all unsere Sponsoren!
13.07.2013 |
Gold
Silber
Bronze
Media Sponsoren:
13.07.2013 |
Hands-on event: PASS Camp 2013!
13.07.2013 |

More Related Content

PPTX
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
PDF
Changing the game with cloud dw
PPTX
سکوهای ابری و مدل های برنامه نویسی در ابر
PPTX
Get started with Microsoft SQL Polybase
PPTX
An intro to Azure Data Lake
PPTX
Azure SQL Data Warehouse for beginners
PPTX
DAC4B 2015 - Polybase
PDF
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 2)
Changing the game with cloud dw
سکوهای ابری و مدل های برنامه نویسی در ابر
Get started with Microsoft SQL Polybase
An intro to Azure Data Lake
Azure SQL Data Warehouse for beginners
DAC4B 2015 - Polybase
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...

What's hot (20)

PPTX
Big data in Azure
PDF
Tarun poladi resume
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
PDF
Prague data management meetup 2018-03-27
PPTX
PDF
Optimizing Presto Connector on Cloud Storage
PPTX
What's new in SQL Server 2016
PDF
Building a Data Lake on AWS
PPTX
Gpu computing workshop
PDF
Temporal Tables, Transparent Archiving in DB2 for z/OS and IDAA
PPTX
Tools and approaches for migrating big datasets to the cloud
PDF
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
PPTX
Introduction to DataStax Enterprise Graph Database
PPTX
How Klout is changing the landscape of social media with Hadoop and BI
PPTX
Azure data factory
PDF
What is hadoop
PPTX
Introduction to PolyBase
PPTX
Research on vector spatial data storage scheme based
PDF
Girish Juneja - Intel Big Data & Cloud Summit 2013
PPTX
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
Big data in Azure
Tarun poladi resume
Building Data Intensive Analytic Application on Top of Delta Lakes
Prague data management meetup 2018-03-27
Optimizing Presto Connector on Cloud Storage
What's new in SQL Server 2016
Building a Data Lake on AWS
Gpu computing workshop
Temporal Tables, Transparent Archiving in DB2 for z/OS and IDAA
Tools and approaches for migrating big datasets to the cloud
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
Introduction to DataStax Enterprise Graph Database
How Klout is changing the landscape of social media with Hadoop and BI
Azure data factory
What is hadoop
Introduction to PolyBase
Research on vector spatial data storage scheme based
Girish Juneja - Intel Big Data & Cloud Summit 2013
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
Ad

Viewers also liked (20)

PDF
Fraud Detection using Hadoop
PPTX
Go Serverless with Azure Functions
PPTX
Azure api app métricas com application insights
PPTX
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
PPTX
Microsoft NYC 14
PPTX
Big data streaming with Apache Spark on Azure
PDF
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
PPTX
PDF
Azure HDInsight
PPTX
Software scope
PPTX
Azure Stream Analytics : Analyse Data in Motion
PPTX
2016-08-25 TechExeter - going serverless with Azure
PPTX
Going serverless
PPTX
Open up to a better learning ecosystem
PPTX
Spark on Azure HDInsight - spark meetup seattle
PPTX
Azure functions
PPTX
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
PDF
Going serverless
PPTX
Building big data solutions on azure
PDF
Microsoft Azure For Solutions Architects
Fraud Detection using Hadoop
Go Serverless with Azure Functions
Azure api app métricas com application insights
Belgian Windows Server 2012 Launch windows azure insights for the enterprise ...
Microsoft NYC 14
Big data streaming with Apache Spark on Azure
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Azure HDInsight
Software scope
Azure Stream Analytics : Analyse Data in Motion
2016-08-25 TechExeter - going serverless with Azure
Going serverless
Open up to a better learning ecosystem
Spark on Azure HDInsight - spark meetup seattle
Azure functions
Azure IoT Hub on a Toradex Colibri VF61 – Part 1 - Sending data to the cloud
Going serverless
Building big data solutions on azure
Microsoft Azure For Solutions Architects
Ad

Similar to SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1) (20)

PPTX
PPTX
Introduction to Microsoft’s Hadoop solution (HDInsight)
PPTX
Big Data on Azure Tutorial
PPTX
Big Data in the Real World
PPTX
Build Big Data Enterprise Solutions Faster on Azure HDInsight
PPTX
Uotm workshop
PPTX
HDInsight Hadoop on Windows Azure
PPTX
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
PPTX
Microsoft's Hadoop Story
PPTX
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
PPTX
Introduction to Azure HDInsight
PDF
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
PPTX
Big Data and NoSQL for Database and BI Pros
PDF
Introduction to Big Data Analytics on Apache Hadoop
PDF
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
PPTX
Big Data in the Microsoft Platform
PPTX
Big Data Visualisation with Hadoop and PowerPivot
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PDF
Finding and Using Big Data in your business
PDF
Azure Big data
Introduction to Microsoft’s Hadoop solution (HDInsight)
Big Data on Azure Tutorial
Big Data in the Real World
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Uotm workshop
HDInsight Hadoop on Windows Azure
Philly Code Camp 2013 Mark Kromer Big Data with SQL Server
Microsoft's Hadoop Story
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
Introduction to Azure HDInsight
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
Big Data and NoSQL for Database and BI Pros
Introduction to Big Data Analytics on Apache Hadoop
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Big Data in the Microsoft Platform
Big Data Visualisation with Hadoop and PowerPivot
Big Data Analytics with Hadoop, MongoDB and SQL Server
Finding and Using Big Data in your business
Azure Big data

More from Sascha Dittmann (17)

PPTX
C# + SQL = Big Data
PDF
Hochskalierbare, relationale Datenbanken in Microsoft Azure
PDF
Microsoft R - Data Science at Scale
PDF
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
PPTX
dotnet Cologne 2015 - Azure Service Fabric
PPTX
SQL Saturday #313 Rheinland - MapReduce in der Praxis
PDF
Hadoop 2.0 - The Next Level
PPTX
Microsoft HDInsight Podcast #001 - Was ist HDInsight
PDF
dotnet Cologne 2013 - Windows Azure Mobile Services
PPTX
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
PPTX
Developer Open Space 2012 - Cloud Computing Workshop
PPTX
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
PPTX
CloudOps Summit 2012 - 3 Wege in die Cloud
PPTX
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
PPTX
Big Data & NoSQL
PPTX
NoSQL mit RavenDB und Azure
PPTX
Windows Azure für Entwickler V1
C# + SQL = Big Data
Hochskalierbare, relationale Datenbanken in Microsoft Azure
Microsoft R - Data Science at Scale
SQL Server vs. Azure DocumentDB – Ein Battle zwischen XML und JSON
dotnet Cologne 2015 - Azure Service Fabric
SQL Saturday #313 Rheinland - MapReduce in der Praxis
Hadoop 2.0 - The Next Level
Microsoft HDInsight Podcast #001 - Was ist HDInsight
dotnet Cologne 2013 - Windows Azure Mobile Services
dotnet Cologne 2013 - Microsoft HD Insight für .NET Entwickler
Developer Open Space 2012 - Cloud Computing Workshop
PASS Camp 2012 - Big Data mit Microsoft (Teil 1)
CloudOps Summit 2012 - 3 Wege in die Cloud
.NET Usergroup Rhein-Neckar: Big Data in der Cloud - Apache Hadoop-based Serv...
Big Data & NoSQL
NoSQL mit RavenDB und Azure
Windows Azure für Entwickler V1

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Modernizing your data center with Dell and AMD
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Approach and Philosophy of On baking technology
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
PPTX
Cloud computing and distributed systems.
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
A Presentation on Artificial Intelligence
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Weekly Chronicles - August'25 Week I
Modernizing your data center with Dell and AMD
Understanding_Digital_Forensics_Presentation.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Advanced methodologies resolving dimensionality complications for autism neur...
Approach and Philosophy of On baking technology
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Monthly Chronicles - July 2025
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf
Cloud computing and distributed systems.
Dropbox Q2 2025 Financial Results & Investor Presentation
“AI and Expert System Decision Support & Business Intelligence Systems”
A Presentation on Artificial Intelligence
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)

Editor's Notes

  • #21: In that capacity,Arun allows Hortonworks to be instrumental in working with the community to drive the roadmap for Core Hadoop, where the focus today is on things like YARN, MapReduce2, HDFS2 and more.For Core Hadoop, in absolute terms, Hortonworkers have contributed more than twice as many lines of code as the next closest contributor, and even more if you include Yahoo, our development partner. Taking such a prominent role also enables us to ensure that our distribution integrates deeply with the ecosystem: on both choice of deployment platforms such as Windows, Azure and more, but also to create deeply engineered solutions with key partners such as Teradata.And consistent with our approach, all of this is done in 100% open source.