SlideShare a Scribd company logo
On the move with Big Data
Hadoop, Pig, Sqoop, SSIS…

Stéphane Fréchette
Thursday February 13, 2014
Who am I?
My name is Stéphane Fréchette

SQL Server MVP - I’m a Database & Business Intelligence Professional and Founder | CEO
of
I have a passion for architecting, designing and building solutions that matter.
Self proclaimed Open Data Hacker/Advocate I founded Gatineau Ouverte a citizen led
initiative which aims to promote open access to civic data of the city of Gatineau.

Twitter: @sfrechette
Blog: stephanefrechette.com
Email: stephanefrechette@ukubu.com
Session Outline
• What is Big Data?
• Apache Hadoop
• Hadoop Ecosystem
• Windows Azure HDInsight
• On the move…
• SSIS, Sqoop, Pig

• Demos
• Resources
What is Big Data?

4
Apache Hadoop
• Open-source software framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming
models
• Designed to scale up from single servers to thousands of machines, each
offering local computation and storage
Hadoop Ecosystem
• Core components;
• HDFS (Hadoop Distributed File System) -> Storage
• MapReduce -> Processing
What is Pig?
• Write complex MapReduce jobs using a simple script language (Pig Latin)

• A platform for analyzing large data sets that consists of high-level language
for expressing data analysis programs
• Pig translates and compiles complex MapReduce jobs on the fly

http://pig.apache.org
What is Sqoop?
• Command-line interface application to transfer bulk data between Hadoop
and relational datastores

http://sqoop.apache.org
What is Hive?
• A data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis
• Provides an SQL-Like language called HiveQL to query data
• Integration between Hadoop and BI and visualization tools

http://hive.apache.org
What is SSIS?
• SQL Server Integration Services is a platform for data integration and
workflow applications. A fast and flexible tool used for data extraction,
transformation, and loading (ETL).
• Contains rich set of built-in tasks and transformations; tools for constructing
packages…
• Used to solve complex business problems
Windows Azure HDInsight
• HDInsight is a Hadoop-based service from Microsoft that brings a 100
percent Apache Hadoop solution to the cloud
• Based on the Hortonworks Data Platform
• Scalable, on-demand service
Demos
(let’s move some data…)
Resources
•
•
•
•
•
•
•
•
•

Apache Projects (list with links) http://bit.ly/MfpLtE
Windows Azure HDInsight http://bit.ly/1dnlAX1
HDInsight Tutorials and Guide http://bit.ly/LWRYol
Hortonworks Sandbox 2.0 http://bit.ly/1gkkCte
Hortonworks Tutorial Gallery http://bit.ly/1nvMAEX
Microsoft JDBC Driver 4.0 for SQL Server http://bit.ly/1kEgJ7O
Microsoft Hive ODBC Driver http://bit.ly/NFkhcH
GitHub: WindowsAzure / azure-content http://bit.ly/1hfthlF
SSIS Custom Task – Disorderly Data (Ken Ross) http://bit.ly/1nvIH2G
• GitHub https://github.com/kzhen/SSISHDFS
What Questions Do You Have?
Thank You
For attending this session

More Related Content

PPTX
Introduction to Azure HDInsight
PDF
SQL Server 2014 Faster Insights from Any Data
PPTX
Big Data on azure
PPTX
Graph Databases for SQL Server Professionals
PPTX
Azure Data Lake Intro (SQLBits 2016)
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PPTX
SQLNexus Copenhaguen - Pipeline for the new oil: Azure Data Factory, Hybrid D...
PPTX
Big data in Azure
Introduction to Azure HDInsight
SQL Server 2014 Faster Insights from Any Data
Big Data on azure
Graph Databases for SQL Server Professionals
Azure Data Lake Intro (SQLBits 2016)
Big Data Analytics in the Cloud with Microsoft Azure
SQLNexus Copenhaguen - Pipeline for the new oil: Azure Data Factory, Hybrid D...
Big data in Azure

What's hot (20)

PPTX
Webinar - Introduction to Azure Data Lake
PPTX
How to boost your datamanagement with Dremio ?
PPTX
Building a Big Data Pipeline
PPTX
Introduction to PolyBase
PPTX
A lap around Azure Data Factory
PPTX
Introduction to Microsoft’s Hadoop solution (HDInsight)
PDF
Big data on Azure for Architects
PPTX
Azure cafe marketplace with looker data analytics
PDF
Cortana Analytics Workshop: Azure Data Lake
PDF
Bi on Big Data - Strata 2016 in London
PPTX
Data lake – On Premise VS Cloud
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
PPTX
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
PPTX
Options for Data Prep - A Survey of the Current Market
PPTX
Big Data - HDInsight and Power BI
PPTX
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
PDF
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
PDF
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
PPTX
Global AI Bootcamp Madrid - Azure Databricks
PDF
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
Webinar - Introduction to Azure Data Lake
How to boost your datamanagement with Dremio ?
Building a Big Data Pipeline
Introduction to PolyBase
A lap around Azure Data Factory
Introduction to Microsoft’s Hadoop solution (HDInsight)
Big data on Azure for Architects
Azure cafe marketplace with looker data analytics
Cortana Analytics Workshop: Azure Data Lake
Bi on Big Data - Strata 2016 in London
Data lake – On Premise VS Cloud
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Options for Data Prep - A Survey of the Current Market
Big Data - HDInsight and Power BI
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Azure Synapse Analytics Teaser (Microsoft TechX Oslo 2019)
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Global AI Bootcamp Madrid - Azure Databricks
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO...
Ad

Viewers also liked (8)

PPT
Big data analytics -hive
PDF
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
PPTX
PDF
Hiveハンズオン
PDF
Programming Hive Reading #4
PDF
Programming Hive Reading #3
PPT
Hive Object Model
PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
Big data analytics -hive
Hadoop World 2011: Replacing RDB/DW with Hadoop and Hive for Telco Big Data -...
Hiveハンズオン
Programming Hive Reading #4
Programming Hive Reading #3
Hive Object Model
How to understand and analyze Apache Hive query execution plan for performanc...
Ad

Similar to On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...) (20)

PPTX
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
PPTX
Windows Azure HDInsight Service
PPT
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
PPTX
Microsoft's Big Play for Big Data
PDF
Microsoft Big Data
PPTX
Big Data and NoSQL for Database and BI Pros
PPTX
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
PPTX
Big Data on the Microsoft Platform
PPTX
Big Data in the Microsoft Platform
PPTX
Big Data in the Real World
PDF
SQL Server Konferenz 2014 - SSIS & HDInsight
PPTX
Big Data and NoSQL for Database and BI Pros
PPTX
מיכאל
PPTX
Overview of big data & hadoop v1
PDF
Azure HDInsight
PPTX
SQL Server 2012 and Big Data
PPTX
Case study on big data
ODP
Hadoop introduction
PDF
Hadoop Fundamentals I
PPTX
Overview of big data & hadoop version 1 - Tony Nguyen
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
Windows Azure HDInsight Service
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data
Microsoft Big Data
Big Data and NoSQL for Database and BI Pros
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform
Big Data in the Microsoft Platform
Big Data in the Real World
SQL Server Konferenz 2014 - SSIS & HDInsight
Big Data and NoSQL for Database and BI Pros
מיכאל
Overview of big data & hadoop v1
Azure HDInsight
SQL Server 2012 and Big Data
Case study on big data
Hadoop introduction
Hadoop Fundamentals I
Overview of big data & hadoop version 1 - Tony Nguyen

More from Stéphane Fréchette (15)

PPTX
Back to the future - Temporal Table in SQL Server 2016
PPTX
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston
PPTX
Power BI - Bring your data together
PPTX
Data Analytics with R and SQL Server
PPTX
Self-Service Data Integration with Power Query
PDF
Le journalisme de données... par où commencer?
PPTX
Modernizing Your Data Warehouse using APS
PPTX
Graph Databases for SQL Server Professionals - SQLSaturday #350 Winnipeg
PPTX
TEDxGatineau
PPTX
PPTX
Introduction to Master Data Services in SQL Server 2012
PDF
Data Quality Services in SQL Server 2012
PDF
Business Intelligence in Excel 2013
KEY
Gatineau Ouverte troisième rencontre publique
KEY
Gatineau Ouverte première rencontre publique
Back to the future - Temporal Table in SQL Server 2016
Self-Service Data Integration with Power Query - SQLSaturday #364 Boston
Power BI - Bring your data together
Data Analytics with R and SQL Server
Self-Service Data Integration with Power Query
Le journalisme de données... par où commencer?
Modernizing Your Data Warehouse using APS
Graph Databases for SQL Server Professionals - SQLSaturday #350 Winnipeg
TEDxGatineau
Introduction to Master Data Services in SQL Server 2012
Data Quality Services in SQL Server 2012
Business Intelligence in Excel 2013
Gatineau Ouverte troisième rencontre publique
Gatineau Ouverte première rencontre publique

Recently uploaded (20)

PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
Chapter 5: Probability Theory and Statistics
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Approach and Philosophy of On baking technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
A Presentation on Touch Screen Technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
A comparative analysis of optical character recognition models for extracting...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Tartificialntelligence_presentation.pptx
Enhancing emotion recognition model for a student engagement use case through...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Programs and apps: productivity, graphics, security and other tools
Encapsulation_ Review paper, used for researhc scholars
Zenith AI: Advanced Artificial Intelligence
Chapter 5: Probability Theory and Statistics
OMC Textile Division Presentation 2021.pptx
A comparative study of natural language inference in Swahili using monolingua...
Approach and Philosophy of On baking technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
A Presentation on Touch Screen Technology
Building Integrated photovoltaic BIPV_UPV.pdf
Heart disease approach using modified random forest and particle swarm optimi...
gpt5_lecture_notes_comprehensive_20250812015547.pdf

On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)

  • 1. On the move with Big Data Hadoop, Pig, Sqoop, SSIS… Stéphane Fréchette Thursday February 13, 2014
  • 2. Who am I? My name is Stéphane Fréchette SQL Server MVP - I’m a Database & Business Intelligence Professional and Founder | CEO of I have a passion for architecting, designing and building solutions that matter. Self proclaimed Open Data Hacker/Advocate I founded Gatineau Ouverte a citizen led initiative which aims to promote open access to civic data of the city of Gatineau. Twitter: @sfrechette Blog: stephanefrechette.com Email: stephanefrechette@ukubu.com
  • 3. Session Outline • What is Big Data? • Apache Hadoop • Hadoop Ecosystem • Windows Azure HDInsight • On the move… • SSIS, Sqoop, Pig • Demos • Resources
  • 4. What is Big Data? 4
  • 5. Apache Hadoop • Open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models • Designed to scale up from single servers to thousands of machines, each offering local computation and storage
  • 6. Hadoop Ecosystem • Core components; • HDFS (Hadoop Distributed File System) -> Storage • MapReduce -> Processing
  • 7. What is Pig? • Write complex MapReduce jobs using a simple script language (Pig Latin) • A platform for analyzing large data sets that consists of high-level language for expressing data analysis programs • Pig translates and compiles complex MapReduce jobs on the fly http://pig.apache.org
  • 8. What is Sqoop? • Command-line interface application to transfer bulk data between Hadoop and relational datastores http://sqoop.apache.org
  • 9. What is Hive? • A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis • Provides an SQL-Like language called HiveQL to query data • Integration between Hadoop and BI and visualization tools http://hive.apache.org
  • 10. What is SSIS? • SQL Server Integration Services is a platform for data integration and workflow applications. A fast and flexible tool used for data extraction, transformation, and loading (ETL). • Contains rich set of built-in tasks and transformations; tools for constructing packages… • Used to solve complex business problems
  • 11. Windows Azure HDInsight • HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud • Based on the Hortonworks Data Platform • Scalable, on-demand service
  • 13. Resources • • • • • • • • • Apache Projects (list with links) http://bit.ly/MfpLtE Windows Azure HDInsight http://bit.ly/1dnlAX1 HDInsight Tutorials and Guide http://bit.ly/LWRYol Hortonworks Sandbox 2.0 http://bit.ly/1gkkCte Hortonworks Tutorial Gallery http://bit.ly/1nvMAEX Microsoft JDBC Driver 4.0 for SQL Server http://bit.ly/1kEgJ7O Microsoft Hive ODBC Driver http://bit.ly/NFkhcH GitHub: WindowsAzure / azure-content http://bit.ly/1hfthlF SSIS Custom Task – Disorderly Data (Ken Ross) http://bit.ly/1nvIH2G • GitHub https://github.com/kzhen/SSISHDFS
  • 14. What Questions Do You Have?
  • 15. Thank You For attending this session