SlideShare a Scribd company logo
BigData in Azure
Venkatesh
Introduction to Azure
• Azure Cloud Service
• PaaS
• IaaS
What is BigData
• Analyzing extremely large datasets computationally to
reveal patterns, trends and associations.
• Characterized by 3Vs (Volume, Velocity and Variety).
• Enhanced insight and decision making.
BigData vs Database
Microsoft BigData solutions
• Microsoft supports Hadoop based BigData solutions.
• Built on top of Hortonworks Data Platform (HDP)
• Three distinct solutions based on HDP
• HDInsight
• HDP for Windows
• Microsoft Analytics Platform
Microsoft Data Platform
Hadoop
• Hadoop - Framework for solving bigdata problem by using scale-out “divide
and conquer” approach
• HDFS – Hadoop Distributed File System. Allows data to be split across
multiple nodes.
• MapReduce – Enables distributed processing.
Hadoop Components
• Cluster – Collection of server nodes, stores data using HDFS and process it.
• Datastore – Data store in each server is a distributed storage service (HDFS
/Equivalent)
• Query – Big data processing queries using Map Reduce
HDInsight
• Implementation of Hadoop that runs on Azure Platform
• Pay only for what you use
• Dynamic allocation of Nodes in the cluster
• Integrated with Azure storage
HDInsight - Data Storage
• Following types of storage supported by HDInsight
• HDFS (Standard Hadoop)
• Azure Storage Blob
• HBase
HDInsight – Data Processing
• Run jobs directly on the cluster using Map Reduce
• Use external programs to connect to the cluster.
• Pig – Execute queries by writing scripts in high level language
• Hive – SQL like query on the data
• Mahout – ML library that allows to perform data mining queries
• Storm – Real time computation for processing fast, large streams of data
Data Loading Options
Designing for HDInsight
• Determine the analytical goals and source data
• Plan and configure the infrastructure
• Obtain data and submit it to HDInsight
• Process the data
• Evaluate the results
• Tune the solution
Azure DataLake
• Single place to store all structured and semi-structured data in native format
• Unlimited data size
• Compatible with HDFS
Creating HDInsight Cluster
Summary
• Hadoop – Defacto solution to the Big Data problem
• Windows Azure HDInsight Service
• Native Hadoop implementation
• Managed Hadoop Service for Windows Azure

More Related Content

PPTX
Azure Big Data Story
PPTX
Big Data on azure
PPTX
Big Data in the Real World
PPTX
Big Data in the Cloud with Azure Marketplace Images
PPTX
Azure cafe marketplace with looker data analytics
PPTX
Building big data solutions on azure
PPTX
Big Data with Azure
PDF
Cloud Big Data Architectures
Azure Big Data Story
Big Data on azure
Big Data in the Real World
Big Data in the Cloud with Azure Marketplace Images
Azure cafe marketplace with looker data analytics
Building big data solutions on azure
Big Data with Azure
Cloud Big Data Architectures

What's hot (20)

PDF
Hd insight essentials quick view
PPTX
How to boost your datamanagement with Dremio ?
PPTX
Pentaho Analytics on MongoDB
PPTX
Pentaho Big Data Analytics with Vertica and Hadoop
PDF
Cortana Analytics Workshop: Azure Data Lake
PDF
Big data on Azure for Architects
PPTX
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
PDF
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
PPTX
The Microsoft BigData Story
PPTX
Big Data Analytics Projects - Real World with Pentaho
PDF
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
PPTX
Data saturday malta - ADX Azure Data Explorer overview
PPTX
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
PPTX
Lecture1
PDF
Azure Big data
PPTX
Managed Cluster Services
PDF
Building Data Lakes with Apache Airflow
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PDF
Azure Data Lake Store and Analytics
PPTX
A lap around Azure Data Factory
Hd insight essentials quick view
How to boost your datamanagement with Dremio ?
Pentaho Analytics on MongoDB
Pentaho Big Data Analytics with Vertica and Hadoop
Cortana Analytics Workshop: Azure Data Lake
Big data on Azure for Architects
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics Solution
The Microsoft BigData Story
Big Data Analytics Projects - Real World with Pentaho
Azure Databricks—Apache Spark as a Service with Sascha Dittmann
Data saturday malta - ADX Azure Data Explorer overview
4Developers 2018: Przetwarzanie Big Data w oparciu o architekturę Lambda na p...
Lecture1
Azure Big data
Managed Cluster Services
Building Data Lakes with Apache Airflow
Big Data Analytics in the Cloud with Microsoft Azure
Azure Data Lake Store and Analytics
A lap around Azure Data Factory
Ad

Viewers also liked (6)

PPTX
Big Data en Azure: Azure Data Lake
PPTX
Intorducing Big Data and Microsoft Azure
PDF
Dive into Spark Streaming
PPTX
PPTX
Microsoft Azure Big Data Analytics
PPTX
Azure Spark - Big Data - Coresic 2016
Big Data en Azure: Azure Data Lake
Intorducing Big Data and Microsoft Azure
Dive into Spark Streaming
Microsoft Azure Big Data Analytics
Azure Spark - Big Data - Coresic 2016
Ad

Similar to Big data in Azure (20)

PPTX
Introduction to Microsoft’s Hadoop solution (HDInsight)
PPTX
Microsoft's Hadoop Story
PPTX
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
PPTX
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
PPTX
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
PPTX
Introduction to Azure HDInsight
PDF
HdInsight essentials Hadoop on Microsoft Platform
PDF
Hd insight essentials quick view
PDF
Big data talking stories in Healthcare
PPTX
Big data solutions in Azure
PPTX
Hd insight overview
PPTX
Big Data on Azure Tutorial
PPTX
Big Data in the Microsoft Platform
PDF
Introduction to Big Data Analytics on Apache Hadoop
PPTX
Windows Azure HDInsight Service
PDF
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
PDF
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
PDF
Using Machine Learning with HDInsight
PPTX
Big Data and NoSQL for Database and BI Pros
PPTX
HDInsight Hadoop on Windows Azure
Introduction to Microsoft’s Hadoop solution (HDInsight)
Microsoft's Hadoop Story
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Stéphane Fréchette - Samedi SQL - Introduction to HDInsight
Introduction to Azure HDInsight
HdInsight essentials Hadoop on Microsoft Platform
Hd insight essentials quick view
Big data talking stories in Healthcare
Big data solutions in Azure
Hd insight overview
Big Data on Azure Tutorial
Big Data in the Microsoft Platform
Introduction to Big Data Analytics on Apache Hadoop
Windows Azure HDInsight Service
[Azureビッグデータ関連サービスとHortonworks勉強会] Azure HDInsight
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Using Machine Learning with HDInsight
Big Data and NoSQL for Database and BI Pros
HDInsight Hadoop on Windows Azure

More from Venkatesh Narayanan (9)

PPTX
Azure ML Training - Deep Dive
PPTX
Azure Functions - Introduction
PPTX
Azure Active Directory - An Introduction
PPTX
Angular js 1.0-fundamentals
PPTX
Markdown – An Introduction
PPTX
Introduction to facebook platform
PPTX
Introduction to o data
PDF
Azure and cloud design patterns
PPTX
Threading net 4.5
Azure ML Training - Deep Dive
Azure Functions - Introduction
Azure Active Directory - An Introduction
Angular js 1.0-fundamentals
Markdown – An Introduction
Introduction to facebook platform
Introduction to o data
Azure and cloud design patterns
Threading net 4.5

Recently uploaded (20)

PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
history of c programming in notes for students .pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
System and Network Administraation Chapter 3
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
AI in Product Development-omnex systems
PPTX
L1 - Introduction to python Backend.pptx
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Nekopoi APK 2025 free lastest update
PPTX
Online Work Permit System for Fast Permit Processing
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Transform Your Business with a Software ERP System
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Upgrade and Innovation Strategies for SAP ERP Customers
history of c programming in notes for students .pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
System and Network Administraation Chapter 3
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
CHAPTER 2 - PM Management and IT Context
AI in Product Development-omnex systems
L1 - Introduction to python Backend.pptx
ManageIQ - Sprint 268 Review - Slide Deck
How Creative Agencies Leverage Project Management Software.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
Nekopoi APK 2025 free lastest update
Online Work Permit System for Fast Permit Processing
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
Odoo Companies in India – Driving Business Transformation.pdf
Transform Your Business with a Software ERP System

Big data in Azure

  • 2. Introduction to Azure • Azure Cloud Service • PaaS • IaaS
  • 3. What is BigData • Analyzing extremely large datasets computationally to reveal patterns, trends and associations. • Characterized by 3Vs (Volume, Velocity and Variety). • Enhanced insight and decision making.
  • 5. Microsoft BigData solutions • Microsoft supports Hadoop based BigData solutions. • Built on top of Hortonworks Data Platform (HDP) • Three distinct solutions based on HDP • HDInsight • HDP for Windows • Microsoft Analytics Platform
  • 7. Hadoop • Hadoop - Framework for solving bigdata problem by using scale-out “divide and conquer” approach • HDFS – Hadoop Distributed File System. Allows data to be split across multiple nodes. • MapReduce – Enables distributed processing.
  • 8. Hadoop Components • Cluster – Collection of server nodes, stores data using HDFS and process it. • Datastore – Data store in each server is a distributed storage service (HDFS /Equivalent) • Query – Big data processing queries using Map Reduce
  • 9. HDInsight • Implementation of Hadoop that runs on Azure Platform • Pay only for what you use • Dynamic allocation of Nodes in the cluster • Integrated with Azure storage
  • 10. HDInsight - Data Storage • Following types of storage supported by HDInsight • HDFS (Standard Hadoop) • Azure Storage Blob • HBase
  • 11. HDInsight – Data Processing • Run jobs directly on the cluster using Map Reduce • Use external programs to connect to the cluster. • Pig – Execute queries by writing scripts in high level language • Hive – SQL like query on the data • Mahout – ML library that allows to perform data mining queries • Storm – Real time computation for processing fast, large streams of data
  • 13. Designing for HDInsight • Determine the analytical goals and source data • Plan and configure the infrastructure • Obtain data and submit it to HDInsight • Process the data • Evaluate the results • Tune the solution
  • 14. Azure DataLake • Single place to store all structured and semi-structured data in native format • Unlimited data size • Compatible with HDFS
  • 16. Summary • Hadoop – Defacto solution to the Big Data problem • Windows Azure HDInsight Service • Native Hadoop implementation • Managed Hadoop Service for Windows Azure

Editor's Notes

  • #6: HDP is 100% compatible with Apache Hadoom Open Enterprise Hadoop Data Platform – Enterprise ready HDInsight – Available to Azure subscribers. Runs on Azure clusters to run HDP and integrates with Azure storage HDP for Windows – OnPrem solution for running Hadoop in Windows Server. Can be physical or virtual machines Microsoft Analytics Platform – Massively Parallel Processing (MPP) in Microsoft Parallel data warehouse (PDW) with Hadoop.
  • #9: Cluster - The cluster is managed by a server called the name node that has knowledge of all the cluster servers and the parts of the data files stored on each one. To store incoming data, the name node server directs the client to the appropriate data node server. The name node also manages replication of data files across all the other cluster members that communicate with each other to replicate the data. DataStore – Key/Value store, Document Store (XML or JSON), Binary store, Column stores, Graph store
  • #10: Cassandra™: A scalable multi-master database with no single points of failure.  Chukwa™: A data collection system for managing large distributed systems.  HBase™: A scalable, distributed database that supports structured data storage for large tables.  HCatalog™: A table and storage management service for data created using Apache Hadoop.  Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.  Mahout™: A Scalable machine learning and data mining library.  Pig™: A high-level data-flow language and execution framework for parallel computation.  ZooKeeper™: A high-performance coordination service for distributed applications.
  • #11: When you store your data using HDFS, it's contained within the nodes of the cluster and it must be called through the HDFS API. When the cluster is decommissioned, the data is lost as well. Azure Storage provides several advantages: you can load the data using standard tools, retain the data when you decommission the cluster, the cost is less, and other processes in Azure. HBase is a NoSQL wide-column data store implemented as distributed system that provides data processing and storage over multiple nodes in a Hadoop cluster. It provides a random, real-time, read/write data store designed to host tables that can contain billions of rows and millions of columns.
  • #12: Pig - Pig is another framework. Using the Pig Latin language, you can create MapReduce jobs more easily than by having to code the Java yourself—the language has simple statements for loading data, storing it in intermediate steps, and computing the data, and it's MapReduce aware Hive - When you want to work with the data on your cluster in a relational-friendly format. Hive allows you to create a data warehouse on top of HDFS or other file systems and uses a language called HiveQL, which has a lot in common with the Structured Query Language (SQL). Mahout is a machine learning library, which allows you to perform data mining queries that examine data files to extract specific types of information. For example, it supports recommendation mining (finding user’s preferences from their behavior), clustering (grouping documents with similar topic content), and classification (assigning new documents to a category based on existing categorization). Storm is a distributed real-time computation system for processing fast, large streams of data. It allows you to build trees and directed acyclic graphs (DAGs) that asynchronously process data items using a user-defined number of parallel tasks. It can be used for real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more.
  • #13: Interactive Data Ingestion – Small amount of data Automative Batch Upload to HDInsight – Using SSIS to gather disaparate data sources and push them to HDINsight Relational data - Sqoop – For loading relational data into HDInsight Web log files - Flume
  • #15: Built on top of Azure’s hyperscale network and supports both single files that can be multiple petabytes in size,