SlideShare a Scribd company logo
A Year in Review -
Building a
Comprehensive Data
Management Program
@ Microsoft Research
What Exactly
Is Big Data?
2
Wikipedia: “Big data is a collection of data sets so
large and complex that it becomes difficult to process
using on-hand database management tools or
traditional data processing applications”
Critical tool for Microsoft’s businesses
Opportunity to deliver transformative new
capabilities to our enterprise customers
MSR and Big
Data
3
First, the sword: Shame on us…
Many undergrads with better big data capabilities
Martians versus Earthlings
Finally…Big data has been fully embraced by MSR as
A vital tool to enable research
A vital area in which to do research
We are MAKING THE INVESTMENT
Microsoft Research’s Centralized Data Management and Data Processing Platform
Founded June - 2013
Microsoft Research’s Centralized Data Management and Data Processing Platform
Project Vision
Motivation:
• Numerous Areas of Research are Driven by Data (Research
Need)
• Data comes in very different forms from very different sources
(Adapting to Change)
• Identified need standardized Data Storage and Data Processing
resource for MSR (Community)
• Many different research groups were processing and storing the
same data sets. (Shared Knowledge / Data Sharing)
• Some research groups were not aware that so many different
types of data was available. (Communication / Collaboration)
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Adapting to
Change
Community
Collaboration
Shared
Knowledge
Data Sharing
Research
Need
Guiding Principles:
• Secure and Compliant (e.g. Data Security, Privacy and Ethics)
• World-wide Access (equal opportunity for access and use given
to all MSR labs)
• Created through Partnerships with teams throughout Microsoft
• Driven by Researcher Needs and Requirements (e.g. Tools,
Hardware, Software, Datasets)
• Flexibility
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Security
Driven by
Researcher
Needs
Research and
Product Team
Partnerships
Global Access
Compliance
Ethics
Goals:
• Centralized, Compliant, and Curated Data Storage Facilities
• Multi-Purpose Data Processing Architecture (mix of different
types of Hardware)
• Flexibility with Software
• Active User Community (supported through Outreach and
Training)
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Centralized
Compliant
Curated
User
Community
Flexibility
with Software
and Tools
Blend of
Technology
and Services
Centralized
Data
Management
Research and
Innovation
Support
Innovative
Hardware and
Tools
Partnerships
Data Privacy
and Security
Community
and Outreach
Microsoft Research’s Centralized Data Management and Data Processing Platform
Microsoft Research’s Centralized Data Management and Data Processing Platform
System Architecture
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Hadoop
GPU
HPC
Azure
Sandbox
Bing
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Hadoop
GPU
HPC
Azure
Sandbox
Bing
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Hadoop
GPU
HPC
Azure
Sandbox
Bing
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
Hadoop
GPU
HPC
Azure
Sandbox
Bing
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
MNIST
Microsoft Research’s Centralized Data Management and Data Processing Platform
Bing
Microsoft Research’s Centralized Data Management and Data Processing Platform
Data Management
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
MNIST
Microsoft Research’s Centralized Data Management and Data Processing Platform
RESEARCH DATA
(INTERNAL AND EXTERNAL)
MNIST
Compliance
Security
Data
Management
Ethics
Policy
Microsoft Research’s Centralized Data Management and Data Processing Platform
ComplianceSecurity Ethics
• Policy / Procedure
• Standardization /
Common Platform
• Technology
• Corporate Technology
and Compliance
• Standardization /
Common Platform
• Technology
• Ethical Review Board /
Legal and Corporate
Affairs
• Standardization /
Common Platform
• Technology
Microsoft Research’s Centralized Data Management and Data Processing Platform
ComplianceSecurity Ethics
Microsoft Research’s Centralized Data Management and Data Processing Platform
Fun Examples
F sharp
Naiad
Skype
Translator
Azure ML
Microsoft Research’s Centralized Data Management and Data Processing Platform
Discussion / Questions / Next Steps

More Related Content

PDF
Data Management vs. Data Governance Program
PDF
Approaching Data Quality
PDF
Data strategy - The Business Game Changer
PPTX
Fasten you seatbelt and listen to the Data Steward
PDF
Building the Modern Data Hub
PDF
DataEd Slides: Data Management + Data Strategy = Interoperability
PDF
DataEd Slides: Getting Data Quality Right – Success Stories
PDF
Data-Ed: A Framework for no sql and Hadoop
Data Management vs. Data Governance Program
Approaching Data Quality
Data strategy - The Business Game Changer
Fasten you seatbelt and listen to the Data Steward
Building the Modern Data Hub
DataEd Slides: Data Management + Data Strategy = Interoperability
DataEd Slides: Getting Data Quality Right – Success Stories
Data-Ed: A Framework for no sql and Hadoop

What's hot (20)

PDF
DataEd Slides: Exorcising the Seven Deadly Data Sins
PDF
Data Systems Integration & Business Value Pt. 2: Cloud
PDF
DataOps - The Foundation for Your Agile Data Architecture
PPTX
Data analytics introduction
PDF
DataEd Slides: Expressing Data Improvements as Business Outcomes
PDF
Focus on Your Analysis, Not Your SQL Code
PDF
Data Strategy Best Practices
PDF
Data-Ed Online Webinar: Data Architecture Requirements
PDF
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
PDF
Activate Data Governance Using the Data Catalog
PDF
Data-Ed Webinar: Data Modeling Fundamentals
PDF
Data Prep - A Key Ingredient for Cloud-based Analytics
PDF
DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?
PPT
Data governance, Information security strategy
PDF
Emerging Trends in Data Architecture – What’s the Next Big Thing
PDF
The Key to Big Data Modeling: Collaboration
PDF
Data Systems Integration & Business Value PT. 3: Warehousing
PDF
RWDG Slides: Data Architecture Is Data Governance
PDF
Data-Ed Webinar: Data Quality Success Stories
PDF
RWDG Slides: Building Data Governance Through Data Stewardship
DataEd Slides: Exorcising the Seven Deadly Data Sins
Data Systems Integration & Business Value Pt. 2: Cloud
DataOps - The Foundation for Your Agile Data Architecture
Data analytics introduction
DataEd Slides: Expressing Data Improvements as Business Outcomes
Focus on Your Analysis, Not Your SQL Code
Data Strategy Best Practices
Data-Ed Online Webinar: Data Architecture Requirements
Data Architecture, Solution Architecture, Platform Architecture — What’s the ...
Activate Data Governance Using the Data Catalog
Data-Ed Webinar: Data Modeling Fundamentals
Data Prep - A Key Ingredient for Cloud-based Analytics
DAS Slides: Emerging Trends in Data Architecture – What’s the Next Big Thing?
Data governance, Information security strategy
Emerging Trends in Data Architecture – What’s the Next Big Thing
The Key to Big Data Modeling: Collaboration
Data Systems Integration & Business Value PT. 3: Warehousing
RWDG Slides: Data Architecture Is Data Governance
Data-Ed Webinar: Data Quality Success Stories
RWDG Slides: Building Data Governance Through Data Stewardship
Ad

Viewers also liked (20)

PPT
Internet un gran sector en el que emprender
DOC
Тематическое планирование 7 класс
PPTX
Hadoop 2 @ Twitter, Elephant Scale
PDF
CDC fy-2015-ofr-annual-report
PPT
Getting out of_debt_presentation(1)
PDF
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
PPT
UX Team Of One
PPTX
How to Determine which Algorithms Really Matter
DOC
DaedalusFBBlog
PPTX
The Future of Hadoop Security
PPT
Etimology
PPTX
Self esteem-2
PPT
The use of_l1.a.reynolds
PPTX
N(ot)-o(nly)-(Ha)doop - the DAG showdown
PDF
Awareness actions AP Fertilidade Portugal 2016
PDF
PPT
Etymology - Communication
DOC
Самообразование
PPTX
Redes de Mercadeo ¿Cuándo fue la última vez que recomendaste algo?
PPTX
HBase and Drill: How loosley typed SQL is ideal for NoSQL
Internet un gran sector en el que emprender
Тематическое планирование 7 класс
Hadoop 2 @ Twitter, Elephant Scale
CDC fy-2015-ofr-annual-report
Getting out of_debt_presentation(1)
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARN
UX Team Of One
How to Determine which Algorithms Really Matter
DaedalusFBBlog
The Future of Hadoop Security
Etimology
Self esteem-2
The use of_l1.a.reynolds
N(ot)-o(nly)-(Ha)doop - the DAG showdown
Awareness actions AP Fertilidade Portugal 2016
Etymology - Communication
Самообразование
Redes de Mercadeo ¿Cuándo fue la última vez que recomendaste algo?
HBase and Drill: How loosley typed SQL is ideal for NoSQL
Ad

Similar to A Year in Review - Building a Comprehensive Data Management Program (20)

PPT
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
PPTX
Fundamentals of Big Data
PPTX
How does Microsoft solve Big Data?
PDF
Microsoft big data_solution_brief
PDF
Are You Prepared For The Future Of Data Technologies?
PPTX
Data mining with big data
DOCX
Nikita rajbhoj(a 50)
PPTX
DataJan27.pptxDataFoundationsPresentation
PDF
Big data/Hadoop/HANA Basics
PPTX
MapR and Cisco Make IT Better
PPTX
Cisco event 6 05 2014v3 wwt only
PPTX
Information processing architectures
PDF
Big Data - Insights & Challenges
PDF
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
PPTX
Azure Data.pptx
PPTX
big data.pptx
PDF
Exploring the Wider World of Big Data
PPTX
Microsoft cloud big data strategy
PDF
Big Data: Its Characteristics And Architecture Capabilities
PDF
INF2190_W1_2016_public
Fast and Furious: From POC to an Enterprise Big Data Stack in 2014
Fundamentals of Big Data
How does Microsoft solve Big Data?
Microsoft big data_solution_brief
Are You Prepared For The Future Of Data Technologies?
Data mining with big data
Nikita rajbhoj(a 50)
DataJan27.pptxDataFoundationsPresentation
Big data/Hadoop/HANA Basics
MapR and Cisco Make IT Better
Cisco event 6 05 2014v3 wwt only
Information processing architectures
Big Data - Insights & Challenges
MAZZ -Bob Towards BIG DATA-RA-AlloyCloud-NIST_BD.pdf
Azure Data.pptx
big data.pptx
Exploring the Wider World of Big Data
Microsoft cloud big data strategy
Big Data: Its Characteristics And Architecture Capabilities
INF2190_W1_2016_public

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Machine learning based COVID-19 study performance prediction
PDF
KodekX | Application Modernization Development
PPTX
A Presentation on Artificial Intelligence
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Machine learning based COVID-19 study performance prediction
KodekX | Application Modernization Development
A Presentation on Artificial Intelligence
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation_ Review paper, used for researhc scholars
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding
20250228 LYD VKU AI Blended-Learning.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...

A Year in Review - Building a Comprehensive Data Management Program

  • 1. A Year in Review - Building a Comprehensive Data Management Program @ Microsoft Research
  • 2. What Exactly Is Big Data? 2 Wikipedia: “Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” Critical tool for Microsoft’s businesses Opportunity to deliver transformative new capabilities to our enterprise customers
  • 3. MSR and Big Data 3 First, the sword: Shame on us… Many undergrads with better big data capabilities Martians versus Earthlings Finally…Big data has been fully embraced by MSR as A vital tool to enable research A vital area in which to do research We are MAKING THE INVESTMENT
  • 4. Microsoft Research’s Centralized Data Management and Data Processing Platform Founded June - 2013
  • 5. Microsoft Research’s Centralized Data Management and Data Processing Platform Project Vision
  • 6. Motivation: • Numerous Areas of Research are Driven by Data (Research Need) • Data comes in very different forms from very different sources (Adapting to Change) • Identified need standardized Data Storage and Data Processing resource for MSR (Community) • Many different research groups were processing and storing the same data sets. (Shared Knowledge / Data Sharing) • Some research groups were not aware that so many different types of data was available. (Communication / Collaboration) Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Adapting to Change Community Collaboration Shared Knowledge Data Sharing Research Need
  • 7. Guiding Principles: • Secure and Compliant (e.g. Data Security, Privacy and Ethics) • World-wide Access (equal opportunity for access and use given to all MSR labs) • Created through Partnerships with teams throughout Microsoft • Driven by Researcher Needs and Requirements (e.g. Tools, Hardware, Software, Datasets) • Flexibility Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Security Driven by Researcher Needs Research and Product Team Partnerships Global Access Compliance Ethics
  • 8. Goals: • Centralized, Compliant, and Curated Data Storage Facilities • Multi-Purpose Data Processing Architecture (mix of different types of Hardware) • Flexibility with Software • Active User Community (supported through Outreach and Training) Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Centralized Compliant Curated User Community Flexibility with Software and Tools Blend of Technology and Services
  • 9. Centralized Data Management Research and Innovation Support Innovative Hardware and Tools Partnerships Data Privacy and Security Community and Outreach Microsoft Research’s Centralized Data Management and Data Processing Platform
  • 10. Microsoft Research’s Centralized Data Management and Data Processing Platform System Architecture
  • 11. Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Hadoop GPU HPC Azure Sandbox Bing
  • 12. Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Hadoop GPU HPC Azure Sandbox Bing
  • 13. Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Hadoop GPU HPC Azure Sandbox Bing
  • 14. Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) Hadoop GPU HPC Azure Sandbox Bing
  • 15. Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) MNIST
  • 16. Microsoft Research’s Centralized Data Management and Data Processing Platform Bing
  • 17. Microsoft Research’s Centralized Data Management and Data Processing Platform Data Management
  • 18. Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) MNIST
  • 19. Microsoft Research’s Centralized Data Management and Data Processing Platform RESEARCH DATA (INTERNAL AND EXTERNAL) MNIST Compliance Security Data Management Ethics Policy
  • 20. Microsoft Research’s Centralized Data Management and Data Processing Platform ComplianceSecurity Ethics • Policy / Procedure • Standardization / Common Platform • Technology • Corporate Technology and Compliance • Standardization / Common Platform • Technology • Ethical Review Board / Legal and Corporate Affairs • Standardization / Common Platform • Technology
  • 21. Microsoft Research’s Centralized Data Management and Data Processing Platform ComplianceSecurity Ethics
  • 22. Microsoft Research’s Centralized Data Management and Data Processing Platform Fun Examples F sharp Naiad Skype Translator Azure ML
  • 23. Microsoft Research’s Centralized Data Management and Data Processing Platform Discussion / Questions / Next Steps

Editor's Notes