SlideShare a Scribd company logo
Hadoop as the Platform for the
Smartgrid at TVA
August 26, 2010
Topics

•   Introduction
•   Retrospective on the openPDC project
•   What Is Hadoop?
•   Current Smartgrid Obstacles
•   Cloudera Enterprise as The New Smartgrid Platform
•   Summary




                  Copyright 2010 Cloudera Inc. All rights reserved   2
Today’s speaker – Josh Patterson

 • josh@cloudera.com
 • Master’s Thesis: self-organizing mesh networks
    • Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
 • Conceived, built, and led Hadoop integration for the
   openPDC project at TVA
    • Led small team which designed classification techniques
      for timeseries and Map Reduce
    • Open source work at http://openpdc.codeplex.com
 • Now: Solutions Architect at Cloudera



                    Copyright 2010 Cloudera Inc. All rights reserved   3
What is the openPDC?

• The openPDC is a complete set of applications for
  processing streaming time-series data in real-time
   • Measured data is gathered with GPS-time from multiple input
     sources, time-sorted and provided to user defined actions,
     dispersed to custom output destinations for archival
• NERC funded
• Started at the Tennessee Valley Authority (TVA)
• Now in use by many government controlled power
  companies around the world


                   Copyright 2010 Cloudera Inc. All rights reserved   4
openPDC Topology




            Copyright 2010 Cloudera Inc. All rights reserved   5
openPDC: Why?

Northeast Blackout of 2003
• Significant failure of US power grid in 2003 due to cascading
  effects
• SCADA provided a limited at best view of what happened
• NERC mandated that companies collect high resolution data
  and store for later analysis
• Power grid in US is aging rapidly, cost of needed overhaul is
  significant




                  Copyright 2010 Cloudera Inc. All rights reserved   6
How “Big Data” Challenged the openPDC Project

 “We Need More Power, Scotty”



 • Data was sampled 30 times a second
 • Number of sensors (Phasor Measurement Units / PMU) was
   increasing rapidly (was 120, heading towards 1000 over next 2
   years, currently taking in 4.2 billion samples per day)
 • Cost of SAN storage became excessive
 • Little analysis possible on SAN due to poor read rates on large
   amounts (TBs) of data

                   Copyright 2010 Cloudera Inc. All rights reserved   7
Major Themes for Storage and Processing Needs

•   Scale Out, not Up
•   Linear scalability in cost and processing power
•   Robust in the face of hardware failure
•   No vendor lock in




                   Copyright 2010 Cloudera Inc. All rights reserved   8
Storage Needs: The Data Deluge

 • At 1000 PMU sensors we were looking at needing to store 500TB of data
 • The Data Deluge
     • “Eighteen months ago, Li & Fung, a firm that manages supply chains for retailers,
       saw 100 gigabytes of information flow through its network each day. Now the
       amount has increased tenfold.”
     •   http://www.economist.com/opinion/displaystory.cfm?story_id=15579717

 • Internet of Things
     • HP's Peter Hartwell: "one trillion nanoscale sensors and actuators will need the
       equivalent of 1000 internets: the next huge demand for computing!“
Processing Needs: Needle in a Haystack

• The “Haystack” in PMU data typically involved in
  scanning through TBs of info to find the one particular
  event we were interested in
• RDBMs simply do not work with high resolution
  timeseries data
• Need for Ad-Hoc processing on data to explore network
  effects and look at how events cascade across the grid




                 Copyright 2010 Cloudera Inc. All rights reserved   10
The Solution: Hadoop

• A scalable fault-tolerant distributed system for data storage
  and processing (open source under the Apache license)

• Two primary components
   • Hadoop Distributed File System (HDFS): self-healing high-bandwidth
     clustered storage
   • MapReduce: fault-tolerant distributed processing

• Key value
   •   Flexible -> store data without a schema and add it later as needed
   •   Affordable -> cost / TB at a fraction of traditional options
   •   Broadly adopted -> a large and active ecosystem
   •   Proven at scale -> dozens of petabyte + implementations in
       production today
                      Copyright 2010 Cloudera Inc. All Rights Reserved.     11
HDFS As Cheap and Scalable Storage

• HDFS is robust in the face of machine failure
• A big thing was cost – we could linearly grow our cluster
  as needed by just adding new machines
• Ran on commodity hardware – we didn’t have to buy
  expensive (and relatively slow), proprietary SAN setups




                  Copyright 2010 Cloudera Inc. All rights reserved   12
MapReduce Provides a Powerful Parallel Processing
Framework
• We found Map Reduce to be the perfect framework to
  quickly process large amounts of PMU (timeseries) data
• Created a machine learning algorithm in Map Reduce
  which detected “unbounded oscillations” in grid data
• Map Reduce based oscillation scan of a few TBs takes
  minutes
• A scan of comparable data from a SAN would take days
  or weeks



                 Copyright 2010 Cloudera Inc. All rights reserved   13
What is common across Hadoop-able problems?

 Nature of the data
 • Complex data
 • Multiple data sources
 • Lots of it

 Nature of the analysis
 • Batch processing
 • Parallel execution
 • Spread data over a cluster of servers
   and take the computation to the data

                  Copyright 2010 Cloudera Inc. All rights reserved   14
What Analysis is Possible With Hadoop?


 • Text mining                                   • Collaborative filtering
 • Index building                                • Prediction models
 • Graph creation and                            • Sentiment analysis
   analysis
                                                 • Risk assessment
 • Pattern recognition




                 Copyright 2010 Cloudera Inc. All rights reserved            15
Benefits of Analyzing With Hadoop

 • Previously impossible/impractical to do this analysis

 • Analysis conducted at lower cost

 • Analysis conducted in less time

 • Greater flexibility




                 Copyright 2010 Cloudera Inc. All rights reserved   16
The Storm of the Data Deluge is Brewing

• Challenges of the openPDC project were just the first
  wave
• Storage requirements are accelerating
• Disk speeds are relatively constant
• Seeing signs of data deluge, GE now using open sourced
  Hadoop-based timeseries classifiers developed in the
  openPDC project




                 Copyright 2010 Cloudera Inc. All rights reserved   17
Coming Power Grid Stressors

• Larger fluctuations in power demands
   • Ex: Millions of new electric cars all charging in the evenings
• An aging power grid that requires more capital infusion
  than most companies have allocated for these purposes
   • Grid infrastructure is older than most realize
   • Maintenance policies generally only look at age of equipment




                    Copyright 2010 Cloudera Inc. All rights reserved   18
The Power Grid Domain is Slow to Evolve

• Power companies are slow to adopt technology
   • They generally have poor maps of their overall infrastructure
• Coming pressures are going to force power companies
  to have to analyze TBs and PBs of data
• Ad-Hoc analysis will be needed to explore the complex
  relationships in this data




                    Copyright 2010 Cloudera Inc. All rights reserved   19
Broader Emerging Smartgrid Themes

• Simply adding lots of sensors is only a very small part of
  the solution
• Collection, storage, and processing are in themselves all
  difficult problems
• In order to build a more effective Smartgrid, platforms
  are needed that handle these things well
• Smartgrid sensor collection is a subset of the larger
  undercurrent of emerging massive sensor based
  systems


                  Copyright 2010 Cloudera Inc. All rights reserved   20
Even Broader Theme: Internet of Things

• We’re collecting sensor data everywhere, not just the
  Smartgrid
• Many of the techniques described above can be easily
  done with Hadoop
   • Open Source generalized collector system is called “Flume”
• Examples:
   • Weather sensors
   • Mesh networks – battlefield UAVs
   • Cell Phones – Google Android as a collector


                   Copyright 2010 Cloudera Inc. All rights reserved   21
Next Generation Sensor Platform: Hadoop and
Related Projects




              Copyright 2010 Cloudera Inc. All rights reserved   22
The Companies That Provide Real Results for
Sensor Platforms Will Win
• Much of today’s Smartgrid talk is just hype
• Few “solutions” actually fix anything, only put sensors
  on things
• Analysis is where the true value lies
   • But you need a complete platform to be in position to analyze
     the data




                   Copyright 2010 Cloudera Inc. All rights reserved   23
Harnessing Hadoop Has Its Challenges
              Ease of use – command line interface only; data
              import and access requires development skills

    Complexity -- > 12 different components,
    different versions, dependencies and patch
    requirements
             Manageability – Hadoop is challenging
             to configure, upgrade, monitor and
             administer
            Interoperability – limited support for
            popular databases and analytical tools

                Copyright 2010 Cloudera Inc. All Rights Reserved.   24
Cloudera’s Distribution for Hadoop, version 3
The industry’s leading Hadoop distribution


                                                  Hue                               Hue SDK

                               Oozie                              Oozie                Hive
                                                                          Pig/
                                                                          Hive


                Flume, Sqoop                                                          HBase

                                                                                   Zookeeper



•   Open source – 100% Apache licensed
•   Simplified – Component versions & dependencies managed for you
•   Integrated – All components & functions interoperate through standard API’s
•   Reliable – Patched with fixes from future releases to improve stability
•   Supported – Employs project founders and committers for >70% of components
                               Copyright 2010 Cloudera Inc. All Rights Reserved.               25
Who is Cloudera?

• Enterprise software & services company providing the industry’s
  leading Hadoop-based data management platform
   • Founding team came from large Web companies



• Products: Cloudera Enterprise & Cloudera’s Distribution for Hadoop
   • All necessary packages, matched, tested and supported
   • Tools to support production use of Hadoop
   • The leading distribution for the enterprise


• Contributors and committers
   • Fixing, patching and adding features

                                                                    26
Hear More Examples @ Hadoop World 2010
http://www.cloudera.com/company/press-center/hadoop-world-nyc/


 • 2nd annual event focused on practical
   applications of Hadoop

 • Date: October 12th 2010

 • Location: Hilton New York                                                 Confirmed speakers from

 • Keynote from Tim O’Reilly – founder
   O’Reilly Media

 • Pre and post conference training
   available for Hadoop and related projects

 • 36 business and technical focused sessions


                         Copyright 2010 Cloudera Inc. All Rights Reserved.                             27
Questions?




             Copyright 2010 Cloudera Inc. All Rights Reserved.   28

More Related Content

PPTX
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
PPTX
巨量資料入門 The evolution of data architecture
PPTX
Part 1: Lambda Architectures: Simplified by Apache Kudu
PPTX
Zeta Architecture: The Next Generation Big Data Architecture
PPTX
Hadoop in the Clouds, Virtualization and Virtual Machines
PPT
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
PDF
Hortonworks HDP, Is it goog enough ?
PDF
Data Science and Machine Learning for the Enterprise
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
巨量資料入門 The evolution of data architecture
Part 1: Lambda Architectures: Simplified by Apache Kudu
Zeta Architecture: The Next Generation Big Data Architecture
Hadoop in the Clouds, Virtualization and Virtual Machines
The Business Advantage of Hadoop: Lessons from the Field – Cloudera Summer We...
Hortonworks HDP, Is it goog enough ?
Data Science and Machine Learning for the Enterprise

What's hot (20)

PPTX
Data Science and CDSW
PPTX
Wrangling Customer Usage Data with Hadoop
PPTX
Supercharge Splunk with Cloudera

PDF
快速数据快速分析引擎-Kudu
PPTX
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
PDF
The Car of the Future - Autonomous, Connected, and Data Centric
PPTX
Moving Beyond Lambda Architectures with Apache Kudu
PPTX
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
PPTX
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
PPTX
Cloudera Altus: Big Data in the Cloud Made Easy
PPTX
Realtime Analytics in Hadoop
PPTX
Big Data Fundamentals
PDF
How YARN Enables Multiple Data Processing Engines in Hadoop
PPTX
Facial recognition
PPTX
A Mayo Clinic Big Data Implementation
PPTX
EMC Big Data Solutions Overview
PPTX
Part 3: Models in Production: A Look From Beginning to End
PPT
A Community Approach to Fighting Cyber Threats
PPTX
Multi-Tenant Operations with Cloudera 5.7 & BT
Data Science and CDSW
Wrangling Customer Usage Data with Hadoop
Supercharge Splunk with Cloudera

快速数据快速分析引擎-Kudu
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Simplifying Real-Time Architectures for IoT with Apache Kudu
The Car of the Future - Autonomous, Connected, and Data Centric
Moving Beyond Lambda Architectures with Apache Kudu
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Realizing the Promise of Big Data with Hadoop - Cloudera Summer Webinar Serie...
Cloudera Altus: Big Data in the Cloud Made Easy
Realtime Analytics in Hadoop
Big Data Fundamentals
How YARN Enables Multiple Data Processing Engines in Hadoop
Facial recognition
A Mayo Clinic Big Data Implementation
EMC Big Data Solutions Overview
Part 3: Models in Production: A Look From Beginning to End
A Community Approach to Fighting Cyber Threats
Multi-Tenant Operations with Cloudera 5.7 & BT
Ad

Similar to Hadoop As The Platform For The Smartgrid At TVA (20)

PPTX
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
PDF
10 Common Hadoop-able Problems Webinar
PDF
20100806 cloudera 10 hadoopable problems webinar
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PPTX
Introduction to Cloud computing and Big Data-Hadoop
PDF
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
PPTX
Big data - Online Training
PDF
Hadoop summit cloudera keynote_v5
PDF
Oracle Cloud : Big Data Use Cases and Architecture
PPTX
IoT Connected Brewery
PPTX
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Altair Leveraging Disruptive Cloud Technologies
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PDF
Accelerate Analytics and ML in the Hybrid Cloud Era
PDF
Building a Hadoop Data Warehouse with Impala
PDF
Alluxio Use Cases and Future Directions
PPTX
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
PDF
Stl meetup cloudera platform - january 2020
PPTX
MapR-DB – The First In-Hadoop Document Database
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
10 Common Hadoop-able Problems Webinar
20100806 cloudera 10 hadoopable problems webinar
Oct 2011 CHADNUG Presentation on Hadoop
Introduction to Cloud computing and Big Data-Hadoop
Webinar: Productionizing Hadoop: Lessons Learned - 20101208
Big data - Online Training
Hadoop summit cloudera keynote_v5
Oracle Cloud : Big Data Use Cases and Architecture
IoT Connected Brewery
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Building a Hadoop Data Warehouse with Impala
Altair Leveraging Disruptive Cloud Technologies
Leveraging the cloud for analytics and machine learning 1.29.19
Accelerate Analytics and ML in the Hybrid Cloud Era
Building a Hadoop Data Warehouse with Impala
Alluxio Use Cases and Future Directions
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stl meetup cloudera platform - january 2020
MapR-DB – The First In-Hadoop Document Database
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
PPTX
Cloudera SDX
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18
Cloudera SDX

Recently uploaded (20)

PPTX
1. Introduction to Computer Programming.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Tartificialntelligence_presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Spectroscopy.pptx food analysis technology
PDF
Mushroom cultivation and it's methods.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
1. Introduction to Computer Programming.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Tartificialntelligence_presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Spectroscopy.pptx food analysis technology
Mushroom cultivation and it's methods.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
A comparative study of natural language inference in Swahili using monolingua...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A Presentation on Artificial Intelligence
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
OMC Textile Division Presentation 2021.pptx
MIND Revenue Release Quarter 2 2025 Press Release
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.

Hadoop As The Platform For The Smartgrid At TVA

  • 1. Hadoop as the Platform for the Smartgrid at TVA August 26, 2010
  • 2. Topics • Introduction • Retrospective on the openPDC project • What Is Hadoop? • Current Smartgrid Obstacles • Cloudera Enterprise as The New Smartgrid Platform • Summary Copyright 2010 Cloudera Inc. All rights reserved 2
  • 3. Today’s speaker – Josh Patterson • josh@cloudera.com • Master’s Thesis: self-organizing mesh networks • Published in IAAI-09: TinyTermite: A Secure Routing Algorithm • Conceived, built, and led Hadoop integration for the openPDC project at TVA • Led small team which designed classification techniques for timeseries and Map Reduce • Open source work at http://openpdc.codeplex.com • Now: Solutions Architect at Cloudera Copyright 2010 Cloudera Inc. All rights reserved 3
  • 4. What is the openPDC? • The openPDC is a complete set of applications for processing streaming time-series data in real-time • Measured data is gathered with GPS-time from multiple input sources, time-sorted and provided to user defined actions, dispersed to custom output destinations for archival • NERC funded • Started at the Tennessee Valley Authority (TVA) • Now in use by many government controlled power companies around the world Copyright 2010 Cloudera Inc. All rights reserved 4
  • 5. openPDC Topology Copyright 2010 Cloudera Inc. All rights reserved 5
  • 6. openPDC: Why? Northeast Blackout of 2003 • Significant failure of US power grid in 2003 due to cascading effects • SCADA provided a limited at best view of what happened • NERC mandated that companies collect high resolution data and store for later analysis • Power grid in US is aging rapidly, cost of needed overhaul is significant Copyright 2010 Cloudera Inc. All rights reserved 6
  • 7. How “Big Data” Challenged the openPDC Project “We Need More Power, Scotty” • Data was sampled 30 times a second • Number of sensors (Phasor Measurement Units / PMU) was increasing rapidly (was 120, heading towards 1000 over next 2 years, currently taking in 4.2 billion samples per day) • Cost of SAN storage became excessive • Little analysis possible on SAN due to poor read rates on large amounts (TBs) of data Copyright 2010 Cloudera Inc. All rights reserved 7
  • 8. Major Themes for Storage and Processing Needs • Scale Out, not Up • Linear scalability in cost and processing power • Robust in the face of hardware failure • No vendor lock in Copyright 2010 Cloudera Inc. All rights reserved 8
  • 9. Storage Needs: The Data Deluge • At 1000 PMU sensors we were looking at needing to store 500TB of data • The Data Deluge • “Eighteen months ago, Li & Fung, a firm that manages supply chains for retailers, saw 100 gigabytes of information flow through its network each day. Now the amount has increased tenfold.” • http://www.economist.com/opinion/displaystory.cfm?story_id=15579717 • Internet of Things • HP's Peter Hartwell: "one trillion nanoscale sensors and actuators will need the equivalent of 1000 internets: the next huge demand for computing!“
  • 10. Processing Needs: Needle in a Haystack • The “Haystack” in PMU data typically involved in scanning through TBs of info to find the one particular event we were interested in • RDBMs simply do not work with high resolution timeseries data • Need for Ad-Hoc processing on data to explore network effects and look at how events cascade across the grid Copyright 2010 Cloudera Inc. All rights reserved 10
  • 11. The Solution: Hadoop • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) • Two primary components • Hadoop Distributed File System (HDFS): self-healing high-bandwidth clustered storage • MapReduce: fault-tolerant distributed processing • Key value • Flexible -> store data without a schema and add it later as needed • Affordable -> cost / TB at a fraction of traditional options • Broadly adopted -> a large and active ecosystem • Proven at scale -> dozens of petabyte + implementations in production today Copyright 2010 Cloudera Inc. All Rights Reserved. 11
  • 12. HDFS As Cheap and Scalable Storage • HDFS is robust in the face of machine failure • A big thing was cost – we could linearly grow our cluster as needed by just adding new machines • Ran on commodity hardware – we didn’t have to buy expensive (and relatively slow), proprietary SAN setups Copyright 2010 Cloudera Inc. All rights reserved 12
  • 13. MapReduce Provides a Powerful Parallel Processing Framework • We found Map Reduce to be the perfect framework to quickly process large amounts of PMU (timeseries) data • Created a machine learning algorithm in Map Reduce which detected “unbounded oscillations” in grid data • Map Reduce based oscillation scan of a few TBs takes minutes • A scan of comparable data from a SAN would take days or weeks Copyright 2010 Cloudera Inc. All rights reserved 13
  • 14. What is common across Hadoop-able problems? Nature of the data • Complex data • Multiple data sources • Lots of it Nature of the analysis • Batch processing • Parallel execution • Spread data over a cluster of servers and take the computation to the data Copyright 2010 Cloudera Inc. All rights reserved 14
  • 15. What Analysis is Possible With Hadoop? • Text mining • Collaborative filtering • Index building • Prediction models • Graph creation and • Sentiment analysis analysis • Risk assessment • Pattern recognition Copyright 2010 Cloudera Inc. All rights reserved 15
  • 16. Benefits of Analyzing With Hadoop • Previously impossible/impractical to do this analysis • Analysis conducted at lower cost • Analysis conducted in less time • Greater flexibility Copyright 2010 Cloudera Inc. All rights reserved 16
  • 17. The Storm of the Data Deluge is Brewing • Challenges of the openPDC project were just the first wave • Storage requirements are accelerating • Disk speeds are relatively constant • Seeing signs of data deluge, GE now using open sourced Hadoop-based timeseries classifiers developed in the openPDC project Copyright 2010 Cloudera Inc. All rights reserved 17
  • 18. Coming Power Grid Stressors • Larger fluctuations in power demands • Ex: Millions of new electric cars all charging in the evenings • An aging power grid that requires more capital infusion than most companies have allocated for these purposes • Grid infrastructure is older than most realize • Maintenance policies generally only look at age of equipment Copyright 2010 Cloudera Inc. All rights reserved 18
  • 19. The Power Grid Domain is Slow to Evolve • Power companies are slow to adopt technology • They generally have poor maps of their overall infrastructure • Coming pressures are going to force power companies to have to analyze TBs and PBs of data • Ad-Hoc analysis will be needed to explore the complex relationships in this data Copyright 2010 Cloudera Inc. All rights reserved 19
  • 20. Broader Emerging Smartgrid Themes • Simply adding lots of sensors is only a very small part of the solution • Collection, storage, and processing are in themselves all difficult problems • In order to build a more effective Smartgrid, platforms are needed that handle these things well • Smartgrid sensor collection is a subset of the larger undercurrent of emerging massive sensor based systems Copyright 2010 Cloudera Inc. All rights reserved 20
  • 21. Even Broader Theme: Internet of Things • We’re collecting sensor data everywhere, not just the Smartgrid • Many of the techniques described above can be easily done with Hadoop • Open Source generalized collector system is called “Flume” • Examples: • Weather sensors • Mesh networks – battlefield UAVs • Cell Phones – Google Android as a collector Copyright 2010 Cloudera Inc. All rights reserved 21
  • 22. Next Generation Sensor Platform: Hadoop and Related Projects Copyright 2010 Cloudera Inc. All rights reserved 22
  • 23. The Companies That Provide Real Results for Sensor Platforms Will Win • Much of today’s Smartgrid talk is just hype • Few “solutions” actually fix anything, only put sensors on things • Analysis is where the true value lies • But you need a complete platform to be in position to analyze the data Copyright 2010 Cloudera Inc. All rights reserved 23
  • 24. Harnessing Hadoop Has Its Challenges Ease of use – command line interface only; data import and access requires development skills Complexity -- > 12 different components, different versions, dependencies and patch requirements Manageability – Hadoop is challenging to configure, upgrade, monitor and administer Interoperability – limited support for popular databases and analytical tools Copyright 2010 Cloudera Inc. All Rights Reserved. 24
  • 25. Cloudera’s Distribution for Hadoop, version 3 The industry’s leading Hadoop distribution Hue Hue SDK Oozie Oozie Hive Pig/ Hive Flume, Sqoop HBase Zookeeper • Open source – 100% Apache licensed • Simplified – Component versions & dependencies managed for you • Integrated – All components & functions interoperate through standard API’s • Reliable – Patched with fixes from future releases to improve stability • Supported – Employs project founders and committers for >70% of components Copyright 2010 Cloudera Inc. All Rights Reserved. 25
  • 26. Who is Cloudera? • Enterprise software & services company providing the industry’s leading Hadoop-based data management platform • Founding team came from large Web companies • Products: Cloudera Enterprise & Cloudera’s Distribution for Hadoop • All necessary packages, matched, tested and supported • Tools to support production use of Hadoop • The leading distribution for the enterprise • Contributors and committers • Fixing, patching and adding features 26
  • 27. Hear More Examples @ Hadoop World 2010 http://www.cloudera.com/company/press-center/hadoop-world-nyc/ • 2nd annual event focused on practical applications of Hadoop • Date: October 12th 2010 • Location: Hilton New York Confirmed speakers from • Keynote from Tim O’Reilly – founder O’Reilly Media • Pre and post conference training available for Hadoop and related projects • 36 business and technical focused sessions Copyright 2010 Cloudera Inc. All Rights Reserved. 27
  • 28. Questions? Copyright 2010 Cloudera Inc. All Rights Reserved. 28