SlideShare a Scribd company logo
Jul-13© 2013 IDC
IDC’s Perspective
On Big Data
Outside Of HPC
Jul-13© 2013 IDC
Big Data:
A General Definition
Value
+
 Lots of data
 Time critical
 Multiple types (e.g.,
numbers, text, video)
 Worth something to
someone
Jul-13© 2013 IDC
Defining Big Data:
For the Broader IT Market
Jul-13© 2013 IDC
Top Drivers For Implementing Big Data
Jul-13© 2013 IDC
Organizational Challenges With Big Data:
Government Compared To All Others
Jul-13© 2013 IDC
Big Data Software
Jul-13© 2013 IDC
Big Data Software Technology Stack
Jul-13© 2013 IDC
Big Data Software Shortcomings -- Today
Jul-13© 2013 IDC
HPDA =
BIG DATA MEETS HPC
AND
ADVANCED SIMULATION
Jul-13© 2013 IDC
HPDA (High Performance Data Analysis):
Data-Intensive Simulation and Analytics
HPDA = tasks involving sufficient data volumes and
algorithmic complexity to require HPC
resources/approaches
 Established (simulation) or newer (analytics) methods
 Structured data, unstructured data, or both
 Regular (e.g., Hadoop) or irregular (e.g., graph) patterns
 Government, industry, or academia
 Upward extensions of commercial business problems
 Accumulated results of iterative problem-solving methods
(e.g., stochastic modeling, parametric modeling).
Jul-13© 2013 IDC
HPDA Market Drivers
 More input data (ingestion)
• More powerful scientific instruments/sensor networks
• More transactions/higher scrutiny (fraud, terrorism)
 More output data for integration/analysis
• More powerful computers
• More realism
• More iterations in available time
 Real time, near-real time requirements
• Catch fraud before it hits credit cards
• Catch terrorists before they strike
• Diagnose patients before they leave the office
• Provide insurance quotes before callers leave the phone
 The need to pose more intelligent questions
• Smarter mathematical models and algorithms
Jul-13© 2013 IDC
Data Movement Is Expensive:
In Energy and Time-to-Solution
Energy Consumption
 1MW ≈ $1 million
 Computing 1 calculation ≈
1 picojoule
 Moving 1 calculation = up
to 100 picojoules
 => It can take 100 times more
energy to move the results of a
calculation than to perform the
calculation in the first place.
Strategies
 Accelerate data movement
(bandwidth, latency)
 Minimize data movement
(e.g., data reduction, in-
memory compute, in-storage
compute, etc.)
Jul-13© 2013 IDC
Different Systems for Different Jobs
Partitionable Big Data Work
 Most jobs are here!
 Goal: search
 Regular access patterns (locality)
 Global memory not important
 Standard clusters + Hadoop,
Cassandra, etc.
Non-Partitionable Work
 Toughest jobs (e.g., graphing)
 Goal: discovery
 Irregular access patterns
 Global memory very important
 Systems turbo-charged for data
movement +graphing
versus
HPC architectures today are compute-centric (FLOPS vs. IOPS)
Jul-13© 2013 IDC
IDC HPDA Server Forecast
 Fast growth from a small starting point
 In 2015, conservatively approaching $1B
Jul-13© 2013 IDC
END-USE EXAMPLES
OF BIG DATA TODAY
Jul-13© 2013 IDC
Some Major Use Cases for HPDA
• Fraud/error detection across massive databases
 A horizontal use – applicable in many domains
• National security/crime-fighting
 SIGINT/anomaly detection/anti-hacking
 Anti-terrorism (including evacuation planning)/anti-crime
• Health care/medical informatics
 Drug design, personalized medicine
 Outcomes-based diagnosis & treatment planning
 Systems biology
• Customer acquisition/retention
• Smart electrical grids
• Design of social network architectures
Jul-13© 2013 IDC
Use Case: PayPal
Fraud Detection / Internet
Commerce
Slides and permission provided by PayPal, an eBay company
Jul-13© 2013 IDC
The Problem
Finding suspicious patterns that we don’t
even know exist in related data sets.
Jul-13© 2013 IDC
What Kind of Volume?
PayPal’s Data Volumes And HPDA Requirements
Jul-13© 2013 IDC
Where Paypal Used HPC
Jul-13© 2013 IDC
The Results
 $710 million saved in fraud that they wouldn’t have
been able to detect before (in the first year)
Jul-13© 2013 IDC
GEICO: Real-Time Insurance Quotes
 Problem: Need accurate automated phone quotes in
100ms. They couldn’t do these calculations nearly fast
enough on the fly.
 Solution: Each weekend, use a new HPC cluster to pre-
calculate quotes for every American adult and household
(60 hour run time)
Jul-13© 2013 IDC
Global Courier Service: Fraud/Error
Detection
Here’s a real-world example of one of the biggest names in global
package delivery.
Their problem is not so different from PayPal’s.
This courier service is doing real-time fraud detection on huge
volumes of packages that come into their sorting facility from many
locations and leave the facility for many other locations around the
world.
 Check 1 billion-plus packages per hour in central sorting facility
 Benchmark won by a HPC vendor with a turbo-charged
interconnect and memory system
Jul-13© 2013 IDC
Apollo Group/University of Phoenix:
Student Recruitment and Retention
Apollo Group is approaching 300,000 online students. To
maintain and grow, they have to target millions of
prospective students.
 Must target millions of potential students
 Must track student performance for early identification of
potential dropouts – “churn” is very expensive
 Solution: a sophisticated, cluster-based Big Data models
Jul-13© 2013 IDC
They use the cloud for this High Performance Data Analysis problem
-- that’s not so surprising, since molecular dynamics codes are often
highly parallel.
Jul-13© 2013 IDC
Architecture
Jul-13© 2013 IDC
Optum + Mayo Initiative to Move Past
Procedures-Based Healthcare
You may have seen the recent news that Optum, which is
part of United Health Group, is teaming with the Mayo
Cline to build a large center ($500K) in Cambridge,
Massachusetts to lay the research groundwork for
outcomes-based medicine.
 Data: 100M United Health Group claims (20 years) + 5M
Mayo Clinic archived patient records. Option for genomic
data
 Findings will be published
 Goal: outcomes-based care
Jul-13© 2013 IDC 28
Jul-13© 2013 IDC
Summary: HPDA Market Opportunity
 HPDA: simulation + newer high-performance analytics
• IDC predicts fast growth from a small starting point
 HPC and high-end commercial analytics are converging
• Algorithmic complexity is the common denominator
• Technologies will evolve greatly
 Economically important use cases are emerging
 No single HPC solution is best for all problems
• Clusters with MR/Hadoop will handle most but not all work
(e.g., graph analysis)
• New technologies will be required in many areas
 IDC believes our growth estimates could be
conservative
Jul-13© 2013 IDC
HPDA User Talks: HPC User Forums, UK,
Germany, France, China and U.S.
• HPC in Evolutionary Biology, Andrew Meade, University of Reading
• HPC in Pharmaceutical Research: From Virtual Screening to All-Atom Simulations of Biomolecules,
Jan Kriegl, Boehringer-Ingelheim
• European Exascale Software Initiative, Jean-Yves Berthou, Electricite de France
• Real-time Rendering in the Automotive Industry, Cornelia Denk, RTT-Munich
• Data Analysis and Visualization for the DoD HPCMP, Paul Adams, ERDC
• Why HPCs Hate Biologists, and What We're Doing About It, Titus Brown, Michigan State University
• Scalable Data Mining and Archiving in the Era of the Square Kilometre Array, the Square Kilometre
Array Telescope Project, Chris Mattmann, NASA/JPL
• Big Data and Analytics in HPC: Leveraging HPC and Enterprise Architectures for Large Scale Inline
Transactional Analytics in Fraud Detection at PayPal, Arno Kolster, PayPal, an eBay Company
• Big Data and Analytics Vendor Panel: How Vendors See Big Data Impacting the Markets and Their
Products/Services, Panel Moderator: Chirag Dekate, IDC
• Data Analysis and Visualization of Very Large Data, David Pugmire, ORNL
• The Impact of HPC and Data-Centric Computing in Cancer Research, Jack Collins, National Cancer
Institute
• Urban Analytics: Big Cities and Big Data, Paul Muzio, City University of New York
• Stampede: Intel MIC And Data-Intensive Computing, Jay Boisseau, Texas Advanced Computing
Center
• Big Data Approaches at Convey, John Leidel
• Cray Technical Perspective On Data-Intensive Computing, Amar Shan
• Data-intensive Computing Research At PNNL, John Feo, Pacific Northwest National Laboratory
• Trends in High Performance Analytics, David Pope, SAS
• Processing Large Volumes of Experimental Data, Shane Canon, LBNL
• SGI Technical Perspective On Data-Intensive Computing, Eng Lim Goh, SGI
• Big Data and PLFS: A Checkpoint File System For Parallel Applications, John Bent, EMC
• HPC Data-intensive Computing Technologies, Scott Campbell, Platform/IBM

More Related Content

PDF
IDC HPC Market Update
PDF
HPC Computing Trends
PDF
HPC Market Update from IDC
PPTX
IDC Report on HPC Market Trends June 2013
PDF
HPC Trends for 2017
PDF
Intersect360 Top of All Things in HPC Snapshot Analysis
PDF
HPC Computing Trends
PDF
Trends in the Worldwide HPC Market
IDC HPC Market Update
HPC Computing Trends
HPC Market Update from IDC
IDC Report on HPC Market Trends June 2013
HPC Trends for 2017
Intersect360 Top of All Things in HPC Snapshot Analysis
HPC Computing Trends
Trends in the Worldwide HPC Market

What's hot (20)

PDF
Measuring HPC: Performance, Cost, & Value
PDF
HPC Market Update from Hyperion Research
PDF
High Performance Data Analysis (HPDA): HPC - Big Data Convergence
PDF
Hot Technology Topics in 2017
PDF
Application Profiling at the HPCAC High Performance Center
PDF
OrionX AI Survey
PDF
Development Trends of Next-Generation Supercomputers
PDF
Outlook on Hot Technologies
PPTX
BIG Data & Hadoop Applications in Logistics
PDF
iphix-demo-day.pdf
PPTX
Whitepaper - Transforming the Energy & Utilities Industry with Smart Analytics
PDF
Big Data Meetup: Data Science & Big Data in Telecom
PDF
Abivin - Big Data Analytics & Optimization
PPT
using big-data methods analyse the Cross platform aviation
PPTX
The Data Asset
PDF
Needle in the Haystack by Anshul Vikram Pandey at QuantCon 2016
PDF
Big Data: Smart Technologies Provide Big Opportunities
PDF
Understanding Big Data so you can act with confidence
PPTX
PDF
That's not a metric! Data for cloud-native success
Measuring HPC: Performance, Cost, & Value
HPC Market Update from Hyperion Research
High Performance Data Analysis (HPDA): HPC - Big Data Convergence
Hot Technology Topics in 2017
Application Profiling at the HPCAC High Performance Center
OrionX AI Survey
Development Trends of Next-Generation Supercomputers
Outlook on Hot Technologies
BIG Data & Hadoop Applications in Logistics
iphix-demo-day.pdf
Whitepaper - Transforming the Energy & Utilities Industry with Smart Analytics
Big Data Meetup: Data Science & Big Data in Telecom
Abivin - Big Data Analytics & Optimization
using big-data methods analyse the Cross platform aviation
The Data Asset
Needle in the Haystack by Anshul Vikram Pandey at QuantCon 2016
Big Data: Smart Technologies Provide Big Opportunities
Understanding Big Data so you can act with confidence
That's not a metric! Data for cloud-native success
Ad

Viewers also liked (19)

PDF
2016 IDC HPC Market Update
PDF
Containerizing Distributed Pipes
PDF
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
PDF
HPC at HP Update
PDF
2016.10 HPDA in Precision Medicine
PDF
Is Machine learning useful for Fraud Prevention?
PDF
Business Wizard Of The Year : Mr. AMAR BABU,M.D.-LENOVO INDIA
PPTX
EMC in HPC – The Journey so far and the Road Ahead
PDF
Best Practices: Large Scale Multiphysics
PPTX
Maximizing HPC Compute Resources with Minimal Cost
PDF
IDC España Predictions 2014
PDF
Content marketing in the B2B customer journey
PDF
It's Time to ROCm!
PDF
Modern Computing: Cloud, Distributed, & High Performance
PPTX
SC16 Student Cluster Competition Configurations & Results
PDF
Idc predictions 2016
PDF
Towards Exascale Computing with Fortran 2015
PPT
Conflictmanagement
PDF
Don't Fall Into a Trap: How Business Continuity Management Can Help Data Brea...
2016 IDC HPC Market Update
Containerizing Distributed Pipes
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
HPC at HP Update
2016.10 HPDA in Precision Medicine
Is Machine learning useful for Fraud Prevention?
Business Wizard Of The Year : Mr. AMAR BABU,M.D.-LENOVO INDIA
EMC in HPC – The Journey so far and the Road Ahead
Best Practices: Large Scale Multiphysics
Maximizing HPC Compute Resources with Minimal Cost
IDC España Predictions 2014
Content marketing in the B2B customer journey
It's Time to ROCm!
Modern Computing: Cloud, Distributed, & High Performance
SC16 Student Cluster Competition Configurations & Results
Idc predictions 2016
Towards Exascale Computing with Fortran 2015
Conflictmanagement
Don't Fall Into a Trap: How Business Continuity Management Can Help Data Brea...
Ad

Similar to IDC Perspectives on Big Data Outside of HPC (20)

PDF
R180305120123
PPTX
An Introduction to Big Data
PPTX
Fundamentals of Big Data
PPTX
Big Data
PPTX
Big Data By Vijay Bhaskar Semwal
PPTX
Big data4businessusers
PDF
Big Data, Big Deal: For Future Big Data Scientists
PPTX
5. big data vs it stki - pini cohen
PDF
The book of elephant tattoo
PPTX
Big data business case
PPTX
Data mining with big data
PPT
Big data and Internet
PPTX
Big data
PDF
PDF
Capturing big value in big data
PDF
Big Data et eGovernment
PPTX
An Overview of BigData
PDF
Big dataimplementation hadoop_and_beyond
PPTX
Kartikey tripathi
PPTX
BigDataFinal.pptx
R180305120123
An Introduction to Big Data
Fundamentals of Big Data
Big Data
Big Data By Vijay Bhaskar Semwal
Big data4businessusers
Big Data, Big Deal: For Future Big Data Scientists
5. big data vs it stki - pini cohen
The book of elephant tattoo
Big data business case
Data mining with big data
Big data and Internet
Big data
Capturing big value in big data
Big Data et eGovernment
An Overview of BigData
Big dataimplementation hadoop_and_beyond
Kartikey tripathi
BigDataFinal.pptx

More from inside-BigData.com (20)

PDF
Major Market Shifts in IT
PDF
Preparing to program Aurora at Exascale - Early experiences and future direct...
PPTX
Transforming Private 5G Networks
PDF
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
PDF
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
PDF
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
PDF
HPC Impact: EDA Telemetry Neural Networks
PDF
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
PDF
Machine Learning for Weather Forecasts
PPTX
HPC AI Advisory Council Update
PDF
Fugaku Supercomputer joins fight against COVID-19
PDF
Energy Efficient Computing using Dynamic Tuning
PDF
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
PDF
State of ARM-based HPC
PDF
Versal Premium ACAP for Network and Cloud Acceleration
PDF
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
PDF
Scaling TCO in a Post Moore's Era
PDF
CUDA-Python and RAPIDS for blazing fast scientific computing
PDF
Introducing HPC with a Raspberry Pi Cluster
PDF
Overview of HPC Interconnects
Major Market Shifts in IT
Preparing to program Aurora at Exascale - Early experiences and future direct...
Transforming Private 5G Networks
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
How to Achieve High-Performance, Scalable and Distributed DNN Training on Mod...
Evolving Cyberinfrastructure, Democratizing Data, and Scaling AI to Catalyze ...
HPC Impact: EDA Telemetry Neural Networks
Biohybrid Robotic Jellyfish for Future Applications in Ocean Monitoring
Machine Learning for Weather Forecasts
HPC AI Advisory Council Update
Fugaku Supercomputer joins fight against COVID-19
Energy Efficient Computing using Dynamic Tuning
HPC at Scale Enabled by DDN A3i and NVIDIA SuperPOD
State of ARM-based HPC
Versal Premium ACAP for Network and Cloud Acceleration
Zettar: Moving Massive Amounts of Data across Any Distance Efficiently
Scaling TCO in a Post Moore's Era
CUDA-Python and RAPIDS for blazing fast scientific computing
Introducing HPC with a Raspberry Pi Cluster
Overview of HPC Interconnects

Recently uploaded (20)

PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Tartificialntelligence_presentation.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Machine Learning_overview_presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx
Electronic commerce courselecture one. Pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Tartificialntelligence_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A comparative analysis of optical character recognition models for extracting...
Spectroscopy.pptx food analysis technology
Advanced methodologies resolving dimensionality complications for autism neur...
Group 1 Presentation -Planning and Decision Making .pptx
Spectral efficient network and resource selection model in 5G networks
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
Programs and apps: productivity, graphics, security and other tools
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Machine Learning_overview_presentation.pptx

IDC Perspectives on Big Data Outside of HPC

  • 1. Jul-13© 2013 IDC IDC’s Perspective On Big Data Outside Of HPC
  • 2. Jul-13© 2013 IDC Big Data: A General Definition Value +  Lots of data  Time critical  Multiple types (e.g., numbers, text, video)  Worth something to someone
  • 3. Jul-13© 2013 IDC Defining Big Data: For the Broader IT Market
  • 4. Jul-13© 2013 IDC Top Drivers For Implementing Big Data
  • 5. Jul-13© 2013 IDC Organizational Challenges With Big Data: Government Compared To All Others
  • 6. Jul-13© 2013 IDC Big Data Software
  • 7. Jul-13© 2013 IDC Big Data Software Technology Stack
  • 8. Jul-13© 2013 IDC Big Data Software Shortcomings -- Today
  • 9. Jul-13© 2013 IDC HPDA = BIG DATA MEETS HPC AND ADVANCED SIMULATION
  • 10. Jul-13© 2013 IDC HPDA (High Performance Data Analysis): Data-Intensive Simulation and Analytics HPDA = tasks involving sufficient data volumes and algorithmic complexity to require HPC resources/approaches  Established (simulation) or newer (analytics) methods  Structured data, unstructured data, or both  Regular (e.g., Hadoop) or irregular (e.g., graph) patterns  Government, industry, or academia  Upward extensions of commercial business problems  Accumulated results of iterative problem-solving methods (e.g., stochastic modeling, parametric modeling).
  • 11. Jul-13© 2013 IDC HPDA Market Drivers  More input data (ingestion) • More powerful scientific instruments/sensor networks • More transactions/higher scrutiny (fraud, terrorism)  More output data for integration/analysis • More powerful computers • More realism • More iterations in available time  Real time, near-real time requirements • Catch fraud before it hits credit cards • Catch terrorists before they strike • Diagnose patients before they leave the office • Provide insurance quotes before callers leave the phone  The need to pose more intelligent questions • Smarter mathematical models and algorithms
  • 12. Jul-13© 2013 IDC Data Movement Is Expensive: In Energy and Time-to-Solution Energy Consumption  1MW ≈ $1 million  Computing 1 calculation ≈ 1 picojoule  Moving 1 calculation = up to 100 picojoules  => It can take 100 times more energy to move the results of a calculation than to perform the calculation in the first place. Strategies  Accelerate data movement (bandwidth, latency)  Minimize data movement (e.g., data reduction, in- memory compute, in-storage compute, etc.)
  • 13. Jul-13© 2013 IDC Different Systems for Different Jobs Partitionable Big Data Work  Most jobs are here!  Goal: search  Regular access patterns (locality)  Global memory not important  Standard clusters + Hadoop, Cassandra, etc. Non-Partitionable Work  Toughest jobs (e.g., graphing)  Goal: discovery  Irregular access patterns  Global memory very important  Systems turbo-charged for data movement +graphing versus HPC architectures today are compute-centric (FLOPS vs. IOPS)
  • 14. Jul-13© 2013 IDC IDC HPDA Server Forecast  Fast growth from a small starting point  In 2015, conservatively approaching $1B
  • 15. Jul-13© 2013 IDC END-USE EXAMPLES OF BIG DATA TODAY
  • 16. Jul-13© 2013 IDC Some Major Use Cases for HPDA • Fraud/error detection across massive databases  A horizontal use – applicable in many domains • National security/crime-fighting  SIGINT/anomaly detection/anti-hacking  Anti-terrorism (including evacuation planning)/anti-crime • Health care/medical informatics  Drug design, personalized medicine  Outcomes-based diagnosis & treatment planning  Systems biology • Customer acquisition/retention • Smart electrical grids • Design of social network architectures
  • 17. Jul-13© 2013 IDC Use Case: PayPal Fraud Detection / Internet Commerce Slides and permission provided by PayPal, an eBay company
  • 18. Jul-13© 2013 IDC The Problem Finding suspicious patterns that we don’t even know exist in related data sets.
  • 19. Jul-13© 2013 IDC What Kind of Volume? PayPal’s Data Volumes And HPDA Requirements
  • 20. Jul-13© 2013 IDC Where Paypal Used HPC
  • 21. Jul-13© 2013 IDC The Results  $710 million saved in fraud that they wouldn’t have been able to detect before (in the first year)
  • 22. Jul-13© 2013 IDC GEICO: Real-Time Insurance Quotes  Problem: Need accurate automated phone quotes in 100ms. They couldn’t do these calculations nearly fast enough on the fly.  Solution: Each weekend, use a new HPC cluster to pre- calculate quotes for every American adult and household (60 hour run time)
  • 23. Jul-13© 2013 IDC Global Courier Service: Fraud/Error Detection Here’s a real-world example of one of the biggest names in global package delivery. Their problem is not so different from PayPal’s. This courier service is doing real-time fraud detection on huge volumes of packages that come into their sorting facility from many locations and leave the facility for many other locations around the world.  Check 1 billion-plus packages per hour in central sorting facility  Benchmark won by a HPC vendor with a turbo-charged interconnect and memory system
  • 24. Jul-13© 2013 IDC Apollo Group/University of Phoenix: Student Recruitment and Retention Apollo Group is approaching 300,000 online students. To maintain and grow, they have to target millions of prospective students.  Must target millions of potential students  Must track student performance for early identification of potential dropouts – “churn” is very expensive  Solution: a sophisticated, cluster-based Big Data models
  • 25. Jul-13© 2013 IDC They use the cloud for this High Performance Data Analysis problem -- that’s not so surprising, since molecular dynamics codes are often highly parallel.
  • 27. Jul-13© 2013 IDC Optum + Mayo Initiative to Move Past Procedures-Based Healthcare You may have seen the recent news that Optum, which is part of United Health Group, is teaming with the Mayo Cline to build a large center ($500K) in Cambridge, Massachusetts to lay the research groundwork for outcomes-based medicine.  Data: 100M United Health Group claims (20 years) + 5M Mayo Clinic archived patient records. Option for genomic data  Findings will be published  Goal: outcomes-based care
  • 29. Jul-13© 2013 IDC Summary: HPDA Market Opportunity  HPDA: simulation + newer high-performance analytics • IDC predicts fast growth from a small starting point  HPC and high-end commercial analytics are converging • Algorithmic complexity is the common denominator • Technologies will evolve greatly  Economically important use cases are emerging  No single HPC solution is best for all problems • Clusters with MR/Hadoop will handle most but not all work (e.g., graph analysis) • New technologies will be required in many areas  IDC believes our growth estimates could be conservative
  • 30. Jul-13© 2013 IDC HPDA User Talks: HPC User Forums, UK, Germany, France, China and U.S. • HPC in Evolutionary Biology, Andrew Meade, University of Reading • HPC in Pharmaceutical Research: From Virtual Screening to All-Atom Simulations of Biomolecules, Jan Kriegl, Boehringer-Ingelheim • European Exascale Software Initiative, Jean-Yves Berthou, Electricite de France • Real-time Rendering in the Automotive Industry, Cornelia Denk, RTT-Munich • Data Analysis and Visualization for the DoD HPCMP, Paul Adams, ERDC • Why HPCs Hate Biologists, and What We're Doing About It, Titus Brown, Michigan State University • Scalable Data Mining and Archiving in the Era of the Square Kilometre Array, the Square Kilometre Array Telescope Project, Chris Mattmann, NASA/JPL • Big Data and Analytics in HPC: Leveraging HPC and Enterprise Architectures for Large Scale Inline Transactional Analytics in Fraud Detection at PayPal, Arno Kolster, PayPal, an eBay Company • Big Data and Analytics Vendor Panel: How Vendors See Big Data Impacting the Markets and Their Products/Services, Panel Moderator: Chirag Dekate, IDC • Data Analysis and Visualization of Very Large Data, David Pugmire, ORNL • The Impact of HPC and Data-Centric Computing in Cancer Research, Jack Collins, National Cancer Institute • Urban Analytics: Big Cities and Big Data, Paul Muzio, City University of New York • Stampede: Intel MIC And Data-Intensive Computing, Jay Boisseau, Texas Advanced Computing Center • Big Data Approaches at Convey, John Leidel • Cray Technical Perspective On Data-Intensive Computing, Amar Shan • Data-intensive Computing Research At PNNL, John Feo, Pacific Northwest National Laboratory • Trends in High Performance Analytics, David Pope, SAS • Processing Large Volumes of Experimental Data, Shane Canon, LBNL • SGI Technical Perspective On Data-Intensive Computing, Eng Lim Goh, SGI • Big Data and PLFS: A Checkpoint File System For Parallel Applications, John Bent, EMC • HPC Data-intensive Computing Technologies, Scott Campbell, Platform/IBM

Editor's Notes

  • #3: Here’s a general definition of Big Data using the schema of the “four V’s” that’s become familiar. This isn’t specific to high performance data analysis. It applies to Big Data across all markets.To qualify as Big Data in this general context, the data set has to be large in volume, critical to analyze in a timeframe...It has to include multiple types of data and it has to be worthwhile to someone, preferably with a monetary value.
  • #11: The emerging market for high performance data analysis is narrower than that. As I said a minute ago, it’s the market being formed by the convergence of data-intensive simulation and data-intensive analytical methods, so it’s really a union set. As the slide shows, this evolving market is very inclusive in relation to methods, types of data, and market sectors. The common denominator across these segments is the use of models that incorporate algorithmic complexity. You typically don’t find that kind of algorithmic complexity in online transaction processing or in commercial applications such as supply chain management and customer relationship management.The ultimate criterion for HPDA that it requires HPC resources.
  • #12: There are important HPDA market drivers on the data ingestion side and the data output side. Data sources have become much more powerful. CERN’s Large Hadron Collider generates 1PB/second when it’s running. The Square Kilometer Array telescope will produce 1EB/day when it becomes operational in 2016. - But those are extreme examples. Much more common are sensor networks for power grids and other things, gene sequencers, MRI machines, and so on.- Onllne sales transactions produce a lot of data and a lot of opportunity for fraud. Standards, regulations and lawsuits are on the rise. Boeing stores all its engineering data for the 30-year lifetime of their commercial airplanes, not just as a reference for designing future planes but in case there’s a crash and a lawsuit. On the output side, more powerful HPC systems are kicking out lots more data in response to the growing user requirements you see listed here.
  • #13: Moving data costs time and money. Energy has become very expensive. It can take 100 times more energy to move the results of a calculation than to perform the calculation in the first place. It’s no wonder that oil and gas companies, for example, still rely heavily on courier services for overnight shipping of disk drives. It would take too long and cost too much to send the data over a computer network.- If you’re a vendor, you have two main strategies available to you: you can speed up data movement , mainly through better interconnects, or you can minimize data movement by pre-filtering data or bringing the compute to the data, or you can both accelerate and minimize.
  • #14: The data in most HPDA jobs assigned to HPC resources will continue to have regular access patterns, whether the data is structured or unstructured.This means it can be partitioned and mapped onto a standard cluster or other distributed memory machine for running Hadoop or other software.But there’s a rising tide of data work that exhibits irregular access patterns and can’t take advantage of data locality processing features. Caches are highly inefficient for jobs like this. These jobs benefit from global memory combined with powerful interconnects and other data movement capabilities. Partitionable jobs are very important now and non-partitionable jobs are becoming more important. By the way, SGI systems address both types. One general remark is that as the data analysis side of HPC expands, HPC architectures will need to become less compute-centric and offer more support for data integration and analysis.“Many current approaches to big data have been about ‘search’ – the ability to efficiently find something that you know is there in your data,” said Arvind Parthasarathi, President of YarcData. “uRiKA was purposely built to solve the problem of ‘discovery’ in big data – to discover things, relationships or patterns that you don’t know exist. By giving organizations the ability to do much faster hypothesis validation at scale and in real time, we are enabling the solution of business problems that were previously difficult or impossible – whether it be discovering the ideal patient treatment, investigating fraud, detecting threats, finding new trading algorithms or identifying counter-party risk. Basically, we are systematizing serendipity.”
  • #15: HPC servers are often used for more than one purpose. IDC classifies HPC servers according to the primary purpose they’re used for. So, an HPDA server is one that’s used more than 50% for HPDA work. As this table shows, IDC forecasts that revenue for HPC servers acquired primarily for HPDA use will grow robustly (10.4% CAGR) to approach $1 billion in 2015. Because HPDA revenue starts as such a relatively small chunk of overall HPC server revenue, the HPDA share of the overall HPC server revenue will still be in the single digits in 2015, despite the fast growth rate.
  • #16: Let’s look at some real-world use cases
  • #17: This slide lists some of the most prominent use cases, meaning ones where repeated sales of HPC products have been happening. Fraud detection and life sciences are emerging fastest. BTW, I didn’t include financial services here because we’ve been tracking back-office FSI analytics as part of the HPC market for more than 20 years. But FSI is an important part of the high performance data analysis market. – not an easy one to penetrate for the first time.
  • #18: I want to zero in more on the PayPal example because they gave me permission to use these slides and because in many ways they are representative of a larger group of commercial companies whose business requirements are pushing them up into HPC. The slides are from a talk PayPal gave IDC’s September 2012 HPC User Forum meeting in Dearborn, Michigan. By the way, if you want a copy of this talk or any of the long list of talks on one of our first slides, just email me at sconway [at] idc.com
  • #19: PayPal is an eBay subsidiary and, among other things, has responsibility for detecting fraud across eBay and SKYPE. Five years ago, a day's worth of data was processed in batch processing overnight and fraud wasn't detected until as much as two weeks later. They realized they needed to detect fraud in real time, and for that they needed graph analysis. They were most interested in checking out collusion between multiple parties, such as when a credit card shows activity from four or more users. They needed to be able to stop that before the credit card got hit. IBM Watson on the Jeopardy game show was amazing but it was a needle in a haystack problem, meaning that Watson could only find answers that were already in its database. PayPal’s problem was different, because there was no visible needle to be found. Graph analysis let them uncover hidden relationships and behavior patterns
  • #20: This gives you an idea of PayPal’s data volumes and HPDA requirements. These are going up all the time.
  • #21: Here’s what PayPal is using. For the serious fraud detection and analysis, they’re using SGI servers and storage on an InfiniBand network. For the less-challenging work that doesn’t involve pattern discovery and real-time requirements, they’re running Hadoop on a cluster. By the way, PayPal says HPC has already saved them $710 million in fraud they wouldn’t have been able to detect before.
  • #22: This gives you an idea of PayPal’s data volumes and HPDA requirements. These are going up all the time.
  • #23: For cost and growth reasons, GEICO moved to automated insurance quotes on the phone. They needed to provide quotes instantaneously, in 100 milliseconds or less. They couldn’t do these calculations nearly fast enough on the fly .GEICO’s solution was to install an HPC system and every weekend run updated quotes for every adult and every household in the United States. That takes 60 wall clock hours today. The phones tap into the stored quotes and return the correct one in 100 milliseconds.
  • #24: Here’s a real-world example of one of the biggest names in global package delivery. Their problem is not so different from PayPal’s. This courier service is doing real-time fraud detection on huge volumes of packages that come into their sorting facility from many locations and leave the facility for many other locations around the world.They ran a difficult benchmark. The winner hasn’t been publicly announced yet, but IDC’s back channels tell us the vendor has a 3-letter name that starts with S.
  • #26: Schrödinger is a global life sciences software company with offices in Munich and Mannheim. One of the major things they do is use molecular dynamics to identify promising candidates for new drugs to combat cancer and other diseases – and it seems they’ve been using the cloud for this High Performance Data Analysis problem. That’s not so surprising, since molecular dynamics codes are often highly parallel.
  • #27: Here’s the architecture they used. Note that they were already using HPC in their on premise data center, but the resources weren’t big enough for this task. That’s why they bursted out to Amazon EC2 using a software management layer from Cycle Computing to access more than 50,000 additional cores. Bringing a new drug to market can cost as much as £10 billion and a decade of time, so security is a major concern with commercial drug discovery. Apparently, Schrödinger felt confident about the cloud security measures.
  • #28: You may have seen the recent news that Optum, which is part of United Health Group, is teaming with the Mayo Cline to build a huge center in Cambridge, Massachusetts to lay the research groundwork for outcomes-based medicine. They’ll have more than 100 million patient records at their disposal for this enormous data-intensive work.They’ll be using data-intensive methods to look at other aspects of health care, too. A week ago, United Health issued a press release in which they said they believe that improved efficiencies alone could reduce Medicare costs by about 40%, obviating much of the need for the major reforms the political parties have been fighting about.
  • #29: In the U.S., the largest urban gangs are the Crips and the Bloods. They’re rival gangs that are at each other’s throats all the time, fighting for money and power. Both gangs are national in scope, but the national organizations aren’t that strong. The branches of these gangs in each city have a lot of autonomy to do what they want.What you see here, again in blurred form, was something that astounded the police department of Atlanta, Georgia, a city with about 4 million inhabitants. Through real-time monitory of social networks, they were able to witness, as it happened, the planned merger of these rival gangs in their city. This information allowed the police to adapt their own plans accordingly.
  • #30: In summary, we defined HPDA and told you that IDC is forecasting rapid growth from a small base.HPDA is about the convergence of data-intensive HPC and high-end commercial analytics. One of the most interesting aspects of this, to us, is that the demands of the commercial market are moving this along faster in the commercial sector than in the traditional HPC market. PayPal is a great example of this (story of how PayPal was shy about presenting at User Forum – both sides should be learning from each other). On the analytics side, some attractive use cases are already out there. In the time allotted to us here, we described some of the more prominent ones, but there are many others.Most of the work will be done on clusters, but some economically important use cases need more capable architectures, especially for graph analytics.Many of the large investment firms are IDC clients, so our growth estimates tend to err on the side of conservatism. There is potential for the HPDA market to grow faster than our current forecast. But we talk with a lot of people and we update the forecasts often, so we get too far off the mark.
  • #31: This is a partial list of the user and vendor talks on this topic that we’ve lined up in the past two years as part of the HPC User Forum. IDC has operated the HPC User Forum since 1999 for a volunteer steering committee made up of senior HPC people from government, industry and academia – organizations like Boeing, GM, Ford, NSF and others. We hold meetings across the world, and the talks listed here include perspectives on High Performance Data Analysis from the Americas, Europe and Asia.I’ll ask Chirag to explain how we define High Performance Data Analysis. I’ll return later to walk you through some real-world use cases. Chirag...