SlideShare a Scribd company logo
v7.0 – 09/07/2012




Accelerating Decisions Through
Enterprise Hadoop
Evolving Hadoop to support Enterprise Computing




v7.0 – 09/07/2012                                            Joey Jablonski
                                                             Practice Director, Analytic Services




           ©2012 DataDirect Networks. All Rights Reserved.                                       ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
DDN | We Accelerate Information Insight

     DDN provides a competitive advantage by maximizing your
     datacenter investment while mitigating growth challenges
     over your discovery process.
 ►   Established: 1998
 ►   Revenue: $226M (2011) – Profitable, Fast Growth
 ►   Main Office: Sunnyvale, California, USA
 ►   Employees: 600+ Worldwide
 ►   Worldwide Presence: 16 Countries
 ►   Installed Base: 1,000+ End Customers; 50+ Countries
 ►   Go To Market: Global Partners, Resellers, Direct




 World-Renowned & Award-Winning



          ©2012 DataDirect Networks. All Rights Reserved.       ddn.com
DDN | 15 Years in HPC
  Investment In Scale & Innovation
                       First HPC
     DDN               Customer
 Incorporated

  DDN                        1st Customer                                  SFA Project          WOS Project       Largest private              500+
  FOUNDED                    NASA                                           Inception            Inception       storage co. (IDC)          EMPLOYEES




    1998    1999        2000        2001        2002         2003   2004     2005        2006   2007      2008   2009     2010       2011     2012




                                                         S2A8000                                S2A9900
                   S2A6000
                                                                                    S2A9550
                                         S2A3000




AWARDS
                                                                                                     6620           10K                        12K




                ©2012 DataDirect Networks. All Rights Reserved.                                                                                 ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
Storage Fusion Processing™

                                                                                        Applications
    DDN’s
Storage Fusion                                                                         GRIDScaler™
 Architecture


                                                                   Network Interface                    Network Interface

                            SAS                                                        Storage Server
                          Interface                                                                                         Compute
     Storage                                    RAID                                                                        Resource
      Media                                    Controller




      • Driving Imperatives = Improved OPEX
             Massive bandwidth and low latency to storage media
             Multi-core processors + Big DRAMs
             Virtualization / Hypervisor

                 ©2012 DataDirect Networks. All Rights Reserved.                                                             ddn.com
DDN | Appliance Portfolio

             GRIDScaler™                                        EXAScaler™




  SFA12K-E                                SFA10K-E               SFA10K-M                  WOS6000
  Bandwidth: 40GB/s                     Bandwidth: 15GB/s         Bandwidth: 2GB/s       4U, 60-Drive System
  Flash IOPS: 1.4M                      Flash IOPS: 840K          Flash IOPS: 840K        8 x GbE per Node
Scales to 1680 Drives                  Scales to 1200 dives       Scales to 120 dives   2PB/Rack, 23PB/Cluster
In-Storage Processing                 In-Storage Processing     In-Storage Processing     25B Objects/Rack


                 Maximize Value: Best-In-Class Performance to Accelerate Applications

              Minimize OPEX: >2x More Data Center Efficient Than Competing Systems

               Minimize Overhead: Autonomous System Fault Management & Recovery

              ©2012 DataDirect Networks. All Rights Reserved.                                                    ddn.com
Storage Fusion Processing™
A Unique DDN Vision

Embedded Data-Intensive Applications
Within Storage Infrastructure

►Reduce  complexity, infrastructure,
 administration, TCO
►Reduce   infrastructure & OPEX
►Increase performance for
 latency sensitive applications
►Success    today with: File-Systems,
 iRODS, Hadoop, BWA, FASTA/SAM/BAM
►Work   with your research teams to:
  • Identify application candidates                         Gap Aligners?
  • Port to our VMs/Hypervisor and Benchmark                Molecular Dynamics?
  • Deploy to your community                                Deep and wide search?
                                                            Query engine?

          ©2012 DataDirect Networks. All Rights Reserved.                    ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
Why Data Analytics is so Hard?


           Technical                                               Business


         Hacking Skills                                           Business Acumen




                     Data
                    Science                                               Analytics

   Math &




                                                                           Decisioning
                      Traditional
                      Research




                                    Substantive
  Statistics




                                                                              Poor
                                                         Communications                  Curiosity
                                     Expertise
 knowledge




       ©2012 DataDirect Networks. All Rights Reserved.                                          ddn.com
Analytics | Looking for Actionable Data



Billions of
   Data
Points to
Consider



•   Consumer purchasing trends
•   Product perception
•   Drug Discovery
•   Genomics
•   Surveillance
•   Financial Analysis

              ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
How do I leverage Analytics?




                                                                 Improved
                                                                  Results




                                                                             Modify
                                                       Insight
                                                                            Behavior


     ©2012 DataDirect Networks. All Rights Reserved.                          ddn.com
Data Gravity
Warps the Application Space

     Applications


                                                        DATA

                                                          Services




      ©2012 DataDirect Networks. All Rights Reserved.                ddn.com
Todays Enterprise Picture
 Empowered




                                                                       Enabled
                                              Aware
                                              Users




                                                                        Users
   Users




                                                           The Cloud




         ©2012 DataDirect Networks. All Rights Reserved.                         ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
The tools of the Trade
Ecosystem
 Hadoop




                     4             3                   5
Core Apache Hadoop




                     2             6                   1



                                                                                   Map   Reduce




                     1   2   3         4      5       6




                                 ©2012 DataDirect Networks. All Rights Reserved.              ddn.com
Hadoop & HPC Compared

                    Data Locality                         Inter-process Communication
                                                                   Job Input
      HPC




               1       2      3        4    5         6
                                                                 Slic      Slic
                                                                 e1        en


                4                  3                  5
                                                                    Job Input
                2                  6                  1
    Hadoop




                                                                 Slic     Slic
                                                                 e1       en
                1      2      3        4    5         6



    ©2012 DataDirect Networks. All Rights Reserved.                                     ddn.com
Organizational Scalability
Higher is Better
   Adoption




                                                                                         Goal for Human Costs




                                                                              Capacity
      18           6/8/12   ©2012 DataDirect Networks. All Rights Reserved.                                     ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
Hadoop Cluster Lifecycle


                                                           Deploy




                                    Upgrade                              Manage




                                                 Respond            Monitor




Software Platform                                                                 Hardware Platform
        ©2012 DataDirect Networks. All Rights Reserved.                                     ddn.com
Infrastructure Chargeback




                                                          • Visibility to Trends
                                                          • Actionable Reporting
                                                          • Limits & Enforcement
                                                       Site Overview




     ©2012 DataDirect Networks. All Rights Reserved.                          ddn.com
Analytics Services Portfolio




  Architect                                     Deploy                        Manage                   Customize


• Data Transformation                   •   hScaler Installation      •       Data Curation            •   Data Migration
• Data & Analytics                      •   hScaler Upgrade           •       hScaler Administration   •   DR&BC
  Strategy                              •   Environment Integration   •       System Tuning            •   Application Integration
• Security Strategy in                  •   Performance Testing       •       Health Checks            •   Data Curation
  shared-data                           •   Operational Validation                                     •   Application Development
  Environments                          •   Factory Build                                              •   Data Cleansing
• DR&BC
• Data Curation
• Solution Sizing
• Data Center Preparation
                                                                               Support
• Process Integration                                                     •   Phone/Email
• ETL planning                                                            •   Phone Home Monitoring
• Compliance Planning                                                     •   Patches & Upgrades
                                                                          •   Remote Diagnostics
                 ©2012 DataDirect Networks. All Rights Reserved.                                                          ddn.com
Apache Hadoop
Genomics Application Examples

 ►    Apache Hadoop™ MapReduce™ computing efficiency:
      • The algorithm-performance should scale with CPU count
      • The algorithm should be embarrassingly parallel
      • There should be no dependence on how the data is distributed
      • The data should be static

 ►    Example genomics application that work well within Hadoop:
      • Crossbow. Whole genome re-sequencing & SNP genotyping (short reads)
      • Contrail. De novo assembly from short sequencing reads.
      • Myrna. Fast short-read & differential gene expression aligner (RNA-seq)
      • PeakRanger. Cloud-enabled peak caller for ChIP-seq data.
      • Quake. Quality-aware detection and sequencing error correction tool.
      • BlastReduce. High-performance short read mapping.
      • CloudBLAST. Hadoop implementation of NCBI’s Blast.
      • MrsRF. Algorithm for analyzing large evolutionary trees.
 23         ©2012 DataDirect Networks. All Rights Reserved.                    ddn.com
CloudBLAST Application Example

                                                                                                            StreamInputFormat
     CloudBLAST is a Map-Reduce
     version of the commonly used                                                              S=
                                                                                          {s1, s2, … sk}
                                                                                                                              S=
                                                                                                                           {s1, s2, … sk}
                                                                                                                                                           S=
                                                                                                                                                      {s1, s2, … sk}

     bioinformatics application NCBI
     BLAST




                                                                                                                                                                       CPU - N
                                                                                CPU - 0


                                                                                           CPU - 1


                                                                                                       CPU - 2


                                                                                                                 CPU - 3


                                                                                                                                  CPU - 4


                                                                                                                                            CPU - 5


                                                                                                                                                      CPU -6
     1. Stream Input Formatted data is split
        into “960 long chunks” base on new
        line.
     2. Data “chunks” split into sequences as
        keys for the MapReduce
     3. Blast output is written to local file




                                                                                                     Data Merger

Based on work by Andréa Matsunaga, Maurício Tsugawa and José Fortes - University of Florida

    24              ©2012 DataDirect Networks. All Rights Reserved.                                                                                                    ddn.com
Agenda for The Data Challenge

►   Overview of DataDirect Network

►   What is Storage Fusion Processing™,
                      it’s advantages & applications

►   Overview of Analytics

►   Introduction to Apache Hadoop

►   An overview of DDN hScaler solution

►   Conclusion


         ©2012 DataDirect Networks. All Rights Reserved.   ddn.com
How DDN can
    Accelerate Your Analytics
►   Lower Total Cost of Ownership and Improved OPEX:
    • Scale – Dynamically add capacity to match your complex workloads
    • Value – Grow storage capacity economically: Access, Solve, Archive
    • High Availability - Always running with world-class 24/7 service & support

►   Drive Innovation:
    • Performance at Scale – A homogeneous platform that performs at scale
    • Eloquent - Leverage virtualization to deliver analytics platform to provide the
      quickest answers to your most complex questions
    • Collaboration – Centralize & share discoveries across the globe, securely

►   Deliver Experience:
    • Fifteen Years of HPC – Government Labs, DoE, and Universities trust DDN
    • HPC community rely on DDN – 60% of the top 500 Supercomputer & growing
    • Single vendor solution - OEMs provide DDN with their datacenter solutions.



             ©2012 DataDirect Networks. All Rights Reserved.                    ddn.com
Thank you – Questions?



DataDirect Networks, Information in Motion, Silicon Storage Appliance, S2A, Storage Fusion Architecture, SFA, Storage Fusion Fabric, Web Object Scaler, WOS, EXAScaler, GRIDScaler,
       xSTREAMScaler, NAS Scaler, ReAct, ObjectAssure, In-Storage Processing and SATAssure are all trademarks of DataDirect Networks. Any unauthorized use is prohibited.

                       ©2012 DataDirect Networks. All Rights Reserved.                                                                                              ddn.com

More Related Content

PPT
The 5 levels of embedded bi
PPTX
Tera stream for datastreams
PDF
Oracle Optimized Datacenter - Storage
PDF
My sql in_enterprise
PDF
Introduction to Data Mining
PPTX
2012 06 hortonworks paris hug
PPTX
Velocity Technology Solutions Overview
PPTX
From the Big Data keynote at InCSIghts 2012
The 5 levels of embedded bi
Tera stream for datastreams
Oracle Optimized Datacenter - Storage
My sql in_enterprise
Introduction to Data Mining
2012 06 hortonworks paris hug
Velocity Technology Solutions Overview
From the Big Data keynote at InCSIghts 2012

What's hot (19)

PDF
01 data quality-international challenge
PDF
Maximize the Business Value of Your Information
PDF
UPES-First Indian University to implement SAP
PDF
2011 As Corporate Overview Linked In
PDF
2011 As Corporate Overview Linked In
PDF
2011 As Corporate Overview Linked In
PDF
2011 As Corporate Overview Linked In
PPTX
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
PPTX
STPCon fall 2012: The Testing Renaissance Has Arrived
PDF
SQL-H a new way to enable SQL analytics
PDF
Infosüsteemide infrastruktuuri haldus ja monitooring Oracle Enterprise Manage...
PPTX
Tech Talk SQL Server 2012 Business Intelligence
PDF
The CIOs Guide to NoSQL 2012
PDF
Business Models for Interoperability
PPTX
Patterns of Data Distribution
PDF
Services and Models in a Large IT System
PDF
HP Storage Works -Clemes Esser
PPTX
Introduction to the Interoperability Reference Architecture
PDF
Microsoft Data Mining 2012
01 data quality-international challenge
Maximize the Business Value of Your Information
UPES-First Indian University to implement SAP
2011 As Corporate Overview Linked In
2011 As Corporate Overview Linked In
2011 As Corporate Overview Linked In
2011 As Corporate Overview Linked In
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
STPCon fall 2012: The Testing Renaissance Has Arrived
SQL-H a new way to enable SQL analytics
Infosüsteemide infrastruktuuri haldus ja monitooring Oracle Enterprise Manage...
Tech Talk SQL Server 2012 Business Intelligence
The CIOs Guide to NoSQL 2012
Business Models for Interoperability
Patterns of Data Distribution
Services and Models in a Large IT System
HP Storage Works -Clemes Esser
Introduction to the Interoperability Reference Architecture
Microsoft Data Mining 2012
Ad

Viewers also liked (13)

PPT
SNIA 2012 - Creating an Enterprise Hadoop Platform
PPTX
Corralling Big Data at TACC
PDF
DDN Service Strategy
PDF
DDN and Intel: Partnered for Exascale
PPTX
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
PDF
Phan tich co phieu JVC, DNM, DDN (fintzone)
PDF
Ddn Vision
PDF
DDN: Protecting Your Data, Protecting Your Hardware
PDF
IBM general parallel file system - introduction
PDF
Optimizing Lustre and GPFS with DDN
PDF
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
PDF
DDN Product Update from SC13
PDF
Academic Workflows with iRODS FINAL
SNIA 2012 - Creating an Enterprise Hadoop Platform
Corralling Big Data at TACC
DDN Service Strategy
DDN and Intel: Partnered for Exascale
DDN GS7K - Easy-to-deploy, High Performance Scale-Out Parallel File System Ap...
Phan tich co phieu JVC, DNM, DDN (fintzone)
Ddn Vision
DDN: Protecting Your Data, Protecting Your Hardware
IBM general parallel file system - introduction
Optimizing Lustre and GPFS with DDN
DDN: Massively-Scalable Platforms and Solutions Engineered for the Big Data a...
DDN Product Update from SC13
Academic Workflows with iRODS FINAL
Ad

Similar to DDN Accelerating-Decisions-Through-Enterprise-Hadoop-final (20)

PDF
101 ab 1445-1515
PDF
101 ab 1445-1515
PDF
Building Big Data Applications
PDF
Big data primer
PPTX
Sujal and scott fina lb
PDF
Introducing VNX Series
PDF
Quiterian analytics
PDF
Storage simplicity value_110810
PPTX
Pass bac jd_sm
PDF
Intel Cloud Summit: Big Data
PPTX
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
PDF
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
PDF
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
PDF
Big Data World Forum
PPTX
2012 10 bigdata_overview
PDF
Hortonworks roadshow
PPTX
Vnx series-technical-review-110616214632-phpapp02
PPTX
Vnx series-technical-review-110616214632-phpapp02
PPTX
Track 1, Session 2, Flash by Amit Sharma
PDF
Research on big data
101 ab 1445-1515
101 ab 1445-1515
Building Big Data Applications
Big data primer
Sujal and scott fina lb
Introducing VNX Series
Quiterian analytics
Storage simplicity value_110810
Pass bac jd_sm
Intel Cloud Summit: Big Data
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Big Data World Forum
2012 10 bigdata_overview
Hortonworks roadshow
Vnx series-technical-review-110616214632-phpapp02
Vnx series-technical-review-110616214632-phpapp02
Track 1, Session 2, Flash by Amit Sharma
Research on big data

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Spectroscopy.pptx food analysis technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Electronic commerce courselecture one. Pdf
PPTX
Cloud computing and distributed systems.
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
sap open course for s4hana steps from ECC to s4
Spectroscopy.pptx food analysis technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Electronic commerce courselecture one. Pdf
Cloud computing and distributed systems.
Programs and apps: productivity, graphics, security and other tools
Per capita expenditure prediction using model stacking based on satellite ima...
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Review of recent advances in non-invasive hemoglobin estimation
“AI and Expert System Decision Support & Business Intelligence Systems”
The AUB Centre for AI in Media Proposal.docx
MIND Revenue Release Quarter 2 2025 Press Release

DDN Accelerating-Decisions-Through-Enterprise-Hadoop-final

  • 1. v7.0 – 09/07/2012 Accelerating Decisions Through Enterprise Hadoop Evolving Hadoop to support Enterprise Computing v7.0 – 09/07/2012 Joey Jablonski Practice Director, Analytic Services ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 2. Agenda for The Data Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 3. DDN | We Accelerate Information Insight DDN provides a competitive advantage by maximizing your datacenter investment while mitigating growth challenges over your discovery process. ► Established: 1998 ► Revenue: $226M (2011) – Profitable, Fast Growth ► Main Office: Sunnyvale, California, USA ► Employees: 600+ Worldwide ► Worldwide Presence: 16 Countries ► Installed Base: 1,000+ End Customers; 50+ Countries ► Go To Market: Global Partners, Resellers, Direct World-Renowned & Award-Winning ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 4. DDN | 15 Years in HPC Investment In Scale & Innovation First HPC DDN Customer Incorporated DDN 1st Customer SFA Project WOS Project Largest private 500+ FOUNDED NASA Inception Inception storage co. (IDC) EMPLOYEES 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 S2A8000 S2A9900 S2A6000 S2A9550 S2A3000 AWARDS 6620 10K 12K ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 5. Agenda for The Data Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 6. Storage Fusion Processing™ Applications DDN’s Storage Fusion GRIDScaler™ Architecture Network Interface Network Interface SAS Storage Server Interface Compute Storage RAID Resource Media Controller • Driving Imperatives = Improved OPEX  Massive bandwidth and low latency to storage media  Multi-core processors + Big DRAMs  Virtualization / Hypervisor ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 7. DDN | Appliance Portfolio GRIDScaler™ EXAScaler™ SFA12K-E SFA10K-E SFA10K-M WOS6000 Bandwidth: 40GB/s Bandwidth: 15GB/s Bandwidth: 2GB/s 4U, 60-Drive System Flash IOPS: 1.4M Flash IOPS: 840K Flash IOPS: 840K 8 x GbE per Node Scales to 1680 Drives Scales to 1200 dives Scales to 120 dives 2PB/Rack, 23PB/Cluster In-Storage Processing In-Storage Processing In-Storage Processing 25B Objects/Rack Maximize Value: Best-In-Class Performance to Accelerate Applications Minimize OPEX: >2x More Data Center Efficient Than Competing Systems Minimize Overhead: Autonomous System Fault Management & Recovery ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 8. Storage Fusion Processing™ A Unique DDN Vision Embedded Data-Intensive Applications Within Storage Infrastructure ►Reduce complexity, infrastructure, administration, TCO ►Reduce infrastructure & OPEX ►Increase performance for latency sensitive applications ►Success today with: File-Systems, iRODS, Hadoop, BWA, FASTA/SAM/BAM ►Work with your research teams to: • Identify application candidates Gap Aligners? • Port to our VMs/Hypervisor and Benchmark Molecular Dynamics? • Deploy to your community Deep and wide search? Query engine? ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 9. Agenda for The Data Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 10. Why Data Analytics is so Hard? Technical Business Hacking Skills Business Acumen Data Science Analytics Math & Decisioning Traditional Research Substantive Statistics Poor Communications Curiosity Expertise knowledge ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 11. Analytics | Looking for Actionable Data Billions of Data Points to Consider • Consumer purchasing trends • Product perception • Drug Discovery • Genomics • Surveillance • Financial Analysis ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 12. How do I leverage Analytics? Improved Results Modify Insight Behavior ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 13. Data Gravity Warps the Application Space Applications DATA Services ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 14. Todays Enterprise Picture Empowered Enabled Aware Users Users Users The Cloud ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 15. Agenda for The Data Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 16. The tools of the Trade Ecosystem Hadoop 4 3 5 Core Apache Hadoop 2 6 1 Map Reduce 1 2 3 4 5 6 ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 17. Hadoop & HPC Compared Data Locality Inter-process Communication Job Input HPC 1 2 3 4 5 6 Slic Slic e1 en 4 3 5 Job Input 2 6 1 Hadoop Slic Slic e1 en 1 2 3 4 5 6 ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 18. Organizational Scalability Higher is Better Adoption Goal for Human Costs Capacity 18 6/8/12 ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 19. Agenda for The Data Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 20. Hadoop Cluster Lifecycle Deploy Upgrade Manage Respond Monitor Software Platform Hardware Platform ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 21. Infrastructure Chargeback • Visibility to Trends • Actionable Reporting • Limits & Enforcement Site Overview ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 22. Analytics Services Portfolio Architect Deploy Manage Customize • Data Transformation • hScaler Installation • Data Curation • Data Migration • Data & Analytics • hScaler Upgrade • hScaler Administration • DR&BC Strategy • Environment Integration • System Tuning • Application Integration • Security Strategy in • Performance Testing • Health Checks • Data Curation shared-data • Operational Validation • Application Development Environments • Factory Build • Data Cleansing • DR&BC • Data Curation • Solution Sizing • Data Center Preparation Support • Process Integration • Phone/Email • ETL planning • Phone Home Monitoring • Compliance Planning • Patches & Upgrades • Remote Diagnostics ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 23. Apache Hadoop Genomics Application Examples ► Apache Hadoop™ MapReduce™ computing efficiency: • The algorithm-performance should scale with CPU count • The algorithm should be embarrassingly parallel • There should be no dependence on how the data is distributed • The data should be static ► Example genomics application that work well within Hadoop: • Crossbow. Whole genome re-sequencing & SNP genotyping (short reads) • Contrail. De novo assembly from short sequencing reads. • Myrna. Fast short-read & differential gene expression aligner (RNA-seq) • PeakRanger. Cloud-enabled peak caller for ChIP-seq data. • Quake. Quality-aware detection and sequencing error correction tool. • BlastReduce. High-performance short read mapping. • CloudBLAST. Hadoop implementation of NCBI’s Blast. • MrsRF. Algorithm for analyzing large evolutionary trees. 23 ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 24. CloudBLAST Application Example StreamInputFormat CloudBLAST is a Map-Reduce version of the commonly used S= {s1, s2, … sk} S= {s1, s2, … sk} S= {s1, s2, … sk} bioinformatics application NCBI BLAST CPU - N CPU - 0 CPU - 1 CPU - 2 CPU - 3 CPU - 4 CPU - 5 CPU -6 1. Stream Input Formatted data is split into “960 long chunks” base on new line. 2. Data “chunks” split into sequences as keys for the MapReduce 3. Blast output is written to local file Data Merger Based on work by Andréa Matsunaga, Maurício Tsugawa and José Fortes - University of Florida 24 ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 25. Agenda for The Data Challenge ► Overview of DataDirect Network ► What is Storage Fusion Processing™, it’s advantages & applications ► Overview of Analytics ► Introduction to Apache Hadoop ► An overview of DDN hScaler solution ► Conclusion ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 26. How DDN can Accelerate Your Analytics ► Lower Total Cost of Ownership and Improved OPEX: • Scale – Dynamically add capacity to match your complex workloads • Value – Grow storage capacity economically: Access, Solve, Archive • High Availability - Always running with world-class 24/7 service & support ► Drive Innovation: • Performance at Scale – A homogeneous platform that performs at scale • Eloquent - Leverage virtualization to deliver analytics platform to provide the quickest answers to your most complex questions • Collaboration – Centralize & share discoveries across the globe, securely ► Deliver Experience: • Fifteen Years of HPC – Government Labs, DoE, and Universities trust DDN • HPC community rely on DDN – 60% of the top 500 Supercomputer & growing • Single vendor solution - OEMs provide DDN with their datacenter solutions. ©2012 DataDirect Networks. All Rights Reserved. ddn.com
  • 27. Thank you – Questions? DataDirect Networks, Information in Motion, Silicon Storage Appliance, S2A, Storage Fusion Architecture, SFA, Storage Fusion Fabric, Web Object Scaler, WOS, EXAScaler, GRIDScaler, xSTREAMScaler, NAS Scaler, ReAct, ObjectAssure, In-Storage Processing and SATAssure are all trademarks of DataDirect Networks. Any unauthorized use is prohibited. ©2012 DataDirect Networks. All Rights Reserved. ddn.com