SlideShare a Scribd company logo
Using Distributed, In-Memory
Computing for Fast Data Analysis
             WSTA Seminar
              September 14, 2011




    Bill Bain (wbain@scaleoutsoftware.com)




            Copyright © 2011 by ScaleOut Software, Inc.
Agenda
• The Need for Memory-Based, Distributed
  Storage
• What Is a Distributed Data Grid (DDG)
• Performance Advantages and Architecture
• Migrating Data to the Cloud and Across Global
  Sites
• Parallel Data Analysis
• Comparison of DDG to File-Based
  Map/Reduce


2                                                 WSTA Seminar
The Need for Memory-Based Storage
Example: Web server farm:
                                                                          Internet
• Load-balancer directs                                                   POW E R FAU LT DATA AL A RM



                                                                                                            Load-balancer
  incoming client requests                                                           Ethernet

  to Web servers.

• Web and app. server
  farms build Web pages         W eb Server
                                              Distributed, In-Memory DataServer W eb Server
                                                W eb Server W eb Server W eb Server W eb
                                                                                         Grid
  and run business logic.                                                            Ethernet




• Database server holds all
  mission-critical, LOB data.
                                                             D atabase   R aid D isk                         D atabase
                                                              Server       Array                              Server                   Bottleneck
• Server farms share fast-                                                Ethernet


  changing data using a                       Distributed, In-Memory Data Grid
  DDG to avoid bottlenecks
  and maximize scalability.                    App. Server      App. Server                             App. Server      App. Server



 3                                                                                                                            WSTA Seminar
The Need for Memory-Based Storage
Example: Cloud Application:           Cloud Application

• Application runs as multiple,       App VS         App VS

                                               App VS
  virtual servers (VS).              App VS
                                                         App VS


• Application instances store and
  retrieve LOB data from cloud-                      Grid VS
                                               Grid VS
  based file system or database.     Grid VS

                                     Distributed Data Grid
• Applications need fast, scalable
  storage for fast-changing data.

• Distributed data grid runs as
  multiple, virtual servers to
  provide “elastic,” in-memory
  storage.
                                     Cloud-Based Storage

4                                                                 WSTA Seminar
What is a Distributed Data Grid?
• A new “vertical” storage tier:              Processor         Processor
                                               Cache             Cache
    – Adds missing layer to boost
      performance.
    – Uses in-memory, out-of-process          L2 Cache          L2 Cache

      storage.
    – Avoids repeated trips to backing        Application
                                                Memory
                                                                Application
                                                                  Memory
                                             “In-Process”      “In-Process”
      storage.

• A new “horizontal” storage tier:           Distributed       Distributed
                                               Cache             Cache
    –   Allows data sharing among servers.    “Out-of-          “Out-of-
                                              Process”          Process”
    –   Scales performance & capacity.
    –   Adds high availability.
                                              Backing
    –   Can be used independently of          Storage

        backing storage.
5                                                           WSTA Seminar
Distributed Data Grids: A Closer Look
• Incorporates a client-side, in-          Application
                                             Memory
  process cache (“near cache”):           “In-Process”
    – Transparent to the application
    – Holds recently accessed data.
                                           Client-side
• Boosts performance:                        Cache
    – Eliminates repeated network data    “In-Process”
      transfers & deserialization.
    – Reduces access times to near “in-
      process” latency.                   Distributed
    – Is automatically updated if the       Cache
      distributed grid changes.            “Out-of-
                                           Process”
    – Supports various coherency models
      (coherent, polled, event-driven)
6                                                 WSTA Seminar
Performance Benefit of Client-side Cache

• Eliminates repeated network data transfers.
• Eliminates repeated object deserialization.

                                  Average Response Time
                                        10KB Objects
                           3500       20:1 Read/Update

                           3000

                           2500
            Microseconds




                           2000

                           1500

                           1000

                            500

                              0
                                   DDG                    DBMS


 7                                                               WSTA Seminar
Top 5 Benefits of Distributed Data Grids
1. Faster access time for business logic state or database data
2. Scalable throughput to match a growing workload and keep
   response times low
3. High availability to prevent data loss if a grid server (or network
   link) fails
                                                              Access Latency vs. Throughput
4. Shared access to data across




                                      Access Latency (msec)
   the server farm                                              Grid     DBMS


5. Advanced capabilities
   for quickly and easily mining
   data using scalable,
   “map/reduce,” analysis.

                                                                Throughput (accesses / sec)



8                                                                                     WSTA Seminar
Scaling the Distributed Data Grid
• Distributed data grid must deliver scalable throughput.
• To do so, its architecture must eliminate bottlenecks to
  scaling:
     – Avoid centralized scheduling to eliminate hot spots.
     – Use data partitioning and maintain load-balance to allow scaling.
     – Use fixed vs. full replication         Read/Write Throughput
       to avoid n-fold overhead.                   10KB Objects

     – Use low overhead
                               Accesses / Second


       heart-beating.               80,000

• Example of linear                                60,000
                                                   40,000
  throughput scaling:                              20,000
                                                       0
                                                                 4       16       28       40       52       64        Nodes
                                                            16,000 ------------------------------------------- 256,000 #Objects

 9                                                                                                    WSTA Seminar
Typical Commercial Distributed Data Grids
• Partition objects to scale throughput and avoid hot
  spots.
• Synchronize access to objects across all servers.
• Dynamically rebalance objects to avoid hot spots.
• Replicate each cached object for high availability.
• Detect server or network failures and self-heal.
                 Client
               Application
                             Retrieve

                Client Cached
                Library Copy
                                                          Object   Copy   Replica

                Cache             Cache                Cache              Cache
                Service           Service              Service            Service

                                   Distributed Cache


                                            Ethernet




10                                                                                  WSTA Seminar
Wide Range of Applications
Financial Services            E-commerce
• Portfolio risk analysis     • Session-state storage
• VaR calculations            • Application state storage
• Monte Carlo simulations     • Online banking
• Algorithmic trading         • Loan applications
• Market message caching      • Wealth management
• Derivatives trading         • Online learning
• Pricing calculations        • Hotel reservations
                              • News story caching
Other Applications
• Edge servers: chat, email   • Shopping carts
• Online gaming servers       • Social networking
• Scientific computations     • Service call tracking
• Command and control         • Online surveys

11                                                  WSTA Seminar
Importance for Cloud Computing
• Cloud computing:
     – Make elastic resources readily available, but…
     – Clouds have relatively slow interconnects.
• Distributed data grids add significant value in the cloud:
     – Allow data sharing across a group of virtual servers.
     – Elastically scale throughput as needed.
     – Provide low latency, object-oriented storage
• Clouds provide the elastic platform for parallel data
  analysis.
• DDGs provides the efficiency and scalability needed to
  overcome the cloud’s limited interconnect speed.

12                                                             WSTA Seminar
DDGs Simplify Data Migration to the Cloud
• Distributed data grids can automatically bridge on-
  premise and cloud-based data grids to unify access.
• This enables seamless access to data across
  multiple sites.
                                 Cloud Application

           Cloud Application VS
                           App              App VS

        App VS              App VS     App VS
                             App VS
                                                App VS
                  App VS                                                           On-Premise Application 2
        App VS              App VS
                                                                                   Server App        Server App
                                                                                         On-Premise Application 2
                                            SOSS VS
                                                                                        Server App      Server App
                                      SOSS VS
                           SOSS VSVS
                             SOSS                                    Aut
                                                                        o
                 SOSS VS                                            Mig matic
                                                                       rate ally
                           Cloud-Based Distributed Automatically
                                                   Cache                   Da
                                                                              ta   SOSS Host         SOSSHost
                                                                                                     SOSS Host
        SOSS VS                                      Migrate Data
                                                                                           SOSS Host
             Cloud hosted Cloud of Virtual Servers                                        On-Premise                       Backing
         Distributed Data Grid                                                       Distributed Data Grid
                                                                                            On-Premise Cache                Store

                                                                                           User’s On-Premise Application
       Cloud of Virtual Servers                                                     User’s On-Premise Application



13                                                                                                                               WSTA Seminar
DDGs Enable Seamless Global Access


     Mirrored Data Centers
                                SOSS SVR                              Satellite Data Centers
                         SOSS SVR
                   SOSS SVR
                                                                                         SOSS SVR
                   Distributed Data Grid                                          SOSS SVR
                     SOSS SVR
                                                                            SOSS SVR
              SOSS SVR
        SOSS SVR                                                            Distributed Data Grid

        Distributed Data Grid
                                                                                            SOSS SVR
                                                                                       SOSS SVR
                                                                              SOSS SVR

                                                                               Distributed Data Grid

                                           Global Distributed Data Grid




14                                                                                        WSTA Seminar
Introducing Parallel Data Analysis
• The goal:
     – Quickly analyze a large set of data for patterns and trends.
     – How? Run a method E (“eval”) across a set of objects D in parallel.
     – Optionally merge the results using method M (“merge”).
• Evolution of parallel analysis:                      E          M
     – '80s: “SIMD/SPMD” (Flynn, Hillis)
     – '90s: “Domain decomposition” (Intel, IBM)      D    D     D    D
     – '00s: “Map/reduce” (Google, Hadoop, Dryad)
                                                      D    D     D    D
• Applications:
     – Search, financial services,                    D    D     D    D
       business intelligence, simulation
                                                      D    D     D    D


                                                           Result
15                                                             WSTA Seminar
Example in Financial Services
Analyze trading strategies across stock histories:
Why?
• Back-testing systems help guard against risks in deploying new
  trading strategies.
• Performance is critical for “first to market” advantage.
• Uses significant amount of market data and computation time.
How?
• Write method E to analyze trading strategies across a single
  stock history.
• Write method M to merge two sets of results.
• Populate the data store with a set of stock histories.
• Run method E in parallel on all stock histories.
• Merge the results with method M to produce a report.
• Refine and repeat…
16                                                       WSTA Seminar
Stage the Data for Analysis

• Step 1: Populate the distributed data grid with objects each of which
  represents a price history for a ticker symbol:




17                                                         WSTA Seminar
Code the Eval and Merge Methods
•    Step 2: Write a method to evaluate a stock history based on parameters:
       Results EvalStockHistory(StockHistory history, Parameters params)
       {
           <analyze trading strategy for this stock history>
           return results;
       }

•    Step 3: Write a method to merge the results of two evaluations:
       Results MergeResuts(Results results1, Results results2)
       {
           <merge both results>
           return results;
       }

•    Notes:
      – This code can be run a sequential calculation on in-memory data.
      – No explicit accesses to the distributed data grid are used.



18                                                                  WSTA Seminar
Run the Analysis
 • Step 4: Invoke parallel evaluation and merging of results:
      Results Invoke(EvalStockHistory, MergeResults, querySpec,
      params);


EvalStockHistory()




      MergeResults()


 19                                                          WSTA Seminar
Start parallel
  analysis

                                                 .eval()


         stock                stock     stock                 stock     stock                stock
        history              history   history               history   history              history




        results              results   results               results   results              results




                  .merge()                       .merge()                        .merge()


                   results                         results                        results




                                                 .merge()

  results returned                                 results
      to client
   20                                                                               WSTA Seminar
DDG Minimizes Data Motion
• File-based map/reduce must move data to memory for analysis:
            M/R Server                M/R Server               M/R Server

        E                     E                        E

                                                                                      Server
                                                                                      Memory



                                                                                    File System /
      D         D        D    D        D           D   D        D           D         Database



• Memory-based DDG analyzes data in place:
                Grid Server             Grid Server             Grid Server

            E                     E                        E

                                                                                     Distributed
       D        D        D    D         D          D   D         D          D        Data Grid



21                                                                              WSTA Seminar
Start parallel
  analysis

                                                 .eval()
                                                  File I/O

         stock                stock     stock                 stock     stock                stock
        history              history   history               history   history              history




        results              results   results               results   results              results




                  .merge()                       .merge()                        .merge()
                                                  File I/O

                   results                         results                        results

                                                  File I/O

                                                 .merge()

  results returned                                 results
      to client
   22                                                                               WSTA Seminar
Performance Impact of Data Motion
     Measured random access to DDG data to simulate file I/O:




23                                                         WSTA Seminar
Comparison of DDGs and File-Based M/R
                    DDG                      File-Based M/R
Data set size       Gigabytes->terabytes     Terabytes->petabytes
Data repository     In-memory                File / database
Data view           Queried object collection File-based key/value
                                              pairs
Development time    Low                      High
Automatic           Yes                      Application
scalability                                  dependent
Best use            Quick-turn analysis of   Complex analysis of
                    memory-based data        large datasets
I/O overhead        Low                      High
Cluster mgt.        Simple                   Complex
High availability   Memory-based             File-based

24                                                         WSTA Seminar
Walk-Away Points
• Developers need fast, scalable, highly available and sharable
  memory-based storage for scaled out applications.
• Distributed data grids (DDGs) address these needs with:
     – Fast access time & scalable throughput
     – Highly available data storage
     – Support for parallel data analysis
• Cloud-based and globally distributed applications need DDGs to:
     – Support scalable data access for “elastic” applications.
     – Efficiently and easily migrate data across sites.
     – Avoid relatively slow cloud I/O storage and interconnects.
• DDGs offer simple, fast “map/reduce” parallel analysis:
     – Make it easy to develop applications and configure clusters.
     – Avoid file I/O overhead for datasets that fit in memory-based grids.
     – Deliver automatic, highly scalable performance.
25                                                                  WSTA Seminar
Distributed Data Grids for
Server Farms & High Performance Computing

        www.scaleoutsoftware.com

More Related Content

PDF
Top 6 Reasons to Use a Distributed Data Grid
PDF
Times Ten in-memory database when time counts - Laszlo Ludas
PDF
Using multi tiered storage systems for storing both structured & unstructured...
PDF
Is your cloud ready for Big Data? Strata NY 2013
PDF
Hadoop on VMware
PDF
Architecting Virtualized Infrastructure for Big Data
PPTX
Gluster Blog 11.15.2010
PPTX
Postgres Plus Cloud Database
Top 6 Reasons to Use a Distributed Data Grid
Times Ten in-memory database when time counts - Laszlo Ludas
Using multi tiered storage systems for storing both structured & unstructured...
Is your cloud ready for Big Data? Strata NY 2013
Hadoop on VMware
Architecting Virtualized Infrastructure for Big Data
Gluster Blog 11.15.2010
Postgres Plus Cloud Database

What's hot (20)

PDF
IBM Systems solution for SAP NetWeaver Business Warehouse Accelerator
PDF
Inside the Hadoop Machine @ VMworld
PDF
Apache Hadoop on Virtual Machines
PPTX
Hadoop on Virtual Machines
PPTX
An Active and Hybrid Storage System for Data-intensive Applications
PDF
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
PPT
Dynamo Systems - QCon SF 2012 Presentation
PDF
Virtualization Primer for Java Developers
PDF
Data Domain Architecture
PPTX
Storage Options in Windows Server 2012
PDF
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
PDF
Lenovo Storage S3200 Simple Setup
PDF
Adaptec Hybrid RAID Whitepaper
PDF
Symantec Netbackup Appliance Family
PDF
Demartek Lenovo Storage S3200 i a mixed workload environment_2016-01
PPTX
DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...
PPTX
High Performance Cloud Computing
PPTX
CodeFutures - Scaling Your Database in the Cloud
PDF
gfs-sosp2003
PPTX
Hadoop in the Clouds, Virtualization and Virtual Machines
IBM Systems solution for SAP NetWeaver Business Warehouse Accelerator
Inside the Hadoop Machine @ VMworld
Apache Hadoop on Virtual Machines
Hadoop on Virtual Machines
An Active and Hybrid Storage System for Data-intensive Applications
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Dynamo Systems - QCon SF 2012 Presentation
Virtualization Primer for Java Developers
Data Domain Architecture
Storage Options in Windows Server 2012
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
Lenovo Storage S3200 Simple Setup
Adaptec Hybrid RAID Whitepaper
Symantec Netbackup Appliance Family
Demartek Lenovo Storage S3200 i a mixed workload environment_2016-01
DV01 Ten Things You Always Wanted to Know About Windows Azure But Were Afraid...
High Performance Cloud Computing
CodeFutures - Scaling Your Database in the Cloud
gfs-sosp2003
Hadoop in the Clouds, Virtualization and Virtual Machines
Ad

Similar to Using Distributed In-Memory Computing for Fast Data Analysis (20)

PPTX
Virtualizing Latency Sensitive Workloads and vFabric GemFire
PDF
Scaling Out Tier Based Applications
PDF
Innovations in Grid Computing with Oracle Coherence
PDF
Learning from google megastore (Part-1)
PDF
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
PPTX
Application architecture for cloud
PDF
Xldb2011 wed 1415_andrew_lamb-buildingblocks
PDF
Betting On Data Grids
PPTX
Advanced databases ben stopford
PDF
Big Data: Movement, Warehousing, & Virtualization
PDF
Gemfire Sqlfire - La Marmite NoSql
PDF
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
PPTX
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
PDF
Membase Meetup - San Diego
PDF
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
PPTX
Scaling Your Database in the Cloud
PDF
Cache and consistency in nosql
PDF
MySQL Cluster Scaling to a Billion Queries
PPT
distributed dbms
PPT
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Virtualizing Latency Sensitive Workloads and vFabric GemFire
Scaling Out Tier Based Applications
Innovations in Grid Computing with Oracle Coherence
Learning from google megastore (Part-1)
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Application architecture for cloud
Xldb2011 wed 1415_andrew_lamb-buildingblocks
Betting On Data Grids
Advanced databases ben stopford
Big Data: Movement, Warehousing, & Virtualization
Gemfire Sqlfire - La Marmite NoSql
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear Scalability
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Membase Meetup - San Diego
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
Scaling Your Database in the Cloud
Cache and consistency in nosql
MySQL Cluster Scaling to a Billion Queries
distributed dbms
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012
Ad

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PDF
cuic standard and advanced reporting.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
Teaching material agriculture food technology
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Teaching material agriculture food technology
NewMind AI Monthly Chronicles - July 2025
Encapsulation_ Review paper, used for researhc scholars
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Review of recent advances in non-invasive hemoglobin estimation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.

Using Distributed In-Memory Computing for Fast Data Analysis

  • 1. Using Distributed, In-Memory Computing for Fast Data Analysis WSTA Seminar September 14, 2011 Bill Bain (wbain@scaleoutsoftware.com) Copyright © 2011 by ScaleOut Software, Inc.
  • 2. Agenda • The Need for Memory-Based, Distributed Storage • What Is a Distributed Data Grid (DDG) • Performance Advantages and Architecture • Migrating Data to the Cloud and Across Global Sites • Parallel Data Analysis • Comparison of DDG to File-Based Map/Reduce 2 WSTA Seminar
  • 3. The Need for Memory-Based Storage Example: Web server farm: Internet • Load-balancer directs POW E R FAU LT DATA AL A RM Load-balancer incoming client requests Ethernet to Web servers. • Web and app. server farms build Web pages W eb Server Distributed, In-Memory DataServer W eb Server W eb Server W eb Server W eb Server W eb Grid and run business logic. Ethernet • Database server holds all mission-critical, LOB data. D atabase R aid D isk D atabase Server Array Server Bottleneck • Server farms share fast- Ethernet changing data using a Distributed, In-Memory Data Grid DDG to avoid bottlenecks and maximize scalability. App. Server App. Server App. Server App. Server 3 WSTA Seminar
  • 4. The Need for Memory-Based Storage Example: Cloud Application: Cloud Application • Application runs as multiple, App VS App VS App VS virtual servers (VS). App VS App VS • Application instances store and retrieve LOB data from cloud- Grid VS Grid VS based file system or database. Grid VS Distributed Data Grid • Applications need fast, scalable storage for fast-changing data. • Distributed data grid runs as multiple, virtual servers to provide “elastic,” in-memory storage. Cloud-Based Storage 4 WSTA Seminar
  • 5. What is a Distributed Data Grid? • A new “vertical” storage tier: Processor Processor Cache Cache – Adds missing layer to boost performance. – Uses in-memory, out-of-process L2 Cache L2 Cache storage. – Avoids repeated trips to backing Application Memory Application Memory “In-Process” “In-Process” storage. • A new “horizontal” storage tier: Distributed Distributed Cache Cache – Allows data sharing among servers. “Out-of- “Out-of- Process” Process” – Scales performance & capacity. – Adds high availability. Backing – Can be used independently of Storage backing storage. 5 WSTA Seminar
  • 6. Distributed Data Grids: A Closer Look • Incorporates a client-side, in- Application Memory process cache (“near cache”): “In-Process” – Transparent to the application – Holds recently accessed data. Client-side • Boosts performance: Cache – Eliminates repeated network data “In-Process” transfers & deserialization. – Reduces access times to near “in- process” latency. Distributed – Is automatically updated if the Cache distributed grid changes. “Out-of- Process” – Supports various coherency models (coherent, polled, event-driven) 6 WSTA Seminar
  • 7. Performance Benefit of Client-side Cache • Eliminates repeated network data transfers. • Eliminates repeated object deserialization. Average Response Time 10KB Objects 3500 20:1 Read/Update 3000 2500 Microseconds 2000 1500 1000 500 0 DDG DBMS 7 WSTA Seminar
  • 8. Top 5 Benefits of Distributed Data Grids 1. Faster access time for business logic state or database data 2. Scalable throughput to match a growing workload and keep response times low 3. High availability to prevent data loss if a grid server (or network link) fails Access Latency vs. Throughput 4. Shared access to data across Access Latency (msec) the server farm Grid DBMS 5. Advanced capabilities for quickly and easily mining data using scalable, “map/reduce,” analysis. Throughput (accesses / sec) 8 WSTA Seminar
  • 9. Scaling the Distributed Data Grid • Distributed data grid must deliver scalable throughput. • To do so, its architecture must eliminate bottlenecks to scaling: – Avoid centralized scheduling to eliminate hot spots. – Use data partitioning and maintain load-balance to allow scaling. – Use fixed vs. full replication Read/Write Throughput to avoid n-fold overhead. 10KB Objects – Use low overhead Accesses / Second heart-beating. 80,000 • Example of linear 60,000 40,000 throughput scaling: 20,000 0 4 16 28 40 52 64 Nodes 16,000 ------------------------------------------- 256,000 #Objects 9 WSTA Seminar
  • 10. Typical Commercial Distributed Data Grids • Partition objects to scale throughput and avoid hot spots. • Synchronize access to objects across all servers. • Dynamically rebalance objects to avoid hot spots. • Replicate each cached object for high availability. • Detect server or network failures and self-heal. Client Application Retrieve Client Cached Library Copy Object Copy Replica Cache Cache Cache Cache Service Service Service Service Distributed Cache Ethernet 10 WSTA Seminar
  • 11. Wide Range of Applications Financial Services E-commerce • Portfolio risk analysis • Session-state storage • VaR calculations • Application state storage • Monte Carlo simulations • Online banking • Algorithmic trading • Loan applications • Market message caching • Wealth management • Derivatives trading • Online learning • Pricing calculations • Hotel reservations • News story caching Other Applications • Edge servers: chat, email • Shopping carts • Online gaming servers • Social networking • Scientific computations • Service call tracking • Command and control • Online surveys 11 WSTA Seminar
  • 12. Importance for Cloud Computing • Cloud computing: – Make elastic resources readily available, but… – Clouds have relatively slow interconnects. • Distributed data grids add significant value in the cloud: – Allow data sharing across a group of virtual servers. – Elastically scale throughput as needed. – Provide low latency, object-oriented storage • Clouds provide the elastic platform for parallel data analysis. • DDGs provides the efficiency and scalability needed to overcome the cloud’s limited interconnect speed. 12 WSTA Seminar
  • 13. DDGs Simplify Data Migration to the Cloud • Distributed data grids can automatically bridge on- premise and cloud-based data grids to unify access. • This enables seamless access to data across multiple sites. Cloud Application Cloud Application VS App App VS App VS App VS App VS App VS App VS App VS On-Premise Application 2 App VS App VS Server App Server App On-Premise Application 2 SOSS VS Server App Server App SOSS VS SOSS VSVS SOSS Aut o SOSS VS Mig matic rate ally Cloud-Based Distributed Automatically Cache Da ta SOSS Host SOSSHost SOSS Host SOSS VS Migrate Data SOSS Host Cloud hosted Cloud of Virtual Servers On-Premise Backing Distributed Data Grid Distributed Data Grid On-Premise Cache Store User’s On-Premise Application Cloud of Virtual Servers User’s On-Premise Application 13 WSTA Seminar
  • 14. DDGs Enable Seamless Global Access Mirrored Data Centers SOSS SVR Satellite Data Centers SOSS SVR SOSS SVR SOSS SVR Distributed Data Grid SOSS SVR SOSS SVR SOSS SVR SOSS SVR SOSS SVR Distributed Data Grid Distributed Data Grid SOSS SVR SOSS SVR SOSS SVR Distributed Data Grid Global Distributed Data Grid 14 WSTA Seminar
  • 15. Introducing Parallel Data Analysis • The goal: – Quickly analyze a large set of data for patterns and trends. – How? Run a method E (“eval”) across a set of objects D in parallel. – Optionally merge the results using method M (“merge”). • Evolution of parallel analysis: E M – '80s: “SIMD/SPMD” (Flynn, Hillis) – '90s: “Domain decomposition” (Intel, IBM) D D D D – '00s: “Map/reduce” (Google, Hadoop, Dryad) D D D D • Applications: – Search, financial services, D D D D business intelligence, simulation D D D D Result 15 WSTA Seminar
  • 16. Example in Financial Services Analyze trading strategies across stock histories: Why? • Back-testing systems help guard against risks in deploying new trading strategies. • Performance is critical for “first to market” advantage. • Uses significant amount of market data and computation time. How? • Write method E to analyze trading strategies across a single stock history. • Write method M to merge two sets of results. • Populate the data store with a set of stock histories. • Run method E in parallel on all stock histories. • Merge the results with method M to produce a report. • Refine and repeat… 16 WSTA Seminar
  • 17. Stage the Data for Analysis • Step 1: Populate the distributed data grid with objects each of which represents a price history for a ticker symbol: 17 WSTA Seminar
  • 18. Code the Eval and Merge Methods • Step 2: Write a method to evaluate a stock history based on parameters: Results EvalStockHistory(StockHistory history, Parameters params) { <analyze trading strategy for this stock history> return results; } • Step 3: Write a method to merge the results of two evaluations: Results MergeResuts(Results results1, Results results2) { <merge both results> return results; } • Notes: – This code can be run a sequential calculation on in-memory data. – No explicit accesses to the distributed data grid are used. 18 WSTA Seminar
  • 19. Run the Analysis • Step 4: Invoke parallel evaluation and merging of results: Results Invoke(EvalStockHistory, MergeResults, querySpec, params); EvalStockHistory() MergeResults() 19 WSTA Seminar
  • 20. Start parallel analysis .eval() stock stock stock stock stock stock history history history history history history results results results results results results .merge() .merge() .merge() results results results .merge() results returned results to client 20 WSTA Seminar
  • 21. DDG Minimizes Data Motion • File-based map/reduce must move data to memory for analysis: M/R Server M/R Server M/R Server E E E Server Memory File System / D D D D D D D D D Database • Memory-based DDG analyzes data in place: Grid Server Grid Server Grid Server E E E Distributed D D D D D D D D D Data Grid 21 WSTA Seminar
  • 22. Start parallel analysis .eval() File I/O stock stock stock stock stock stock history history history history history history results results results results results results .merge() .merge() .merge() File I/O results results results File I/O .merge() results returned results to client 22 WSTA Seminar
  • 23. Performance Impact of Data Motion Measured random access to DDG data to simulate file I/O: 23 WSTA Seminar
  • 24. Comparison of DDGs and File-Based M/R DDG File-Based M/R Data set size Gigabytes->terabytes Terabytes->petabytes Data repository In-memory File / database Data view Queried object collection File-based key/value pairs Development time Low High Automatic Yes Application scalability dependent Best use Quick-turn analysis of Complex analysis of memory-based data large datasets I/O overhead Low High Cluster mgt. Simple Complex High availability Memory-based File-based 24 WSTA Seminar
  • 25. Walk-Away Points • Developers need fast, scalable, highly available and sharable memory-based storage for scaled out applications. • Distributed data grids (DDGs) address these needs with: – Fast access time & scalable throughput – Highly available data storage – Support for parallel data analysis • Cloud-based and globally distributed applications need DDGs to: – Support scalable data access for “elastic” applications. – Efficiently and easily migrate data across sites. – Avoid relatively slow cloud I/O storage and interconnects. • DDGs offer simple, fast “map/reduce” parallel analysis: – Make it easy to develop applications and configure clusters. – Avoid file I/O overhead for datasets that fit in memory-based grids. – Deliver automatic, highly scalable performance. 25 WSTA Seminar
  • 26. Distributed Data Grids for Server Farms & High Performance Computing www.scaleoutsoftware.com