Using Distributed In-Memory Computing for Fast Data Analysis

Using Distributed, In-Memory
Computing for Fast Data Analysis
WSTA Seminar
September 14, 2011

Bill Bain (wbain@scaleoutsoftware.com)

Copyright © 2011 by ScaleOut Software, Inc.

Agenda
• The Need for Memory-Based, Distributed
Storage
• What Is a Distributed Data Grid (DDG)
• Performance Advantages and Architecture
• Migrating Data to the Cloud and Across Global
Sites
• Parallel Data Analysis
• Comparison of DDG to File-Based
Map/Reduce

2 WSTA Seminar

The Need for Memory-Based Storage
Example: Web server farm:
Internet
• Load-balancer directs POW E R FAU LT DATA AL A RM

Load-balancer
incoming client requests Ethernet

to Web servers.

• Web and app. server
farms build Web pages W eb Server
Distributed, In-Memory DataServer W eb Server
W eb Server W eb Server W eb Server W eb
Grid
and run business logic. Ethernet

• Database server holds all
mission-critical, LOB data.
D atabase R aid D isk D atabase
Server Array Server Bottleneck
• Server farms share fast- Ethernet

changing data using a Distributed, In-Memory Data Grid
DDG to avoid bottlenecks
and maximize scalability. App. Server App. Server App. Server App. Server

3 WSTA Seminar

The Need for Memory-Based Storage
Example: Cloud Application: Cloud Application

• Application runs as multiple, App VS App VS

App VS
virtual servers (VS). App VS
App VS

• Application instances store and
retrieve LOB data from cloud- Grid VS
Grid VS
based file system or database. Grid VS

Distributed Data Grid
• Applications need fast, scalable
storage for fast-changing data.

• Distributed data grid runs as
multiple, virtual servers to
provide “elastic,” in-memory
storage.
Cloud-Based Storage

4 WSTA Seminar

What is a Distributed Data Grid?
• A new “vertical” storage tier: Processor Processor
Cache Cache
– Adds missing layer to boost
performance.
– Uses in-memory, out-of-process L2 Cache L2 Cache

storage.
– Avoids repeated trips to backing Application
Memory
Application
Memory
“In-Process” “In-Process”
storage.

• A new “horizontal” storage tier: Distributed Distributed
Cache Cache
– Allows data sharing among servers. “Out-of- “Out-of-
Process” Process”
– Scales performance & capacity.
– Adds high availability.
Backing
– Can be used independently of Storage

backing storage.
5 WSTA Seminar

Distributed Data Grids: A Closer Look
• Incorporates a client-side, in- Application
Memory
process cache (“near cache”): “In-Process”
– Transparent to the application
– Holds recently accessed data.
Client-side
• Boosts performance: Cache
– Eliminates repeated network data “In-Process”
transfers & deserialization.
– Reduces access times to near “in-
process” latency. Distributed
– Is automatically updated if the Cache
distributed grid changes. “Out-of-
Process”
– Supports various coherency models
(coherent, polled, event-driven)
6 WSTA Seminar

Performance Benefit of Client-side Cache

• Eliminates repeated network data transfers.
• Eliminates repeated object deserialization.

Average Response Time
10KB Objects
3500 20:1 Read/Update

3000

2500
Microseconds

2000

1500

1000

500

0
DDG DBMS

7 WSTA Seminar

Top 5 Benefits of Distributed Data Grids
1. Faster access time for business logic state or database data
2. Scalable throughput to match a growing workload and keep
response times low
3. High availability to prevent data loss if a grid server (or network
link) fails
Access Latency vs. Throughput
4. Shared access to data across

Access Latency (msec)
the server farm Grid DBMS

5. Advanced capabilities
for quickly and easily mining
data using scalable,
“map/reduce,” analysis.

Throughput (accesses / sec)

8 WSTA Seminar

Scaling the Distributed Data Grid
• Distributed data grid must deliver scalable throughput.
• To do so, its architecture must eliminate bottlenecks to
scaling:
– Avoid centralized scheduling to eliminate hot spots.
– Use data partitioning and maintain load-balance to allow scaling.
– Use fixed vs. full replication Read/Write Throughput
to avoid n-fold overhead. 10KB Objects

– Use low overhead
Accesses / Second

heart-beating. 80,000

• Example of linear 60,000
40,000
throughput scaling: 20,000
0
4 16 28 40 52 64 Nodes
16,000 ------------------------------------------- 256,000 #Objects

9 WSTA Seminar

Typical Commercial Distributed Data Grids
• Partition objects to scale throughput and avoid hot
spots.
• Synchronize access to objects across all servers.
• Dynamically rebalance objects to avoid hot spots.
• Replicate each cached object for high availability.
• Detect server or network failures and self-heal.
Client
Application
Retrieve

Client Cached
Library Copy
Object Copy Replica

Cache Cache Cache Cache
Service Service Service Service

Distributed Cache

Ethernet

10 WSTA Seminar

Wide Range of Applications
Financial Services E-commerce
• Portfolio risk analysis • Session-state storage
• VaR calculations • Application state storage
• Monte Carlo simulations • Online banking
• Algorithmic trading • Loan applications
• Market message caching • Wealth management
• Derivatives trading • Online learning
• Pricing calculations • Hotel reservations
• News story caching
Other Applications
• Edge servers: chat, email • Shopping carts
• Online gaming servers • Social networking
• Scientific computations • Service call tracking
• Command and control • Online surveys

11 WSTA Seminar

Importance for Cloud Computing
• Cloud computing:
– Make elastic resources readily available, but…
– Clouds have relatively slow interconnects.
• Distributed data grids add significant value in the cloud:
– Allow data sharing across a group of virtual servers.
– Elastically scale throughput as needed.
– Provide low latency, object-oriented storage
• Clouds provide the elastic platform for parallel data
analysis.
• DDGs provides the efficiency and scalability needed to
overcome the cloud’s limited interconnect speed.

12 WSTA Seminar

DDGs Simplify Data Migration to the Cloud
• Distributed data grids can automatically bridge on-
premise and cloud-based data grids to unify access.
• This enables seamless access to data across
multiple sites.
Cloud Application

Cloud Application VS
App App VS

App VS App VS App VS
App VS
App VS
App VS On-Premise Application 2
App VS App VS
Server App Server App
On-Premise Application 2
SOSS VS
Server App Server App
SOSS VS
SOSS VSVS
SOSS Aut
o
SOSS VS Mig matic
rate ally
Cloud-Based Distributed Automatically
Cache Da
ta SOSS Host SOSSHost
SOSS Host
SOSS VS Migrate Data
SOSS Host
Cloud hosted Cloud of Virtual Servers On-Premise Backing
Distributed Data Grid Distributed Data Grid
On-Premise Cache Store

User’s On-Premise Application
Cloud of Virtual Servers User’s On-Premise Application

13 WSTA Seminar

DDGs Enable Seamless Global Access

Mirrored Data Centers
SOSS SVR Satellite Data Centers
SOSS SVR
SOSS SVR
SOSS SVR
Distributed Data Grid SOSS SVR
SOSS SVR
SOSS SVR
SOSS SVR
SOSS SVR Distributed Data Grid

SOSS SVR
SOSS SVR
SOSS SVR


Global Distributed Data Grid

14 WSTA Seminar

Introducing Parallel Data Analysis
• The goal:
– Quickly analyze a large set of data for patterns and trends.
– How? Run a method E (“eval”) across a set of objects D in parallel.
– Optionally merge the results using method M (“merge”).
• Evolution of parallel analysis: E M
– '80s: “SIMD/SPMD” (Flynn, Hillis)
– '90s: “Domain decomposition” (Intel, IBM) D D D D
– '00s: “Map/reduce” (Google, Hadoop, Dryad)
D D D D
• Applications:
– Search, financial services, D D D D
business intelligence, simulation
D D D D

Result
15 WSTA Seminar

Example in Financial Services
Analyze trading strategies across stock histories:
Why?
• Back-testing systems help guard against risks in deploying new
trading strategies.
• Performance is critical for “first to market” advantage.
• Uses significant amount of market data and computation time.
How?
• Write method E to analyze trading strategies across a single
stock history.
• Write method M to merge two sets of results.
• Populate the data store with a set of stock histories.
• Run method E in parallel on all stock histories.
• Merge the results with method M to produce a report.
• Refine and repeat…
16 WSTA Seminar

Stage the Data for Analysis

• Step 1: Populate the distributed data grid with objects each of which
represents a price history for a ticker symbol:

17 WSTA Seminar

Code the Eval and Merge Methods
• Step 2: Write a method to evaluate a stock history based on parameters:
Results EvalStockHistory(StockHistory history, Parameters params)
{
<analyze trading strategy for this stock history>
return results;
}

• Step 3: Write a method to merge the results of two evaluations:
Results MergeResuts(Results results1, Results results2)
{
<merge both results>
return results;
}

• Notes:
– This code can be run a sequential calculation on in-memory data.
– No explicit accesses to the distributed data grid are used.

18 WSTA Seminar

Run the Analysis
• Step 4: Invoke parallel evaluation and merging of results:
Results Invoke(EvalStockHistory, MergeResults, querySpec,
params);

EvalStockHistory()

MergeResults()

19 WSTA Seminar

Start parallel
analysis

.eval()

stock stock stock stock stock stock
history history history history history history

results results results results results results

.merge() .merge() .merge()

results results results

.merge()

results returned results
to client
20 WSTA Seminar

DDG Minimizes Data Motion
• File-based map/reduce must move data to memory for analysis:
M/R Server M/R Server M/R Server

E E E

Server
Memory

File System /
D D D D D D D D D Database

• Memory-based DDG analyzes data in place:
Grid Server Grid Server Grid Server

E E E

Distributed
D D D D D D D D D Data Grid

21 WSTA Seminar

Start parallel
analysis

.eval()
File I/O

stock stock stock stock stock stock
history history history history history history

results results results results results results

.merge() .merge() .merge()
File I/O

results results results

File I/O

.merge()

results returned results
to client
22 WSTA Seminar

Performance Impact of Data Motion
Measured random access to DDG data to simulate file I/O:

23 WSTA Seminar

Comparison of DDGs and File-Based M/R
DDG File-Based M/R
Data set size Gigabytes->terabytes Terabytes->petabytes
Data repository In-memory File / database
Data view Queried object collection File-based key/value
pairs
Development time Low High
Automatic Yes Application
scalability dependent
Best use Quick-turn analysis of Complex analysis of
memory-based data large datasets
I/O overhead Low High
Cluster mgt. Simple Complex
High availability Memory-based File-based

24 WSTA Seminar

Walk-Away Points
• Developers need fast, scalable, highly available and sharable
memory-based storage for scaled out applications.
• Distributed data grids (DDGs) address these needs with:
– Fast access time & scalable throughput
– Highly available data storage
– Support for parallel data analysis
• Cloud-based and globally distributed applications need DDGs to:
– Support scalable data access for “elastic” applications.
– Efficiently and easily migrate data across sites.
– Avoid relatively slow cloud I/O storage and interconnects.
• DDGs offer simple, fast “map/reduce” parallel analysis:
– Make it easy to develop applications and configure clusters.
– Avoid file I/O overhead for datasets that fit in memory-based grids.
– Deliver automatic, highly scalable performance.
25 WSTA Seminar

Distributed Data Grids for
Server Farms & High Performance Computing

www.scaleoutsoftware.com

Using Distributed In-Memory Computing for Fast Data Analysis

More Related Content

What's hot (20)

Similar to Using Distributed In-Memory Computing for Fast Data Analysis (20)

Recently uploaded (20)

Using Distributed In-Memory Computing for Fast Data Analysis