SlideShare a Scribd company logo
Enabling Operational Intelligence:
Making Hadoop Real-Time
Copyright © 2014 by ScaleOut Software, Inc.
Los Angeles Big Data Users Group
August 6, 2014
Bill Bain, CEO (wbain@scaleoutsoftware.com)
2 ScaleOut Software, Inc.
• Operational Intelligence vs. Business Intelligence
• Data-Parallel Computation
• Implementing Using an In-Memory Data Grid:
• Distributing the Data Across a Cluster
• Running the Computation using “Parallel Method Invocation”
• An Example in Financial Services
• Implementing In-Memory Hadoop MapReduce
• Video Demo
• Examples of Applications in Operational Intelligence
Agenda
3 ScaleOut Software, Inc.
• Develops and markets In-Memory Data Grids,
software middleware for:
• Scaling application performance and
• Providing operational intelligence using
• In-memory data storage and computing
• Dr. William Bain, Founder & CEO
• Career focused on parallel computing – Bell Labs, Intel, Microsoft
• 3 prior start-ups, last acquired by Microsoft and product now ships as
Network Load Balancing in Windows Server
• Eight years in the market; 400 customers, 10,000 servers
• Sample customers:
About ScaleOut Software
4 ScaleOut Software, Inc.
Goal: Provide immediate feedback to a system handling live data.
A few examples:
• Equity trading: to minimize risk during a trading day
• Ecommerce: for real-time recommendations
• Reservations systems: to identify issues, reroute, etc.
• Credit cards & wire transfers: to detect fraud in real time
• Smart grids: to optimize power distribution & detect issues
Online Systems Need Operational
Intelligence
5 ScaleOut Software, Inc.
Big Data Analytics
Real-Time vs. Batch Analytics
Static data sets
Petabytes
Disk storage
Hours to minutes
Best uses:
• Analyzing
warehoused data
• Mining for long-
term trends
Live data sets
Gigabytes to terabytes
In-memory storage
Minutes to seconds
Best uses:
• Tracking live data
• Immediately
identifying trends
and capturing
opportunities
• Providing immediate
feedback
Analytics
Server
hServer
Hadoop
IBM
Teradata
SAS
SAP
Real-Time Batch
Real-time
“Operational Intelligence”
Batch
“Business Intelligence”
6 ScaleOut Software, Inc.
• Operational intelligence can co-exist with business intelligence:
• Processes streaming data close to its sources.
• Provides real-time, “tactical” feedback (e.g., recommendations, alerts).
• Translates data for storage in the data warehouse (ETL).
• Data warehouse provides “strategic” guidance.
• Using the same tool set (e.g., Hadoop MapReduce) lowers TCO:
• Leverages common skill set.
• Simplifies design (e.g., loading data into HDFS).
Integrated View of Analytics
7 ScaleOut Software, Inc.
• To keep up with fast
growing “live” workloads &
maintain fast response times:
• Ex.: Handle incoming data
streams in real time.
• Ex. Process updates to data
set based on incoming data.
• To identify and respond to
trends in fast-changing data:
• Ex. Evaluate data set changes in
real time.
• Ex. Respond to identified
patterns within seconds.
Challenges for Operational Intelligence
0
50
100
150
200
250
300
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Millions
Growth in Web Servers
Source:
Netcraft
0
500
1000
1500
2000
2500
3000
3500
4000
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Exebytes
Growth in “Big Data”
“More data has been
created in the past three
years than in the past
40,000.”
8 ScaleOut Software, Inc.
The Solution: Data-Parallel Computing
• Straightforward, well understood model of parallel computation
• An alternative to task-parallel computation (e.g., Storm)
• Simple: runs the same code on multiple, in-memory data items.
• Powerful: maintains a “live,” in-memory model of a real-world system
• Fast: avoids data motion which lowers speedup.
Analyze Data (Eval)
Combine Results
(Merge)
Server Cluster
9 ScaleOut Software, Inc.
Track “live” data and analyze in real-time:
Implementing OI
In-Memory
State
NoSQL
Storage
Real-Time
Data Parallel
Analysis
10 ScaleOut Software, Inc.
• Storm implements pipelined execution of tasks by “bolts” on
incoming data streams.
• Streams can be distributed to bolts with configurable mappings.
• Developer controls the number of tasks per bolt.
• Storm uses a centralized master node and Zookeeper for fault-
tolerance.
• Key strength: continuous
processing of input
streams
• Issues:
• Complexity / tuning
• Minimizing data motion
• Managing global state
Quick Comparison to Storm
11 ScaleOut Software, Inc.
Data-Parallel Enables Linear Speedup
Avoids data motion (network or disk I/O) which limits throughput:
12 ScaleOut Software, Inc.
Data-Parallel Computing Is Not New
• 1980’s: Special Purpose Hardware: “SIMD”
Thinking Machines
Connection Machine 5
• 1990’s: General Purpose Parallel Supercomputers:
“Domain Decomposition”, “SPMD”
Intel
IPSC-2
IBM
SP1
13 ScaleOut Software, Inc.
Data-Parallel Computing Is Not New
• 1990’s – early 2000’s: HPC on Clusters: “MPI”
• Since 2003: Clusters, the Cloud, and IMDGs: “MapReduce”
HP
Blade
Servers
Amazon EC2,
Windows Azure
14 ScaleOut Software, Inc.
• In-memory data grid
(IMDG) holds active
entities undergoing
state changes in
memory.
• Backing store
optionally holds large
population of entities.
• IMDG processes
incoming stream of
state changes.
• Analytics engine
examines entities in real
time and generates
alerts within seconds
as needed.
A Data-Parallel Architecture for
Operational Intelligence
15 ScaleOut Software, Inc.
In-Memory Data Grid (IMDG) stores “live” data in a cluster:
• Fits in the business logic layer:
• Follows object-oriented view of data
(vs. relational view).
• Stores collections of Java/.NET
objects shared by multiple clients.
• Uses create/read/update/delete
and query APIs to access data.
• Implemented across a cluster of
servers or VMs:
• Scales storage and throughput
by adding servers.
• Provides high availability
in case a server fails.
In-Memory Data Grid for Live Data
16 ScaleOut Software, Inc.
• IMDG’s collections of objects act like in-
memory collections:
• Unstructured, typically instances of a class
(stored as serialized blobs)
• Individually accessible / update-able
• IMDG adds attributes:
• Accessible by global key
• Query-able by properties
• Highly available
• Optional timeouts
• Distributed locking
• Integration with a backing store
• Optional dependency relationships
• Asynchronous event handling
IMDGs Store “Live” Data
Basic “CRUD” APIs:
• Create(key, obj, tout)
• Read(key)
• Update(key, obj)
• Delete(key)
and…
• Lock(key)
• Unlock(key)
Object
key
17 ScaleOut Software, Inc.
Spark / Spark Streaming from U.C.
Berkeley amplab:
• In-memory computing to accelerate and
extend Hadoop MapReduce using data-
parallel operators in Scala.
• Stores data as “resilient
distributed datasets” (RDDs):
• Distributed across cluster
• Immutable
• Hold data from/output to HDFS.
• Store data stream as a sequence of RDDs.
• Comparison to IMDG:
• Not designed for “live” data:
• Lacks CRUD on individual objects.
• Lacks high availability.
• Designed for “data parallel” operators.
Quick Comparison to Spark
18 ScaleOut Software, Inc.
Data-Parallel Computing Using PMI
“Parallel Method Invocation” (PMI): an object-oriented version of data-
parallel computing from the HPC community:
• Serves as a platform for MapReduce and other data-parallel operators.
• Selects objects using a parallel query on data hosted in the IMDG.
• Runs user-defined methods in parallel across the cluster.
Analyze Data (Eval)
Combine Results
(Merge)
In-Memory Data Grid Runs
Data-Parallel Computation.
19 ScaleOut Software, Inc.
Integrate analysis into a stock trading platform:
• The IMDG holds market data and hedging strategies.
• Updates to market data
continuously flow through
the IMDG.
• The IMDG performs
repeated data-parallel
analysis on hedging
strategies and alerts
traders in real time.
• IMDG automatically and dynamically
scales its throughput to handle new
hedging strategies by adding servers.
Example in Financial Services
20 ScaleOut Software, Inc.
Selects all relevant objects in a distributed collection.
• Query spec matches data’s object-oriented properties.
• Selected objects are fed to the analysis engine on the local server.
Step 1: Select with Parallel Query
21 ScaleOut Software, Inc.
Java Example: Parallel Query
public class Portfolio {
private long id;
private Set<Stock> longPositions;
private Set<Stock> shortPositions;
private double totalValue;
private Region region;
private boolean alerted; // alert for trading
@SossIndexAttribute // query-able property
public double getTotalValue() {…}
@SossIndexAttribute // query-able property
public Region getRegion() {…}
public Set<Long> evalPositions(MarketSnapshot ms) {…};
}
NamedCache pset = CacheFactory.getCache(“portfolios");
Set<Portfolio> res = pset.queryObjects(Portfolio.class,
and(greaterThan(“totalValue”, 1000000),
equals(“region”, Region.US)));
22 ScaleOut Software, Inc.
• Create method to analyze a queried portfolio and another method to
pair-wise merge the result sets of alerted portfolios:
Java Example: Parallel Method Invocation
public class PortfolioAnalysis implements
Invokable<Portfolio, MarketSnapshot, Set<Long>>
{
public Set<Long> eval(Portfolio p, MarketSnapshot ms)
throws InvokeException {
// update portfolio and return id if alerted:
return p.evalPositions(ms);
}
public Set<Long> merge(Set<Long> set1, Set<Long> set2)
throws InvokeException {
set1.addAll(set2);
return set1; // merged set of alerted portfolio ids
}}
23 ScaleOut Software, Inc.
• Run a parallel method invocation on a queried set of portfolios and
return set of ids for alerted portfolios:
Java Example: Parallel Method Invocation
NamedCache pset = CacheFactory.getCache(“portfolios");
InvokeResult alertedPortolios = pset.invoke(
PortfolioAnalysis.class,
Portfolio.class,
and(greaterThan(“totalValue”, 1000000), // query spec
equals(“region”, Region.US)),
marketSnapshot, // parameters
...
);
System.out.println("The alerted portfolios are" +
alertedPortfolios.getResult());
24 ScaleOut Software, Inc.
• IMDG ships user’s code and libraries to its servers.
• IMDG automatically schedules analysis operations across all grid
servers and cores:
• The analysis runs on all objects selected
by the parallel query.
• Each grid server analyzes its locally stored
objects to minimize data motion.
• Parallel execution ensures fast
completion time:
• IMDG automatically distributes
workload across servers/cores.
• Scaling the IMDG automatically
handles larger data sets.
Running the Analysis
25 ScaleOut Software, Inc.
• The IMDG automatically merges all analysis results:
• The IMDG first merges all results within each grid server in parallel.
• It then merges results across all grid servers to create one combined
result.
• Efficient parallel merge
minimizes the delay in
combining all results.
• The IMDG delivers the
combined result to the
invoking application as
one object.
Merging the Results
26 ScaleOut Software, Inc.
• Measured a similar financial services application (back testing stock
trading strategies on stock histories)
• Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock
history data in memory
• IMDG handled a continuous stream of updates (1.1 GB/s)
• Results: analyzed 1 TB in 4.1 seconds (250 GB/s) with linear scaling
Sample Performance Results for PMI
27 ScaleOut Software, Inc.
Benefits:
• Enables use of Hadoop MapReduce for operational intelligence.
• Accelerates data access by holding data in memory.
• Analyzes and updates “live” data.
• Reduces overheads of standard
Hadoop distributions:
• Batch scheduling
• Disk access
• Data shuffling
• Mandatory key sorting
• Enables new features, e.g.:
• Global combining, optional sorting
Using PMI to Implement
“In-Memory” Hadoop MapReduce
28 ScaleOut Software, Inc.
• A Hadoop distribution does not have to be installed unless HDFS is used.
• The developer starts MapReduce applications from a remote workstation.
• The IMDG automatically builds a reusable “invocation grid” of JVMs on the
grid’s servers for PMI and ships the application’s jars.
• Results are stored in the IMDG, HDFS, or optionally globally merged and
returned to the remote workstation.
Running MapReduce on the IMDG
29 ScaleOut Software, Inc.
Run In-Memory MR with YARN
• YARN, transparently integrates batch and in-memory MapReduce into
a single execution framework with shared access to HDFS.
• For example, hServer can transparently run Apache Hive in-memory.
Example of ScaleOut hServer with Hortonworks
Example of Hive
Running on hServer
30 ScaleOut Software, Inc.
Run MapReduce as two PMI
phases:
• Data can be input from either the
IMDG or an external data source.
• Works with any input/output format
compatible with the Apache
distribution.
• IMDG uses its data-parallel
execution engine (PMI) to invoke
the mappers and the reducers.
• Eliminates batch scheduling
overhead.
• Intermediate results are stored
within the IMDG.
• Minimizes data motion between the
mappers and reducers.
• Allows optional sorting.
• Output of a single reducer/combiner
optionally can be globally merged.
Implementing MapReduce
31 ScaleOut Software, Inc.
• IMDG adds grid input format for
accessing key/value pairs held in
the IMDG.
• MapReduce programs optionally
can output results to IMDG with
grid output format.
• Grid Record Reader optimizes
access to key/value pairs to
eliminate network overhead.
• Applications can access and
update key/value pairs as
operational data during analysis.
Accessing IMDG Data for M/R
32 ScaleOut Software, Inc.
• IMDG adds Dataset Record Reader (wrapper) to cache HDFS
data during program execution.
• Hadoop automatically retrieves data from ScaleOut IMDG on
subsequent runs.
• Dataset Record Reader
stores and retrieves data
with minimum network
and memory overheads.
• Tests with Terasort
benchmark have
demonstrated 11X
faster access latency
over HDFS without IMDG.
Optional Caching of HDFS Data
33 ScaleOut Software, Inc.
IMDG needs multiple in-memory
storage models:
• Named cache, optimized for
rich semantics on large
objects:
• Property-based query
• Distributed locking
• Access from remote grids
• Named map, optimized for
efficient storage and bulk
analysis (e.g., MapReduce):
• Highly efficient object storage
• Pipelined, bulk-access
mechanisms
Optimized In-Memory Storage
34 ScaleOut Software, Inc.
In-Memory Named Map:
• Stores key/value pairs in chunks.
• Allows CRUD operations on kvps.
• Automatically organizes chunks into
splits.
• Uses per-split hash table to access
keys and manage multi-valued
keys.
• Stores shuffled data set between
mappers and reducers.
• Pipelines chunks to mappers and
from reducers.
• Optionally uses memory mapped
files to reduce access latency.
• Provides support for sorting keys.
Named Map Optimizations
35 ScaleOut Software, Inc.
• Measured performance:
• Startup times reduced to a few milliseconds
• Word count benchmark shows 20X speedup.
• Real-world example shows >40X speedup.
• MapReduce optimizations:
• Optional sorting
• Optional multicast of parameters to mappers
• Optional O(logN) global combining (avoids
single reducer)
• Optional HDFS caching
• Optional reuse of JVMs across jobs
• Current limitations:
• No specific security for multi-tenancy
• Intermediate data must fit in the IMDG
Performance & Optimizations
36 ScaleOut Software, Inc.
• Invocation grids can be re-used across MapReduce jobs:
Accelerating Start-Up Times
public static void main(String argv[]) throws Exception {
//Configure and load the invocation grid
InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid").
// Add JAR files as IG dependencies
addJar("main-job.jar"). addJar("first-library.jar").
// Add classes as IG dependencies
addClass(MyMapper.class). addClass(MyReducer.class).
// Define custom JVM parameters
setJVMParameters("-Xms512M -Xmx1024M").
load();
//Run 10 jobs on the same invocation grid
for(int i=0; i<10; i++) {
Configuration conf = new Configuration();
//The preloaded invocation grid is passed as the parameter to the job
Job job = new HServerJob(conf, "Job number "+i, false, grid);
//......Configure the job here.........
//Run the job
job.waitForCompletion(true);
}
//Unload the invocation grid when we are done
grid.unload();
}
37 ScaleOut Software, Inc.
• IMDG can run Apache Hive distribution
unchanged.
• Accelerates queries for datasets hosted in
HDFS or the IMDG:
• Intermediate data must fit within the IMDG.
• Challenges we faced:
• Requires YARN to transparently invoke
MapReduce on IMDG.
• IMDG must use multiple JVMs per server
since Hive tasks are not thread-safe.
• IMDG must support Hadoop’s distributed
cache (required by Hive).
Running Hive on In-Memory Data
38 ScaleOut Software, Inc.
• Assume we have a named map called “customers” of
customer objects:
Example: Querying a Named Map
public class Customer implements Serializable
{
private int customerId;
private String firstName;
private String lastName;
private String login;
public int getCustomerId() {
return customerId;}
public String getFirstName() {
return firstName;}
...
}
39 ScaleOut Software, Inc.
• Create a table view of a named map:
• Associates class properties with columns.
• Allows properties to be omitted.
• Allows use of custom serialization.
Example: Querying a Named Map
public hive> CREATE TABLE
customers (customerid int, firstname string, lastname
string, login string)
STORED BY
'com.scaleoutsoftware.soss.hserver.hive.HServerHiveStorage
Handler'
TBLPROPERTIES ("hserver.map.name" = "customers");
OK
Time taken: 0.508 seconds
40 ScaleOut Software, Inc.
• Now query the named map:
Example: Querying a Named Map
hive> SELECT * FROM customers;
..............................
1 Eduardo Hazelrigg ehazelrigg
13 Serena Sadberry ssadberry
9 Ermelinda Manganaro emanganaro
5 Edda Speir espeir
17 Tomeka Stovall tstovall
21 Luciano Perkinson lperkinson
25 Jacob Garrow jgarrow
33 Quincy Kreutzer qkreutzer
37 Iona Speir ispeir
41 Ermelinda Thielen ethielen
Time taken: 0.475 seconds, Fetched: 100 row(s)
41 ScaleOut Software, Inc.
The Challenge: Operational intelligence to quickly evaluate
and respond to sub-second market changes:
• Hedge fund tracks a set of hedging strategies:
• Strategies can cover various market
sectors, such as high-tech, automotive,
energy, consumer, real estate, etc.
• Each strategy contains list of holdings
and rules for managing the holdings
(such as target allocations).
• Updates to market data
continuously arrive during
the trading day.
• Challenge: The hedge fund must be able to quickly update and
analyze its hedging strategies and provide alerts to traders.
Demo of the Finserv Application
42 ScaleOut Software, Inc.
• Delivers a stream of alerts to traders
within a few seconds.
• Enables the trader to examine strategy details in real time:
Output: Real-Time Alerts
43 ScaleOut Software, Inc.
• Video Link
Video
44 ScaleOut Software, Inc.
Fast map/reduce reconciles inventory and order systems
for an online retailer:
• Challenge: Inventory and online
order management are handled
by different applications.
• Reconciled once per day.
• Inaccurate orders reduces margins.
• Solution:
• Host SKUs in IMDG updated in real
time by order & inventory systems.
• Use MapReduce to reconcile in two minutes.
• Results: Real-time reconciliation ensures accurate orders.
Example in Ecommerce: Inventory
Management
45 ScaleOut Software, Inc.
• IMDG holds customer
information for active
Web users.
• IMDG saves/retrieves
customer information
from backing store.
• Web browsers send
activity information to
analytics engine.
• IMDG updates customer history and
preferences.
• Analytics engine identifies browsing and
buying patterns.
• Analytics engine makes suggestions in
real-time. Also sends email follow-ups.
Example: Web Shopping
46 ScaleOut Software, Inc.
• Track
connectivity
issues.
• Obtain time-
sensitive
business data.
• Offer enhanced
services.
• Increase
security.
Example: Telecommunications
Optimize Operations
Customer Experience
Historical queries
for real-time data
enrichment
Stream
persistence for
future analysis
Network
Elements
47 ScaleOut Software, Inc.
• Online systems need operational
intelligence on “live” data for
immediate feedback.
• Operational intelligence can be
implemented using standard
data-parallel computing
techniques, such as M/R.
• In-memory data grids provide
an excellent platform for
operational intelligence:
• Host and update “live” data.
• Implement high availability.
• Offer fast, data-parallel
computation for immediate
feedback.
Recap
Additional Information
48
49 ScaleOut Software, Inc.
• ScaleOut StateServer®
• In-Memory Data Grid for Windows and
Linux
• Scales application performance.
• Industry-leading performance and ease of use
• ScaleOut GeoServer® adds
• WAN based data replication for DR
• Breakthrough technology for global
data access
• ScaleOut Analytics Server® adds
• Real-time data analysis for “live” data
• Comprehensive management tools
• ScaleOut hServer®
• Full Hadoop Map/Reduce engine (>20X faster*)
• Hadoop Map/Reduce on live, in-memory data
ScaleOut Software Products
ScaleOut StateServer In-Memory Data Grid
Grid
Service
Grid
Service
Grid
Service
Grid
Service
*in benchmark testing
50 ScaleOut Software, Inc.
Many Use Cases:
• Authorizations / Payment
Processing / Mobile Payments
• Service Activation
• Inventory Management
• Sensor Data / SCADA
• Real Time Tracking
• Fraud Detection
• Situational Awareness
• Churn Management
• Market Feed / Event Handlers
• Execution Rules
• Financial: Risk, P&L, Pricing
• Operational Risk Compliance
The Need for Real-Time Analytics
Across Key Industries:
• CPG
• Financial
• Telco
• Retail
• Utilities
• Manufacturing
• Logistics
• IC / DoD
• Life Sciences
• Government
• Health Care
• Law enforcement
51 ScaleOut Software, Inc.
• Brick and mortar stores need to compete with online experience.
• Point-of-sale identifies opt-in customers to analytics engine.
• RFID tags identify product selection and availability in showroom.
• Analytics engine sends real-time advisories to sales staff via tablet.
Example: Retail Shopping
52 ScaleOut Software, Inc.
• Typically used for very large, static, offline datasets
• Data must be copied from disk-based storage (e.g., HDFS) into
memory for analysis.
• Hadoop Map/Reduce adds lengthy batch scheduling and data
shuffling overhead.
Problem: Hadoop Cannot Efficiently
Perform Real-Time Analytics
53 ScaleOut Software, Inc.
// This job will run using the Hadoop
// job tracker:
public static void main(String[] args)
throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
job.waitForCompletion(true);
}
// This job will run using ScaleOut hServer:
public static void main(String[] args)
throws Exception {
Configuration conf = new Configuration();
Job job = new HServerJob(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
job.waitForCompletion(true);
}
Configuring MapReduce for the IMDG
• Without YARN, subclass the Hadoop Job class with a one-line change (below).
• With YARN, just replace the MapReduce execution framework.
54 ScaleOut Software, Inc.
• Mark class properties as indexes for query:
• Define a query using these properties:
Parallel Query Example (C#)
class Stock {
[SossIndex]
public string Ticker { get; set; }
public decimal TotalShares { get; set; }
public decimal Price { get; set; }}
NamedCache cache = CacheFactory.GetCache("Stocks");
var q = from s in cache.QueryObjects<Stock>()
where s.Ticker == "GOOG" || s.Ticker == "ORCL"
select s;
Console.WriteLine("{0} Stocks found", q.Count());
55 ScaleOut Software, Inc.
• Create method to analyze each queried stock object:
• Create method to pair-wise merge the results:
Example of Analysis Code (C#)
static decimal eval(Stock stock, StockCalcParams params)
{
return stock.Price * stock.TotalShares;
}
static decimal merge(decimal r1, decimal r2)
{
return r1 + r2;
}
56 ScaleOut Software, Inc.
• Run a parallel method invocation:
Invoking the Parallel Analysis (C#)
NamedCache cache = CacheFactory.GetCache("Stocks");
decimal valueOfSelectedStocks =
(from s in cache.QueryObjects<Stock>()
where s.Ticker == "GOOG" || s.Ticker == "ORCL"
select s)
.Invoke(new StockCalcParams(…),
new Func<Stock, StockCalcParams, decimal>(eval))
.Merge(new Func<decimal, decimal, decimal>(merge));
Console.WriteLine(“The value of selected stocks is {0}",
valueOfSelectedStocks);

More Related Content

PDF
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
PPTX
Hadoop and Enterprise Data Warehouse
PDF
Bigdata Hadoop project payment gateway domain
PDF
Data Gloveboxes: A Philosophy of Data Science Data Security
PDF
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
PDF
Apache Eagle: Secure Hadoop in Real Time
PPTX
Apache Kudu: Technical Deep Dive


PDF
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
Hadoop and Enterprise Data Warehouse
Bigdata Hadoop project payment gateway domain
Data Gloveboxes: A Philosophy of Data Science Data Security
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Apache Eagle: Secure Hadoop in Real Time
Apache Kudu: Technical Deep Dive


IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...

What's hot (20)

PPTX
Breakout: Hadoop and the Operational Data Store
PPTX
Enabling the Active Data Warehouse with Apache Kudu
PPTX
Hadoop and Hive in Enterprises
PDF
IBM Power8 announce
PDF
Machine Learning for z/OS
PPTX
Active Learning for Fraud Prevention
PDF
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
PPTX
Spark and Couchbase– Augmenting the Operational Database with Spark
PPTX
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...
PPTX
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
PDF
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
PPT
Migrating legacy ERP data into Hadoop
PDF
Big Data Architecture and Deployment
PPTX
Building the Data Lake with Azure Data Factory and Data Lake Analytics
PPTX
Tools and approaches for migrating big datasets to the cloud
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
PDF
Integrated Data Warehouse with Hadoop and Oracle Database
PPTX
Choosing technologies for a big data solution in the cloud
PPTX
Hadoop Reporting and Analysis - Jaspersoft
PDF
Big Data Real Time Applications
Breakout: Hadoop and the Operational Data Store
Enabling the Active Data Warehouse with Apache Kudu
Hadoop and Hive in Enterprises
IBM Power8 announce
Machine Learning for z/OS
Active Learning for Fraud Prevention
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Spark and Couchbase– Augmenting the Operational Database with Spark
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
Migrating legacy ERP data into Hadoop
Big Data Architecture and Deployment
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Tools and approaches for migrating big datasets to the cloud
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Integrated Data Warehouse with Hadoop and Oracle Database
Choosing technologies for a big data solution in the cloud
Hadoop Reporting and Analysis - Jaspersoft
Big Data Real Time Applications
Ad

Similar to Making Hadoop Realtime by Dr. William Bain of Scaleout Software (20)

PDF
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
PPTX
Operational Intelligence Using Hadoop
PPTX
Real-time analysis using an in-memory data grid - Cloud Expo 2013
PPSX
November 2013 HUG: Real-time analytics with in-memory grid
PPTX
Webinar: Enterprise Trends for Database-as-a-Service
PDF
Kognitio overview jan 2013
PDF
Kognitio overview jan 2013
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PPTX
Simplifying Real-Time Architectures for IoT with Apache Kudu
PDF
QuerySurge Slide Deck for Big Data Testing Webinar
PDF
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
PPTX
Big Data Berlin v8.0 Stream Processing with Apache Apex
PPTX
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
PDF
InfoSphere BigInsights - Analytics power for Hadoop - field experience
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
PPTX
Hybrid Transactional/Analytics Processing with Spark and IMDGs
PDF
An overview of modern scalable web development
PDF
Analytics&IoT
PDF
Ibm db2update2019 icp4 data
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
Operational Intelligence Using Hadoop
Real-time analysis using an in-memory data grid - Cloud Expo 2013
November 2013 HUG: Real-time analytics with in-memory grid
Webinar: Enterprise Trends for Database-as-a-Service
Kognitio overview jan 2013
Kognitio overview jan 2013
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Simplifying Real-Time Architectures for IoT with Apache Kudu
QuerySurge Slide Deck for Big Data Testing Webinar
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Big Data Berlin v8.0 Stream Processing with Apache Apex
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
InfoSphere BigInsights - Analytics power for Hadoop - field experience
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Hybrid Transactional/Analytics Processing with Spark and IMDGs
An overview of modern scalable web development
Analytics&IoT
Ibm db2update2019 icp4 data
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Cloud computing and distributed systems.
NewMind AI Monthly Chronicles - July 2025
Advanced methodologies resolving dimensionality complications for autism neur...
Review of recent advances in non-invasive hemoglobin estimation
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Understanding_Digital_Forensics_Presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Encapsulation_ Review paper, used for researhc scholars
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Making Hadoop Realtime by Dr. William Bain of Scaleout Software

  • 1. Enabling Operational Intelligence: Making Hadoop Real-Time Copyright © 2014 by ScaleOut Software, Inc. Los Angeles Big Data Users Group August 6, 2014 Bill Bain, CEO (wbain@scaleoutsoftware.com)
  • 2. 2 ScaleOut Software, Inc. • Operational Intelligence vs. Business Intelligence • Data-Parallel Computation • Implementing Using an In-Memory Data Grid: • Distributing the Data Across a Cluster • Running the Computation using “Parallel Method Invocation” • An Example in Financial Services • Implementing In-Memory Hadoop MapReduce • Video Demo • Examples of Applications in Operational Intelligence Agenda
  • 3. 3 ScaleOut Software, Inc. • Develops and markets In-Memory Data Grids, software middleware for: • Scaling application performance and • Providing operational intelligence using • In-memory data storage and computing • Dr. William Bain, Founder & CEO • Career focused on parallel computing – Bell Labs, Intel, Microsoft • 3 prior start-ups, last acquired by Microsoft and product now ships as Network Load Balancing in Windows Server • Eight years in the market; 400 customers, 10,000 servers • Sample customers: About ScaleOut Software
  • 4. 4 ScaleOut Software, Inc. Goal: Provide immediate feedback to a system handling live data. A few examples: • Equity trading: to minimize risk during a trading day • Ecommerce: for real-time recommendations • Reservations systems: to identify issues, reroute, etc. • Credit cards & wire transfers: to detect fraud in real time • Smart grids: to optimize power distribution & detect issues Online Systems Need Operational Intelligence
  • 5. 5 ScaleOut Software, Inc. Big Data Analytics Real-Time vs. Batch Analytics Static data sets Petabytes Disk storage Hours to minutes Best uses: • Analyzing warehoused data • Mining for long- term trends Live data sets Gigabytes to terabytes In-memory storage Minutes to seconds Best uses: • Tracking live data • Immediately identifying trends and capturing opportunities • Providing immediate feedback Analytics Server hServer Hadoop IBM Teradata SAS SAP Real-Time Batch Real-time “Operational Intelligence” Batch “Business Intelligence”
  • 6. 6 ScaleOut Software, Inc. • Operational intelligence can co-exist with business intelligence: • Processes streaming data close to its sources. • Provides real-time, “tactical” feedback (e.g., recommendations, alerts). • Translates data for storage in the data warehouse (ETL). • Data warehouse provides “strategic” guidance. • Using the same tool set (e.g., Hadoop MapReduce) lowers TCO: • Leverages common skill set. • Simplifies design (e.g., loading data into HDFS). Integrated View of Analytics
  • 7. 7 ScaleOut Software, Inc. • To keep up with fast growing “live” workloads & maintain fast response times: • Ex.: Handle incoming data streams in real time. • Ex. Process updates to data set based on incoming data. • To identify and respond to trends in fast-changing data: • Ex. Evaluate data set changes in real time. • Ex. Respond to identified patterns within seconds. Challenges for Operational Intelligence 0 50 100 150 200 250 300 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Millions Growth in Web Servers Source: Netcraft 0 500 1000 1500 2000 2500 3000 3500 4000 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Exebytes Growth in “Big Data” “More data has been created in the past three years than in the past 40,000.”
  • 8. 8 ScaleOut Software, Inc. The Solution: Data-Parallel Computing • Straightforward, well understood model of parallel computation • An alternative to task-parallel computation (e.g., Storm) • Simple: runs the same code on multiple, in-memory data items. • Powerful: maintains a “live,” in-memory model of a real-world system • Fast: avoids data motion which lowers speedup. Analyze Data (Eval) Combine Results (Merge) Server Cluster
  • 9. 9 ScaleOut Software, Inc. Track “live” data and analyze in real-time: Implementing OI In-Memory State NoSQL Storage Real-Time Data Parallel Analysis
  • 10. 10 ScaleOut Software, Inc. • Storm implements pipelined execution of tasks by “bolts” on incoming data streams. • Streams can be distributed to bolts with configurable mappings. • Developer controls the number of tasks per bolt. • Storm uses a centralized master node and Zookeeper for fault- tolerance. • Key strength: continuous processing of input streams • Issues: • Complexity / tuning • Minimizing data motion • Managing global state Quick Comparison to Storm
  • 11. 11 ScaleOut Software, Inc. Data-Parallel Enables Linear Speedup Avoids data motion (network or disk I/O) which limits throughput:
  • 12. 12 ScaleOut Software, Inc. Data-Parallel Computing Is Not New • 1980’s: Special Purpose Hardware: “SIMD” Thinking Machines Connection Machine 5 • 1990’s: General Purpose Parallel Supercomputers: “Domain Decomposition”, “SPMD” Intel IPSC-2 IBM SP1
  • 13. 13 ScaleOut Software, Inc. Data-Parallel Computing Is Not New • 1990’s – early 2000’s: HPC on Clusters: “MPI” • Since 2003: Clusters, the Cloud, and IMDGs: “MapReduce” HP Blade Servers Amazon EC2, Windows Azure
  • 14. 14 ScaleOut Software, Inc. • In-memory data grid (IMDG) holds active entities undergoing state changes in memory. • Backing store optionally holds large population of entities. • IMDG processes incoming stream of state changes. • Analytics engine examines entities in real time and generates alerts within seconds as needed. A Data-Parallel Architecture for Operational Intelligence
  • 15. 15 ScaleOut Software, Inc. In-Memory Data Grid (IMDG) stores “live” data in a cluster: • Fits in the business logic layer: • Follows object-oriented view of data (vs. relational view). • Stores collections of Java/.NET objects shared by multiple clients. • Uses create/read/update/delete and query APIs to access data. • Implemented across a cluster of servers or VMs: • Scales storage and throughput by adding servers. • Provides high availability in case a server fails. In-Memory Data Grid for Live Data
  • 16. 16 ScaleOut Software, Inc. • IMDG’s collections of objects act like in- memory collections: • Unstructured, typically instances of a class (stored as serialized blobs) • Individually accessible / update-able • IMDG adds attributes: • Accessible by global key • Query-able by properties • Highly available • Optional timeouts • Distributed locking • Integration with a backing store • Optional dependency relationships • Asynchronous event handling IMDGs Store “Live” Data Basic “CRUD” APIs: • Create(key, obj, tout) • Read(key) • Update(key, obj) • Delete(key) and… • Lock(key) • Unlock(key) Object key
  • 17. 17 ScaleOut Software, Inc. Spark / Spark Streaming from U.C. Berkeley amplab: • In-memory computing to accelerate and extend Hadoop MapReduce using data- parallel operators in Scala. • Stores data as “resilient distributed datasets” (RDDs): • Distributed across cluster • Immutable • Hold data from/output to HDFS. • Store data stream as a sequence of RDDs. • Comparison to IMDG: • Not designed for “live” data: • Lacks CRUD on individual objects. • Lacks high availability. • Designed for “data parallel” operators. Quick Comparison to Spark
  • 18. 18 ScaleOut Software, Inc. Data-Parallel Computing Using PMI “Parallel Method Invocation” (PMI): an object-oriented version of data- parallel computing from the HPC community: • Serves as a platform for MapReduce and other data-parallel operators. • Selects objects using a parallel query on data hosted in the IMDG. • Runs user-defined methods in parallel across the cluster. Analyze Data (Eval) Combine Results (Merge) In-Memory Data Grid Runs Data-Parallel Computation.
  • 19. 19 ScaleOut Software, Inc. Integrate analysis into a stock trading platform: • The IMDG holds market data and hedging strategies. • Updates to market data continuously flow through the IMDG. • The IMDG performs repeated data-parallel analysis on hedging strategies and alerts traders in real time. • IMDG automatically and dynamically scales its throughput to handle new hedging strategies by adding servers. Example in Financial Services
  • 20. 20 ScaleOut Software, Inc. Selects all relevant objects in a distributed collection. • Query spec matches data’s object-oriented properties. • Selected objects are fed to the analysis engine on the local server. Step 1: Select with Parallel Query
  • 21. 21 ScaleOut Software, Inc. Java Example: Parallel Query public class Portfolio { private long id; private Set<Stock> longPositions; private Set<Stock> shortPositions; private double totalValue; private Region region; private boolean alerted; // alert for trading @SossIndexAttribute // query-able property public double getTotalValue() {…} @SossIndexAttribute // query-able property public Region getRegion() {…} public Set<Long> evalPositions(MarketSnapshot ms) {…}; } NamedCache pset = CacheFactory.getCache(“portfolios"); Set<Portfolio> res = pset.queryObjects(Portfolio.class, and(greaterThan(“totalValue”, 1000000), equals(“region”, Region.US)));
  • 22. 22 ScaleOut Software, Inc. • Create method to analyze a queried portfolio and another method to pair-wise merge the result sets of alerted portfolios: Java Example: Parallel Method Invocation public class PortfolioAnalysis implements Invokable<Portfolio, MarketSnapshot, Set<Long>> { public Set<Long> eval(Portfolio p, MarketSnapshot ms) throws InvokeException { // update portfolio and return id if alerted: return p.evalPositions(ms); } public Set<Long> merge(Set<Long> set1, Set<Long> set2) throws InvokeException { set1.addAll(set2); return set1; // merged set of alerted portfolio ids }}
  • 23. 23 ScaleOut Software, Inc. • Run a parallel method invocation on a queried set of portfolios and return set of ids for alerted portfolios: Java Example: Parallel Method Invocation NamedCache pset = CacheFactory.getCache(“portfolios"); InvokeResult alertedPortolios = pset.invoke( PortfolioAnalysis.class, Portfolio.class, and(greaterThan(“totalValue”, 1000000), // query spec equals(“region”, Region.US)), marketSnapshot, // parameters ... ); System.out.println("The alerted portfolios are" + alertedPortfolios.getResult());
  • 24. 24 ScaleOut Software, Inc. • IMDG ships user’s code and libraries to its servers. • IMDG automatically schedules analysis operations across all grid servers and cores: • The analysis runs on all objects selected by the parallel query. • Each grid server analyzes its locally stored objects to minimize data motion. • Parallel execution ensures fast completion time: • IMDG automatically distributes workload across servers/cores. • Scaling the IMDG automatically handles larger data sets. Running the Analysis
  • 25. 25 ScaleOut Software, Inc. • The IMDG automatically merges all analysis results: • The IMDG first merges all results within each grid server in parallel. • It then merges results across all grid servers to create one combined result. • Efficient parallel merge minimizes the delay in combining all results. • The IMDG delivers the combined result to the invoking application as one object. Merging the Results
  • 26. 26 ScaleOut Software, Inc. • Measured a similar financial services application (back testing stock trading strategies on stock histories) • Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock history data in memory • IMDG handled a continuous stream of updates (1.1 GB/s) • Results: analyzed 1 TB in 4.1 seconds (250 GB/s) with linear scaling Sample Performance Results for PMI
  • 27. 27 ScaleOut Software, Inc. Benefits: • Enables use of Hadoop MapReduce for operational intelligence. • Accelerates data access by holding data in memory. • Analyzes and updates “live” data. • Reduces overheads of standard Hadoop distributions: • Batch scheduling • Disk access • Data shuffling • Mandatory key sorting • Enables new features, e.g.: • Global combining, optional sorting Using PMI to Implement “In-Memory” Hadoop MapReduce
  • 28. 28 ScaleOut Software, Inc. • A Hadoop distribution does not have to be installed unless HDFS is used. • The developer starts MapReduce applications from a remote workstation. • The IMDG automatically builds a reusable “invocation grid” of JVMs on the grid’s servers for PMI and ships the application’s jars. • Results are stored in the IMDG, HDFS, or optionally globally merged and returned to the remote workstation. Running MapReduce on the IMDG
  • 29. 29 ScaleOut Software, Inc. Run In-Memory MR with YARN • YARN, transparently integrates batch and in-memory MapReduce into a single execution framework with shared access to HDFS. • For example, hServer can transparently run Apache Hive in-memory. Example of ScaleOut hServer with Hortonworks Example of Hive Running on hServer
  • 30. 30 ScaleOut Software, Inc. Run MapReduce as two PMI phases: • Data can be input from either the IMDG or an external data source. • Works with any input/output format compatible with the Apache distribution. • IMDG uses its data-parallel execution engine (PMI) to invoke the mappers and the reducers. • Eliminates batch scheduling overhead. • Intermediate results are stored within the IMDG. • Minimizes data motion between the mappers and reducers. • Allows optional sorting. • Output of a single reducer/combiner optionally can be globally merged. Implementing MapReduce
  • 31. 31 ScaleOut Software, Inc. • IMDG adds grid input format for accessing key/value pairs held in the IMDG. • MapReduce programs optionally can output results to IMDG with grid output format. • Grid Record Reader optimizes access to key/value pairs to eliminate network overhead. • Applications can access and update key/value pairs as operational data during analysis. Accessing IMDG Data for M/R
  • 32. 32 ScaleOut Software, Inc. • IMDG adds Dataset Record Reader (wrapper) to cache HDFS data during program execution. • Hadoop automatically retrieves data from ScaleOut IMDG on subsequent runs. • Dataset Record Reader stores and retrieves data with minimum network and memory overheads. • Tests with Terasort benchmark have demonstrated 11X faster access latency over HDFS without IMDG. Optional Caching of HDFS Data
  • 33. 33 ScaleOut Software, Inc. IMDG needs multiple in-memory storage models: • Named cache, optimized for rich semantics on large objects: • Property-based query • Distributed locking • Access from remote grids • Named map, optimized for efficient storage and bulk analysis (e.g., MapReduce): • Highly efficient object storage • Pipelined, bulk-access mechanisms Optimized In-Memory Storage
  • 34. 34 ScaleOut Software, Inc. In-Memory Named Map: • Stores key/value pairs in chunks. • Allows CRUD operations on kvps. • Automatically organizes chunks into splits. • Uses per-split hash table to access keys and manage multi-valued keys. • Stores shuffled data set between mappers and reducers. • Pipelines chunks to mappers and from reducers. • Optionally uses memory mapped files to reduce access latency. • Provides support for sorting keys. Named Map Optimizations
  • 35. 35 ScaleOut Software, Inc. • Measured performance: • Startup times reduced to a few milliseconds • Word count benchmark shows 20X speedup. • Real-world example shows >40X speedup. • MapReduce optimizations: • Optional sorting • Optional multicast of parameters to mappers • Optional O(logN) global combining (avoids single reducer) • Optional HDFS caching • Optional reuse of JVMs across jobs • Current limitations: • No specific security for multi-tenancy • Intermediate data must fit in the IMDG Performance & Optimizations
  • 36. 36 ScaleOut Software, Inc. • Invocation grids can be re-used across MapReduce jobs: Accelerating Start-Up Times public static void main(String argv[]) throws Exception { //Configure and load the invocation grid InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid"). // Add JAR files as IG dependencies addJar("main-job.jar"). addJar("first-library.jar"). // Add classes as IG dependencies addClass(MyMapper.class). addClass(MyReducer.class). // Define custom JVM parameters setJVMParameters("-Xms512M -Xmx1024M"). load(); //Run 10 jobs on the same invocation grid for(int i=0; i<10; i++) { Configuration conf = new Configuration(); //The preloaded invocation grid is passed as the parameter to the job Job job = new HServerJob(conf, "Job number "+i, false, grid); //......Configure the job here......... //Run the job job.waitForCompletion(true); } //Unload the invocation grid when we are done grid.unload(); }
  • 37. 37 ScaleOut Software, Inc. • IMDG can run Apache Hive distribution unchanged. • Accelerates queries for datasets hosted in HDFS or the IMDG: • Intermediate data must fit within the IMDG. • Challenges we faced: • Requires YARN to transparently invoke MapReduce on IMDG. • IMDG must use multiple JVMs per server since Hive tasks are not thread-safe. • IMDG must support Hadoop’s distributed cache (required by Hive). Running Hive on In-Memory Data
  • 38. 38 ScaleOut Software, Inc. • Assume we have a named map called “customers” of customer objects: Example: Querying a Named Map public class Customer implements Serializable { private int customerId; private String firstName; private String lastName; private String login; public int getCustomerId() { return customerId;} public String getFirstName() { return firstName;} ... }
  • 39. 39 ScaleOut Software, Inc. • Create a table view of a named map: • Associates class properties with columns. • Allows properties to be omitted. • Allows use of custom serialization. Example: Querying a Named Map public hive> CREATE TABLE customers (customerid int, firstname string, lastname string, login string) STORED BY 'com.scaleoutsoftware.soss.hserver.hive.HServerHiveStorage Handler' TBLPROPERTIES ("hserver.map.name" = "customers"); OK Time taken: 0.508 seconds
  • 40. 40 ScaleOut Software, Inc. • Now query the named map: Example: Querying a Named Map hive> SELECT * FROM customers; .............................. 1 Eduardo Hazelrigg ehazelrigg 13 Serena Sadberry ssadberry 9 Ermelinda Manganaro emanganaro 5 Edda Speir espeir 17 Tomeka Stovall tstovall 21 Luciano Perkinson lperkinson 25 Jacob Garrow jgarrow 33 Quincy Kreutzer qkreutzer 37 Iona Speir ispeir 41 Ermelinda Thielen ethielen Time taken: 0.475 seconds, Fetched: 100 row(s)
  • 41. 41 ScaleOut Software, Inc. The Challenge: Operational intelligence to quickly evaluate and respond to sub-second market changes: • Hedge fund tracks a set of hedging strategies: • Strategies can cover various market sectors, such as high-tech, automotive, energy, consumer, real estate, etc. • Each strategy contains list of holdings and rules for managing the holdings (such as target allocations). • Updates to market data continuously arrive during the trading day. • Challenge: The hedge fund must be able to quickly update and analyze its hedging strategies and provide alerts to traders. Demo of the Finserv Application
  • 42. 42 ScaleOut Software, Inc. • Delivers a stream of alerts to traders within a few seconds. • Enables the trader to examine strategy details in real time: Output: Real-Time Alerts
  • 43. 43 ScaleOut Software, Inc. • Video Link Video
  • 44. 44 ScaleOut Software, Inc. Fast map/reduce reconciles inventory and order systems for an online retailer: • Challenge: Inventory and online order management are handled by different applications. • Reconciled once per day. • Inaccurate orders reduces margins. • Solution: • Host SKUs in IMDG updated in real time by order & inventory systems. • Use MapReduce to reconcile in two minutes. • Results: Real-time reconciliation ensures accurate orders. Example in Ecommerce: Inventory Management
  • 45. 45 ScaleOut Software, Inc. • IMDG holds customer information for active Web users. • IMDG saves/retrieves customer information from backing store. • Web browsers send activity information to analytics engine. • IMDG updates customer history and preferences. • Analytics engine identifies browsing and buying patterns. • Analytics engine makes suggestions in real-time. Also sends email follow-ups. Example: Web Shopping
  • 46. 46 ScaleOut Software, Inc. • Track connectivity issues. • Obtain time- sensitive business data. • Offer enhanced services. • Increase security. Example: Telecommunications Optimize Operations Customer Experience Historical queries for real-time data enrichment Stream persistence for future analysis Network Elements
  • 47. 47 ScaleOut Software, Inc. • Online systems need operational intelligence on “live” data for immediate feedback. • Operational intelligence can be implemented using standard data-parallel computing techniques, such as M/R. • In-memory data grids provide an excellent platform for operational intelligence: • Host and update “live” data. • Implement high availability. • Offer fast, data-parallel computation for immediate feedback. Recap
  • 49. 49 ScaleOut Software, Inc. • ScaleOut StateServer® • In-Memory Data Grid for Windows and Linux • Scales application performance. • Industry-leading performance and ease of use • ScaleOut GeoServer® adds • WAN based data replication for DR • Breakthrough technology for global data access • ScaleOut Analytics Server® adds • Real-time data analysis for “live” data • Comprehensive management tools • ScaleOut hServer® • Full Hadoop Map/Reduce engine (>20X faster*) • Hadoop Map/Reduce on live, in-memory data ScaleOut Software Products ScaleOut StateServer In-Memory Data Grid Grid Service Grid Service Grid Service Grid Service *in benchmark testing
  • 50. 50 ScaleOut Software, Inc. Many Use Cases: • Authorizations / Payment Processing / Mobile Payments • Service Activation • Inventory Management • Sensor Data / SCADA • Real Time Tracking • Fraud Detection • Situational Awareness • Churn Management • Market Feed / Event Handlers • Execution Rules • Financial: Risk, P&L, Pricing • Operational Risk Compliance The Need for Real-Time Analytics Across Key Industries: • CPG • Financial • Telco • Retail • Utilities • Manufacturing • Logistics • IC / DoD • Life Sciences • Government • Health Care • Law enforcement
  • 51. 51 ScaleOut Software, Inc. • Brick and mortar stores need to compete with online experience. • Point-of-sale identifies opt-in customers to analytics engine. • RFID tags identify product selection and availability in showroom. • Analytics engine sends real-time advisories to sales staff via tablet. Example: Retail Shopping
  • 52. 52 ScaleOut Software, Inc. • Typically used for very large, static, offline datasets • Data must be copied from disk-based storage (e.g., HDFS) into memory for analysis. • Hadoop Map/Reduce adds lengthy batch scheduling and data shuffling overhead. Problem: Hadoop Cannot Efficiently Perform Real-Time Analytics
  • 53. 53 ScaleOut Software, Inc. // This job will run using the Hadoop // job tracker: public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class); job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0])); FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true); } // This job will run using ScaleOut hServer: public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new HServerJob(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass( TextInputFormat.class); job.setOutputFormatClass( TextOutputFormat.class); FileInputFormat.addInputPath( job, new Path(args[0])); FileOutputFormat.setOutputPath( job, new Path(args[1])); job.waitForCompletion(true); } Configuring MapReduce for the IMDG • Without YARN, subclass the Hadoop Job class with a one-line change (below). • With YARN, just replace the MapReduce execution framework.
  • 54. 54 ScaleOut Software, Inc. • Mark class properties as indexes for query: • Define a query using these properties: Parallel Query Example (C#) class Stock { [SossIndex] public string Ticker { get; set; } public decimal TotalShares { get; set; } public decimal Price { get; set; }} NamedCache cache = CacheFactory.GetCache("Stocks"); var q = from s in cache.QueryObjects<Stock>() where s.Ticker == "GOOG" || s.Ticker == "ORCL" select s; Console.WriteLine("{0} Stocks found", q.Count());
  • 55. 55 ScaleOut Software, Inc. • Create method to analyze each queried stock object: • Create method to pair-wise merge the results: Example of Analysis Code (C#) static decimal eval(Stock stock, StockCalcParams params) { return stock.Price * stock.TotalShares; } static decimal merge(decimal r1, decimal r2) { return r1 + r2; }
  • 56. 56 ScaleOut Software, Inc. • Run a parallel method invocation: Invoking the Parallel Analysis (C#) NamedCache cache = CacheFactory.GetCache("Stocks"); decimal valueOfSelectedStocks = (from s in cache.QueryObjects<Stock>() where s.Ticker == "GOOG" || s.Ticker == "ORCL" select s) .Invoke(new StockCalcParams(…), new Func<Stock, StockCalcParams, decimal>(eval)) .Merge(new Func<decimal, decimal, decimal>(merge)); Console.WriteLine(“The value of selected stocks is {0}", valueOfSelectedStocks);