The datascientists workplace of the future, IBM developerDays 2014, Vienna by Romeo Kienzler

© 2013 IBM Corporation1
The Data Scientists Workplace of the Future - Workshop
SwissRE, 11.6.14
Romeo Kienzler
IBM Center of Excellence for Data Science, Cognitive Systems and BigData
(A joint-venture between IBM Research Zurich and IBM Innovation Center DACH)
Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg

The Data Scientists Workplace of the Future -
* * C R E D I T S * *
Romeo Kienzler
IBM Innovation Center
●
Parts of these slides have been copied from and/or revised by
●
Dr. Anand Ranganathan, IBM Watson Research Lab
●
Dr. Stefan Mück, IBM BigData Leader Europe
●
Dr. Berthold Rheinwald, IBM Almaden Research Lab
●
Dr. Diego Kuonen, Statoo Consulting
●
Dr. Abdel Labbi, IBM Zurich Research Lab
●
Brandon MacKenzie, IBM Software Group

What is DataScience?
Source: Statoo.com http://slidesha.re/1kmNiX0

DataScience at present
●
Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html)
●
SQL (42%)
●
R (33%)
●
Python (26%)
●
Excel (25%)
●
Java, Ruby, C++ (17%)
●
SPSS, SAS (9%)
●
Limitations (Single Node usage)
●
Main Memory
●
CPU <> Main Memory Bandwidth
●
CPU
●
Storage <> Main Memory Bandwidth (either Single node or SAN)

DataScience at present - Demo
●
Assume 1 TB file on Hard Drive
●
Spit into 16 files
●
split -d -n 16 output.json
●
Distribute on 4 Nodes
●
for node in `seq 1 16`; do scp x$node id@node$i:~/; done
●
Perform calculation in paralell
●
for node in `seq 1 16`; do
ssh id@node$i 'cat $file
|awk -F":" '{print $6}'
|grep -i samsung
|grep breathtaking |wc -l';
done > result
●
Merge Result
●
cat result |sum
Source: http://sergeytihon.wordpress.com/2013/03/20/the-data-science-venn-diagram/

What is BIG data?

What is BIG data?
Big Data
Hadoop

What is BIG data?
Business Intelligence
Data Warehouse

BigData == Hadoop?
Hadoop BigData
Hadoop

What is beyond “Data Warehouse”?
Data Lake
Data Warehouse

First “BigData” UseCase ?
●
Google Index
●
40 X 10^9 = 40.000.000.000 => 40 billion pages indexed
●
Will break 100 PB barrier soon
●
Derived from MapReduce
●
now “caffeine” based on “percolator”
●
Incremental vs. batch
●
In-Memory vs. disk
●

Map-Reduce → Hadoop → BigInsights

BigData UseCases
●
CERN LHC
●
25 petabytes per year
●
Facebook
●
Hive Datawarehouse
●
300 PB, Growing 600 TB / d
●
> 100 k servers
●
Genomics
●
Enterprises
●
Data center analytics (Logflies, OS/NW monitors, ...)
●
Predictive Maintenance, Cybersecurity
●
Social Media Analytics
●
DWH offload
●
Call Detail Record (CDR) data preservation
http://www.balthasar-glaettli.ch/vorratsdaten/

Why is Big Data important?

BigData Analytics
Source: http://www.strategy-at-risk.com/2008/01/01/what-we-do/

BigData Analytics – Predictive Analytics
"sometimes it's not
who has the best
algorithm that wins;
it's who has the most
data."
(C) Google Inc.
The Unreasonable Effectiveness of Data¹
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
No Sampling => Work with full dataset => No p-Value/z-Scores anymore

We need Data Parallelism

Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec

Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3TB SEAGATE Barracuda 7200.14
< CHF 500
 100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD
 MTBF ~ 365 d > 1,5 d
Source: http://www.cloudcomputingpatterns.org/Watchdog

NoSQL Databases
 Column Store
– Hadoop / HBASE
– Cassandra
– Amazon Simple DB
 JSON / Document Store
– MongoDB
– CouchDB
 Key / Value Store
– Amazon DynamoDB
– Voldemort
 Graph DBs
– DB2 SPARQL Extension
– Neo4J
 MP RDBMS
– DB2 DPF, DB2 pureScale, PureData for Operational Analytics
– Oracle RAC
– Greenplum

http://nosql-database.org/ > 150

CAP Theorem / Brewers Theorem¹
 impossible for a distributed computer system simultaneously guarantee all 3 properties
– Consistency (all nodes see the same data at the same time)
– Availability (guarantee that every request knows whether it was successful or failed)
– Partition tolerance (continues to operate despite failure of part of the system)
 What about ACID?
– Atomicity
– Consistency
– Isolation
– Durability
 BASE, the new ACID
– Basically Available
– Soft state
– Eventual consistency
• Monotonic Read Consistency
• Monotonic Write Consistency
• Read Your Own Writes
–


What role is the cloud playing here?

“Elastic” Scale-Out
Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload

of

of
CPU Cores

of
CPU Cores Storage

of
CPU Cores Storage Memory

linear
Source: http://www.cloudcomputingpatterns.org/Elastic_Platform

How do Databases Scale-Out?
Shared Disk Architectures

How do Databases Scale-Out?
Shared Nothing Architectures

Hadoop?
Shared Nothing Architecture?
Shared Disk Architecture?

Data Science on Hadoop
SQL (42%)
R (33%)
Python (26%)
Excel (25%)
Java, Ruby, C++ (17%)
SPSS, SAS (9%)
Data Science Hadoop

Large Scale Data Ingestion
●
Traditionally
●
Crawl to local file system (e.g. wget http://www.heise.de/newsticker/)
●
Export RDBMS data to CSV (local file system)
●
Batched FTP Servers uploads
●
Then: Copy to HDFS
●
BigInsights
●
Use one of built-in importers
●
Imports directly info HDFS
●
Use Eclipse-Tooling to deploy custom importers easily

Large Scale Data Ingestion (ETL on M/R)
●
Modern ETL (Extract, Transform, Load) tools support Hadoop as
●
Source, Sink (HDFS)
●
Engine (MapReduce)
●
Example: InfoSphere DataStage

Real-Time/ In-Memory Data Ingestion
●
If volume can be reduced dramatically during first processing steps
●
Feature Extraction of
●
Video
●
Audio
●
Semistructured Text (e.g. Logfiles)
●
Structured Text
●
Filtering
●
Compression
●
Recommendation: Usage of Streaming Engines
●
IBM InfoSphere Streams
●
Twitter Storm (now Apache incubator)
●
Apache Spark Streaming

Real-Time/ In-Memory Data Ingestion
●
If volume can be reduced dramatically during first processing steps
●
Feature Extraction of
●
Video
●
Audio
●
Semistructured Text (e.g. Logfiles)
●
Structured Text
●
Filtering
●
Compression

SQL on Hadoop
●
IBM BigSQL (ANSI 92 compliant)
●
HIVE (SQL dialect)
●
Cloudera Impala
●
Lingual
●
...
SQL Hadoop

BigSQL V3.0 – ANSI SQL 92 compliant
IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to
successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without
modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about-infosphere-biginsights-v30-big-sql

BigSQL V3.0 – Architecture

BigSQL V3.0 – Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
●
●
●

CREATE EXTERNAL TABLE trace (
hour integer, employeeid integer,
departmentid integer, clientid integer,
date string, timestamp string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY 'n' STORED AS TEXTFILE LOCATION
'/user/biadmin/32Gtest';
select count(hour), hour from trace group by hour order by hour
-- This command runs on 32 GB / ~650.000.000 rows in HDFS

R on Hadoop
●
IBM BigR (based on SystemML Almadan Research project)
●
Rhadoop
●
RHIPE
●
...
“R” Hadoop

BigR (based on SystemML)
Example: Gaussian Non-negative Matrix Factorization
package gnmf;
import java.io.IOException;
import java.net.URISyntaxException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
public class MatrixGNMF
{
public static void main(String[] args) throws IOException, URISyntaxException
{
if(args.length < 10)
{
System.out.println("missing parameters");
System.out.println("expected parameters: [directory of v] [directory of w] [directory
of h] " +
"[k] [num mappers] [num reducers] [replication] [working directory] " +
"[final directory of w] [final directory of h]");
System.exit(1);
}
String vDir = args[0];
String wDir = args[1];
String hDir = args[2];
int k = Integer.parseInt(args[3]);
int numMappers = Integer.parseInt(args[4]);
int numReducers = Integer.parseInt(args[5]);
int replication = Integer.parseInt(args[6]);
String outputDir = args[7];
String wFinalDir = args[8];
String hFinalDir = args[9];
JobConf mainJob = new JobConf(MatrixGNMF.class);
String vDirectory;
String wDirectory;
String hDirectory;
FileSystem.get(mainJob).delete(new Path(outputDir));
vDirectory = vDir;
hDirectory = hDir;
wDirectory = wDir;
String workingDirectory;
String resultDirectoryX;
String resultDirectoryY;
long start = System.currentTimeMillis();
System.gc();
System.out.println("starting calculation");
System.out.print("calculating X = WT * V... ");
workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication,
UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k);
resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication,
workingDirectory, outputDir);
FileSystem.get(mainJob).delete(new Path(workingDirectory));
System.out.println("done");
System.out.print("calculating Y = WT * W * H... ");
wDirectory, outputDir);
resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory,
UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir);
System.out.print("calculating H = H .* X ./ Y... ");
hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k);
FileSystem.get(mainJob).delete(new Path(resultDirectoryX));
FileSystem.get(mainJob).delete(new Path(resultDirectoryY));
System.out.print("storing back H... ");
FileSystem.get(mainJob).delete(new Path(hDirectory));
hDirectory = workingDirectory;
System.out.print("calculating X = V * HT... ");
UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k);
resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication,
workingDirectory, outputDir);
System.out.print("calculating Y = W * H * HT... ");
hDirectory, outputDir);
resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory,
UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir);
System.out.print("calculating W = W .* X ./ Y... ");
wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k);
FileSystem.get(mainJob).delete(new Path(resultDirectoryX));
FileSystem.get(mainJob).delete(new Path(resultDirectoryY));
System.out.print("storing back W... ");
FileSystem.get(mainJob).delete(new Path(wDirectory));
package gnmf;
import gnmf.io.MatrixObject;
import gnmf.io.MatrixVector;
import gnmf.io.TaggedIndex;
import java.util.Iterator;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep2
{
static class UpdateWHStep2Mapper extends MapReduceBase
implements Mapper<TaggedIndex, MatrixVector, TaggedIndex,
MatrixVector>
{
@Override
public void map(TaggedIndex key, MatrixVector value,
OutputCollector<TaggedIndex, MatrixVector> out,
Reporter reporter) throws IOException
{
out.collect(key, value);
}
}
static class UpdateWHStep2Reducer extends MapReduceBase
implements Reducer<TaggedIndex, MatrixVector, TaggedIndex,
MatrixObject>
{
@Override
public void reduce(TaggedIndex key, Iterator<MatrixVector> values,
OutputCollector<TaggedIndex, MatrixObject> out, Reporter
reporter)
throws IOException
{
MatrixVector result = null;
while(values.hasNext())
{
MatrixVector current = values.next();
if(result == null)
{
result = current.getCopy();
} else
{
result.addVector(current);
}
}
if(result != null)
{
out.collect(new TaggedIndex(key.getIndex(),
TaggedIndex.TYPE_VECTOR_X),
new MatrixObject(result));
}
}
}
public static String runJob(int numMappers, int numReducers, int
replication,
String inputDir, String outputDir) throws IOException
{
String workingDirectory = outputDir + System.currentTimeMillis() +
"-UpdateWHStep2/";
JobConf job = new JobConf(UpdateWHStep2.class);
job.setJobName("MatrixGNMFUpdateWHStep2");
job.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(inputDir));
package gnmf;
import gnmf.io.MatrixCell;
import gnmf.io.MatrixFormats;
import gnmf.io.MatrixObject;
import gnmf.io.MatrixVector;
import gnmf.io.TaggedIndex;
import java.util.Iterator;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
public class UpdateWHStep1
{
public static final int UPDATE_TYPE_H = 0;
public static final int UPDATE_TYPE_W = 1;
static class UpdateWHStep1Mapper extends MapReduceBase
implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject>
{
private int updateType;
@Override
public void map(TaggedIndex key, MatrixObject value,
OutputCollector<TaggedIndex, MatrixObject> out,
Reporter reporter) throws IOException
{
if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL)
{
MatrixCell current = (MatrixCell) value.getObject();
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL),
new MatrixObject(new MatrixCell(key.getIndex(), current.getValue())));
} else
{
out.collect(key, value);
}
}
@Override
public void configure(JobConf job)
{
updateType = job.getInt("gnmf.updateType", 0);
}
}
static class UpdateWHStep1Reducer extends MapReduceBase
implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector>
{
private double[] baseVector = null;
private int vectorSizeK;
@Override
public void reduce(TaggedIndex key, Iterator<MatrixObject> values,
OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter)
throws IOException
{
if(key.getType() == TaggedIndex.TYPE_VECTOR)
{
if(!values.hasNext())
throw new RuntimeException("expected vector");
MatrixFormats current = values.next().getObject();
if(!(current instanceof MatrixVector))
throw new RuntimeException("expected vector");
baseVector = ((MatrixVector) current).getValues();
} else
{
while(values.hasNext())
{
MatrixCell current = (MatrixCell) values.next().getObject();
if(baseVector == null)
{
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR),
new MatrixVector(vectorSizeK));
} else
{
if(baseVector.length == 0)
throw new RuntimeException("base vector is corrupted");
MatrixVector resultingVector = new MatrixVector(baseVector);
resultingVector.multiplyWithScalar(current.getValue());
if(resultingVector.getValues().length == 0)
throw new RuntimeException("multiplying with scalar failed");
out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR),
resultingVector);
}
}
baseVector = null;
}
}
@Override
public void configure(JobConf job)
{
vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0);
Java Implementation
(>1500 lines of code)
Equivalent SystemML Implementation
(10 lines of code)
Experimenting with multiple variants!
W = W*max(V%*%t(H) – alphaW JW, 0)/(W%*%H%*%t(H))
H = H*max(t(W)%*%V – alphaH JH, 0)/(t(W)%*%W%*%H)
W = W*((S*V)%*%t(H))/((S*(W%*%H))%*%t(H))
H = H*(t(W)%*%(S*V))/(t(W)%*%(S*(W%*%H)))
W = W*(V/(W%*%H) %*% t(H))/(E%*%t(H))
H = H*(t(W)%*%(V/(W%*%H)))/(t(W)%*%E)

BigR (based on SystemML)
SystemML compiles hybrid runtime plans ranging from in-
memory, single machine (CP) to large-scale, cluster (MR)
compute
●
Challenge
●
Guaranteed hard memory constraints
(budget of JVM size)
●
for arbitrary complex ML programs
●
Key Technical Innovations
●
CP & MR Runtime: Single machine & MR operations, integrated runtime
●
Caching: Reuse and eviction of in-memory objects
●
Cost Model: Accurate time and worst-case memory estimates
●
Optimizer: Cost-based runtime plan generation
●
Dyn. Recompiler: Re-optimization for initial unknowns
Data size
Runtime
CP CP/MR MR
Gradually exploit
MR parallelism
High performance
computing for
small data sizes.
Scalable
computing for
large data sizes.
Hybrid Plans

R Clients
SystemML
Statistics
Engine
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or, push R
functions
right on the
data
1
2
3
© 2014 IBM Corporation17 IBM Internal Use Only
BigR Architecture

BigR Demo (small)
●
●
●

BigR Demo (small)
library(bigr)
bigr.connect(host="bigdata",
port=7052, database="default",
user="biadmin", password="xxx")
is.bigr.connected()
tbr <- bigr.frame(dataSource="DEL", coltypes =
c("numeric","numeric","numeric","numeric","character","character"),
dataPath="/user/biadmin/32Gtest", delimiter=",",
header=F, useMapReduce=T)
h <- bigr.histogram.stats(tbr$V1, nbins=24)

BigR Demo (small)
class bins counts centroids
1 ALL 0 18289280 1.583333
2 ALL 1 15360 2.750000
3 ALL 2 55040 3.916667
4 ALL 3 189440 5.083333
5 ALL 4 579840 6.250000
6 ALL 5 5292160 7.416667
7 ALL 6 8074880 8.583333
8 ALL 7 15653120 9.750000
...

BigR Demo (small)

BigR Demo (small)
jpeg('hist.jpg')
bigr.histogram(tbr$V1, nbins=24)
# This command runs on 32 GB / ~650.000.000 rows in HDFS
dev.off()

BigR Demo (small)
Sampling, Resampling, Bootstrapping
vs
Whole Dataset Processing
What is your experience?

Python on Hadoop
python Hadoop

SPSS on Hadoop

BigSheets Demo (small)
●
●
●
●
●
●

This command runs on 32 GB /
~650.000.000 rows in HDFS

Text Extraction (SystemT, AQL)

If this is not enough? → BigData AppStore

BigData AppStore, Eclipse Tooling
●
Write your apps in
●
Java (MapReduce)
●
PigLatin,Jaql
●
BigSQL/Hive/BigR
●
Deploy it to BigInsights via Eclipse
●
Automatically
●
Schedule
●
Update
●
hdfs files
●
BigSQL tables
●
BigSheets collections

Questions?
http://www.ibm.com/software/data/bigdata/
Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps

DFT/Audio Analytics (as promised)
library(tuneR)
a <- readWave("whitenoisesine.wav")
f<- fft(a@left)
jpeg('rplot_wnsine.jpg')
plot(Re(f)^2)
dev.off()
a <- readWave("whitenoise.wav")
f<- fft(a@left)
jpeg('rplot_wn.jpg')
plot(Re(f)^2)
dev.off()
a <- readWave("whitenoisesine.wav")
brv <- as.bigr.vector(a@left)
al <- as.list(a@left)

Backup Slides

Map-Reduce
Source: http://www.cloudcomputingpatterns.org/Map_Reduce

The datascientists workplace of the future, IBM developerDays 2014, Vienna by Romeo Kienzler

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to The datascientists workplace of the future, IBM developerDays 2014, Vienna by Romeo Kienzler (20)

More from Romeo Kienzler (20)

Recently uploaded (20)

The datascientists workplace of the future, IBM developerDays 2014, Vienna by Romeo Kienzler