Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us

Distributed DataFrame (DDF)
Simplifying Big Data
For The Rest Of Us
Christopher Nguyen, PhD
Co-Founder & CEO
DATA INTELLIGENCE FOR ALL

1. Motivation: Problem & Solution
2. DDF Features & Benefits
3. Demo
Agenda

• Former Engineering Director of Google Apps  
(Google Founders’ Award)
• Former Professor and Co-Founder of the Computer
Engineering program at HKUST
• PhD Stanford, BS U.C. Berkeley Summa cum Laude
Christopher Nguyen, PhD
Adatao Inc.
Co-Founder & CEO

RHadoop
HDFS
MapReduce
Hive
HBASE
Pig
SummingBird
Crunch
Storm
Scalding
Cascading

class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double,
upper: Double, isTraining: Boolean)
extends RDD[Array[Double]](prev) {
override def getPartitions: Array[Partition] = {
val rg = new Random(seed)
firstParent[Array[Double]].partitions.map(x => new SeededPartition(x,
rg.nextInt))
}
override def getPreferredLocations(split: Partition): Seq[String] =
firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio
n].prev)
override def compute(splitIn: Partition, context: TaskContext):
Iterator[Array[Double]] = {
val split = splitIn.asInstanceOf[SeededPartition]
val rand = new Random(split.seed)
if (isTraining) {
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {
val z = rand.nextDouble;
z < lower || z >= upper
})
} else {
lower <= z && z < upper
})
}
}
}
def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int,
trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]],
RDD[Array[Double]])] = {
require(0 < trainingSize && trainingSize < 1)
(1 to numSplits).map(_ => rg.nextInt).map(z =>
(new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true),
new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator
}
Can’t It Be
Even
Simpler?
WHY

class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double,
upper: Double, isTraining: Boolean)
extends RDD[Array[Double]](prev) {
override def getPartitions: Array[Partition] = {
firstParent[Array[Double]].partitions.map(x => new SeededPartition(x,
rg.nextInt))
}
override def getPreferredLocations(split: Partition): Seq[String] =
firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio
n].prev)
override def compute(splitIn: Partition, context: TaskContext):
Iterator[Array[Double]] = {
val split = splitIn.asInstanceOf[SeededPartition]
val rand = new Random(split.seed)
if (isTraining) {
z < lower || z >= upper
})
} else {
lower <= z && z < upper
})
}
}
}
def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int,
trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]],
RDD[Array[Double]])] = {
require(0 < trainingSize && trainingSize < 1)
(1 to numSplits).map(_ => rg.nextInt).map(z =>
(new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true),
new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator
}
table = load(“housePrices”)
table.dropNA
table.train.glm(“price”, “sf,zip,beds,baths”)
What
Happiness
Looks Like

vs.
!
!
HDFS
DAG Engine
MapReduce
Monad
FP
Bottom-Up
Top-Down

we begin to dream / find solution to fight this evil

Design Principles
Sophistication
Power & SpeedEase & Simplicity

HDFS
Web Browser R-Studio Python
Business Analyst Data Scientist Data Engineer
API
API
API
DDFClient
PAClient
PIClient
Example:

CLIENT WORKER WORKER WORKERWORKERMASTER
Demo Deployment
Diagram
Airline Dataset, 12GB, 123M rows

Table-Like Abstraction
on Top of Big Data
1

2 Native R Data.frame
Experience
ddf.getSummary()
ddf.getFiveNum()
ddf.binning(“distance”, 10)
ddf.scale(SCALE.MINMAX)

Real-Time
Scoring
manager.loadModel(uri)
lm.predict(point)
Data
Wrangling
ddf.dropNA()
ddf.transform(“speed
=distance/duration”)
ddf.lm(0.1, 10)
ddf.roc(testddf)
Model
Building & Validation
3 Focus on Analytics,
Not MapReduce

4 Seamlessly Integrate
with External ML Libraries
Data Algorithm Intelligence
+ =

5 Share DDFs
Seamlessly & Eﬃciently
Can I see
your data?
ddf://adatao/airline

6Work in Your
Preferred Language
…even Plain English!

1
2
3
4
5
6
Table-Like Abstraction
on Top of Big Data
Seamlessly Integrate with
External ML Libraries
Share DDFs
Seamlessly & eﬃciently
Native R Data.frame
Experience
Focus on Analytics,  
not MR
Work in Your
Preferred Language

as
DDF is to
Big Data
SQL+R is to
Small Data

www.adatao.com/ddf
DATA INTELLIGENCE FOR ALL
We’re Open Sourcing DDF!!!
1. Accept Core Committers
2. Clean up & Prep
3. Open up for public access

Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us

More Related Content

What's hot (18)

Similar to Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us (20)

Recently uploaded (20)

Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us