SlideShare a Scribd company logo
Distributed DataFrame (DDF)
Simplifying Big Data
For The Rest Of Us
Christopher Nguyen, PhD
Co-Founder & CEO
DATA INTELLIGENCE FOR ALL
1. Motivation: Problem & Solution
2. DDF Features & Benefits
3. Demo
Agenda
• Former Engineering Director of Google Apps 

(Google Founders’ Award)
• Former Professor and Co-Founder of the Computer
Engineering program at HKUST
• PhD Stanford, BS U.C. Berkeley Summa cum Laude
Christopher Nguyen, PhD
Adatao Inc.
Co-Founder & CEO
Small Data
World
RHadoop
HDFS
MapReduce
Hive
HBASE
Pig
SummingBird
Crunch
Storm
Scalding
Cascading
Distributed DataFrame (DDF) Simplifying Big Data  For The Rest Of Us
class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double,
upper: Double, isTraining: Boolean)	
extends RDD[Array[Double]](prev) {	
override def getPartitions: Array[Partition] = {	
val rg = new Random(seed)	
firstParent[Array[Double]].partitions.map(x => new SeededPartition(x,
rg.nextInt))	
}	
override def getPreferredLocations(split: Partition): Seq[String] =	
firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio
n].prev)	
override def compute(splitIn: Partition, context: TaskContext):
Iterator[Array[Double]] = {	
val split = splitIn.asInstanceOf[SeededPartition]	
val rand = new Random(split.seed)	
if (isTraining) {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
z < lower || z >= upper	
})	
} else {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
lower <= z && z < upper	
})	
}	
}	
}	
def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int,
trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]],
RDD[Array[Double]])] = {	
require(0 < trainingSize && trainingSize < 1)	
val rg = new Random(seed)	
(1 to numSplits).map(_ => rg.nextInt).map(z =>	
(new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true),	
new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator	
}	
Can’t It Be
Even
Simpler?
WHY
class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double,
upper: Double, isTraining: Boolean)	
extends RDD[Array[Double]](prev) {	
override def getPartitions: Array[Partition] = {	
val rg = new Random(seed)	
firstParent[Array[Double]].partitions.map(x => new SeededPartition(x,
rg.nextInt))	
}	
override def getPreferredLocations(split: Partition): Seq[String] =	
firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio
n].prev)	
override def compute(splitIn: Partition, context: TaskContext):
Iterator[Array[Double]] = {	
val split = splitIn.asInstanceOf[SeededPartition]	
val rand = new Random(split.seed)	
if (isTraining) {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
z < lower || z >= upper	
})	
} else {	
firstParent[Array[Double]].iterator(split.prev, context).filter(x => {	
val z = rand.nextDouble;	
lower <= z && z < upper	
})	
}	
}	
}	
def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int,
trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]],
RDD[Array[Double]])] = {	
require(0 < trainingSize && trainingSize < 1)	
val rg = new Random(seed)	
(1 to numSplits).map(_ => rg.nextInt).map(z =>	
(new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true),	
new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator	
}	
table = load(“housePrices”)
table.dropNA
table.train.glm(“price”, “sf,zip,beds,baths”)
What
Happiness
Looks Like
vs.
!
!
HDFS
DAG Engine
MapReduce
Monad
FP
Bottom-Up
Top-Down
we begin to dream / find solution to fight this evil
Design Principles
Sophistication
Power & SpeedEase & Simplicity
HDFS
Web Browser R-Studio Python
Business Analyst Data Scientist Data Engineer
API
API
API
DDFClient
PAClient
PIClient
Example:
CLIENT WORKER WORKER WORKERWORKERMASTER
Demo Deployment
Diagram
Airline Dataset, 12GB, 123M rows
1
2
3
4
5
6
DDF Offers
Table-Like Abstraction
on Top of Big Data
1
1
2
3
4
5
6
DDF Offers
2 Native R Data.frame
Experience
ddf.getSummary()	
ddf.getFiveNum()	
ddf.binning(“distance”, 10)	
ddf.scale(SCALE.MINMAX)
1
2
3
4
5
6
DDF Offers
Real-Time
Scoring
manager.loadModel(uri)	
lm.predict(point)
Data
Wrangling
ddf.dropNA()	
ddf.transform(“speed
=distance/duration”)
ddf.lm(0.1, 10)	
ddf.roc(testddf)
Model
Building & Validation
3 Focus on Analytics,
Not MapReduce
1
2
3
4
5
6
DDF Offers
4 Seamlessly Integrate
with External ML Libraries
Data Algorithm Intelligence
+ =
1
2
3
4
5
6
DDF Offers
5 Share DDFs
Seamlessly & Efficiently
Can I see
your data?
ddf://adatao/airline
1
2
3
4
5
6
DDF Offers
6Work in Your
Preferred Language
…even Plain English!
1
2
3
4
5
6
Table-Like Abstraction
on Top of Big Data
Seamlessly Integrate with
External ML Libraries
Share DDFs
Seamlessly & efficiently
Native R Data.frame
Experience
Focus on Analytics, 

not MR
Work in Your
Preferred Language
as
DDF is to
Big Data
SQL+R is to
Small Data
www.adatao.com/ddf
DATA INTELLIGENCE FOR ALL
We’re Open Sourcing DDF!!!
1. Accept Core Committers
2. Clean up & Prep
3. Open up for public access

More Related Content

PDF
Patterns in Terraform 12+13: Data, Transformations and Resources
PDF
Jan Lehnardt Couch Db In A Real World Setting
PDF
Data Manipulation Using R (& dplyr)
PDF
Rsplit apply combine
PDF
The State of NoSQL
PPTX
The Tidyverse and the Future of the Monitoring Toolchain
PPTX
NoSQL with MongoDB
DOCX
ggplot2 extensions-ggtree.
Patterns in Terraform 12+13: Data, Transformations and Resources
Jan Lehnardt Couch Db In A Real World Setting
Data Manipulation Using R (& dplyr)
Rsplit apply combine
The State of NoSQL
The Tidyverse and the Future of the Monitoring Toolchain
NoSQL with MongoDB
ggplot2 extensions-ggtree.

What's hot (18)

PDF
Scaling modern JVM applications with Akka toolkit
PPTX
MongoDB: Easy Java Persistence with Morphia
PPTX
テスト用のプレゼンテーション
PPT
10b. Graph Databases Lab
PDF
MongoDB: Replication,Sharding,MapReduce
PPTX
MongoDB-SESSION03
PDF
20110514 mongo dbチューニング
PDF
Reading Data into R
KEY
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
PPTX
Windows Server 2012 Active Directory Recovery
PDF
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
PDF
Introduction To Core Data
PPTX
Using Arbor/ RGraph JS libaries for Data Visualisation
PDF
Data aggregation in R
DOCX
WOTC_Import
PDF
MySQL flexible schema and JSON for Internet of Things
PDF
Meetup Analytics with R and Neo4j
Scaling modern JVM applications with Akka toolkit
MongoDB: Easy Java Persistence with Morphia
テスト用のプレゼンテーション
10b. Graph Databases Lab
MongoDB: Replication,Sharding,MapReduce
MongoDB-SESSION03
20110514 mongo dbチューニング
Reading Data into R
Mapping Flatland: Using MongoDB for an MMO Crossword Game (GDC Online 2011)
Windows Server 2012 Active Directory Recovery
Map/reduce, geospatial indexing, and other cool features (Kristina Chodorow)
Introduction To Core Data
Using Arbor/ RGraph JS libaries for Data Visualisation
Data aggregation in R
WOTC_Import
MySQL flexible schema and JSON for Internet of Things
Meetup Analytics with R and Neo4j
Ad

Similar to Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us (20)

PDF
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
PDF
User Defined Aggregation in Apache Spark: A Love Story
PDF
User Defined Aggregation in Apache Spark: A Love Story
PDF
Tulsa techfest Spark Core Aug 5th 2016
PDF
Debugging & Tuning in Spark
DOCX
ggtimeseries-->ggplot2 extensions
PDF
Spark workshop
PDF
No more struggles with Apache Spark workloads in production
PDF
From Java to Scala - advantages and possible risks
PPTX
Transformations and actions a visual guide training
PPTX
Introduction to R
ODP
Introduction to R
PPTX
20130912 YTC_Reynold Xin_Spark and Shark
PDF
A deeper-understanding-of-spark-internals-aaron-davidson
PDF
A deeper-understanding-of-spark-internals
PDF
Using Scala Slick at FortyTwo
PPTX
Scala meetup - Intro to spark
PDF
20140427 parallel programming_zlobin_lecture11
PPTX
PDF
Big Data Analytics with Scala at SCALA.IO 2013
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Tulsa techfest Spark Core Aug 5th 2016
Debugging & Tuning in Spark
ggtimeseries-->ggplot2 extensions
Spark workshop
No more struggles with Apache Spark workloads in production
From Java to Scala - advantages and possible risks
Transformations and actions a visual guide training
Introduction to R
Introduction to R
20130912 YTC_Reynold Xin_Spark and Shark
A deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals
Using Scala Slick at FortyTwo
Scala meetup - Intro to spark
20140427 parallel programming_zlobin_lecture11
Big Data Analytics with Scala at SCALA.IO 2013
Ad

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Digital Strategies for Manufacturing Companies
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Transform Your Business with a Software ERP System
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
System and Network Administration Chapter 2
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Introduction to Artificial Intelligence
PDF
top salesforce developer skills in 2025.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
L1 - Introduction to python Backend.pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
How to Migrate SBCGlobal Email to Yahoo Easily
Odoo Companies in India – Driving Business Transformation.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Digital Strategies for Manufacturing Companies
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PTS Company Brochure 2025 (1).pdf.......
Transform Your Business with a Software ERP System
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
System and Network Administration Chapter 2
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Introduction to Artificial Intelligence
top salesforce developer skills in 2025.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
L1 - Introduction to python Backend.pptx
CHAPTER 2 - PM Management and IT Context
wealthsignaloriginal-com-DS-text-... (1).pdf

Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us

  • 1. Distributed DataFrame (DDF) Simplifying Big Data For The Rest Of Us Christopher Nguyen, PhD Co-Founder & CEO DATA INTELLIGENCE FOR ALL
  • 2. 1. Motivation: Problem & Solution 2. DDF Features & Benefits 3. Demo Agenda
  • 3. • Former Engineering Director of Google Apps 
 (Google Founders’ Award) • Former Professor and Co-Founder of the Computer Engineering program at HKUST • PhD Stanford, BS U.C. Berkeley Summa cum Laude Christopher Nguyen, PhD Adatao Inc. Co-Founder & CEO
  • 7. class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double, upper: Double, isTraining: Boolean) extends RDD[Array[Double]](prev) { override def getPartitions: Array[Partition] = { val rg = new Random(seed) firstParent[Array[Double]].partitions.map(x => new SeededPartition(x, rg.nextInt)) } override def getPreferredLocations(split: Partition): Seq[String] = firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio n].prev) override def compute(splitIn: Partition, context: TaskContext): Iterator[Array[Double]] = { val split = splitIn.asInstanceOf[SeededPartition] val rand = new Random(split.seed) if (isTraining) { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; z < lower || z >= upper }) } else { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; lower <= z && z < upper }) } } } def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int, trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]], RDD[Array[Double]])] = { require(0 < trainingSize && trainingSize < 1) val rg = new Random(seed) (1 to numSplits).map(_ => rg.nextInt).map(z => (new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true), new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator } Can’t It Be Even Simpler? WHY
  • 8. class RandomSplitRDD( prev: RDD[Array[Double]], seed: Long, lower: Double, upper: Double, isTraining: Boolean) extends RDD[Array[Double]](prev) { override def getPartitions: Array[Partition] = { val rg = new Random(seed) firstParent[Array[Double]].partitions.map(x => new SeededPartition(x, rg.nextInt)) } override def getPreferredLocations(split: Partition): Seq[String] = firstParent[Array[Double]].preferredLocations(split.asInstanceOf[SeededPartitio n].prev) override def compute(splitIn: Partition, context: TaskContext): Iterator[Array[Double]] = { val split = splitIn.asInstanceOf[SeededPartition] val rand = new Random(split.seed) if (isTraining) { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; z < lower || z >= upper }) } else { firstParent[Array[Double]].iterator(split.prev, context).filter(x => { val z = rand.nextDouble; lower <= z && z < upper }) } } } def randomSplit[Array[Double]](rdd: RDD[Array[Double]], numSplits: Int, trainingSize: Double, seed: Long): Iterator[(RDD[Array[Double]], RDD[Array[Double]])] = { require(0 < trainingSize && trainingSize < 1) val rg = new Random(seed) (1 to numSplits).map(_ => rg.nextInt).map(z => (new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, true), new RandomSplitRDD(rdd, z, 0, 1.0 - trainingSize, false))).toIterator } table = load(“housePrices”) table.dropNA table.train.glm(“price”, “sf,zip,beds,baths”) What Happiness Looks Like
  • 10. we begin to dream / find solution to fight this evil
  • 12. HDFS Web Browser R-Studio Python Business Analyst Data Scientist Data Engineer API API API DDFClient PAClient PIClient Example:
  • 13. CLIENT WORKER WORKER WORKERWORKERMASTER Demo Deployment Diagram Airline Dataset, 12GB, 123M rows
  • 17. 2 Native R Data.frame Experience ddf.getSummary() ddf.getFiveNum() ddf.binning(“distance”, 10) ddf.scale(SCALE.MINMAX)
  • 21. 4 Seamlessly Integrate with External ML Libraries Data Algorithm Intelligence + =
  • 23. 5 Share DDFs Seamlessly & Efficiently Can I see your data? ddf://adatao/airline
  • 25. 6Work in Your Preferred Language …even Plain English!
  • 26. 1 2 3 4 5 6 Table-Like Abstraction on Top of Big Data Seamlessly Integrate with External ML Libraries Share DDFs Seamlessly & efficiently Native R Data.frame Experience Focus on Analytics, 
 not MR Work in Your Preferred Language
  • 27. as DDF is to Big Data SQL+R is to Small Data
  • 28. www.adatao.com/ddf DATA INTELLIGENCE FOR ALL We’re Open Sourcing DDF!!! 1. Accept Core Committers 2. Clean up & Prep 3. Open up for public access