SlideShare a Scribd company logo
Data	Preparation	and	
Descriptive	Statistics	in	
SystemML
1
Outline
• Data	pre-processing	and	transformation
• Training/Testing/Cross	Validation
• Descriptive	statistics
I. Univariate	statistics
II. Bivariate	statistics
III. Stratified	statistics
2
Input	Data	Format
3
Input	data		
§ Rows:	data	points	(aka	records)
§ Columns:	features	(aka	variables,	attributes)	
Feature	types:
§ Scale (aka	continuous),	 e.g.,	‘Height’,	‘Weight’,	 ‘Salary’,	‘Temperature’
§ Categorical (aka	discrete)
§ Nominal – no	natural	ranking,		e.g.,	‘Gender’,	‘Region’,	‘Hair	color’
§ Ordinal – natural	ranking,	e.g.,	‘Level	of	Satisfaction’	
Example:	
The	house	data	set
Data	Pre-Processing
Tabular	input	data	needs	to	be	transformed	into	a	matrix	– transform()	built-in	function
Categorical	features	need	special	treatment:
§ Recoding:	mapping	distinct	categories	into	consecutive	numbers	starting	from	1
§ Dummycoding (aka	one-hot-encoding,	 one-of-K	encoding)
Example:	
recoding dummycoding
4
Zipcode
96334
95123
95141
96334
Zipcode
1
2
3
1
direction
east
west
north
south
dir_east dir_west dir_north dir_south
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
transform() Built-in	Function
transform() built-in	function	 supports:
§ Omitting	missing	values
§ Missing	value	imputation by	global_mean (scale	features),	global_mode (categorical	
features),	or constant (scale/categorical	features)
§ Binning (equi-width)
§ Scaling (scale	features):	mean-subtraction,	z-score
§ Recoding
§ Dummycoding
5
Transform	Specification
§ Transformations	operate	on	individual	columns
§ All	required	transformations	specified	in	a	JSON	file
§ Property	na.strings in	the	mtd file	specifies	missing	values
Example:
data.spec.json data.csv.mtd
6
{
"data_type": "frame",
"format": "csv",
"sep": ",",
"header": true,
"na.strings": [ "NA", "" ]
}
{
“ids": true
, "omit": [ 1, 4, 5, 6, 7, 8, 9 ]
, "impute":
[ { “id": 2, "method": "constant",
"value": "south" }
,{ “id": 3, "method":
"global_mean" }
]
,"recode": [ 1, 2, 4, 5, 6, 7 ]
,"bin":
[ { “id": 8, "method": "equi-
width", "numbins": 3 } ]
,"dummycode": [ 2, 5, 6, 7, 8, 3 ]
}
Combinations	of	Transformations
7
Signature	of	transform()
§ Invocation	1:
§ Resulting	metadata:	#	distinct	values	in	categorical	columns,	 list	of	distinct	values	with	their	
recoded	IDs,	number	of	bins,	bin	width,	etc.	
§ An	existing	transformation	can	be	applied	to	new	data	using	the	metadata	generated	in	an	
earlier	invocation
§ Invocation	2:
8
output = transform (target = input,
spec = specification,
transformPath = "/path/to/metadata“);
output = transform (target = input,
transformPath = "/path/to/new_metadata“
applyTransformPath = "/path/to/metadata“);
Outline
• Data	pre-processing	and	transformation
• Training/Testing/Cross	Validation
• Descriptive	statistics
I. Univariate	statistics
II. Bivariate	statistics
III. Stratified	statistics
9
Training/Testing
§ Pre-processing	training	and	testing	data	sets
§ Splitting	data	points	and	labels	– splitXY.dml and	splitXY-dummy.dml (hands-on)
§ Sampling	data	points	– sample.dml (hands-on)
§ Cross	Validation	– cv-linreg.dml (hands-on)
10
Pre-Processing	Training	and	
Testing	Data
Training	phase	
Testing	phase
11
Train = read ("/user/ml/trainset.csv");
Spec = read("/user/ml/tf.spec.json“, data_type = "scalar",
value_type = "String");
trainD = transform (target = Train,
transformSpec = Spec,
transformPath = "/user/ml/train_tf_metadata");
# Build a predictive model using trainD
...
Test = read ("/user/ml/testset.csv");
testD = transform (target = Test,
transformPath = "/user/ml/test_tf_metadata",
applyTransformPath = "/user/ml/train_tf_metdata");
# Test the model using testD
...
Cross	Validation
K-fold	Cross	Validation:
1. Shuffle	the	data	points	
2. Divide	the	data	points	into	𝑘 folds	of	(roughly)	
the	same	size
3. For	𝑖 = 1, … , 𝑘:	
• Train	each	model	on	all	the	data	points	that		
do	not	belong	to	fold	𝑖
• Test	each	model	on	all	the	examples	in	fold	𝑖
and	compute	the	test	error
4. Select	the	model	with	the	minimum	average	test	
over	all	𝑘 folds
5. (Train	the	winning	model	on	all	the	data	points)	
12
Testing Training
Example:	𝑘 = 5
Outline
• Data	pre-processing	and	transformation
• Training/Testing/Cross	Validation
• Descriptive	statistics
I. Univariate	statistics
II. Bivariate	statistics
III. Stratified	statistics
13
Univariate	Statistics
14
Row Name of	Statistic Scale Category
1 Minimum +
2 Maximum +
3 Range +
4 Mean +
5 Variance +
6 Standard	deviation +
7 Standard error	of	mean +
8 Coefficient	of	variation +
9 Skewness +
10 Kurtosis +
11 Standard	error	of	skewness +
12 Standard	error	of	Kurtosis +
13 Median +
14 Intequartilemean +
15 Number	of	categories +
16 Mode +
17 Number	of	modes +
Central	tendency	measures
Dispersion	measures
Shape	measures
Categorical	measures
Bivariate	Statistics
Quantitative	association	between	pairs	of	features
I. Scale-vs-Scale	statistics
§ Pearson’s	correlation	coefficient	
II. Nominal-vs-Nominal	statistics
§ Pearson’s	𝜒)
§ Cramér's 𝑉
III. Nominal-vs-Scale	statistics
§ Eta	statistic
§ 𝐹 statistic
IV. Ordinal-vs-Ordinal	statistics
§ Spearman’s	rank	correlation	coefficient
15
Scale-vs-Scale	Statistics	
Pearson’s	correlation	coefficient
§ A	measure	of	linear	dependence	between	scale	features
§ 𝜌)
measures	accuracy	of	𝑥)	~	𝑥0
16
𝜌	 =
123(56,57)
9:69:7
,								𝜌	 ∈ [−1,+1]
1 − 𝜌)
=
∑ 𝑥A,) − 𝑥BA,)
)C
AD0
∑ 𝑥A,) − 𝑥̅A,)
)C
AD0
Residual	Sum	of	Squares	(RSS)
Total	Sum	of	Squares	(TSS)
Nominal-vs-Nominal	Statistics
Pearson’s	𝜒)
§ A	measure	how	much	frequencies	of	value	pairs	of	two	categorical	features	deviate	from	
statistical	independence
§ Under	independence	assumption Pearson’s	𝜒)
distributed	approximately	𝜒)
𝑑 with
𝑑 = (𝑘0 − 1)(𝑘) − 1) degrees	of	freedom
§ 𝑃-value:
§ 𝑃 → 0 (rapidly)	as	features’	dependence	increases,	sensitive	to	𝑛
§ Only	measures	the	presence	of	dependence	not the	strength	of	dependence
17
𝜒)
=	 K
𝑂M,N − 𝐸M,N
)
𝐸M,NM,N
𝑥0 with 𝑘0 distinct categories
𝑥) with 𝑘) distinct categories
𝑂M ,N = #(𝑎, 𝑏) observed	frequencies
𝐸M,N =
#M	#N
C
expected frequencies for all
pairs (𝑎, 𝑏)
𝑃 = Pr 𝜌 ≥ Pearson[
s	𝜒)
	𝜌	~𝜒)
(𝑑)	distribution
Nominal-vs-Nominal	Statistics
Cramér's	𝑉
§ A	measure	for	the	strength	of	association	between	two	categorical	features
§ Under	independence	assumption	𝑉 distributed	approximately	𝜒)
𝑑 with	
𝑑 = (𝑘0 − 1)(𝑘) − 1) degrees	of	freedom
§ 𝑃-value:
§ 𝑃 → 1 (slowly)	as	features’	dependence	increases,	sensitive	to	𝑛
18
𝑉 =
Pearson[s	𝜒)
𝜒aM5
)
𝜒aM5
)
= 𝑛.min	{ 𝑘0 − 1, 𝑘) − 1}
𝑃 = Pr 𝜌 ≥ Cramér[
s	𝑉	 	𝜌	~𝜒)
(𝑑)	distribution
Nominal-vs-Scale	Statistics
Eta	statistic
§ A	measure	for	the	strength	of	association	between	a	categorical	feature	and	a	scale	
feature
§ 𝜂)
measures	accuracy	of	𝑦	~	𝑥 similar	to	𝑅)
statistic	of	linear	regression
19
𝜂)
= 1 −
∑ 𝑦A − 𝑦B[𝑥A] )C
AD0
∑ 𝑦A − 𝑦k )C
AD0
RSS
TSS
𝑥 categorical
𝑦 scale
𝑦B[𝑥A]:	average	of	𝑦A among	all	records	with	
𝑥A = 𝑥
Nominal-vs-Scale	Statistics
𝐹 statistic
§ A	measure	for	the	strength	of	association	between	a	categorical	feature	and	a	scale	
feature
§ Assumptions	(𝑥 categorical, 𝑦 scale):
§ 𝑦	~	𝑁𝑜𝑟𝑚𝑎𝑙 𝜇, 𝜎)
- same	variance	for	all	𝑥
§ 𝑥 has	small	value	domain	with	large	frequency	counts, 𝑥A non-random
§ All	records	are	iid
§ Under	independence	assumption	𝐹 distributed	approximately	𝐹(𝑘 − 1, 𝑛 − 𝑘)
20
𝐹 =
∑ 𝑓𝑟𝑒𝑞 𝑥 𝑦B 𝑥 − 𝑦k )/(𝑘 − 1)5
∑ 𝑦A − 𝑦B 𝑥A
)/(𝑛 − 𝑘)C
AD0
=
𝜂)(𝑛 − 𝑘)
1 − 𝜂)(𝑘 − 1)
ESS:	Explained	Sum	of	Squares
RSS
Degrees	of	freedom
Degrees	of	freedom
Ordinal-vs-Ordinal	Statistics
Spearman’s	rank	correlation	coefficient
§ A	measure	for	the	strength	of	association	between	two	ordinal	features
§ Pearson’s	correlation	efficient	applied	to	feature	with	values	replaced	by	their	ranks
Example:
21
8x
3)
11z
8{
5|
20
𝑥′
8
3
11
8
5
2
𝑥
4.5
2
6
4.5
3
1
𝑟
𝜌	 =
123	(•6,•7)
	9‚69‚7
𝜌	 ∈ [−1, +1]
Stratified	Statistic
Bivariate	statistics	measures	association	between	pairs	of	features	in	presence	of	a	
confounding	categorical	feature
Why	stratification?
22
Month Oct Nov Dec Oct-Dec
Customers	(Millions) 0.6 1.4 1.4 0.6 3.0 1.0 5.0 3.0
Promotions	(0	or	1) 0 1 0 1 0 1 0 1
Avg sales	per	1000 0.4 0.5 0.9 1.0 2.5 2.6 1.8 1.3
A	trend	in	each	group	is	reversed	and	
amplified	if	groups	combined
Stratified	Statistics
Measure	of	associations:	correlation,	slope,	𝑃-values,	etc.
Assumptions:
• Values	of	confounding	feature	𝑠 group	the	records	into	strata,	within	each	strata	all	
bivariate	pairs	assumed	free	of	confounding
• For	each	bivariate	pair	(𝑥, 𝑦),	𝑦 must	be	numerical	and	𝑦	distributed	normally	given	𝑥
• A	linear	regression	model	for	𝑦 (𝑖:	stratum	id)
• 𝜎)
same	across	all	strata
Computed	statistics:
• 𝑥̅A,		𝜎„5…
,		𝑦kA, 𝜎B†…
• For	𝑥	~ strata,	y	~ strata,	y	~	𝑥 NO	strata,	and	y	~	𝑥 AND	strata
• 𝑅)
, slopes,	std.	error	of	slopes,	𝑃- values
23
𝑦A,ˆ = 𝛼A + 𝛽𝑥A,ˆ + 𝜀A,ˆ 𝜀A,ˆ	~	𝑁𝑜𝑟𝑚𝑎𝑙(0, 𝜎)
)

More Related Content

PPTX
Correlational Research : Language Learning / Teaching Attitudes
PDF
Regression using Apache SystemML by Alexandre V Evfimievski
PDF
Classification using Apache SystemML by Prithviraj Sen
PDF
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
PDF
S1 DML Syntax and Invocation
PDF
Amia tb-review-11
PDF
Inside Apache SystemML by Frederick Reiss
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
Correlational Research : Language Learning / Teaching Attitudes
Regression using Apache SystemML by Alexandre V Evfimievski
Classification using Apache SystemML by Prithviraj Sen
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
S1 DML Syntax and Invocation
Amia tb-review-11
Inside Apache SystemML by Frederick Reiss
Building Custom Machine Learning Algorithms With Apache SystemML

Viewers also liked (15)

PPTX
Inside Apache SystemML
PDF
Ggianluca Fiorelli - International Social Media
PDF
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
PDF
Clustering and Factorization using Apache SystemML by Prithviraj Sen
DOCX
Resume sachin kuckian
PDF
Apache SystemML Architecture by Niketan Panesar
PDF
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
PPTX
Equilibrium – puttingdemandandsupplytogether
PDF
Saddam_Patel_CV_2016
PPTX
Music Magazine Analysis
PDF
PROFILE - YUKAKO
PDF
CHRISTIAN HINOJOSA PORTFOLIO SHEETS 1-28
PPTX
WavesPREPWALK
PDF
Oil Sector Presentation (Saeed Ahmad)
PPTX
Gravitationalfield
Inside Apache SystemML
Ggianluca Fiorelli - International Social Media
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Resume sachin kuckian
Apache SystemML Architecture by Niketan Panesar
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Equilibrium – puttingdemandandsupplytogether
Saddam_Patel_CV_2016
Music Magazine Analysis
PROFILE - YUKAKO
CHRISTIAN HINOJOSA PORTFOLIO SHEETS 1-28
WavesPREPWALK
Oil Sector Presentation (Saeed Ahmad)
Gravitationalfield
Ad

Similar to Data preparation, training and validation using SystemML by Faraz Makari Manshadi (20)

PPTX
Chap2-Data.pptx. It is all about data in data mining.
PDF
Big_DM_24_MS_Topic_02_Understanding Data.pdf
PPT
Data Analysis Steps -- descriptive statistics.ppt
PPT
Descriptive and Inferential Statistics Basics
PPT
Descriptive_Statistics_PPT.ppt
PPT
Descriptive statistics ppt
PPTX
DESCRIPTIVE STATISTICS.pptx Biostatistics
PPTX
Exploratory Spatial Data Analysis spatial data analysis and interpretation.pptx
PDF
Statistical Methods in Research
PPTX
Statistical Analysis: FUNDAMENTAL TO STATISTICS.pptx
DOCX
ANALYSIS ANDINTERPRETATION OF DATA Analysis and Interpr.docx
PDF
1-Descriptive Statistics - pdf file descriptive
PPT
Statistical methods
PPTX
Machine learning session1
PPTX
Descrptive statistics
PPTX
Basic statisctis -Anandh Shankar
PPTX
fundamentals of data science and analytics on descriptive analysis.pptx
PPTX
Quantitative Methods
PPTX
Quantitative Methods
PPTX
Statistics in research by dr. sudhir sahu
Chap2-Data.pptx. It is all about data in data mining.
Big_DM_24_MS_Topic_02_Understanding Data.pdf
Data Analysis Steps -- descriptive statistics.ppt
Descriptive and Inferential Statistics Basics
Descriptive_Statistics_PPT.ppt
Descriptive statistics ppt
DESCRIPTIVE STATISTICS.pptx Biostatistics
Exploratory Spatial Data Analysis spatial data analysis and interpretation.pptx
Statistical Methods in Research
Statistical Analysis: FUNDAMENTAL TO STATISTICS.pptx
ANALYSIS ANDINTERPRETATION OF DATA Analysis and Interpr.docx
1-Descriptive Statistics - pdf file descriptive
Statistical methods
Machine learning session1
Descrptive statistics
Basic statisctis -Anandh Shankar
fundamentals of data science and analytics on descriptive analysis.pptx
Quantitative Methods
Quantitative Methods
Statistics in research by dr. sudhir sahu
Ad

More from Arvind Surve (13)

PDF
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
PDF
Apache SystemML Architecture by Niketan Panesar
PDF
Clustering and Factorization using Apache SystemML by Prithviraj Sen
PDF
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
PDF
Classification using Apache SystemML by Prithviraj Sen
PDF
Data preparation, training and validation using SystemML by Faraz Makari Mans...
PDF
DML Syntax and Invocation process
PDF
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
PDF
Apache SystemML 2016 Summer class primer by Berthold Reinwald
PDF
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
PDF
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
PDF
Regression using Apache SystemML by Alexandre V Evfimievski
PDF
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Apache SystemML Architecture by Niketan Panesar
Clustering and Factorization using Apache SystemML by Prithviraj Sen
Clustering and Factorization using Apache SystemML by Alexandre V Evfimievski
Classification using Apache SystemML by Prithviraj Sen
Data preparation, training and validation using SystemML by Faraz Makari Mans...
DML Syntax and Invocation process
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Apache SystemML 2016 Summer class primer by Berthold Reinwald
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Matthias Boehm
Regression using Apache SystemML by Alexandre V Evfimievski
Apache SystemML 2016 Summer class primer by Berthold Reinwald

Recently uploaded (20)

PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Pre independence Education in Inndia.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Pharma ospi slides which help in ospi learning
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Business Ethics Teaching Materials for college
PPTX
Cell Types and Its function , kingdom of life
PDF
01-Introduction-to-Information-Management.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Supply Chain Operations Speaking Notes -ICLT Program
Renaissance Architecture: A Journey from Faith to Humanism
Module 4: Burden of Disease Tutorial Slides S2 2025
Pre independence Education in Inndia.pdf
human mycosis Human fungal infections are called human mycosis..pptx
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Microbial diseases, their pathogenesis and prophylaxis
Pharma ospi slides which help in ospi learning
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPH.pptx obstetrics and gynecology in nursing
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Cell Structure & Organelles in detailed.
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Business Ethics Teaching Materials for college
Cell Types and Its function , kingdom of life
01-Introduction-to-Information-Management.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student

Data preparation, training and validation using SystemML by Faraz Makari Manshadi