SlideShare a Scribd company logo
Data Preparation August 24, 2014
UC Irvine Week7_1
________________________________________________________________________________________
1
PRELIMINARY MODELING REPORT
Marta H Seoane
I. Introduction
The data for this report is from the Paralyzed Veterans of America (PVA). PVA is a not-for-profit organization that provides programs and services for
US veterans with spinal cord injuries or disease. The in-house data base contains information for over 13 million donors. PVA is also one of the
largest direct mail fund raisers in the country. The data set was used in the KDD-CUP-98 —Second International Knowledge Discovery and Data
Mining Tools Competition. The goal of the competition was to estimate the return from a direct mailing in order to maximize donation profits. 1
The purpose of the report is to analyze information that might be relevant to the organization’s efforts to increase net revenues generated from
renewals from lapsed donors. The donors’ data originated from three separate data sets that were subsequently merged. The integrated data set
includes 36,205 analytical cases and 205 variables. The report is divided in three sections—Methods, Experimental Modeling, and Post Modeling
Analysis. The first section describes the data preparation steps to carry out the analysis, the second provides the results of an experimental modeling
analysis using different algorithms and assumptions, and the third proposes steps for further analysis.
II. Methods
Data preparation is a sequence of activities designed to ready data that usually has been acquired for analysis and modeling. Typically, data
preparation accounts for 80% of all activities leading to model building, training, testing, and deployment. It is through this process that the 200+
variables in our data set were narrowed to the 17 used for modeling. The flow or the order in which activities or operations are performed can be
delineated in stages (Figure 1). I chose a wheel graph with arrows pointing in opposite directions for two reasons: Clockwise the arrows show the
sequence and necessary steps in data preparation. Counterclockwise, the arrows convey the iterative nature of data preparation. The intent is to
portray a cycle which begins with the acquisition of entire data sets, and ends with the selection of few variables for modeling. To complete the cycle
the data goes through various stages where is readied, massaged, transformed, narrowed, and evaluated. Following is a brief description of the
typical stages in data preparation with selective references to our case study.2
Figure1. Data Preparation as an Iterative Process
Data Acquisition, Data Description, and Data Integration
• The Donor’s information originated from three separate data sets that were subsequently merged into a single “Integrated” data set using
Statistica Data Miner (SDM) and Excel.
• The description of the data is an essential step to learn about the data set, its size, its variables, characteristics, potential, and challenges.
This is important in data mining and in more traditional types of analysis. A data dictionary contained a list of the original variables,
variable labels, names, types, and format. Histograms, scatterplots, and descriptive statistics provided a more intimate knowledge of
target and predictor variables. Correlation matrices helped identify collinearity—a necessary step to narrow down the selection of
potential predictors.
• Variable selection—the process of selecting the variables for modeling begins at this stage, and continues throughout the cycle.
• The activities, operations, and results were documented in a Data Description Report.
1
http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.htm— accessed 08/25/2014
2
There are 10 steps in data preparation: 1. Data access and extraction; 2. Data integration; 3. Data cleansing; 4. Data conditioning; 5. Missing value imputation; 6. New variable derivation; 7. Variable
selection; 8. Algorithm selection; Preliminary model building; 10.Feedback. These steps are included in the stages in Figure1.
Acquisition
Description
Integration
Quality
Assessment
Transformation
Standardization
Modeling
Balancing|Partitioning
Data
Preparation
Data Preparation August 24, 2014
UC Irvine Week7_1
________________________________________________________________________________________
2
Data Quality Assessment
• Data profiling and data cleansing are central to the assessment of data quality.
o Data profiling identifies the number of valid cases, examines the measures of central tendency, searches for outliers, missing
data, miscodes, and helps determine what new variables might need to be created in the analysis, e.g. Valid N.
o Data cleansing looks to correct miscodes, such as negative zip codes, inconsistencies in data entries for gender, eliminate
redundancies, e.g. age vs date of birth.
• Missing Values and Variable Recoding
o The tools in Statistica provide the means to search for valid data and missing cases—several of the variables in the Donor’s
data set registered 99% of missing data. Decisions were made to impute missing values, or in some cases to select them for
deletion. Recoding was used to remove outliers, fix incorrect zip codes, and change categorical values from text to numeric,
e.g. Homeownr and Gender. Care must be taken with recoding variables, which can affect the number of missing values, if the
data values cannot be calculated.
• A data quality assessment report was generated in Statistica. The report listed for each variable: name, number of valid cases, coefficients
of variations, and a measure of non-normality (Mean/Md).
• In the initial stages of data preparation, the data quality assessment stage is limited in scope. During the Data Transformation stage,
quality is reassessed more rigorously by running a Data Health Check.
• A Data Quality Assessment Report documented the initial results generated in Statistica and the various steps taken in this stage.
Data Transformation and Variable Selection
• New variables and distribution of metrics. New variables are created to assist in analysis, and modeling, e.g. “Valid N” to determine blanks
or missing values, “Mean/Median” to assess normality, Random to randomized the balanced data set prior to modeling. The distribution
of metrics, for example, relies on the newly created variable “Mean/Median” as a measure of non-normality to determine the extent to
which a variable is skewed to either side of the distribution. Categorical variables, such as the dependent variable, need to be changed
from text to numerical for algorithms to read.
• Dummy variables are another form of data transformation. For example, transforming the categorical variable Gender from text (M/F) to
numerical M=1, F=0). Dummy variables were also created for Homeownr, Domain and MdMaude. Numerical variables can also be
transformed, such as combining, values for different income variables.
• By now the data has undergone many changes through recoding, deletion, addition, derivation, and standardization. To assess the “health
of the data” a Data Health Check in Statistica was ran to assess the number of valid cases and to identify missing values. Imputation was
used in the Income variable using the mean for each unknown value.
• Other types of transformation are deriving new variables from existing ones, such as data abstraction—a variable constructed from other
data by arithmetic or logical operations, e.g. lag variables. Lag variables were not used in the Donor’s data set.
• During this stage a list of predictor variables is selected.
Data Standardization
• The idea behind standardization is to normalize the data to a common range, such as z-values |z=(value-mean)/standard deviation. The
numerical variables were standardized using Statistica. Variables values often differ in range. Parametric statistics will bias results toward
the variables with the wider range. Standardization solves the problem by reducing the numerical variables to a common scale, thus,
lowering the estimation error. During this process categorical variables are deleted narrowing total number of variables in the data set.
Statistica has the option All Variable Specs, and the Data Transformation Group has an automated Standardization routine.
o Variable selection: Only the predictors selected for modeling are standardized
Balancing, Partitioning, and Modeling
• Balancing, partitioning, and preliminary modeling are very much data preparation activities.
o Algorithms: Algorithms are instrumental in determining how the data is prepared. Following are the algorithms used in the
study: Boosted Trees, SANN, and Service Vector Machines. Statistica Data Miner Recipes provided additional results for C&RT.
o Balancing: Some algorithms require a balanced data set for proper operation. The methods to balance a data set can be
manual, automated, or weighted. Balancing a data set is important when binary coded variables are unevenly distributed. In
the donor’s study over 80% of the target variables were coded as 0, and 20% as 1. The imbalance causes algorithms that learn
case by case to bias prediction toward the common case. Balancing can be achieved by reducing the number of cases—under-
sampling, over-sampling, and weighting.
o Some algorithms balance their data sets internally—Boosted Trees, SANN, CART. Others such as Service Vector Machines
(SVM) and Statistica Data Mining (SDM) accept case by case weights.
o Partitioning: Training (to calculate weights). Testing (for error evaluation between iterations). And Validation (to calculate
accuracy for all iterations, thus is never used in training).
 SANN has a built in partitioning option. set at (60:20:20)
 Boosted Trees and SVM required a splitting node in Statistica to create a training and testing sample
Data Preparation August 24, 2014
UC Irvine Week7_1
________________________________________________________________________________________
3
o Modeling: During preliminary modeling algorithms are configured to run the train, test, and validation subsets. Accuracies
measurements are used to compare the results from training and testing unbalanced and balanced data sets. Accuracies
measurements are used to evaluate and compare results. These steps are necessary to determine whether the models are
ready for deployment or not.
III. Experimental Results
Experimental Results (no-balancing, manual balancing, balancing with case weights)
• Weight balancing: changed B=1 to 15, and B=0 to 2. These weights will proportionately weight Target_B1 to 7 ½ times that of Target_B=0.
• Use of a Weight variable in SDM
This section shows the results from using the non-balance and unbalanced data sets to run the algorithms. The workspace below displays the steps
leading to the preliminary modeling. The nodes for the data sources represent the unbalanced and balanced data sets. Notice that the connectors for
both sets have a direct link to the SANN Classification Automated Network Search (ANS) node, and bypass the node to split the input data into
training and testing samples. The SANN algorithm conducts a randomized 80: 20 partition of the data set training on the 80% and testing on the rest;
in addition it provides an option to automate partitioning— set in this exercise at 60:20:20 for the train, test, and validation subsets. In the other two
models the data has to traverse the Split Input Data node prior to partitioning to fit algorithms BT and SVM. The workspace below displays the
connections for the unbalanced and balance data sets to the three types of algorithms.
Figure2. Preliminary Modeling Workspace
A)
SANN: Unbalanced SANN: Balanced3
Data Mining Recipe: Results for Neural Network—Balanced
SANN balanced outperforms the unbalanced data set, and DMR for the test sample.
3
Manual balancing: The total number of records was reduced from 36205 to 9687—4843 records for each Target=1, and Target=0 (under-sampling)
Manually balanced
Unbalanced
Data Preparation August 24, 2014
UC Irvine Week7_1
________________________________________________________________________________________
4
B)
BOOSTED TREES: Unbalanced
Accuracy for the unbalanced test samples are about 86%. The results for the test samples give the advantage to the balanced set: The accuracy value
for Target=1 is 89%, and 92% for DMR model below.
BOOSTED TREES: Balanced
Data Mining Recipes Results for Boosted Trees Balanced
C)
SUPPORT VECTOR MACHINE (SVM)
SVM is a supervised classification technique used in machine learning. SVM algorithms analyze and recognize patterns. It is a non-probabilistic
method that assigns data points in a training data set to one or two categories. The model is suited for the study of the categorical Target_B variable,
with categories 1 and 0. In the training sample, the data points are assigned to one or the other side of a line drawn in space. The algorithms create a
“gap” between the categories, and during the testing phase, assign “1” and “0” values to the categories defined in the training set. The Statistica
SVM model used for the analysis is known as Classification SVM Type 1 or C-SVM, and a Radial Basis Kernel function—the most popular. Cross-
validation with SVM uses a form of resampling, in which separate models are trained on different random samples of the data set.
The results below show higher accuracy values for the balanced data set, and higher performance by cross-validation relative to the balanced data
set.
Data Preparation August 24, 2014
UC Irvine Week7_1
________________________________________________________________________________________
5
Support Vector Machine: unbalanced
Support Vector Machine: balanced
CROSS-VALIDATION with SVM
IV. Post Modeling Analysis
The predictor DOMAIN, a potential predictor excluded from the final variable list, is worth further investigation. DOMAIN is a symbolic variable
representing levels of urbanicity and socio-economic status. The histogram below reveals that three of the categories combined accounted for 40%
of all cases. Further, the means-with-errors plot for DOMAIN and TARGET_D —the dollar amount of donations associated with responses to mailing
requests—shows higher values for groups associated with affluent neighborhoods, regardless of geography, and lower values for the rest of the
grouping suggesting the possibility of two predictive models. Location analysis would enhance the predictive value of the models.
Conclusion
This exercise illustrates the inherently iterative nature of data preparation. Data preparation begins with data collection, continues through several
steps until it reaches the preliminary modeling stage. It is a repetitive process. Balancing the data set is one aspect of data preparation. The balanced
and unbalanced data sets were ran through three types of classification algorithms; the results for each one indicate that balancing the data whether
by adding weights or under-sampling enhances model accuracy. In addition to running the three models, I relied on Statistica’s Data Mining Recipe
(DMR) to compare the results for the balanced neural network with those of BT. The results are not strictly comparable because in DMR accuracy is
assessed relative to the accuracies from algorithms Booster Trees and SVM. For example, notice in the examples above, SANN algorithm trained 5
neural nets with outputs ranging from 8 to 13 (Net Name). DMR, on the other hand, trained one neural net-10. Also there were differences in the
error functions, and the output activation between the two methods. In terms of accuracy, DMR is a good automated approach and useful evaluation
tool to derived results without much data preparation. A post-modeling analysis recommends a deeper examination of the predictor DOMAIN. Based
on earlier results, the data suggests the possibility of two predictive models based on the socio-economic characteristics of the donors’
neighborhoods.
END

More Related Content

DOCX
Quality performance management
DOCX
Food quality management system
DOCX
Water quality management
DOCX
Quotes on quality management
DOCX
Healthcare quality management certification
DOCX
Quality management wiki
DOCX
Certification in quality management
DOCX
Quality management policies
Quality performance management
Food quality management system
Water quality management
Quotes on quality management
Healthcare quality management certification
Quality management wiki
Certification in quality management
Quality management policies

What's hot (20)

DOCX
Quality management books
DOCX
Food safety and quality management
DOCX
Radiology quality management
DOCX
Introduction of quality management
DOCX
Introduction to quality management
DOCX
Clinical quality management
DOCX
Certificate in quality management
DOCX
Quality management manual
DOCX
Integrated quality management
DOCX
Ms in quality management
DOCX
Statistical quality management
PDF
Study of Data Mining Methods and its Applications
PDF
The graphical analysis for maintenace management method
PDF
A novel hybrid feature selection approach
DOCX
Integrated quality management system
PDF
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
DOCX
What is quality management in healthcare
DOCX
Quality management healthcare
PDF
The Development of Financial Information System and Business Intelligence Usi...
DOCX
Air quality management district
Quality management books
Food safety and quality management
Radiology quality management
Introduction of quality management
Introduction to quality management
Clinical quality management
Certificate in quality management
Quality management manual
Integrated quality management
Ms in quality management
Statistical quality management
Study of Data Mining Methods and its Applications
The graphical analysis for maintenace management method
A novel hybrid feature selection approach
Integrated quality management system
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
What is quality management in healthcare
Quality management healthcare
The Development of Financial Information System and Business Intelligence Usi...
Air quality management district
Ad

Viewers also liked (16)

DOCX
Boomers_Millenials_Occupation
PDF
Customers Sales and Profits_
PDF
California 2012 Elections Yes on GMOs Food Labeling
PPT
Business Research Methods. data collection preparation and analysis
PPT
Data Preparation and Processing
PDF
What Makes Great Infographics
PDF
Masters of SlideShare
PDF
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
PDF
You Suck At PowerPoint!
PDF
10 Ways to Win at SlideShare SEO & Presentation Optimization
PDF
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
PDF
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
PDF
2015 Upload Campaigns Calendar - SlideShare
PPTX
What to Upload to SlideShare
PDF
How to Make Awesome SlideShares: Tips & Tricks
PDF
Getting Started With SlideShare
Boomers_Millenials_Occupation
Customers Sales and Profits_
California 2012 Elections Yes on GMOs Food Labeling
Business Research Methods. data collection preparation and analysis
Data Preparation and Processing
What Makes Great Infographics
Masters of SlideShare
STOP! VIEW THIS! 10-Step Checklist When Uploading to Slideshare
You Suck At PowerPoint!
10 Ways to Win at SlideShare SEO & Presentation Optimization
How To Get More From SlideShare - Super-Simple Tips For Content Marketing
A Guide to SlideShare Analytics - Excerpts from Hubspot's Step by Step Guide ...
2015 Upload Campaigns Calendar - SlideShare
What to Upload to SlideShare
How to Make Awesome SlideShares: Tips & Tricks
Getting Started With SlideShare
Ad

Similar to Preliminary Modeling Report (20)

PDF
IRJET- Medical Data Mining
PPTX
Data Analysis for students learning.pptx
PPTX
data science, prior knowledge ,modeling, scatter plot
DOCX
PPTX
Data analytics
PPTX
Data Analytics for UG students - What is data analytics and its importance
PPTX
DATA ANALYSIS Presentation Computing Fundamentals.pptx
PPT
analysis plan.ppt
PPTX
Data Processing & Explain each term in details.pptx
PPTX
Data processing and analysis final
PDF
EDA-Unit 1.pdf
PPT
Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...
PDF
IRJET- Disease Prediction System
PDF
Data Analyst Interview Questions & Answers
PPTX
Introduction to data science
DOCX
Data quality management model
PPTX
Stats LECTURE 1.pptx
DOCX
Machine Learning Approaches and its Challenges
PPTX
Predicting Disease with Machine Learning.pptx
PPTX
Presentation1.pptx
IRJET- Medical Data Mining
Data Analysis for students learning.pptx
data science, prior knowledge ,modeling, scatter plot
Data analytics
Data Analytics for UG students - What is data analytics and its importance
DATA ANALYSIS Presentation Computing Fundamentals.pptx
analysis plan.ppt
Data Processing & Explain each term in details.pptx
Data processing and analysis final
EDA-Unit 1.pdf
Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...
IRJET- Disease Prediction System
Data Analyst Interview Questions & Answers
Introduction to data science
Data quality management model
Stats LECTURE 1.pptx
Machine Learning Approaches and its Challenges
Predicting Disease with Machine Learning.pptx
Presentation1.pptx

Preliminary Modeling Report

  • 1. Data Preparation August 24, 2014 UC Irvine Week7_1 ________________________________________________________________________________________ 1 PRELIMINARY MODELING REPORT Marta H Seoane I. Introduction The data for this report is from the Paralyzed Veterans of America (PVA). PVA is a not-for-profit organization that provides programs and services for US veterans with spinal cord injuries or disease. The in-house data base contains information for over 13 million donors. PVA is also one of the largest direct mail fund raisers in the country. The data set was used in the KDD-CUP-98 —Second International Knowledge Discovery and Data Mining Tools Competition. The goal of the competition was to estimate the return from a direct mailing in order to maximize donation profits. 1 The purpose of the report is to analyze information that might be relevant to the organization’s efforts to increase net revenues generated from renewals from lapsed donors. The donors’ data originated from three separate data sets that were subsequently merged. The integrated data set includes 36,205 analytical cases and 205 variables. The report is divided in three sections—Methods, Experimental Modeling, and Post Modeling Analysis. The first section describes the data preparation steps to carry out the analysis, the second provides the results of an experimental modeling analysis using different algorithms and assumptions, and the third proposes steps for further analysis. II. Methods Data preparation is a sequence of activities designed to ready data that usually has been acquired for analysis and modeling. Typically, data preparation accounts for 80% of all activities leading to model building, training, testing, and deployment. It is through this process that the 200+ variables in our data set were narrowed to the 17 used for modeling. The flow or the order in which activities or operations are performed can be delineated in stages (Figure 1). I chose a wheel graph with arrows pointing in opposite directions for two reasons: Clockwise the arrows show the sequence and necessary steps in data preparation. Counterclockwise, the arrows convey the iterative nature of data preparation. The intent is to portray a cycle which begins with the acquisition of entire data sets, and ends with the selection of few variables for modeling. To complete the cycle the data goes through various stages where is readied, massaged, transformed, narrowed, and evaluated. Following is a brief description of the typical stages in data preparation with selective references to our case study.2 Figure1. Data Preparation as an Iterative Process Data Acquisition, Data Description, and Data Integration • The Donor’s information originated from three separate data sets that were subsequently merged into a single “Integrated” data set using Statistica Data Miner (SDM) and Excel. • The description of the data is an essential step to learn about the data set, its size, its variables, characteristics, potential, and challenges. This is important in data mining and in more traditional types of analysis. A data dictionary contained a list of the original variables, variable labels, names, types, and format. Histograms, scatterplots, and descriptive statistics provided a more intimate knowledge of target and predictor variables. Correlation matrices helped identify collinearity—a necessary step to narrow down the selection of potential predictors. • Variable selection—the process of selecting the variables for modeling begins at this stage, and continues throughout the cycle. • The activities, operations, and results were documented in a Data Description Report. 1 http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.htm— accessed 08/25/2014 2 There are 10 steps in data preparation: 1. Data access and extraction; 2. Data integration; 3. Data cleansing; 4. Data conditioning; 5. Missing value imputation; 6. New variable derivation; 7. Variable selection; 8. Algorithm selection; Preliminary model building; 10.Feedback. These steps are included in the stages in Figure1. Acquisition Description Integration Quality Assessment Transformation Standardization Modeling Balancing|Partitioning Data Preparation
  • 2. Data Preparation August 24, 2014 UC Irvine Week7_1 ________________________________________________________________________________________ 2 Data Quality Assessment • Data profiling and data cleansing are central to the assessment of data quality. o Data profiling identifies the number of valid cases, examines the measures of central tendency, searches for outliers, missing data, miscodes, and helps determine what new variables might need to be created in the analysis, e.g. Valid N. o Data cleansing looks to correct miscodes, such as negative zip codes, inconsistencies in data entries for gender, eliminate redundancies, e.g. age vs date of birth. • Missing Values and Variable Recoding o The tools in Statistica provide the means to search for valid data and missing cases—several of the variables in the Donor’s data set registered 99% of missing data. Decisions were made to impute missing values, or in some cases to select them for deletion. Recoding was used to remove outliers, fix incorrect zip codes, and change categorical values from text to numeric, e.g. Homeownr and Gender. Care must be taken with recoding variables, which can affect the number of missing values, if the data values cannot be calculated. • A data quality assessment report was generated in Statistica. The report listed for each variable: name, number of valid cases, coefficients of variations, and a measure of non-normality (Mean/Md). • In the initial stages of data preparation, the data quality assessment stage is limited in scope. During the Data Transformation stage, quality is reassessed more rigorously by running a Data Health Check. • A Data Quality Assessment Report documented the initial results generated in Statistica and the various steps taken in this stage. Data Transformation and Variable Selection • New variables and distribution of metrics. New variables are created to assist in analysis, and modeling, e.g. “Valid N” to determine blanks or missing values, “Mean/Median” to assess normality, Random to randomized the balanced data set prior to modeling. The distribution of metrics, for example, relies on the newly created variable “Mean/Median” as a measure of non-normality to determine the extent to which a variable is skewed to either side of the distribution. Categorical variables, such as the dependent variable, need to be changed from text to numerical for algorithms to read. • Dummy variables are another form of data transformation. For example, transforming the categorical variable Gender from text (M/F) to numerical M=1, F=0). Dummy variables were also created for Homeownr, Domain and MdMaude. Numerical variables can also be transformed, such as combining, values for different income variables. • By now the data has undergone many changes through recoding, deletion, addition, derivation, and standardization. To assess the “health of the data” a Data Health Check in Statistica was ran to assess the number of valid cases and to identify missing values. Imputation was used in the Income variable using the mean for each unknown value. • Other types of transformation are deriving new variables from existing ones, such as data abstraction—a variable constructed from other data by arithmetic or logical operations, e.g. lag variables. Lag variables were not used in the Donor’s data set. • During this stage a list of predictor variables is selected. Data Standardization • The idea behind standardization is to normalize the data to a common range, such as z-values |z=(value-mean)/standard deviation. The numerical variables were standardized using Statistica. Variables values often differ in range. Parametric statistics will bias results toward the variables with the wider range. Standardization solves the problem by reducing the numerical variables to a common scale, thus, lowering the estimation error. During this process categorical variables are deleted narrowing total number of variables in the data set. Statistica has the option All Variable Specs, and the Data Transformation Group has an automated Standardization routine. o Variable selection: Only the predictors selected for modeling are standardized Balancing, Partitioning, and Modeling • Balancing, partitioning, and preliminary modeling are very much data preparation activities. o Algorithms: Algorithms are instrumental in determining how the data is prepared. Following are the algorithms used in the study: Boosted Trees, SANN, and Service Vector Machines. Statistica Data Miner Recipes provided additional results for C&RT. o Balancing: Some algorithms require a balanced data set for proper operation. The methods to balance a data set can be manual, automated, or weighted. Balancing a data set is important when binary coded variables are unevenly distributed. In the donor’s study over 80% of the target variables were coded as 0, and 20% as 1. The imbalance causes algorithms that learn case by case to bias prediction toward the common case. Balancing can be achieved by reducing the number of cases—under- sampling, over-sampling, and weighting. o Some algorithms balance their data sets internally—Boosted Trees, SANN, CART. Others such as Service Vector Machines (SVM) and Statistica Data Mining (SDM) accept case by case weights. o Partitioning: Training (to calculate weights). Testing (for error evaluation between iterations). And Validation (to calculate accuracy for all iterations, thus is never used in training).  SANN has a built in partitioning option. set at (60:20:20)  Boosted Trees and SVM required a splitting node in Statistica to create a training and testing sample
  • 3. Data Preparation August 24, 2014 UC Irvine Week7_1 ________________________________________________________________________________________ 3 o Modeling: During preliminary modeling algorithms are configured to run the train, test, and validation subsets. Accuracies measurements are used to compare the results from training and testing unbalanced and balanced data sets. Accuracies measurements are used to evaluate and compare results. These steps are necessary to determine whether the models are ready for deployment or not. III. Experimental Results Experimental Results (no-balancing, manual balancing, balancing with case weights) • Weight balancing: changed B=1 to 15, and B=0 to 2. These weights will proportionately weight Target_B1 to 7 ½ times that of Target_B=0. • Use of a Weight variable in SDM This section shows the results from using the non-balance and unbalanced data sets to run the algorithms. The workspace below displays the steps leading to the preliminary modeling. The nodes for the data sources represent the unbalanced and balanced data sets. Notice that the connectors for both sets have a direct link to the SANN Classification Automated Network Search (ANS) node, and bypass the node to split the input data into training and testing samples. The SANN algorithm conducts a randomized 80: 20 partition of the data set training on the 80% and testing on the rest; in addition it provides an option to automate partitioning— set in this exercise at 60:20:20 for the train, test, and validation subsets. In the other two models the data has to traverse the Split Input Data node prior to partitioning to fit algorithms BT and SVM. The workspace below displays the connections for the unbalanced and balance data sets to the three types of algorithms. Figure2. Preliminary Modeling Workspace A) SANN: Unbalanced SANN: Balanced3 Data Mining Recipe: Results for Neural Network—Balanced SANN balanced outperforms the unbalanced data set, and DMR for the test sample. 3 Manual balancing: The total number of records was reduced from 36205 to 9687—4843 records for each Target=1, and Target=0 (under-sampling) Manually balanced Unbalanced
  • 4. Data Preparation August 24, 2014 UC Irvine Week7_1 ________________________________________________________________________________________ 4 B) BOOSTED TREES: Unbalanced Accuracy for the unbalanced test samples are about 86%. The results for the test samples give the advantage to the balanced set: The accuracy value for Target=1 is 89%, and 92% for DMR model below. BOOSTED TREES: Balanced Data Mining Recipes Results for Boosted Trees Balanced C) SUPPORT VECTOR MACHINE (SVM) SVM is a supervised classification technique used in machine learning. SVM algorithms analyze and recognize patterns. It is a non-probabilistic method that assigns data points in a training data set to one or two categories. The model is suited for the study of the categorical Target_B variable, with categories 1 and 0. In the training sample, the data points are assigned to one or the other side of a line drawn in space. The algorithms create a “gap” between the categories, and during the testing phase, assign “1” and “0” values to the categories defined in the training set. The Statistica SVM model used for the analysis is known as Classification SVM Type 1 or C-SVM, and a Radial Basis Kernel function—the most popular. Cross- validation with SVM uses a form of resampling, in which separate models are trained on different random samples of the data set. The results below show higher accuracy values for the balanced data set, and higher performance by cross-validation relative to the balanced data set.
  • 5. Data Preparation August 24, 2014 UC Irvine Week7_1 ________________________________________________________________________________________ 5 Support Vector Machine: unbalanced Support Vector Machine: balanced CROSS-VALIDATION with SVM IV. Post Modeling Analysis The predictor DOMAIN, a potential predictor excluded from the final variable list, is worth further investigation. DOMAIN is a symbolic variable representing levels of urbanicity and socio-economic status. The histogram below reveals that three of the categories combined accounted for 40% of all cases. Further, the means-with-errors plot for DOMAIN and TARGET_D —the dollar amount of donations associated with responses to mailing requests—shows higher values for groups associated with affluent neighborhoods, regardless of geography, and lower values for the rest of the grouping suggesting the possibility of two predictive models. Location analysis would enhance the predictive value of the models. Conclusion This exercise illustrates the inherently iterative nature of data preparation. Data preparation begins with data collection, continues through several steps until it reaches the preliminary modeling stage. It is a repetitive process. Balancing the data set is one aspect of data preparation. The balanced and unbalanced data sets were ran through three types of classification algorithms; the results for each one indicate that balancing the data whether by adding weights or under-sampling enhances model accuracy. In addition to running the three models, I relied on Statistica’s Data Mining Recipe (DMR) to compare the results for the balanced neural network with those of BT. The results are not strictly comparable because in DMR accuracy is assessed relative to the accuracies from algorithms Booster Trees and SVM. For example, notice in the examples above, SANN algorithm trained 5 neural nets with outputs ranging from 8 to 13 (Net Name). DMR, on the other hand, trained one neural net-10. Also there were differences in the error functions, and the output activation between the two methods. In terms of accuracy, DMR is a good automated approach and useful evaluation tool to derived results without much data preparation. A post-modeling analysis recommends a deeper examination of the predictor DOMAIN. Based on earlier results, the data suggests the possibility of two predictive models based on the socio-economic characteristics of the donors’ neighborhoods. END