Preliminary Modeling Report

Data Preparation August 24, 2014
UC Irvine Week7_1
________________________________________________________________________________________
1
PRELIMINARY MODELING REPORT
Marta H Seoane
I. Introduction
The data for this report is from the Paralyzed Veterans of America (PVA). PVA is a not-for-profit organization that provides programs and services for
US veterans with spinal cord injuries or disease. The in-house data base contains information for over 13 million donors. PVA is also one of the
largest direct mail fund raisers in the country. The data set was used in the KDD-CUP-98 —Second International Knowledge Discovery and Data
Mining Tools Competition. The goal of the competition was to estimate the return from a direct mailing in order to maximize donation profits. 1
The purpose of the report is to analyze information that might be relevant to the organization’s efforts to increase net revenues generated from
renewals from lapsed donors. The donors’ data originated from three separate data sets that were subsequently merged. The integrated data set
includes 36,205 analytical cases and 205 variables. The report is divided in three sections—Methods, Experimental Modeling, and Post Modeling
Analysis. The first section describes the data preparation steps to carry out the analysis, the second provides the results of an experimental modeling
analysis using different algorithms and assumptions, and the third proposes steps for further analysis.
II. Methods
Data preparation is a sequence of activities designed to ready data that usually has been acquired for analysis and modeling. Typically, data
preparation accounts for 80% of all activities leading to model building, training, testing, and deployment. It is through this process that the 200+
variables in our data set were narrowed to the 17 used for modeling. The flow or the order in which activities or operations are performed can be
delineated in stages (Figure 1). I chose a wheel graph with arrows pointing in opposite directions for two reasons: Clockwise the arrows show the
sequence and necessary steps in data preparation. Counterclockwise, the arrows convey the iterative nature of data preparation. The intent is to
portray a cycle which begins with the acquisition of entire data sets, and ends with the selection of few variables for modeling. To complete the cycle
the data goes through various stages where is readied, massaged, transformed, narrowed, and evaluated. Following is a brief description of the
typical stages in data preparation with selective references to our case study.2
Figure1. Data Preparation as an Iterative Process
Data Acquisition, Data Description, and Data Integration
• The Donor’s information originated from three separate data sets that were subsequently merged into a single “Integrated” data set using
Statistica Data Miner (SDM) and Excel.
• The description of the data is an essential step to learn about the data set, its size, its variables, characteristics, potential, and challenges.
This is important in data mining and in more traditional types of analysis. A data dictionary contained a list of the original variables,
variable labels, names, types, and format. Histograms, scatterplots, and descriptive statistics provided a more intimate knowledge of
target and predictor variables. Correlation matrices helped identify collinearity—a necessary step to narrow down the selection of
potential predictors.
• Variable selection—the process of selecting the variables for modeling begins at this stage, and continues throughout the cycle.
• The activities, operations, and results were documented in a Data Description Report.
1
http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.htm— accessed 08/25/2014
2
There are 10 steps in data preparation: 1. Data access and extraction; 2. Data integration; 3. Data cleansing; 4. Data conditioning; 5. Missing value imputation; 6. New variable derivation; 7. Variable
selection; 8. Algorithm selection; Preliminary model building; 10.Feedback. These steps are included in the stages in Figure1.
Acquisition
Description
Integration
Quality
Assessment
Transformation
Standardization
Modeling
Balancing|Partitioning
Data
Preparation

UC Irvine Week7_1
________________________________________________________________________________________
2
Data Quality Assessment
• Data profiling and data cleansing are central to the assessment of data quality.
o Data profiling identifies the number of valid cases, examines the measures of central tendency, searches for outliers, missing
data, miscodes, and helps determine what new variables might need to be created in the analysis, e.g. Valid N.
o Data cleansing looks to correct miscodes, such as negative zip codes, inconsistencies in data entries for gender, eliminate
redundancies, e.g. age vs date of birth.
• Missing Values and Variable Recoding
o The tools in Statistica provide the means to search for valid data and missing cases—several of the variables in the Donor’s
data set registered 99% of missing data. Decisions were made to impute missing values, or in some cases to select them for
deletion. Recoding was used to remove outliers, fix incorrect zip codes, and change categorical values from text to numeric,
e.g. Homeownr and Gender. Care must be taken with recoding variables, which can affect the number of missing values, if the
data values cannot be calculated.
• A data quality assessment report was generated in Statistica. The report listed for each variable: name, number of valid cases, coefficients
of variations, and a measure of non-normality (Mean/Md).
• In the initial stages of data preparation, the data quality assessment stage is limited in scope. During the Data Transformation stage,
quality is reassessed more rigorously by running a Data Health Check.
• A Data Quality Assessment Report documented the initial results generated in Statistica and the various steps taken in this stage.
Data Transformation and Variable Selection
• New variables and distribution of metrics. New variables are created to assist in analysis, and modeling, e.g. “Valid N” to determine blanks
or missing values, “Mean/Median” to assess normality, Random to randomized the balanced data set prior to modeling. The distribution
of metrics, for example, relies on the newly created variable “Mean/Median” as a measure of non-normality to determine the extent to
which a variable is skewed to either side of the distribution. Categorical variables, such as the dependent variable, need to be changed
from text to numerical for algorithms to read.
• Dummy variables are another form of data transformation. For example, transforming the categorical variable Gender from text (M/F) to
numerical M=1, F=0). Dummy variables were also created for Homeownr, Domain and MdMaude. Numerical variables can also be
transformed, such as combining, values for different income variables.
• By now the data has undergone many changes through recoding, deletion, addition, derivation, and standardization. To assess the “health
of the data” a Data Health Check in Statistica was ran to assess the number of valid cases and to identify missing values. Imputation was
used in the Income variable using the mean for each unknown value.
• Other types of transformation are deriving new variables from existing ones, such as data abstraction—a variable constructed from other
data by arithmetic or logical operations, e.g. lag variables. Lag variables were not used in the Donor’s data set.
• During this stage a list of predictor variables is selected.
Data Standardization
• The idea behind standardization is to normalize the data to a common range, such as z-values |z=(value-mean)/standard deviation. The
numerical variables were standardized using Statistica. Variables values often differ in range. Parametric statistics will bias results toward
the variables with the wider range. Standardization solves the problem by reducing the numerical variables to a common scale, thus,
lowering the estimation error. During this process categorical variables are deleted narrowing total number of variables in the data set.
Statistica has the option All Variable Specs, and the Data Transformation Group has an automated Standardization routine.
o Variable selection: Only the predictors selected for modeling are standardized
Balancing, Partitioning, and Modeling
• Balancing, partitioning, and preliminary modeling are very much data preparation activities.
o Algorithms: Algorithms are instrumental in determining how the data is prepared. Following are the algorithms used in the
study: Boosted Trees, SANN, and Service Vector Machines. Statistica Data Miner Recipes provided additional results for C&RT.
o Balancing: Some algorithms require a balanced data set for proper operation. The methods to balance a data set can be
manual, automated, or weighted. Balancing a data set is important when binary coded variables are unevenly distributed. In
the donor’s study over 80% of the target variables were coded as 0, and 20% as 1. The imbalance causes algorithms that learn
case by case to bias prediction toward the common case. Balancing can be achieved by reducing the number of cases—under-
sampling, over-sampling, and weighting.
o Some algorithms balance their data sets internally—Boosted Trees, SANN, CART. Others such as Service Vector Machines
(SVM) and Statistica Data Mining (SDM) accept case by case weights.
o Partitioning: Training (to calculate weights). Testing (for error evaluation between iterations). And Validation (to calculate
accuracy for all iterations, thus is never used in training).
 SANN has a built in partitioning option. set at (60:20:20)
 Boosted Trees and SVM required a splitting node in Statistica to create a training and testing sample

UC Irvine Week7_1
________________________________________________________________________________________
3
o Modeling: During preliminary modeling algorithms are configured to run the train, test, and validation subsets. Accuracies
measurements are used to compare the results from training and testing unbalanced and balanced data sets. Accuracies
measurements are used to evaluate and compare results. These steps are necessary to determine whether the models are
ready for deployment or not.
III. Experimental Results
Experimental Results (no-balancing, manual balancing, balancing with case weights)
• Weight balancing: changed B=1 to 15, and B=0 to 2. These weights will proportionately weight Target_B1 to 7 ½ times that of Target_B=0.
• Use of a Weight variable in SDM
This section shows the results from using the non-balance and unbalanced data sets to run the algorithms. The workspace below displays the steps
leading to the preliminary modeling. The nodes for the data sources represent the unbalanced and balanced data sets. Notice that the connectors for
both sets have a direct link to the SANN Classification Automated Network Search (ANS) node, and bypass the node to split the input data into
training and testing samples. The SANN algorithm conducts a randomized 80: 20 partition of the data set training on the 80% and testing on the rest;
in addition it provides an option to automate partitioning— set in this exercise at 60:20:20 for the train, test, and validation subsets. In the other two
models the data has to traverse the Split Input Data node prior to partitioning to fit algorithms BT and SVM. The workspace below displays the
connections for the unbalanced and balance data sets to the three types of algorithms.
Figure2. Preliminary Modeling Workspace
A)
SANN: Unbalanced SANN: Balanced3
Data Mining Recipe: Results for Neural Network—Balanced
SANN balanced outperforms the unbalanced data set, and DMR for the test sample.
3
Manual balancing: The total number of records was reduced from 36205 to 9687—4843 records for each Target=1, and Target=0 (under-sampling)
Manually balanced
Unbalanced

UC Irvine Week7_1
________________________________________________________________________________________
4
B)
BOOSTED TREES: Unbalanced
Accuracy for the unbalanced test samples are about 86%. The results for the test samples give the advantage to the balanced set: The accuracy value
for Target=1 is 89%, and 92% for DMR model below.
BOOSTED TREES: Balanced
Data Mining Recipes Results for Boosted Trees Balanced
C)
SUPPORT VECTOR MACHINE (SVM)
SVM is a supervised classification technique used in machine learning. SVM algorithms analyze and recognize patterns. It is a non-probabilistic
method that assigns data points in a training data set to one or two categories. The model is suited for the study of the categorical Target_B variable,
with categories 1 and 0. In the training sample, the data points are assigned to one or the other side of a line drawn in space. The algorithms create a
“gap” between the categories, and during the testing phase, assign “1” and “0” values to the categories defined in the training set. The Statistica
SVM model used for the analysis is known as Classification SVM Type 1 or C-SVM, and a Radial Basis Kernel function—the most popular. Cross-
validation with SVM uses a form of resampling, in which separate models are trained on different random samples of the data set.
The results below show higher accuracy values for the balanced data set, and higher performance by cross-validation relative to the balanced data
set.

UC Irvine Week7_1
________________________________________________________________________________________
5
Support Vector Machine: unbalanced
Support Vector Machine: balanced
CROSS-VALIDATION with SVM
IV. Post Modeling Analysis
The predictor DOMAIN, a potential predictor excluded from the final variable list, is worth further investigation. DOMAIN is a symbolic variable
representing levels of urbanicity and socio-economic status. The histogram below reveals that three of the categories combined accounted for 40%
of all cases. Further, the means-with-errors plot for DOMAIN and TARGET_D —the dollar amount of donations associated with responses to mailing
requests—shows higher values for groups associated with affluent neighborhoods, regardless of geography, and lower values for the rest of the
grouping suggesting the possibility of two predictive models. Location analysis would enhance the predictive value of the models.
Conclusion
This exercise illustrates the inherently iterative nature of data preparation. Data preparation begins with data collection, continues through several
steps until it reaches the preliminary modeling stage. It is a repetitive process. Balancing the data set is one aspect of data preparation. The balanced
and unbalanced data sets were ran through three types of classification algorithms; the results for each one indicate that balancing the data whether
by adding weights or under-sampling enhances model accuracy. In addition to running the three models, I relied on Statistica’s Data Mining Recipe
(DMR) to compare the results for the balanced neural network with those of BT. The results are not strictly comparable because in DMR accuracy is
assessed relative to the accuracies from algorithms Booster Trees and SVM. For example, notice in the examples above, SANN algorithm trained 5
neural nets with outputs ranging from 8 to 13 (Net Name). DMR, on the other hand, trained one neural net-10. Also there were differences in the
error functions, and the output activation between the two methods. In terms of accuracy, DMR is a good automated approach and useful evaluation
tool to derived results without much data preparation. A post-modeling analysis recommends a deeper examination of the predictor DOMAIN. Based
on earlier results, the data suggests the possibility of two predictive models based on the socio-economic characteristics of the donors’
neighborhoods.
END

Preliminary Modeling Report

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to Preliminary Modeling Report (20)

Preliminary Modeling Report