SlideShare a Scribd company logo
IDS 572 DATA MINING FOR BUSINESS
ASSIGNMENT 3 - FUNDRAISING 2
Kaushik Kompella
Jaideep Adusumelli
Sushaanth Srirangapathi
1. Modeling Partitioning - Partition the dataset into 60% training and 40% validation (set the seed
to 12345). In the last assignment, you developed decision tree, logistic regression, naïve Bayes,
random forest and boosted tree models. Now, develop support vector machine models for
classification. Examine different parameter values, as you see suitable. Report on what you
experimented with and what worked best.
Howdo you select the subset ofvariables to include in the model? What methods do you use to select
variables that you feelshouldbe includedinthe model(s)? Doesvariable selectionmake adifference?
Provide a comparative evaluation ofperformance ofyour bestmodels from all techniques (including
those from part 1, i.e. assignment 2) (Be sure NOT to include “TARGET−D” in your analysis.)
Data Selection: Missing Values
In order to focusour effortson the set of variablesthatwouldprovide the mostmeaningful information
we decided to start by looking for variables with excessive amounts of missing values.
We sawan advantage inchoosing to eliminatevariableswith>=85% of all valuesmissing.Thishelpedus
to eliminate anumberof variables.We chose thisroute because we knew that includingthese variables
withgreaternumbersof missingvalueswouldnotbe useful inpredictingdonorresponse,because there
isverylittle informationcontainedwithinthose variables,wellbelow eventhe low responserate of 5.1%.
Certain notable variables were:
Attributes Eliminated Number of Missing values (out of 9999 values)
MAILCODE 9857
PVASTATE 9847
RECP3 9728
RECPGVG 9983
CHILD03 9884
CHILD12 9840
CHILD18 9746
SOLP3 9976
MAJOR 9969
HOMEE 9902
PHOTO 9504
KIDSTUFF 9822
CARDS 9869
PLATES 9933
RDATE_3 9947
RDATE_5 9997
RDATE_6 9896
RAMNT_3 9947
RAMNT_4 9954
RAMNT_5 9997
RAMNT_6 9896
RECSWEEP 9826
These Variables did not even have enough number of values for the actual data which has only 5.1%
response rate. Sothese variableseventhoughcontainveryimportantinformationcannotcontribute well
towards creating a good model. Hence we eliminated all these variables.
We did not eliminate those variables that were coded as X and null values. We transformed
many of those attributes and will discuss those in the ‘transform’ section.
Data Selection: Relevance
Next, we chose to focus on only including variables which wouldhave relevance on predictingdonation
probability.We didthisbyscanningthroughthe variabledescriptionsinthe datadictionarytoonlyretain
those that were relevant.
An example of a variable thatwe chose to eliminate wasDW2. Thisvariable correspondedtothe locale-
specific percentage of single-unit detached structures. Eliminating these variable and other variables
similar to this allowed us to lessen the total number of variables that we analyzed.
In addition to those variables that we found to be irrelevant, we noticedthat there was a great deal of
overlap between certain variables. For instance, the ‘Wealth1’ and ‘Wealth2’ variables had very similar
descriptions in the data dictionary. However, we noticed that Wealth2 contained a more robust
description of the variable –
“Wealth rating uses median family income and population statistics from each area to index relative
wealthwithineachstate.The segmentsare denoted0-9,with9beingthe highestincomegroupandzero
being the lowest. Each rating has a different meaning within each state”.
Whereas “Wealth1” simply contained the description: wealth rating. Since “Wealth1” has a similar
description to “Wealth2” we know that there is a lot of information overlap, which allows us to justify
dropping the less informative of the two: “Wealth1”.
We also found some variables which had no relationship with “TARGET_B” but have a relationship with
“TARGET_D”. Some variables of this kind are:
Data Reduction: PCA
We experimented with creating many different sets of PCAs but found that in many cases we were
including many variable sets that had numerous missing values, or had summary variables that were
alreadyexisting.Therefore,we onlycreated3PCAsthat were valuable PCA_CENSUS_X,PCA_EMP_Xand
PCA_INTERESTS_X.
PCA_CENSUS_X included census variables for neighborhoods
PCA_EMP_X included Employment information of the people i.e. Whether they are army or military
veterans, state government employees or federal government employees, etc.
PCA_INTERESTS_X included the 18 ‘interest/hobby’ based variables.
Data Reduction: Random Forests
We ran random forests and plotted the variable importance graphs for all the variables for all
the remaining attributes after the above processes were carried out. And then we selected all
those variables which had the “Decrease in Gini Index” more than or equal to 0.0064 which
gave us 68 variables. These included PCAs and mostly summary variables.
Most of the variables were not at all included in the split and we were able to eliminate them to
improve the overall performance of our model.
Data Selection: Missing Value Replacement
We utilized two rapid miner operators in our process to ensure that those variables that had missing
valuesbelowourcutoff were transformed.The twooperatorswe usedwere the Replace MissingValues
and the Map operators.
Replace MissingValuesallowedustoimpute the valueswhennecessarywhile the Mapoperatorallowed
us to recode those variables that had encoded characters such as “MDMAUD”. This allowed us to code
the extraneous characters as null values giving us a more accurate picture of which variables actually
contained many more missing values than originally depicted in our first glance.
Data Selection: The Variables Included
CLUSTER DOMAINTYPE WEALTH1
INCOME PEPSTRFL MAXRAMNT
DOMAINSES HIT VIETVETS
WWIIVETS IC1 CARDPROM
MAXADATE CARDPM12 RAMNTALL
NGIFTALL CARDGIFT MINRAMNT
MINRDATE MAXRDATE LASTGIFT
FISTDATE TIMELAG AVGGIFT
CONTROLN TOTALDAYS RECENCYFREQ
LOWINCOME MEDINCOME HIGHINCOME
PCA_NEIGH_PER_1-5 PCA_NEIGH_MMM_1-4 PCA_EMP_1-4
Decision Trees
We chose the J48 version of the decision tree as we were able to obtain the most accurate results.
The parameter that allowed us to build the best model was .4 as a threshold. We selected Laplace
smoothing parameter for the J48 and chose an overall confidence of .35.
We created two main PCA’s which we removed from the model initially. The following confusion matrix
is our J48 without these three PCAs:
Accuracy of Training Data set for the best tree: 94.97
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 3796 159 95.98
Predicted Bad 143 1901 93.00
Class recall 96.37 92.28
Accuracy of Test Data set for the best tree: 55.67
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 1683 895 65.28
Predicted Bad 878 544 38.26
Class recall 65.72 37.80
Our optimal J48model showed a55.67% accuracy onthe overall testdata.The classprecisionfortrue 1’s
and predicted 1’s was 38.26% and overall class recall was 37.80%. These results were close to training
data accuracy so it implies that we reached optimal model.
Logistic Regression
Once again,we ran the model withandwithoutthe PCAsthatwe generatedforthe J48 model.Inthis
case we alsofoundthatthe inclusionof PCA’swashurting overallmodelperformance. Below isthe
confusionmatrix associatedwithlogisticregression.
Accuracy of Training Data set for the best tree: 54.49
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 1476 267 84.68
Predicted Bad 2463 1793 42.13
Class recall 37.47 87.04
Accuracy of Test Data set for the best tree: 48.08
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 814 330 71.15
Predicted Bad 1747 1109 38.83
Class recall 31.78 77.07
We thencomparedthe trainingresultsfromourLogisticRegressiontothe validationdataandfoundour
best LR model. The associated test data confusion matrix is seen above.
We could maintainan overall model accuracyof 48.08%. The class precisionfromthe testdata was very
similartothatof the training data.We obtained a38.83% precisionwithanoverallclassrecall of 77.07%.
We used default parameters for the Logistic regression as these gave us the highest overall accuracy.
Naïve Bayes
After running J48 and Logistic regression we attempted to find better model using Naïve Bayes. We
applied same subset of variables both with and without PCAsto find best Naïve Bayes model. Our best
model again didnot contain any of the PCA’sbut as with the othermodels,containedall of our created
variables described in the initial data preprocessing steps.
Accuracy of Training Data set for the best tree: 63.68
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 2888 1128 71.99
Predicted Bad 1051 932 47.00
Class recall 73.32 45.24
Accuracy of Test Data set for the best tree: 59.35
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 1745 810 68.30
Predicted Bad 816 629 43.53
Class recall 68.14 43.71
The overall model accuracy is 59.35% with a class precision of 43.53% and a class recall of 43.71%
We compared the accuracy of our Naïve Bayes model with the associated testing data. The resulting
confusion matrix is as above.
In this case we see a rise in overall model accuracy on the validation data set. Also, we noticed similar
class precisionandclass recall betweenourtestingand trainingdata suggestingwe hadfound our best,
and most stable Naïve Bayes model.
Support Vector Machines
Initially we tried to deploy SVMmodel on our data but were getting perfect over fit. We were using a
kernel type radial basedfunctionondefaultsettings.Inaddition,we changedthe thresholdto.3, .5, and
.65 but were still getting massive over fit on our training data.
Below is the confusion matrix associated with the SVMtraining data:
Accuracy of Training Data set for the best tree: 59.21
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 3064 1572 66.09
Predicted Bad 875 488 35.80
Class recall 77.79 23.69
Accuracy of Test Data set for the best tree: 59.55
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 2048 1105 64.95
Predicted Bad 513 334 39.43
Class recall 79.97 23.21
Afteranalyzingmany differentkernel typessuchas- polynomial,dot,andGaussiancombination.We also
experimentedwiththe Cpenaltyparametersand determinedthatthe supportvectormachinewasn’tthe
best suited algorithm to our data and did not generalize well.
We could generate an overall accuracyof 59.21% onthismodel,butwithalowerclassprecisionof 35.8%.
The class recall was also worse than all other models with a 23.69% recall on true 1’s/predicted 1’s.
Our testdatahad an overall accuracy of 59.55% butstill withverylow classprecisionof 39.43% and class
recall of 23.21%.
KNN
We usedthe followingparameterstogenerate ourbestKNN model:
K = 150; Measure:Mixed,Mixed-Measure:MixedEucledian
Thisincluded usingak-value of 150, withthresholdof .33 alongwithmixedmeasure of Euclideanasa
distance metric.
Belowisthe confusionmatrix of ourtrainingdatafor KNN:
Accuracy of Training Data set for the best tree: 54.53
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 2004 793 71.65
Predicted Bad 1935 1267 39.57
Class recall 50.88 61.50
Accuracy of Test Data set for the best tree: 50.48
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 1151 571 66.84
Predicted Bad 1410 868 38.10
Class recall 44.94 60.32
Using the parameters above we obtained an overall accuracy of 54.3% on the training data. The class
precisionwas lowerthan that of other modelsat 39.57%. The class recall for predictedand true 1’s was
61.5% overall.
Our testing confusionmatrix gave an accuracy of 50.48%, with a class precision of 38.1%, but an overall
class recall of 60.32% on predicted 1’s and true 1’s.
Random Forest
Initially we usedthe same setof attributesthatwe deployedinourSVMmodel andwere onlyable to
generate a~30% Test accuracy but witha ~80% Trainingdata accuracy. We allowedtoomanytreesto
buildandas a resultthismodel wasoverfit.
By limitingdepthof the trees,we could generate amuchhigheraccuracy ontest data.Below isour
RandomForestconfusionmatrix forourtrainingdata:
Accuracy of Training Data set for the best tree: 60.06
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 2101 558 79.01
Predicted Bad 1838 1502 44.97
Class recall 53.34 72.91
Accuracy of Test Data set for the best tree: 53.80
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 1212 499 70.84
Predicted Bad 1349 940 41.07
Class recall 47.33 65.32
We obtainedanaccuracy of 60.06% overall on trainingdata,butwitha classprecisionof 44.97% and a
classrecall 72.91% forpredictedandtrue 1’s
On testing data set we obtained a reasonably close accuracy of 53.8%. The class precision was 41.07%
with a class recall close to that of the training data at 65.32%.
Gradient Boosted Trees:
Accuracy of Training Data set for the best tree: 72.03
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 2706 445 85.88
Predicted Bad 1233 1615 56.71
Class recall 68.70 78.40
Accuracy of Test Data set for the best tree: 55.62
Prediction / Actual Actual Good Actual Bad Class precision
Predicted Good 1428 642 68.99
Predicted Bad 1133 797 41.30
Class recall 55.76 55.39
The parameters used were:
Number of trees: 40, Maximal Depth: 5, Minimum Rows: 20, Minimum Split Improvement: 0.0025,
Number of Bins: 25, Learning Rate: 0.1, Sample Rate: 1, Distribution: Auto
From the above matrix we found that gradient boosted trees were the best model we could get with
optimal accuracy and precision and high recall which would help us to maximize the profits.
2. Our overall goal is to identify which individuals to target for maximum donations (profit). We
will try two approaches for this:
(i) using the response model,togetherwith average donation and mailingcosts information,to
identifythe most profitable individualstotarget (Q 2.1 below)
(ii) developa secondmodel on TARGET_D, and combine this with the response model to identifythe
most profitable individualstotarget (Q 2.2 below)
2.1 (a) What is the ‘best’ model for each method in Question 1 for maximizing revenue? Calculate
the net profit for both the training and validation set based on the actual response rate (5.1%). We
can calculate the net profit from given information - the expected donation, given that they are
donors, is $13.00, and the total cost of each mailing is $0.68. Note: to calculate estimated net profit
(on data with the ‘natural’ response rate of5.1%), we will need to “undo” the effects ofthe weighted
sampling, and calculate the net profit that reflects the actual response distribution of 5.1% donors
and 94.9% non-donors.)
Afterchoosingthe bestmodelsfromthe above methods,we coulddeterminethatourbestmodel was
the GradientBoosted Trees.Thisis notjust basedonthe accuracy itself butthe netprofitwe got.
We neededtoundothe effectsof the weightedsampling whichwe didinthe followingway:
Firstly, adjustthe weightedprofit andthe weightedcost.
Weightedprofit:13-$0.68 = $12.32
WeightedCost= ($0.68)
Afterthis,we usedthe followingformulatocalculate maximumprofit:
(12.32*.051)/(3499/9999) = 1.795
Next,we usedthe followingformulatocalculate the adjustedcost:
(($0.68)*0.9491)/(6500/9999) = .992
The liftwe achievedforthe logisticregressionwas 306.679 as max profit.
(b) Summarize the performance of the ‘best’ model from each method, in terms of net profit from
predicting donors in the validation dataset; at what cutoff is the best performance obtained?
Draw profit curves:Draweachmodel’s netcumulative profitcurve forthe validation setontoasingle
graph. Are there any models that dominate?
Best Model: From your answers above, what do you think will be the “best” model to implement?
(What criteria do you use to determine ‘best’?)
Our bestmodel is GradientBoostedTrees.We usedmax liftcriteriabutalsoconsideredthe highestclass
recall (55.39%) and class precision (41.30%). Our next best model (“Logistic Regression”) has a class
precision of 38.83%, and a class recall of 77.07%, but a higher lift curve. It does not make sense from a
businessstandpointtoevaluate ourbestmodelonthe criteriaof liftalone because thenthe J48wouldn’t
generalize well to predict donors accurately.
The performance of all models are as follows:
Model Profit (Performance)
J-48 105.5 $
Logistic 257.63 $
Naïve Bayes 319.58 $
Random Forests 349.10 $
KNN 159.34 $
SVM 90.63 $
The cutoff we usedin GradientBoostedTrees wasa confidence of: 0.496901.
0
50
100
150
200
250
300
1
29
57
85
113
141
169
197
225
253
281
309
337
365
393
421
449
477
505
533
561
589
617
645
673
Cummulative Benefit curveof Decision Tree
The best model that was J-48 according to the lift chart. However, Gradient Boosted Trees had the best
chance to generalize well on unseen data.
-150
-100
-50
0
50
100
150
1
29
57
85
113
141
169
197
225
253
281
309
337
365
393
421
449
477
505
533
561
589
617
645
673
Cummulative benefit curvefor Gradient
Boosted Trees
-50
0
50
100
150
200
1
29
57
85
113
141
169
197
225
253
281
309
337
365
393
421
449
477
505
533
561
589
617
645
673
Cummulative Benefit Curvefor Naive Bayes
-50
0
50
100
150
200
1
29
57
85
113
141
169
197
225
253
281
309
337
365
393
421
449
477
505
533
561
589
617
645
673
Chart Title
2.2. (a) We will next develop a model for the donated amount (TARGET_D). Note that TARGET_D has
valuesonlyforthose individualswhodonors(thatis,TARGET_D valuesare definedonlyforcaseswhere
TARGET_B is 1). What data will you use to developa model for TARGET_D? (Non-donors,obviously,do
not have any donationamount -- shouldyou considerthese as $0.0 donation,or impute missingvalues
here? Should non-donors be included for developing the model to predict donation amount? Also,
should cases with rare very large donation amounts be excluded? [Hah! – leading questions]
Develop a model for TARGET_D. What modeling method do you use (report on any one). Which
variables do you use? What variable selection methods do you use? Report on performance.
For the caseswithrare verylarge donationamounts,we will notexclude thesecasesaswe will lose out
on predictingourmostvaluable donors.Andthesedonationsmightbe large because of asupportive
reasonfromthe valuesof othervariables.Hence we didnotwanttolose ImportantInformation.
We will use linearRegressionmodelforpredictedTARGET_Dwitha splitvalidationof 60-40 on the data
set.
Variable selectionmethods
Afterapplyingthe datacleaningandmissingvalue treatmentasappliedondatasetforTARGET_B, we
lookedatthe scatter plotsof the independentvariablesagainstTARGET_Dandeliminatedthose
variablesthatdid notshowany significanttrendand/orrelationshipwithTARGET_D. Givenbelow are a
fewvariablesthatwe have selectedforregressionmodellingforTarget-Dandtheirscatterplotswith
Target D.
Fig.1 RAMNT_18 vs TARGET_D
Fig.2 RAMNT_8 vs TARGET_D
Fig.3 LASTGIFT vs TARGET_D
Fig.4 RAMNTALL vs TARGET_D
Fig.5 HIT vs TARGET_D
As can be seenfromthe plots,the above variableshadasignificantrelationshipwithTARGET_D,hence
we selectedthese variablesforthe linearregressionmodellingfor the predictionof TARGET_D Variable.
Givenbeloware the scatterplotsof variable withTARGET_D for the oneswhichwe eliminated.
Fig.1 AGE vs TARGET_D
Fig.2 ADI vs TARGET_D
Fig.3 DMA vs TARGET_D
Fig.4 CARDPROMvs TARGET_D
These variables didn’tshowanyparticulartrendwiththe TARGET_D variable hence were eliminated.
AfterScatterplots,we narroweddownthe listof variables from481 to 261. Andthenwe used
correlationmatrix tofinallyselectvariablesformodelling.
The highestcorrelation(R) valuesof independentvariableswithTARGET_Dwas of 0.2 from LASTGIFT
followedbythese variables - NGIFTALL,AVGGIFT,RAMNTALL,MINRAMNT.We choose all the variables
having>= 0.05 as R value withTARGET_D. We finalizedalist of 15 independentvariablesinourdataset
to run the model.Here isthe listof variablesthat we usedin our bestlinearregressionmodelthrough
the T-Testfeature selection.
The variable we gotwere:
HIT WEALTH_1 NUMPROM RAMNT_8 RAMNT_12
RAMNT_14 RAMNT_16 RAMNT_18 RAMNT_22 RAMNTALL
NGIFTALL MINRAMNT MAXRAMNT LATGIFT AVGGIFT
Performance-
Our bestmodel hasgiventhe followingvaluesforRootmeansquarederrorandsquaredcorrelation
Root-MeanSquarederror:9.949 +/- 0.000
(b) How can you use the results from the response model together with results from the donation
amount model to identify targets? (Hint: The response model estimatesthe probability of response
p(y=1|x). The donation amount model estimates the conditional donation amount, E [amt | x, y=1].
The product of these gives …..? )
How do you identify individuals to target, when combining information from both models? [Sorted
values of predicteddonation amount? What thresholdwould you use? Or maybe you prefer to target
all individuals with predicted amounts greater than $0.68 (why?), and/or…..]
Giventhe confidence probabilitiesof donors(TARGET_B=1) fromthe classificationmodel andthe
predicteddonationamounts(TARGET_D) fromthe regressionmodel we multiplythe twovaluesto
combine the twomodels. The multiplicationof the confidence (1) fromthe classificationmodel andthe
probable donationfromthe donorpredictedbythe linearregressionmodelgivesusthe mostprobable
donationbythe responder.
Hence withthe classificationmodel we onlypredictedhow manyindividualsare goingtorespondtoour
promotion,onthe otherhandthe regressionmodel givesusthe probable donationamountbythe
respondersobtainedfromthe classificationmodel.Hence combiningthe twomodelsplaysavery
importantrole.
The modellingtechniquewe usedisGradientBoostedTrees,whichhasanaccuracy of approximately
50%, hence we have selectedall those individualswhoare donating 2$ or more (as the accuracy is 50%,
we assume that outof all the predictedrespondersoutevery2one wouldsurelyrespond,hence we
wantthat one individual tocoverourpromotional chargeswhichwasdownscaledto -0.9928 $ for two
individualsi.e. 1.9856 $.
Hence, inorderto cover ourexpenses,we wouldneedindividualswhowoulddonate aminimum
amountof ~2 $.
5. Testing – chose one model, either the one from 2.1 or 2.2 above,based on performance on the test
data.
The file FutureFundraising.xlscontains the attributes for future mailing candidates. Using your “best”
model from Step 2 which of these candidates do you predict as donors and non-donors? List them in
descending order of probability of being a donor/prediction of donation amount. What cutoff do you
use? Submit this file (xls format), with your best model’s predictions (prob of being a donor).
For testingthe FutureFunraising.xlsfile,the bestmodel whichwe have usedis GradientBoostedTrees.
Afterapplyingthe model,the below are the predicted informationinferred
Numberof Donors : 10857
Numberof Non-Donors : 9143
Cumulative Profit : ~24604 $
The cut off usedto predict donor/non-donorsis the confidence ofpredicting1s = 0.496901
Note:
The listof predictionsusingthe bestmodel is uploadedtothe blackboard.(Future_Data_Prediction.xls).
Appendix-1
Python code used for cleaning the data-
import numpy as np
import pandas as pd
drop_column = ["ODATEDW", "OSOURCE", "TCODE", "ZIP", "MAILCODE", "PVASTATE", "DOB", "NOEXCH", "CLUSTER",
"AGEFLAG", "NUMCHLD", "HIT", "DATASRCE", "MALEMILI", "MALEVET", "VIETVETS", "WWIIVETS", "LOCALGOV",
"STATEGOV", "FEDGOV", "GEOCODE", "HHP1", "HHP2", "DW1", "DW2", "DW3", "DW4", "DW5", "DW6", "DW7", "DW8",
"DW9",
"HV1", "HV2", "HV3", "HV4", "HU1", "HU2", "HU3", "HU4", "HU5", "HHD1", "HHD2", "HHD3", "HHD4", "HHD5",
"HHD6",
"HHD7", "HHD1", "HHD2", "HHD3", "HHD4", "HHD5", "HHD6", "HHD7", "HHD8", "HHD9", "HHD10", "HHD11",
"HHD12", "HUR1",
"HUR2", "RHP1", "RHP2", "RHP3", "RHP4", "HUPA1", "HUPA2", "HUPA3", "HUPA4", "HUPA5", "HUPA6", "HUPA7",
"RP1",
"RP2", "RP3", "RP4", "MSA", "ADI", "DMA", "MC1", "MC2", "MC3", "TPE1", "TPE2", "TPE3", "TPE4", "TPE5",
"TPE6", "TPE7",
"TPE8", "TPE9", "PEC1", "PEC2", "TPE10", "TPE11", "TPE12","TPE13", "ANC1", "ANC2", "ANC3", "ANC4",
"ANC5", "ANC6",
"ANC7", "ANC8", "ANC9", "ANC10", "ANC11", "ANC12", "ANC13", "ANC14", "ANC15", "POBC1", "POBC2", "LSC1",
"LSC2",
"LSC3", "LSC4", "VOC1", "VOC2", "VOC3", "ADATE_2", "ADATE_3", "ADATE_4", "ADATE_5", "ADATE_6", "ADATE_7",
"ADATE_8",
"ADATE_9", "ADATE_10", "ADATE_11", "ADATE_12", "ADATE_13", "ADATE_14", "ADATE_15", "ADATE_16",
"ADATE_17",
"ADATE_18", "ADATE_19", "ADATE_20", "ADATE_21", "ADATE_22", "ADATE_23", "ADATE_24","MAXADATE",
"RDATE_3",
"RDATE_4","RDATE_5", "RDATE_6", "RDATE_7","RDATE_8", "RDATE_9", "RDATE_10", "RDATE_11", "RDATE_12",
"RDATE_13",
"RDATE_14", "RDATE_14", "RDATE_15", "RDATE_16", "RDATE_17", "RDATE_18", "RDATE_19", "RDATE_20",
"RDATE_21",
"RDATE_22", "RDATE_23", "RDATE_24", "MINRDATE","MAXRDATE", "LASTDATE", "FISTDATE","NEXTDATE",
"CONTROLN",
"TARGET_D", "HPHONE_D", "RFA_2R", "RFA_2F", "RFA_2A", "MDMAUD_R", "MDMAUD_F", "MDMAUD_A", "CLUSTER2",
"GEOCODE2", "MDMAUD"]
df = pd.read_csv("C:UserstyrionDocumentsIDS_572_notesassign2pvaBal35Trg.csv", sep=',', na_values=[' '],
low_memory=False)
df.drop(drop_column, axis=1, inplace=True)
list_string = []
non_list_string = []
# filling numeric columns with -1
for c in df.columns:
li = df[c].values.tolist()
a = np.asarray(li)
# print type(a)
flag = 0
for x in np.nditer(a):
si = x.tolist()
if x != "nan":
if type(si) != str:
flag = 1
break;
# filling NAN for all numeric entries
if flag == 1:
df[c].fillna(-1, inplace=True)
# print "was in"
# print c
else:
if df[c].isnull().values.any():
# print type(df[c])
list_string.append(c)
else:
non_list_string.append(c)
# print c
# replacing columns having "X" to 1 and "NaN" to 0
for val in list_string:
str_l = df[val].values.tolist()
a = np.asarray(str_l)
# print type(a)
flag = 0
for x in np.nditer(a):
if x == "X":
flag = 1
break
if x == "M":
flag = 2
break
if x == "Y":
flag = 3
break
if flag == 1:
# print val
df[val] = df[val].replace({'X': 1}, regex=False)
df[val].fillna(0, inplace=True)
if flag == 2:
df[val].fillna(-1, inplace=True)
df[val] = df[val].replace({'M': 1}, regex=False)
df[val] = df[val].replace({'F': 0}, regex=False)
if flag == 3:
df[val] = df[val].replace({'Y': 1}, regex=False)
df[val].fillna(-1, inplace=True)
if val == "HOMEOWNR":
df[val].fillna(0, inplace=True)
df[val] = df[val].replace({'H': 1}, regex=False)
new_attri = []
for val in list_string:
if val.find("RFA",0) == 0:
df[val].fillna("Z5Z", inplace=True)
r = val + "_R"
f = val + "_F"
c = val + "_C"
df[r] = df[val].str.extract('([FNALISZ])',expand = True)
df[f] = df[val].str.extract('(d)', expand=True)
df[c] = df[val].str.extract('[a-zA-Z][d]([a-zA-Z])', expand=True)
"""for che in range(1,9999,1):
if df[val].iloc[che] == "Z5Z": print che
print df[f].iloc[128]
"""
df[r] = df[r].replace({'F': 0,'N': 1,'A': 2,'L': 3,'I': 4,'S': 5,'Z': 6}, regex=False)
df[c] = df[c].replace({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5,'G': 6,'Z': 7}, regex=False)
new_attri.append(r)
new_attri.append(f)
new_attri.append(c)
del df[val]
val = "RFA_2"
df[val].fillna("Z5Z", inplace=True)
r = val + "_R"
f = val + "_F"
c = val + "_C"
df[r] = df[val].str.extract('([FNALISZ])',expand = True)
df[f] = df[val].str.extract('(d)', expand=True)
df[c] = df[val].str.extract('[a-zA-Z][d]([a-zA-Z])', expand=True)
"""for che in range(1,9999,1):
if df[val].iloc[che] == "Z5Z": print che
print df[f].iloc[128]
"""
df[r] = df[r].replace({'F': 0,'N': 1,'A': 2,'L': 3,'I': 4,'S': 5,'Z': 6}, regex=False)
df[c] = df[c].replace({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6,'Z':7}, regex=False)
new_attri.append(r)
new_attri.append(f)
new_attri.append(c)
del df[val]
val = "DOMAIN"
df[val].fillna("Z5", inplace=True)
domain_att_1 = val + '_urban_level'
domain_att_2 = val + "_economic_status"
df[domain_att_1] = df[val].str.extract('([UCSTRZ])',expand = True)
df[domain_att_2] = df[val].str.extract('(d)', expand=True)
df[domain_att_1] = df[domain_att_1].replace({'U': 0,'C': 1,'S': 2,'T': 3,'R': 4,'Z':5}, regex=False)
# new_attri.append(domain_att_1)
# new_attri.append(domain_att_2)
del df[val]
# exporting the dataframe to csv
df.to_csv('cleaned_PVA_data.csv')
# NORMALIZE DATA
print np.count_nonzero(df.columns.values)
pca_variables = ["CHILD03", "CHILD07", "CHILD12", "CHILD18", "MBCRAFT", "MBGARDEN", "MBBOOKS", "MBCOLECT", "MAGFAML",
"MAGFEM", "MAGMALE", "PUBGARDN", "PUBCULIN", "PUBHLTH", "PUBDOITY", "PUBNEWFN", "PUBPHOTO", "PUBOPP",
"COLLECT1", "VETERANS", "BIBLE", "CATLG", "HOMEE", "PETS", "CDPLAY", "STEREO", "PCOWNERS", "PHOTO",
"CRAFTS", "FISHER", "GARDENIN", "BOATS", "WALKER", "KIDSTUFF", "CARDS", "PLATES", "LIFESRC",
"PEPSTRFL", "POP901", "POP902", "POP903", "POP90C1", "POP90C2", "POP90C3", "POP90C4", "POP90C5",
"ETH1","ETH2", "ETH3", "ETH4", "ETH5", "ETH6", "ETH7", "ETH8", "ETH9", "ETH10", "ETH11", "ETH12",
"ETH13","ETH14", "ETH15", "ETH16", "AGE901", "AGE902", "AGE903", "AGE904", "AGE905", "AGE906",
"AGE907", "CHIL1", "CHIL2", "CHIL3", "AGEC1", "AGEC2", "AGEC4", "AGEC5", "AGEC6", "AGEC7", "CHILC1",
"CHILC2", "CHILC3", "CHILC4", "CHILC5", "HHAGE1", "HHAGE2", "HHAGE3", "HHN1", "HHN2", "HHN3", "HHN3",
"HHN4", "HHN5", "HHN6", "MARR1", "MARR2", "MARR3", "MARR4", "ETHC1", "ETHC2", "ETHC3", "ETHC4",
"ETHC5", "ETHC6", "HVP1", "HVP2", "HVP3", "HVP4", "HVP5", "HVP6", "IC1", "IC2", "IC3", "IC4", "IC5",
"IC6", "IC7", "IC8", "IC9", "IC9", "IC9", "IC10", "IC11", "IC12", "IC13", "IC14", "IC15", "IC16",
"IC17", "IC18", "IC19", "IC20", "IC21", "IC22", "IC23", "HHAS1", "HHAS2", "HHAS3", "HHAS4", "LFC1",
"LFC2", "LFC3", "LFC4", "LFC5", "LFC6", "LFC7", "LFC8", "LFC9", "LFC10", "OCC1", "OCC2", "OCC3",
"OCC4", "OCC5", "OCC6", "OCC7", "OCC8", "OCC9", "OCC10", "OCC11", "OCC12", "OCC13", "EIC1", "EIC2",
"EIC3", "EIC4", "EIC5", "EIC6", "EIC7", "EIC8", "EIC9", "EIC10", "EIC11", "EIC12", "EIC13", "EIC14",
"EIC15", "EIC16", "OEDC1", "OEDC2", "OEDC3", "OEDC4", "OEDC5", "OEDC6", "OEDC7", "EC1", "EC2", "EC3",
"EC4", "EC5", "EC6", "EC7", "EC8", "SEC1", "SEC2", "SEC3", "SEC4", "AFC1", "AFC2", "AFC3", "AFC4",
"AFC5", "AFC6", "VC1", "VC2", "VC3", "VC4", "HC1", "HC2", "HC3", "HC4", "HC5", "HC6", "HC7", "HC8",
"HC9", "HC10", "HC11", "HC12", "HC13", "HC14", "HC15", "HC16", "HC17", "HC18", "HC19", "HC20", "HC21",
"MHUC1", "MHUC2", "AC1", "AC2", "RAMNT_3", "RAMNT_4", "RAMNT_5","RAMNT_6", "RAMNT_7", "RAMNT_8",
"RAMNT_9", "RAMNT_10", "RAMNT_11", "RAMNT_12", "RAMNT_13", "RAMNT_14","RAMNT_15", "RAMNT_16",
"RAMNT_17", "RAMNT_18", "RAMNT_19", "RAMNT_20", "RAMNT_21", "RAMNT_22", "RAMNT_23", "RAMNT_24",
"RAMNTALL", "NGIFTALL", "CARDGIFT","MINRAMNT", "MAXRAMNT", "TIMELAG","AVGGIFT", new_attri]
print len(pca_variables)
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
# std_scale = preprocessing.StandardScaler().fit(df[pca_variables])
# df_std = std_scale.transform(df[pca_variables])
# minmax_scale = preprocessing.MinMaxScaler().fit(df[pca_variables])
# df_minmax = minmax_scale.transform(df[pca_variables])
"""
df1 = pd.DataFrame(index=range(0, 99999), columns=['A'], dtype='int')
for nd_series in df.columns:
print type(nd_series)
if nd_series in pca_variables:
df1.append(df[nd_series])
print df1.column
X_std = StandardScaler().fit_transform(df1)
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix n%s' %cov_mat)
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors n%s' %eig_vecs)
print('nEigenvalues n%s' %eig_vals)"""
Appendix-2
Process chart:
Fig.1 Classification modelling process Rapidminer.
Fig.2 Regression Modelling Process Rapidminer.

More Related Content

DOCX
Clustering Techniques Review
PPTX
Classification
PPTX
multiple linear regression
PPTX
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
PPT
Basic Descriptive Statistics
PDF
Visual Explanation of Ridge Regression and LASSO
PPTX
Machine Learning - Splitting Datasets
PPTX
Data clustring
Clustering Techniques Review
Classification
multiple linear regression
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Basic Descriptive Statistics
Visual Explanation of Ridge Regression and LASSO
Machine Learning - Splitting Datasets
Data clustring

What's hot (20)

PPTX
Nearest Neighbor Algorithm Zaffar Ahmed
PPTX
Spectral clustering
PPTX
Basics of Educational Statistics (Descriptive statistics)
PDF
Community detection in graphs
PDF
Mean shift and Hierarchical clustering
PDF
Normal Distribution.pdf
PPTX
Unsupervised learning (clustering)
PDF
Ordinal logistic regression
PPT
Descriptive statistics
PPTX
Knn 160904075605-converted
PDF
Visualizing Data Using t-SNE
PDF
Decision tree lecture 3
PPTX
Birch Algorithm With Solved Example
PPTX
Dimension Reduction: What? Why? and How?
PDF
Data Science - Part V - Decision Trees & Random Forests
PDF
DMTM Lecture 15 Clustering evaluation
PPT
K means Clustering Algorithm
PPTX
Loan default prediction with machine language
PDF
Team project - Data visualization on Olist company data
PDF
Data Science interview questions of Statistics
Nearest Neighbor Algorithm Zaffar Ahmed
Spectral clustering
Basics of Educational Statistics (Descriptive statistics)
Community detection in graphs
Mean shift and Hierarchical clustering
Normal Distribution.pdf
Unsupervised learning (clustering)
Ordinal logistic regression
Descriptive statistics
Knn 160904075605-converted
Visualizing Data Using t-SNE
Decision tree lecture 3
Birch Algorithm With Solved Example
Dimension Reduction: What? Why? and How?
Data Science - Part V - Decision Trees & Random Forests
DMTM Lecture 15 Clustering evaluation
K means Clustering Algorithm
Loan default prediction with machine language
Team project - Data visualization on Olist company data
Data Science interview questions of Statistics
Ad

Similar to Classification modelling review (20)

PDF
Data Mining using SAS
DOCX
Predictive Modelling & Market-Basket Analysis.
PDF
German credit score shivaram prakash
PDF
Predictive modeling
PDF
Internship project report,Predictive Modelling
PDF
Preprocessing of Low Response Data for Predictive Modeling
PPTX
DataAnalyticsIntroduction and its ci.pptx
PPTX
Influence of the Event Rate on Discrimination Abilities of Bankruptcy Predict...
DOCX
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
PPTX
Predicting Employee Attrition
DOCX
Data Analytics Using R - Report
PPTX
The 8 Step Data Mining Process
PPTX
BMDSE v1 - Data Scientist Deck
PPTX
Informs presentation new ppt
PDF
Preliminary Modeling Report
PPTX
Predict Backorder on a supply chain data for an Organization
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PDF
IRJET- Performance Evaluation of Various Classification Algorithms
PPTX
JamieStainer ATA SCIEnCE path finder.pptx
PPTX
AI AND DATA SCIENCE generative data scinece.pptx
Data Mining using SAS
Predictive Modelling & Market-Basket Analysis.
German credit score shivaram prakash
Predictive modeling
Internship project report,Predictive Modelling
Preprocessing of Low Response Data for Predictive Modeling
DataAnalyticsIntroduction and its ci.pptx
Influence of the Event Rate on Discrimination Abilities of Bankruptcy Predict...
Big Data Analytics Tools..DS_Store__MACOSXBig Data Analyti.docx
Predicting Employee Attrition
Data Analytics Using R - Report
The 8 Step Data Mining Process
BMDSE v1 - Data Scientist Deck
Informs presentation new ppt
Preliminary Modeling Report
Predict Backorder on a supply chain data for an Organization
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
JamieStainer ATA SCIEnCE path finder.pptx
AI AND DATA SCIENCE generative data scinece.pptx
Ad

Recently uploaded (20)

PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Foundation of Data Science unit number two notes
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
STUDY DESIGN details- Lt Col Maksud (21).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Database Infoormation System (DBIS).pptx
Introduction to Knowledge Engineering Part 1
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Supervised vs unsupervised machine learning algorithms
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction-to-Cloud-ComputingFinal.pptx
Foundation of Data Science unit number two notes
.pdf is not working space design for the following data for the following dat...
oil_refinery_comprehensive_20250804084928 (1).pptx
Reliability_Chapter_ presentation 1221.5784

Classification modelling review

  • 1. IDS 572 DATA MINING FOR BUSINESS ASSIGNMENT 3 - FUNDRAISING 2 Kaushik Kompella Jaideep Adusumelli Sushaanth Srirangapathi
  • 2. 1. Modeling Partitioning - Partition the dataset into 60% training and 40% validation (set the seed to 12345). In the last assignment, you developed decision tree, logistic regression, naïve Bayes, random forest and boosted tree models. Now, develop support vector machine models for classification. Examine different parameter values, as you see suitable. Report on what you experimented with and what worked best. Howdo you select the subset ofvariables to include in the model? What methods do you use to select variables that you feelshouldbe includedinthe model(s)? Doesvariable selectionmake adifference? Provide a comparative evaluation ofperformance ofyour bestmodels from all techniques (including those from part 1, i.e. assignment 2) (Be sure NOT to include “TARGET−D” in your analysis.) Data Selection: Missing Values In order to focusour effortson the set of variablesthatwouldprovide the mostmeaningful information we decided to start by looking for variables with excessive amounts of missing values. We sawan advantage inchoosing to eliminatevariableswith>=85% of all valuesmissing.Thishelpedus to eliminate anumberof variables.We chose thisroute because we knew that includingthese variables withgreaternumbersof missingvalueswouldnotbe useful inpredictingdonorresponse,because there isverylittle informationcontainedwithinthose variables,wellbelow eventhe low responserate of 5.1%. Certain notable variables were: Attributes Eliminated Number of Missing values (out of 9999 values) MAILCODE 9857 PVASTATE 9847 RECP3 9728 RECPGVG 9983 CHILD03 9884 CHILD12 9840 CHILD18 9746 SOLP3 9976 MAJOR 9969 HOMEE 9902 PHOTO 9504 KIDSTUFF 9822 CARDS 9869 PLATES 9933 RDATE_3 9947 RDATE_5 9997 RDATE_6 9896 RAMNT_3 9947 RAMNT_4 9954 RAMNT_5 9997 RAMNT_6 9896 RECSWEEP 9826
  • 3. These Variables did not even have enough number of values for the actual data which has only 5.1% response rate. Sothese variableseventhoughcontainveryimportantinformationcannotcontribute well towards creating a good model. Hence we eliminated all these variables. We did not eliminate those variables that were coded as X and null values. We transformed many of those attributes and will discuss those in the ‘transform’ section. Data Selection: Relevance Next, we chose to focus on only including variables which wouldhave relevance on predictingdonation probability.We didthisbyscanningthroughthe variabledescriptionsinthe datadictionarytoonlyretain those that were relevant. An example of a variable thatwe chose to eliminate wasDW2. Thisvariable correspondedtothe locale- specific percentage of single-unit detached structures. Eliminating these variable and other variables similar to this allowed us to lessen the total number of variables that we analyzed. In addition to those variables that we found to be irrelevant, we noticedthat there was a great deal of overlap between certain variables. For instance, the ‘Wealth1’ and ‘Wealth2’ variables had very similar descriptions in the data dictionary. However, we noticed that Wealth2 contained a more robust description of the variable – “Wealth rating uses median family income and population statistics from each area to index relative wealthwithineachstate.The segmentsare denoted0-9,with9beingthe highestincomegroupandzero being the lowest. Each rating has a different meaning within each state”. Whereas “Wealth1” simply contained the description: wealth rating. Since “Wealth1” has a similar description to “Wealth2” we know that there is a lot of information overlap, which allows us to justify dropping the less informative of the two: “Wealth1”. We also found some variables which had no relationship with “TARGET_B” but have a relationship with “TARGET_D”. Some variables of this kind are: Data Reduction: PCA We experimented with creating many different sets of PCAs but found that in many cases we were including many variable sets that had numerous missing values, or had summary variables that were alreadyexisting.Therefore,we onlycreated3PCAsthat were valuable PCA_CENSUS_X,PCA_EMP_Xand PCA_INTERESTS_X. PCA_CENSUS_X included census variables for neighborhoods PCA_EMP_X included Employment information of the people i.e. Whether they are army or military veterans, state government employees or federal government employees, etc. PCA_INTERESTS_X included the 18 ‘interest/hobby’ based variables.
  • 4. Data Reduction: Random Forests We ran random forests and plotted the variable importance graphs for all the variables for all the remaining attributes after the above processes were carried out. And then we selected all those variables which had the “Decrease in Gini Index” more than or equal to 0.0064 which gave us 68 variables. These included PCAs and mostly summary variables. Most of the variables were not at all included in the split and we were able to eliminate them to improve the overall performance of our model. Data Selection: Missing Value Replacement We utilized two rapid miner operators in our process to ensure that those variables that had missing valuesbelowourcutoff were transformed.The twooperatorswe usedwere the Replace MissingValues and the Map operators. Replace MissingValuesallowedustoimpute the valueswhennecessarywhile the Mapoperatorallowed us to recode those variables that had encoded characters such as “MDMAUD”. This allowed us to code the extraneous characters as null values giving us a more accurate picture of which variables actually contained many more missing values than originally depicted in our first glance.
  • 5. Data Selection: The Variables Included CLUSTER DOMAINTYPE WEALTH1 INCOME PEPSTRFL MAXRAMNT DOMAINSES HIT VIETVETS WWIIVETS IC1 CARDPROM MAXADATE CARDPM12 RAMNTALL NGIFTALL CARDGIFT MINRAMNT MINRDATE MAXRDATE LASTGIFT FISTDATE TIMELAG AVGGIFT CONTROLN TOTALDAYS RECENCYFREQ LOWINCOME MEDINCOME HIGHINCOME PCA_NEIGH_PER_1-5 PCA_NEIGH_MMM_1-4 PCA_EMP_1-4
  • 6. Decision Trees We chose the J48 version of the decision tree as we were able to obtain the most accurate results. The parameter that allowed us to build the best model was .4 as a threshold. We selected Laplace smoothing parameter for the J48 and chose an overall confidence of .35. We created two main PCA’s which we removed from the model initially. The following confusion matrix is our J48 without these three PCAs: Accuracy of Training Data set for the best tree: 94.97 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 3796 159 95.98 Predicted Bad 143 1901 93.00 Class recall 96.37 92.28 Accuracy of Test Data set for the best tree: 55.67 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 1683 895 65.28 Predicted Bad 878 544 38.26 Class recall 65.72 37.80 Our optimal J48model showed a55.67% accuracy onthe overall testdata.The classprecisionfortrue 1’s and predicted 1’s was 38.26% and overall class recall was 37.80%. These results were close to training data accuracy so it implies that we reached optimal model. Logistic Regression Once again,we ran the model withandwithoutthe PCAsthatwe generatedforthe J48 model.Inthis case we alsofoundthatthe inclusionof PCA’swashurting overallmodelperformance. Below isthe confusionmatrix associatedwithlogisticregression. Accuracy of Training Data set for the best tree: 54.49 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 1476 267 84.68 Predicted Bad 2463 1793 42.13 Class recall 37.47 87.04 Accuracy of Test Data set for the best tree: 48.08 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 814 330 71.15 Predicted Bad 1747 1109 38.83 Class recall 31.78 77.07
  • 7. We thencomparedthe trainingresultsfromourLogisticRegressiontothe validationdataandfoundour best LR model. The associated test data confusion matrix is seen above. We could maintainan overall model accuracyof 48.08%. The class precisionfromthe testdata was very similartothatof the training data.We obtained a38.83% precisionwithanoverallclassrecall of 77.07%. We used default parameters for the Logistic regression as these gave us the highest overall accuracy. Naïve Bayes After running J48 and Logistic regression we attempted to find better model using Naïve Bayes. We applied same subset of variables both with and without PCAsto find best Naïve Bayes model. Our best model again didnot contain any of the PCA’sbut as with the othermodels,containedall of our created variables described in the initial data preprocessing steps. Accuracy of Training Data set for the best tree: 63.68 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 2888 1128 71.99 Predicted Bad 1051 932 47.00 Class recall 73.32 45.24 Accuracy of Test Data set for the best tree: 59.35 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 1745 810 68.30 Predicted Bad 816 629 43.53 Class recall 68.14 43.71 The overall model accuracy is 59.35% with a class precision of 43.53% and a class recall of 43.71% We compared the accuracy of our Naïve Bayes model with the associated testing data. The resulting confusion matrix is as above. In this case we see a rise in overall model accuracy on the validation data set. Also, we noticed similar class precisionandclass recall betweenourtestingand trainingdata suggestingwe hadfound our best, and most stable Naïve Bayes model. Support Vector Machines Initially we tried to deploy SVMmodel on our data but were getting perfect over fit. We were using a kernel type radial basedfunctionondefaultsettings.Inaddition,we changedthe thresholdto.3, .5, and .65 but were still getting massive over fit on our training data.
  • 8. Below is the confusion matrix associated with the SVMtraining data: Accuracy of Training Data set for the best tree: 59.21 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 3064 1572 66.09 Predicted Bad 875 488 35.80 Class recall 77.79 23.69 Accuracy of Test Data set for the best tree: 59.55 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 2048 1105 64.95 Predicted Bad 513 334 39.43 Class recall 79.97 23.21 Afteranalyzingmany differentkernel typessuchas- polynomial,dot,andGaussiancombination.We also experimentedwiththe Cpenaltyparametersand determinedthatthe supportvectormachinewasn’tthe best suited algorithm to our data and did not generalize well. We could generate an overall accuracyof 59.21% onthismodel,butwithalowerclassprecisionof 35.8%. The class recall was also worse than all other models with a 23.69% recall on true 1’s/predicted 1’s. Our testdatahad an overall accuracy of 59.55% butstill withverylow classprecisionof 39.43% and class recall of 23.21%. KNN We usedthe followingparameterstogenerate ourbestKNN model: K = 150; Measure:Mixed,Mixed-Measure:MixedEucledian Thisincluded usingak-value of 150, withthresholdof .33 alongwithmixedmeasure of Euclideanasa distance metric. Belowisthe confusionmatrix of ourtrainingdatafor KNN: Accuracy of Training Data set for the best tree: 54.53 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 2004 793 71.65 Predicted Bad 1935 1267 39.57 Class recall 50.88 61.50
  • 9. Accuracy of Test Data set for the best tree: 50.48 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 1151 571 66.84 Predicted Bad 1410 868 38.10 Class recall 44.94 60.32 Using the parameters above we obtained an overall accuracy of 54.3% on the training data. The class precisionwas lowerthan that of other modelsat 39.57%. The class recall for predictedand true 1’s was 61.5% overall. Our testing confusionmatrix gave an accuracy of 50.48%, with a class precision of 38.1%, but an overall class recall of 60.32% on predicted 1’s and true 1’s. Random Forest Initially we usedthe same setof attributesthatwe deployedinourSVMmodel andwere onlyable to generate a~30% Test accuracy but witha ~80% Trainingdata accuracy. We allowedtoomanytreesto buildandas a resultthismodel wasoverfit. By limitingdepthof the trees,we could generate amuchhigheraccuracy ontest data.Below isour RandomForestconfusionmatrix forourtrainingdata: Accuracy of Training Data set for the best tree: 60.06 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 2101 558 79.01 Predicted Bad 1838 1502 44.97 Class recall 53.34 72.91 Accuracy of Test Data set for the best tree: 53.80 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 1212 499 70.84 Predicted Bad 1349 940 41.07 Class recall 47.33 65.32 We obtainedanaccuracy of 60.06% overall on trainingdata,butwitha classprecisionof 44.97% and a classrecall 72.91% forpredictedandtrue 1’s On testing data set we obtained a reasonably close accuracy of 53.8%. The class precision was 41.07% with a class recall close to that of the training data at 65.32%.
  • 10. Gradient Boosted Trees: Accuracy of Training Data set for the best tree: 72.03 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 2706 445 85.88 Predicted Bad 1233 1615 56.71 Class recall 68.70 78.40 Accuracy of Test Data set for the best tree: 55.62 Prediction / Actual Actual Good Actual Bad Class precision Predicted Good 1428 642 68.99 Predicted Bad 1133 797 41.30 Class recall 55.76 55.39 The parameters used were: Number of trees: 40, Maximal Depth: 5, Minimum Rows: 20, Minimum Split Improvement: 0.0025, Number of Bins: 25, Learning Rate: 0.1, Sample Rate: 1, Distribution: Auto From the above matrix we found that gradient boosted trees were the best model we could get with optimal accuracy and precision and high recall which would help us to maximize the profits.
  • 11. 2. Our overall goal is to identify which individuals to target for maximum donations (profit). We will try two approaches for this: (i) using the response model,togetherwith average donation and mailingcosts information,to identifythe most profitable individualstotarget (Q 2.1 below) (ii) developa secondmodel on TARGET_D, and combine this with the response model to identifythe most profitable individualstotarget (Q 2.2 below) 2.1 (a) What is the ‘best’ model for each method in Question 1 for maximizing revenue? Calculate the net profit for both the training and validation set based on the actual response rate (5.1%). We can calculate the net profit from given information - the expected donation, given that they are donors, is $13.00, and the total cost of each mailing is $0.68. Note: to calculate estimated net profit (on data with the ‘natural’ response rate of5.1%), we will need to “undo” the effects ofthe weighted sampling, and calculate the net profit that reflects the actual response distribution of 5.1% donors and 94.9% non-donors.) Afterchoosingthe bestmodelsfromthe above methods,we coulddeterminethatourbestmodel was the GradientBoosted Trees.Thisis notjust basedonthe accuracy itself butthe netprofitwe got. We neededtoundothe effectsof the weightedsampling whichwe didinthe followingway: Firstly, adjustthe weightedprofit andthe weightedcost. Weightedprofit:13-$0.68 = $12.32 WeightedCost= ($0.68) Afterthis,we usedthe followingformulatocalculate maximumprofit: (12.32*.051)/(3499/9999) = 1.795 Next,we usedthe followingformulatocalculate the adjustedcost: (($0.68)*0.9491)/(6500/9999) = .992 The liftwe achievedforthe logisticregressionwas 306.679 as max profit.
  • 12. (b) Summarize the performance of the ‘best’ model from each method, in terms of net profit from predicting donors in the validation dataset; at what cutoff is the best performance obtained? Draw profit curves:Draweachmodel’s netcumulative profitcurve forthe validation setontoasingle graph. Are there any models that dominate? Best Model: From your answers above, what do you think will be the “best” model to implement? (What criteria do you use to determine ‘best’?) Our bestmodel is GradientBoostedTrees.We usedmax liftcriteriabutalsoconsideredthe highestclass recall (55.39%) and class precision (41.30%). Our next best model (“Logistic Regression”) has a class precision of 38.83%, and a class recall of 77.07%, but a higher lift curve. It does not make sense from a businessstandpointtoevaluate ourbestmodelonthe criteriaof liftalone because thenthe J48wouldn’t generalize well to predict donors accurately. The performance of all models are as follows: Model Profit (Performance) J-48 105.5 $ Logistic 257.63 $ Naïve Bayes 319.58 $ Random Forests 349.10 $ KNN 159.34 $ SVM 90.63 $ The cutoff we usedin GradientBoostedTrees wasa confidence of: 0.496901. 0 50 100 150 200 250 300 1 29 57 85 113 141 169 197 225 253 281 309 337 365 393 421 449 477 505 533 561 589 617 645 673 Cummulative Benefit curveof Decision Tree
  • 13. The best model that was J-48 according to the lift chart. However, Gradient Boosted Trees had the best chance to generalize well on unseen data. -150 -100 -50 0 50 100 150 1 29 57 85 113 141 169 197 225 253 281 309 337 365 393 421 449 477 505 533 561 589 617 645 673 Cummulative benefit curvefor Gradient Boosted Trees -50 0 50 100 150 200 1 29 57 85 113 141 169 197 225 253 281 309 337 365 393 421 449 477 505 533 561 589 617 645 673 Cummulative Benefit Curvefor Naive Bayes -50 0 50 100 150 200 1 29 57 85 113 141 169 197 225 253 281 309 337 365 393 421 449 477 505 533 561 589 617 645 673 Chart Title
  • 14. 2.2. (a) We will next develop a model for the donated amount (TARGET_D). Note that TARGET_D has valuesonlyforthose individualswhodonors(thatis,TARGET_D valuesare definedonlyforcaseswhere TARGET_B is 1). What data will you use to developa model for TARGET_D? (Non-donors,obviously,do not have any donationamount -- shouldyou considerthese as $0.0 donation,or impute missingvalues here? Should non-donors be included for developing the model to predict donation amount? Also, should cases with rare very large donation amounts be excluded? [Hah! – leading questions] Develop a model for TARGET_D. What modeling method do you use (report on any one). Which variables do you use? What variable selection methods do you use? Report on performance. For the caseswithrare verylarge donationamounts,we will notexclude thesecasesaswe will lose out on predictingourmostvaluable donors.Andthesedonationsmightbe large because of asupportive reasonfromthe valuesof othervariables.Hence we didnotwanttolose ImportantInformation. We will use linearRegressionmodelforpredictedTARGET_Dwitha splitvalidationof 60-40 on the data set. Variable selectionmethods Afterapplyingthe datacleaningandmissingvalue treatmentasappliedondatasetforTARGET_B, we lookedatthe scatter plotsof the independentvariablesagainstTARGET_Dandeliminatedthose variablesthatdid notshowany significanttrendand/orrelationshipwithTARGET_D. Givenbelow are a fewvariablesthatwe have selectedforregressionmodellingforTarget-Dandtheirscatterplotswith Target D. Fig.1 RAMNT_18 vs TARGET_D
  • 15. Fig.2 RAMNT_8 vs TARGET_D Fig.3 LASTGIFT vs TARGET_D Fig.4 RAMNTALL vs TARGET_D
  • 16. Fig.5 HIT vs TARGET_D As can be seenfromthe plots,the above variableshadasignificantrelationshipwithTARGET_D,hence we selectedthese variablesforthe linearregressionmodellingfor the predictionof TARGET_D Variable. Givenbeloware the scatterplotsof variable withTARGET_D for the oneswhichwe eliminated. Fig.1 AGE vs TARGET_D
  • 17. Fig.2 ADI vs TARGET_D Fig.3 DMA vs TARGET_D Fig.4 CARDPROMvs TARGET_D These variables didn’tshowanyparticulartrendwiththe TARGET_D variable hence were eliminated.
  • 18. AfterScatterplots,we narroweddownthe listof variables from481 to 261. Andthenwe used correlationmatrix tofinallyselectvariablesformodelling. The highestcorrelation(R) valuesof independentvariableswithTARGET_Dwas of 0.2 from LASTGIFT followedbythese variables - NGIFTALL,AVGGIFT,RAMNTALL,MINRAMNT.We choose all the variables having>= 0.05 as R value withTARGET_D. We finalizedalist of 15 independentvariablesinourdataset to run the model.Here isthe listof variablesthat we usedin our bestlinearregressionmodelthrough the T-Testfeature selection. The variable we gotwere: HIT WEALTH_1 NUMPROM RAMNT_8 RAMNT_12 RAMNT_14 RAMNT_16 RAMNT_18 RAMNT_22 RAMNTALL NGIFTALL MINRAMNT MAXRAMNT LATGIFT AVGGIFT Performance- Our bestmodel hasgiventhe followingvaluesforRootmeansquarederrorandsquaredcorrelation Root-MeanSquarederror:9.949 +/- 0.000
  • 19. (b) How can you use the results from the response model together with results from the donation amount model to identify targets? (Hint: The response model estimatesthe probability of response p(y=1|x). The donation amount model estimates the conditional donation amount, E [amt | x, y=1]. The product of these gives …..? ) How do you identify individuals to target, when combining information from both models? [Sorted values of predicteddonation amount? What thresholdwould you use? Or maybe you prefer to target all individuals with predicted amounts greater than $0.68 (why?), and/or…..] Giventhe confidence probabilitiesof donors(TARGET_B=1) fromthe classificationmodel andthe predicteddonationamounts(TARGET_D) fromthe regressionmodel we multiplythe twovaluesto combine the twomodels. The multiplicationof the confidence (1) fromthe classificationmodel andthe probable donationfromthe donorpredictedbythe linearregressionmodelgivesusthe mostprobable donationbythe responder. Hence withthe classificationmodel we onlypredictedhow manyindividualsare goingtorespondtoour promotion,onthe otherhandthe regressionmodel givesusthe probable donationamountbythe respondersobtainedfromthe classificationmodel.Hence combiningthe twomodelsplaysavery importantrole. The modellingtechniquewe usedisGradientBoostedTrees,whichhasanaccuracy of approximately 50%, hence we have selectedall those individualswhoare donating 2$ or more (as the accuracy is 50%, we assume that outof all the predictedrespondersoutevery2one wouldsurelyrespond,hence we wantthat one individual tocoverourpromotional chargeswhichwasdownscaledto -0.9928 $ for two individualsi.e. 1.9856 $. Hence, inorderto cover ourexpenses,we wouldneedindividualswhowoulddonate aminimum amountof ~2 $.
  • 20. 5. Testing – chose one model, either the one from 2.1 or 2.2 above,based on performance on the test data. The file FutureFundraising.xlscontains the attributes for future mailing candidates. Using your “best” model from Step 2 which of these candidates do you predict as donors and non-donors? List them in descending order of probability of being a donor/prediction of donation amount. What cutoff do you use? Submit this file (xls format), with your best model’s predictions (prob of being a donor). For testingthe FutureFunraising.xlsfile,the bestmodel whichwe have usedis GradientBoostedTrees. Afterapplyingthe model,the below are the predicted informationinferred Numberof Donors : 10857 Numberof Non-Donors : 9143 Cumulative Profit : ~24604 $ The cut off usedto predict donor/non-donorsis the confidence ofpredicting1s = 0.496901 Note: The listof predictionsusingthe bestmodel is uploadedtothe blackboard.(Future_Data_Prediction.xls).
  • 21. Appendix-1 Python code used for cleaning the data- import numpy as np import pandas as pd drop_column = ["ODATEDW", "OSOURCE", "TCODE", "ZIP", "MAILCODE", "PVASTATE", "DOB", "NOEXCH", "CLUSTER", "AGEFLAG", "NUMCHLD", "HIT", "DATASRCE", "MALEMILI", "MALEVET", "VIETVETS", "WWIIVETS", "LOCALGOV", "STATEGOV", "FEDGOV", "GEOCODE", "HHP1", "HHP2", "DW1", "DW2", "DW3", "DW4", "DW5", "DW6", "DW7", "DW8", "DW9", "HV1", "HV2", "HV3", "HV4", "HU1", "HU2", "HU3", "HU4", "HU5", "HHD1", "HHD2", "HHD3", "HHD4", "HHD5", "HHD6", "HHD7", "HHD1", "HHD2", "HHD3", "HHD4", "HHD5", "HHD6", "HHD7", "HHD8", "HHD9", "HHD10", "HHD11", "HHD12", "HUR1", "HUR2", "RHP1", "RHP2", "RHP3", "RHP4", "HUPA1", "HUPA2", "HUPA3", "HUPA4", "HUPA5", "HUPA6", "HUPA7", "RP1", "RP2", "RP3", "RP4", "MSA", "ADI", "DMA", "MC1", "MC2", "MC3", "TPE1", "TPE2", "TPE3", "TPE4", "TPE5", "TPE6", "TPE7", "TPE8", "TPE9", "PEC1", "PEC2", "TPE10", "TPE11", "TPE12","TPE13", "ANC1", "ANC2", "ANC3", "ANC4", "ANC5", "ANC6", "ANC7", "ANC8", "ANC9", "ANC10", "ANC11", "ANC12", "ANC13", "ANC14", "ANC15", "POBC1", "POBC2", "LSC1", "LSC2", "LSC3", "LSC4", "VOC1", "VOC2", "VOC3", "ADATE_2", "ADATE_3", "ADATE_4", "ADATE_5", "ADATE_6", "ADATE_7", "ADATE_8", "ADATE_9", "ADATE_10", "ADATE_11", "ADATE_12", "ADATE_13", "ADATE_14", "ADATE_15", "ADATE_16", "ADATE_17", "ADATE_18", "ADATE_19", "ADATE_20", "ADATE_21", "ADATE_22", "ADATE_23", "ADATE_24","MAXADATE", "RDATE_3", "RDATE_4","RDATE_5", "RDATE_6", "RDATE_7","RDATE_8", "RDATE_9", "RDATE_10", "RDATE_11", "RDATE_12", "RDATE_13", "RDATE_14", "RDATE_14", "RDATE_15", "RDATE_16", "RDATE_17", "RDATE_18", "RDATE_19", "RDATE_20", "RDATE_21", "RDATE_22", "RDATE_23", "RDATE_24", "MINRDATE","MAXRDATE", "LASTDATE", "FISTDATE","NEXTDATE", "CONTROLN", "TARGET_D", "HPHONE_D", "RFA_2R", "RFA_2F", "RFA_2A", "MDMAUD_R", "MDMAUD_F", "MDMAUD_A", "CLUSTER2", "GEOCODE2", "MDMAUD"] df = pd.read_csv("C:UserstyrionDocumentsIDS_572_notesassign2pvaBal35Trg.csv", sep=',', na_values=[' '], low_memory=False) df.drop(drop_column, axis=1, inplace=True) list_string = [] non_list_string = [] # filling numeric columns with -1 for c in df.columns: li = df[c].values.tolist() a = np.asarray(li) # print type(a) flag = 0 for x in np.nditer(a): si = x.tolist() if x != "nan": if type(si) != str: flag = 1 break; # filling NAN for all numeric entries if flag == 1: df[c].fillna(-1, inplace=True) # print "was in" # print c else: if df[c].isnull().values.any(): # print type(df[c]) list_string.append(c) else: non_list_string.append(c) # print c # replacing columns having "X" to 1 and "NaN" to 0 for val in list_string: str_l = df[val].values.tolist() a = np.asarray(str_l) # print type(a) flag = 0 for x in np.nditer(a): if x == "X": flag = 1 break if x == "M": flag = 2
  • 22. break if x == "Y": flag = 3 break if flag == 1: # print val df[val] = df[val].replace({'X': 1}, regex=False) df[val].fillna(0, inplace=True) if flag == 2: df[val].fillna(-1, inplace=True) df[val] = df[val].replace({'M': 1}, regex=False) df[val] = df[val].replace({'F': 0}, regex=False) if flag == 3: df[val] = df[val].replace({'Y': 1}, regex=False) df[val].fillna(-1, inplace=True) if val == "HOMEOWNR": df[val].fillna(0, inplace=True) df[val] = df[val].replace({'H': 1}, regex=False) new_attri = [] for val in list_string: if val.find("RFA",0) == 0: df[val].fillna("Z5Z", inplace=True) r = val + "_R" f = val + "_F" c = val + "_C" df[r] = df[val].str.extract('([FNALISZ])',expand = True) df[f] = df[val].str.extract('(d)', expand=True) df[c] = df[val].str.extract('[a-zA-Z][d]([a-zA-Z])', expand=True) """for che in range(1,9999,1): if df[val].iloc[che] == "Z5Z": print che print df[f].iloc[128] """ df[r] = df[r].replace({'F': 0,'N': 1,'A': 2,'L': 3,'I': 4,'S': 5,'Z': 6}, regex=False) df[c] = df[c].replace({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5,'G': 6,'Z': 7}, regex=False) new_attri.append(r) new_attri.append(f) new_attri.append(c) del df[val] val = "RFA_2" df[val].fillna("Z5Z", inplace=True) r = val + "_R" f = val + "_F" c = val + "_C" df[r] = df[val].str.extract('([FNALISZ])',expand = True) df[f] = df[val].str.extract('(d)', expand=True) df[c] = df[val].str.extract('[a-zA-Z][d]([a-zA-Z])', expand=True) """for che in range(1,9999,1): if df[val].iloc[che] == "Z5Z": print che print df[f].iloc[128] """ df[r] = df[r].replace({'F': 0,'N': 1,'A': 2,'L': 3,'I': 4,'S': 5,'Z': 6}, regex=False) df[c] = df[c].replace({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6,'Z':7}, regex=False) new_attri.append(r) new_attri.append(f) new_attri.append(c) del df[val] val = "DOMAIN" df[val].fillna("Z5", inplace=True) domain_att_1 = val + '_urban_level' domain_att_2 = val + "_economic_status" df[domain_att_1] = df[val].str.extract('([UCSTRZ])',expand = True) df[domain_att_2] = df[val].str.extract('(d)', expand=True) df[domain_att_1] = df[domain_att_1].replace({'U': 0,'C': 1,'S': 2,'T': 3,'R': 4,'Z':5}, regex=False) # new_attri.append(domain_att_1) # new_attri.append(domain_att_2) del df[val] # exporting the dataframe to csv df.to_csv('cleaned_PVA_data.csv') # NORMALIZE DATA print np.count_nonzero(df.columns.values) pca_variables = ["CHILD03", "CHILD07", "CHILD12", "CHILD18", "MBCRAFT", "MBGARDEN", "MBBOOKS", "MBCOLECT", "MAGFAML", "MAGFEM", "MAGMALE", "PUBGARDN", "PUBCULIN", "PUBHLTH", "PUBDOITY", "PUBNEWFN", "PUBPHOTO", "PUBOPP", "COLLECT1", "VETERANS", "BIBLE", "CATLG", "HOMEE", "PETS", "CDPLAY", "STEREO", "PCOWNERS", "PHOTO", "CRAFTS", "FISHER", "GARDENIN", "BOATS", "WALKER", "KIDSTUFF", "CARDS", "PLATES", "LIFESRC", "PEPSTRFL", "POP901", "POP902", "POP903", "POP90C1", "POP90C2", "POP90C3", "POP90C4", "POP90C5", "ETH1","ETH2", "ETH3", "ETH4", "ETH5", "ETH6", "ETH7", "ETH8", "ETH9", "ETH10", "ETH11", "ETH12", "ETH13","ETH14", "ETH15", "ETH16", "AGE901", "AGE902", "AGE903", "AGE904", "AGE905", "AGE906", "AGE907", "CHIL1", "CHIL2", "CHIL3", "AGEC1", "AGEC2", "AGEC4", "AGEC5", "AGEC6", "AGEC7", "CHILC1",
  • 23. "CHILC2", "CHILC3", "CHILC4", "CHILC5", "HHAGE1", "HHAGE2", "HHAGE3", "HHN1", "HHN2", "HHN3", "HHN3", "HHN4", "HHN5", "HHN6", "MARR1", "MARR2", "MARR3", "MARR4", "ETHC1", "ETHC2", "ETHC3", "ETHC4", "ETHC5", "ETHC6", "HVP1", "HVP2", "HVP3", "HVP4", "HVP5", "HVP6", "IC1", "IC2", "IC3", "IC4", "IC5", "IC6", "IC7", "IC8", "IC9", "IC9", "IC9", "IC10", "IC11", "IC12", "IC13", "IC14", "IC15", "IC16", "IC17", "IC18", "IC19", "IC20", "IC21", "IC22", "IC23", "HHAS1", "HHAS2", "HHAS3", "HHAS4", "LFC1", "LFC2", "LFC3", "LFC4", "LFC5", "LFC6", "LFC7", "LFC8", "LFC9", "LFC10", "OCC1", "OCC2", "OCC3", "OCC4", "OCC5", "OCC6", "OCC7", "OCC8", "OCC9", "OCC10", "OCC11", "OCC12", "OCC13", "EIC1", "EIC2", "EIC3", "EIC4", "EIC5", "EIC6", "EIC7", "EIC8", "EIC9", "EIC10", "EIC11", "EIC12", "EIC13", "EIC14", "EIC15", "EIC16", "OEDC1", "OEDC2", "OEDC3", "OEDC4", "OEDC5", "OEDC6", "OEDC7", "EC1", "EC2", "EC3", "EC4", "EC5", "EC6", "EC7", "EC8", "SEC1", "SEC2", "SEC3", "SEC4", "AFC1", "AFC2", "AFC3", "AFC4", "AFC5", "AFC6", "VC1", "VC2", "VC3", "VC4", "HC1", "HC2", "HC3", "HC4", "HC5", "HC6", "HC7", "HC8", "HC9", "HC10", "HC11", "HC12", "HC13", "HC14", "HC15", "HC16", "HC17", "HC18", "HC19", "HC20", "HC21", "MHUC1", "MHUC2", "AC1", "AC2", "RAMNT_3", "RAMNT_4", "RAMNT_5","RAMNT_6", "RAMNT_7", "RAMNT_8", "RAMNT_9", "RAMNT_10", "RAMNT_11", "RAMNT_12", "RAMNT_13", "RAMNT_14","RAMNT_15", "RAMNT_16", "RAMNT_17", "RAMNT_18", "RAMNT_19", "RAMNT_20", "RAMNT_21", "RAMNT_22", "RAMNT_23", "RAMNT_24", "RAMNTALL", "NGIFTALL", "CARDGIFT","MINRAMNT", "MAXRAMNT", "TIMELAG","AVGGIFT", new_attri] print len(pca_variables) from sklearn import preprocessing from sklearn.preprocessing import StandardScaler # std_scale = preprocessing.StandardScaler().fit(df[pca_variables]) # df_std = std_scale.transform(df[pca_variables]) # minmax_scale = preprocessing.MinMaxScaler().fit(df[pca_variables]) # df_minmax = minmax_scale.transform(df[pca_variables]) """ df1 = pd.DataFrame(index=range(0, 99999), columns=['A'], dtype='int') for nd_series in df.columns: print type(nd_series) if nd_series in pca_variables: df1.append(df[nd_series]) print df1.column X_std = StandardScaler().fit_transform(df1) mean_vec = np.mean(X_std, axis=0) cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1) print('Covariance matrix n%s' %cov_mat) cov_mat = np.cov(X_std.T) eig_vals, eig_vecs = np.linalg.eig(cov_mat) print('Eigenvectors n%s' %eig_vecs) print('nEigenvalues n%s' %eig_vals)"""
  • 24. Appendix-2 Process chart: Fig.1 Classification modelling process Rapidminer. Fig.2 Regression Modelling Process Rapidminer.