Feature-selection-techniques to be used in machine learning algorithms

00 Categorical data using the Chi-Squared Test 00 Pearson's Correlation Coefficient for
Numeric Data 00 Principal Component Analysis for Numeric Data
00 Feature Importance with Random Forests for Both Categorical and Numeric Data Let's get
started!
Data and Imports
For our demonstration today, we will use the Bank Marketing UCI dataset, which one can find
on Kaggle. This dataset contains information about Bank customers in a marketing campaign,
and it contains a target
Got any questions? I'm happy to help.
variable that one can utilize in a classification model. This dataset is in the public domain under
CC01 Public Domain and can be used.

to Wikipedia:
The Chi-Squared test is a statistical test applied to categorical data to evaluate how likely it is
that any observed difference between the sets arose by chance.
You apply the Chi-Squared test when both your feature data is categorical as well as your
target data is categorical e.g., classification problems.
Note: While this dataset contains a mix of categorical and numeric values, we'll isolate the
categorical values to demonstrate how you would apply the Chi-Squared test. A better method
for this dataset will be described below via Feature Importances to select features across
categorical and numeric types.
We'll start by selecting only those types that are categorical or of type object in Pandas. Pandas
stores text as objects, so you should validate if these are absolute values before simply utilizing
the object type.
# get categorical data
cat_data = df.select_dtypes(include=['object'])
We can then isolate the features and the target values. The target variable, y is the last column in
the Data Frame and therefore we can use Python's slicing technique to separate them into X
and y.
X = cat_data.iloc[:, :-1].values y = cat_data.iloc[:,-1].values
Next we have two functions. These functions will use the OrdinalEncoder for the X data and the
LabelEncoder for the y data. As the name implies, the OrdinalEncoder will convert categorical

# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)
Next is a function that will help us select the best features utilizing the Chi-Squared test inside
the SelectKBest method. We can start by setting the argument k='all', which will first run the
test across all features, and later we can apply it with a specific number of features.
def select_features(X_train, y_train, X_test, k_value='all'): fs =
SelectKBest(score_func=chi2, k=k_value) fs.fit(X_train, y_train)
X_train_fs = fs.transform(X_train)
X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs
We can start by printing off the scores for each feature.
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc,
# what are scores for the features names = []
values = []
for i in range(len(fs.scores_)):
names.append(cat_data.columns[i]) values.append(fs.scores_[i]) chi_list = zip(names,
values)
# plot the scores plt.figure(figsize=(10,4)) sns.barplot(x=names, y=values)
plt.xticks(rotation = 90) plt.show()
Here, we see that contact has the largest score while marital, default, and month have the
lowest. Overall it looks like there are about 5 features that are worth considering. We'll use the
SelectKBest method to select the top 5 features.

fs.get_feature_names_out()
array(['x0', 'x4', 'x5', 'x6', 'x8'], dtype=object)
And finally, we can print the shape of the X_train_fs and X_test_fs data and see that the second
dimension is 5 for the selected features.
print(X_train_fs.shape)
print(X_test_fs.shape)
(3029, 5)
(1492, 5)
Feature Selection for Numeric Values
When dealing with pure numeric data, there are two methods that I prefer to use. The first is
Pearson's Correlation Coefficient and the second is Principal Component Analysis or PCA.
X_test_enc)
# what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f % (i,

plt.yticks(rotation=45);
Pearson Correlation of Features

features and their relative contribution to the overall performance of a model. This method also
works with other Bagged trees like Extra Trees and Gradient Boosting for classification and
regression.
The bonus of this method is that it fits well with the overall flow of building a model. Let's walk
through how this works. Let's start by getting all the columns from our Data Frame. Note that we
can utilize both categorical and numeric data in this case.
print(df.columns)
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous',
'poutcome', 'y'], dtype='object')
When reading the documentation for this dataset, you'll notice that the Duration column is
something that we shouldn't use to train your model. We'll manually remove it from our list of
columns.
Duration: last contact duration, in seconds (numeric). Important note: this attribute highly
affects the output target (e.g., if duration=0, then y='no'). Yet, the duration is not known before a
call is performed. Also, after the end of the call y is obviously known. Thus, this input should only
be included for benchmark purposes and should be discarded if the intention is to have a
realistic predictive model.
Additionally, we can remove any other features that might not be useful if we'd like. For our
example, we'll keep all of them aside from duration.
# Remove columns from the list that are not relevant.

# Create a random forest classifier for feature importance
elf = RandomForestClassifier(random_state=42, nJobs=6, class_weight-balanced')
pipeline = Pipeline([('prep',column_trans),
('clf', clf)])
Next, we'll split our data into training and test sets and fit our model to the training data.
# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(df[targets], df['y'], test_size=0.3,
random_state=0)
pipeline.fit(X_train, y_train)
We can call the feature_importances_ method against the classifier to see the output. Note how
you
reference the classifier in the pipeline by calling its name clf, similar to a dictionary in Python.
pipeline['clf].feature_importances_
array([0.12097191, 0.1551929,0.10382712, 0.04618367, 0.04876248,
0.02484967, 0.11530121, 0.15703306, 0.10358275, 0.04916597,
0.05092775, 0.02420151])
Next, let's display these sorted by the greatest importance and their cumulative importance.

6 housing 0.078803 0.623837
3 education 0.072885 0.696722
4 default 0.056480 0.753202 12 pdays 0.048966 0.802168
8 contact 0.043289 0.845457
7 loan 0.037978 0.883436
14 poutcome 0.034298 0.917733
10 month 0.028382 0.946116
5 balance 0.028184 0.974300
11 campaign 0.021657 0.995957
9 day 0.004043 1.000000
Finally, based on that loop, let's print out the features that we've selected overall. Based on
this analysis, we removed about 50% of them from our model, and we can see which ones have
the highest impact!
print('Most Important Features:') print(included_feats)
print('Number of Included Features =', len(included_feats))
Most Important Features:
['age', 'job', 'marital', 'education', 'default', 'housing', 'previous']
Number of Included Features = 7
Thank you for reading! You can find all the code for this article on GitHub
Conclusion

Feature-selection-techniques to be used in machine learning algorithms

Feature-selection-techniques to be used in machine learning algorithms

More Related Content

Similar to Feature-selection-techniques to be used in machine learning algorithms (20)

Recently uploaded (20)

Feature-selection-techniques to be used in machine learning algorithms