SlideShare a Scribd company logo
00 Categorical data using the Chi-Squared Test 00 Pearson's Correlation Coefficient for
Numeric Data 00 Principal Component Analysis for Numeric Data
00 Feature Importance with Random Forests for Both Categorical and Numeric Data Let's get
started!
Data and Imports
For our demonstration today, we will use the Bank Marketing UCI dataset, which one can find
on Kaggle. This dataset contains information about Bank customers in a marketing campaign,
and it contains a target
Got any questions? I'm happy to help.
variable that one can utilize in a classification model. This dataset is in the public domain under
CC01 Public Domain and can be used.
to Wikipedia:
The Chi-Squared test is a statistical test applied to categorical data to evaluate how likely it is
that any observed difference between the sets arose by chance.
You apply the Chi-Squared test when both your feature data is categorical as well as your
target data is categorical e.g., classification problems.
Note: While this dataset contains a mix of categorical and numeric values, we'll isolate the
categorical values to demonstrate how you would apply the Chi-Squared test. A better method
for this dataset will be described below via Feature Importances to select features across
categorical and numeric types.
We'll start by selecting only those types that are categorical or of type object in Pandas. Pandas
stores text as objects, so you should validate if these are absolute values before simply utilizing
the object type.
# get categorical data
cat_data = df.select_dtypes(include=['object'])
We can then isolate the features and the target values. The target variable, y is the last column in
the Data Frame and therefore we can use Python's slicing technique to separate them into X
and y.
X = cat_data.iloc[:, :-1].values y = cat_data.iloc[:,-1].values
Next we have two functions. These functions will use the OrdinalEncoder for the X data and the
LabelEncoder for the y data. As the name implies, the OrdinalEncoder will convert categorical
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)
Next is a function that will help us select the best features utilizing the Chi-Squared test inside
the SelectKBest method. We can start by setting the argument k='all', which will first run the
test across all features, and later we can apply it with a specific number of features.
def select_features(X_train, y_train, X_test, k_value='all'): fs =
SelectKBest(score_func=chi2, k=k_value) fs.fit(X_train, y_train)
X_train_fs = fs.transform(X_train)
X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs
We can start by printing off the scores for each feature.
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc,
# what are scores for the features names = []
values = []
for i in range(len(fs.scores_)):
names.append(cat_data.columns[i]) values.append(fs.scores_[i]) chi_list = zip(names,
values)
# plot the scores plt.figure(figsize=(10,4)) sns.barplot(x=names, y=values)
plt.xticks(rotation = 90) plt.show()
Here, we see that contact has the largest score while marital, default, and month have the
lowest. Overall it looks like there are about 5 features that are worth considering. We'll use the
SelectKBest method to select the top 5 features.
fs.get_feature_names_out()
array(['x0', 'x4', 'x5', 'x6', 'x8'], dtype=object)
And finally, we can print the shape of the X_train_fs and X_test_fs data and see that the second
dimension is 5 for the selected features.
print(X_train_fs.shape)
print(X_test_fs.shape)
(3029, 5)
(1492, 5)
Feature Selection for Numeric Values
When dealing with pure numeric data, there are two methods that I prefer to use. The first is
Pearson's Correlation Coefficient and the second is Principal Component Analysis or PCA.
X_test_enc)
# what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f % (i,
plt.yticks(rotation=45);
Pearson Correlation of Features
features and their relative contribution to the overall performance of a model. This method also
works with other Bagged trees like Extra Trees and Gradient Boosting for classification and
regression.
The bonus of this method is that it fits well with the overall flow of building a model. Let's walk
through how this works. Let's start by getting all the columns from our Data Frame. Note that we
can utilize both categorical and numeric data in this case.
print(df.columns)
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous',
'poutcome', 'y'], dtype='object')
When reading the documentation for this dataset, you'll notice that the Duration column is
something that we shouldn't use to train your model. We'll manually remove it from our list of
columns.
Duration: last contact duration, in seconds (numeric). Important note: this attribute highly
affects the output target (e.g., if duration=0, then y='no'). Yet, the duration is not known before a
call is performed. Also, after the end of the call y is obviously known. Thus, this input should only
be included for benchmark purposes and should be discarded if the intention is to have a
realistic predictive model.
Additionally, we can remove any other features that might not be useful if we'd like. For our
example, we'll keep all of them aside from duration.
# Remove columns from the list that are not relevant.
# Create a random forest classifier for feature importance
elf = RandomForestClassifier(random_state=42, nJobs=6, class_weight-balanced')
pipeline = Pipeline([('prep',column_trans),
('clf', clf)])
Next, we'll split our data into training and test sets and fit our model to the training data.
# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(df[targets], df['y'], test_size=0.3,
random_state=0)
pipeline.fit(X_train, y_train)
We can call the feature_importances_ method against the classifier to see the output. Note how
you
reference the classifier in the pipeline by calling its name clf, similar to a dictionary in Python.
pipeline['clf].feature_importances_
array([0.12097191, 0.1551929,0.10382712, 0.04618367, 0.04876248,
0.02484967, 0.11530121, 0.15703306, 0.10358275, 0.04916597,
0.05092775, 0.02420151])
Next, let's display these sorted by the greatest importance and their cumulative importance.
6 housing 0.078803 0.623837
3 education 0.072885 0.696722
4 default 0.056480 0.753202 12 pdays 0.048966 0.802168
8 contact 0.043289 0.845457
7 loan 0.037978 0.883436
14 poutcome 0.034298 0.917733
10 month 0.028382 0.946116
5 balance 0.028184 0.974300
11 campaign 0.021657 0.995957
9 day 0.004043 1.000000
Finally, based on that loop, let's print out the features that we've selected overall. Based on
this analysis, we removed about 50% of them from our model, and we can see which ones have
the highest impact!
print('Most Important Features:') print(included_feats)
print('Number of Included Features =', len(included_feats))
Most Important Features:
['age', 'job', 'marital', 'education', 'default', 'housing', 'previous']
Number of Included Features = 7
Thank you for reading! You can find all the code for this article on GitHub
Conclusion
Feature-selection-techniques to be used in machine learning algorithms

More Related Content

PPTX
CSL0777-L07.pptx
PPTX
KabirDataPreprocessingPyMMMMMMMMMMMMMMMMMMMMthon.pptx
PDF
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
PDF
Linear Regression (Machine Learning)
PDF
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
PDF
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
PDF
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
PDF
BPstudy sklearn 20180925
CSL0777-L07.pptx
KabirDataPreprocessingPyMMMMMMMMMMMMMMMMMMMMthon.pptx
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Linear Regression (Machine Learning)
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
THE IMPLICATION OF STATISTICAL ANALYSIS AND FEATURE ENGINEERING FOR MODEL BUI...
BPstudy sklearn 20180925

Similar to Feature-selection-techniques to be used in machine learning algorithms (20)

PDF
Machine Learning deep learning artificial
PDF
maXbox starter69 Machine Learning VII
PDF
Lesson 2 data preprocessing
PDF
Machine Learning with Python- Machine Learning Algorithms.pdf
PDF
Lab 2: Classification and Regression Prediction Models, training and testing ...
PPTX
Lecture 1 Pandas Basics.pptx machine learning
PPTX
Employee Salary Presentation.l based on data science collection of data
PPTX
PPT on Data Science Using Python
PDF
20MEMECH Part 3- Classification.pdf
PDF
Kaggle KDD Cup Report
PDF
Workshop: Your first machine learning project
PDF
Ai_Project_report
PDF
maXbox starter65 machinelearning3
PDF
Analysis using r
PDF
maxbox_starter138_top7_statistical_methods.pdf
PPTX
Data Visualization_pandas in hadoop.pptx
PDF
working with python
PDF
Machine Learning.pdf
PDF
Competition 1 (blog 1)
PDF
Machine Learning Guide maXbox Starter62
Machine Learning deep learning artificial
maXbox starter69 Machine Learning VII
Lesson 2 data preprocessing
Machine Learning with Python- Machine Learning Algorithms.pdf
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lecture 1 Pandas Basics.pptx machine learning
Employee Salary Presentation.l based on data science collection of data
PPT on Data Science Using Python
20MEMECH Part 3- Classification.pdf
Kaggle KDD Cup Report
Workshop: Your first machine learning project
Ai_Project_report
maXbox starter65 machinelearning3
Analysis using r
maxbox_starter138_top7_statistical_methods.pdf
Data Visualization_pandas in hadoop.pptx
working with python
Machine Learning.pdf
Competition 1 (blog 1)
Machine Learning Guide maXbox Starter62
Ad

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Mega Projects Data Mega Projects Data
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Database Infoormation System (DBIS).pptx
PDF
Business Analytics and business intelligence.pdf
PDF
annual-report-2024-2025 original latest.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Miokarditis (Inflamasi pada Otot Jantung)
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Mega Projects Data Mega Projects Data
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Database Infoormation System (DBIS).pptx
Business Analytics and business intelligence.pdf
annual-report-2024-2025 original latest.
IBA_Chapter_11_Slides_Final_Accessible.pptx
Qualitative Qantitative and Mixed Methods.pptx
Supervised vs unsupervised machine learning algorithms
STUDY DESIGN details- Lt Col Maksud (21).pptx
climate analysis of Dhaka ,Banglades.pptx
Introduction to Knowledge Engineering Part 1
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Ad

Feature-selection-techniques to be used in machine learning algorithms

  • 1. 00 Categorical data using the Chi-Squared Test 00 Pearson's Correlation Coefficient for Numeric Data 00 Principal Component Analysis for Numeric Data 00 Feature Importance with Random Forests for Both Categorical and Numeric Data Let's get started! Data and Imports For our demonstration today, we will use the Bank Marketing UCI dataset, which one can find on Kaggle. This dataset contains information about Bank customers in a marketing campaign, and it contains a target Got any questions? I'm happy to help. variable that one can utilize in a classification model. This dataset is in the public domain under CC01 Public Domain and can be used.
  • 2. to Wikipedia: The Chi-Squared test is a statistical test applied to categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. You apply the Chi-Squared test when both your feature data is categorical as well as your target data is categorical e.g., classification problems. Note: While this dataset contains a mix of categorical and numeric values, we'll isolate the categorical values to demonstrate how you would apply the Chi-Squared test. A better method for this dataset will be described below via Feature Importances to select features across categorical and numeric types. We'll start by selecting only those types that are categorical or of type object in Pandas. Pandas stores text as objects, so you should validate if these are absolute values before simply utilizing the object type. # get categorical data cat_data = df.select_dtypes(include=['object']) We can then isolate the features and the target values. The target variable, y is the last column in the Data Frame and therefore we can use Python's slicing technique to separate them into X and y. X = cat_data.iloc[:, :-1].values y = cat_data.iloc[:,-1].values Next we have two functions. These functions will use the OrdinalEncoder for the X data and the LabelEncoder for the y data. As the name implies, the OrdinalEncoder will convert categorical
  • 3. # prepare output data y_train_enc, y_test_enc = prepare_targets(y_train, y_test) Next is a function that will help us select the best features utilizing the Chi-Squared test inside the SelectKBest method. We can start by setting the argument k='all', which will first run the test across all features, and later we can apply it with a specific number of features. def select_features(X_train, y_train, X_test, k_value='all'): fs = SelectKBest(score_func=chi2, k=k_value) fs.fit(X_train, y_train) X_train_fs = fs.transform(X_train) X_test_fs = fs.transform(X_test) return X_train_fs, X_test_fs, fs We can start by printing off the scores for each feature. # feature selection X_train_fs, X_test_fs, fs = select_features(X_train_enc, y_train_enc, # what are scores for the features names = [] values = [] for i in range(len(fs.scores_)): names.append(cat_data.columns[i]) values.append(fs.scores_[i]) chi_list = zip(names, values) # plot the scores plt.figure(figsize=(10,4)) sns.barplot(x=names, y=values) plt.xticks(rotation = 90) plt.show() Here, we see that contact has the largest score while marital, default, and month have the lowest. Overall it looks like there are about 5 features that are worth considering. We'll use the SelectKBest method to select the top 5 features.
  • 4. fs.get_feature_names_out() array(['x0', 'x4', 'x5', 'x6', 'x8'], dtype=object) And finally, we can print the shape of the X_train_fs and X_test_fs data and see that the second dimension is 5 for the selected features. print(X_train_fs.shape) print(X_test_fs.shape) (3029, 5) (1492, 5) Feature Selection for Numeric Values When dealing with pure numeric data, there are two methods that I prefer to use. The first is Pearson's Correlation Coefficient and the second is Principal Component Analysis or PCA. X_test_enc) # what are scores for the features for i in range(len(fs.scores_)): print('Feature %d: %f % (i,
  • 6. features and their relative contribution to the overall performance of a model. This method also works with other Bagged trees like Extra Trees and Gradient Boosting for classification and regression. The bonus of this method is that it fits well with the overall flow of building a model. Let's walk through how this works. Let's start by getting all the columns from our Data Frame. Note that we can utilize both categorical and numeric data in this case. print(df.columns) Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y'], dtype='object') When reading the documentation for this dataset, you'll notice that the Duration column is something that we shouldn't use to train your model. We'll manually remove it from our list of columns. Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0, then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. Additionally, we can remove any other features that might not be useful if we'd like. For our example, we'll keep all of them aside from duration. # Remove columns from the list that are not relevant.
  • 7. # Create a random forest classifier for feature importance elf = RandomForestClassifier(random_state=42, nJobs=6, class_weight-balanced') pipeline = Pipeline([('prep',column_trans), ('clf', clf)]) Next, we'll split our data into training and test sets and fit our model to the training data. # Split the data into 30% test and 70% training X_train, X_test, y_train, y_test = train_test_split(df[targets], df['y'], test_size=0.3, random_state=0) pipeline.fit(X_train, y_train) We can call the feature_importances_ method against the classifier to see the output. Note how you reference the classifier in the pipeline by calling its name clf, similar to a dictionary in Python. pipeline['clf].feature_importances_ array([0.12097191, 0.1551929,0.10382712, 0.04618367, 0.04876248, 0.02484967, 0.11530121, 0.15703306, 0.10358275, 0.04916597, 0.05092775, 0.02420151]) Next, let's display these sorted by the greatest importance and their cumulative importance.
  • 8. 6 housing 0.078803 0.623837 3 education 0.072885 0.696722 4 default 0.056480 0.753202 12 pdays 0.048966 0.802168 8 contact 0.043289 0.845457 7 loan 0.037978 0.883436 14 poutcome 0.034298 0.917733 10 month 0.028382 0.946116 5 balance 0.028184 0.974300 11 campaign 0.021657 0.995957 9 day 0.004043 1.000000 Finally, based on that loop, let's print out the features that we've selected overall. Based on this analysis, we removed about 50% of them from our model, and we can see which ones have the highest impact! print('Most Important Features:') print(included_feats) print('Number of Included Features =', len(included_feats)) Most Important Features: ['age', 'job', 'marital', 'education', 'default', 'housing', 'previous'] Number of Included Features = 7 Thank you for reading! You can find all the code for this article on GitHub Conclusion