EvoFeat: Genetic Programming-based Feature Engineering Approach to Tabular Data Classification

EvoFeat: Genetic Programming based Fea-
ture Engineering Approach to Tabular Data
Classification
Hengzhe Zhang, Qi Chen, Bing Xue, Yan Wang, Aimin Zhou, Mengjie Zhang
Victoria University of Wellington
06/06/2023

Table of Contents
1 Introduction
2 Related Work
3 Preliminaries
4 The Proposed Algorithm
5 Experiments
1 36

Introduction
Tabular Data Learning: Widely used in recommendation systems 1 and
advertising 2.
Goal: Capture the relationship between explanatory variables {x1, . . . , xm} and
a response variable y.
Dataset Structure: {({x1
1, . . . , x1
m}, y1), . . . , ({xn
1 , . . . , xn
m}, yn)}, where n is the
number of instances.
Challenge
Linear models assume linear relationships.
Decision trees assume axis-parallel decision boundaries.
Real-world data often violates these assumptions.
1
Ruoxi Wang et al., Proceedings of the Web Conference 2021 (2021)
2
Haizhi Yang et al., Proceedings of the 30th ACM International Conference on Information &
Knowledge Management (2021)
2 36

Feature Engineering Techniques
Manual Design: Based on domain knowledge.
Kernel Methods: Use kernel tricks to transform data into higher dimensions.
Deep Learning: Leverages neural networks to learn features automatically. 1.
Limitations
Manual Design: Labor-intensive.
Kernel Methods: Hard to integrate with tree-based methods.
Deep Learning: Requires large datasets, effectiveness debatable for small,
heterogeneous datasets 2.
1
Jianxun Lian et al., Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining (2018)
2
Yury Gorishniy et al., Advances in Neural Information Processing Systems (2021)
3 36

Motivation
Objective: Feature construction using genetic programming (GP).
GP Advantages: Gradient-free, interpretable, and flexible.
Hypothesis: GP-based feature engineering can outperform both traditional and
deep learning methods on tabular data.
Our Approach: EvoFeat
Constructs nonlinear features with GP.
Enhances ensemble learning models.
Uses cross-validation and feature importance for evaluation.
4 36

Related Work
Beam Search Methods:
▶ Greedy, lacks strong mechanisms to prevent overfitting.
Deep Learning Methods:
▶ Effectiveness in comparison to tree-based methods is still debated 1
.
1
5 36

Beam Search Methods
Iterative Feature Generation:
▶ Starts with low-order features.
▶ Generates higher-order features based on important low-order features 1
.
Evaluation:
▶ Uses logistic regression accuracy or XGBoost feature importance.
▶ Sole reliance on training loss can lead to overfitting.
Key Limitation
The lack of effective mechanisms to prevent overfitting restricts feature
construction capabilities.
1
Yuanfei Luo et al., Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
6 36

Deep Learning Methods
High-Order Feature Construction:
▶ Cross Network in DCN.
▶ Field-wise feature cross in xDeepFM 1
.
▶ Attention mechanism in AutoInt 2
.
Effectiveness
Effectiveness over fully connected NN is debatable 3.
Lack of comprehensive studies comparing with XGBoost 4.
1
Jianxun Lian et al., Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
2
Weiping Song et al., Proceedings of the 28th ACM International Conference on Information and
Knowledge Management (2019)
3
4
7 36

Evolutionary Feature Construction
Single Learner:
▶ Traditionally, more focus on simple learners like single decision trees 1
.
▶ Gap in enhancing state-of-the-art algorithms.
Ensemble-based Feature Construction:
▶ Promising results in regression 2
.
▶ Requires adaptation for tabular classification.
Notes
Adapting evolutionary feature construction techniques for classification involves:
Adapting loss functions.
Using logistic regression models as base learners.
1
Binh Tran, Bing Xue, and Mengjie Zhang, Pattern Recognition (2019)
2
Hengzhe Zhang, Aimin Zhou, and Hu Zhang, IEEE Transactions on Evolutionary Computation (2021)
8 36

Feature Engineering Process
Feature Initialization:
▶ Construct initial features based on domain knowledge or randomly.
Feature Evaluation:
▶ Evaluate features using cross-validation and calculate feature importance.
Feature Improvement:
▶ Discard ineffective features and replace them with new ones derived from
important features.
Feature engineering workflow.
9 36

Feature Evaluation and Improvement
Cross-Validation:
▶ Evaluates generalization performance.
Feature Importance:
▶ Identifies useful features.
▶ Risky to rely solely on feature importance.
Key Insight
Constructing multiple sets of features and evaluating them using cross-validation
can provide better insights into their generalization capabilities.
10 36

Feature Representation
Symbolic Trees:
▶ Each individual has k GP trees representing k new features.
Tree Structure:
▶ Non-leaf nodes: Functions (e.g., +, −, ∗, log, sin).
▶ Leaf nodes: Original Features.
Base Learners:
▶ Decision trees or linear regression models.
11 36

Algorithm Framework
Initialization
Randomly initialize N individuals, each with k symbolic trees.
Evaluation
Evaluate individuals using cross-validation loss.
Calculate feature importance for each feature.
12 36

Algorithm Framework
Selection
Use lexicase selection 1 to select parent individuals based on cross-validation
losses.
Generation
Generate new individuals using self-competitive crossover and guided mutation 2.
Archive Update
Update archive with top-performing models using reduce-error pruning 3.
1
William La Cava et al., Evolutionary Computation (2019)
2
Hengzhe Zhang et al., IEEE Transactions on Evolutionary Computation (2023)
3
Rich Caruana et al., Proceedings of the Twenty-First International Conference on Machine Learning
(2004)
13 36

Feature Initialization
Initialization Strategy:
▶ Ramped-half-and-half for symbolic trees.
▶ Half full trees, half random depth.
Base Learner Assignment:
▶ Randomly assign decision tree or linear regression model.
14 36

Feature Selection
Three selection operators in EvoFeat:
▶ Base Learner Selection
▶ Individual Selection: Lexicase Selection
▶ Feature Selection: Softmax Selection
15 36

Base Learner Selection
Divide population into two subgroups (decision trees, logistic regression).
Random mating probability (rmp = 0.5):
▶ 50%: Select parents from different subgroups.
▶ 50%: Select parents from the same subgroup.
Inspired by multitask GP 1
1
Fangfang Zhang et al., IEEE Transactions on Cybernetics (2021)
16 36

Individual Selection: Lexicase Selection
Selects individuals based on a vector of cross-validation losses, one for each
instance.
Constructs filters based on each loss value 1:
τj = min
i
Li
j + ϵj, (1)
Where:
▶ τj is the threshold,
▶ Li
j is the loss of the i-th individual on the j-th instance,
▶ ϵj is the median absolute deviation.
1
William La Cava et al., Evolutionary Computation (2019)
17 36

Softmax Selection
Select features based on importance values {θ1, . . . , θk}.
Uses softmax function:
P(θi) =
eθi/T
Pk
j=1 eθj/T
, (2)
Good features are sampled by P(θi), bad features by P(−θi).
18 36

Offspring Generation: Self-Competitive Crossover
Self-Competitive Crossover:
▶ Transfers beneficial material from good features to bad features.
▶ Biased crossover, only modifies bad features, preserving good features 1
.
▶ Ensures top-performing features are preserved.
1
Su Nguyen et al., IEEE Transactions on Cybernetics (2021)
19 36

Feature Importance
Decision Tree:
▶ Calculated by the total reduction of Gini impurity contributed by each feature ϕ.
Logistic Regression:
▶ Calculated by the absolute value of the model coefficients.
▶ Features are standardized to ensure equal influence on the coefficients.
20 36

Offspring Generation: Guided Mutation
Guided Mutation:
▶ Replaces a subtree with a randomly generated subtree.
▶ Uses a guided probability vector for terminal variable selection.
▶ The probability vector corresponds to the terminal usage of archived individuals.
21 36

Feature Evaluation
Cross-Validation:
▶ Partition the training set into five folds.
▶ Train on four folds, validate on one fold.
Loss Function:
▶ Cross entropy: X
c∈C
pc ∗ log(qc), (3)
▶ Where pc is the true probability, qc is the predicted probability.
22 36

Experiments
Objective: Compare EvoFeat with popular machine learning and deep learning
methods.
Datasets: 130 datasets from DIGEN and PMLB benchmarks.
▶ DIGEN 1
:
A total of 40 diverse synthetic datasets generated using genetic programming.
▶ PMLB 2
:
Collection of real-world datasets from OpenML.
Focus on classification tasks with more than 200 instances.
A total of 90 datasets selected where the product of the number of instances and the
number of features is less than 105
due to memory constraints.
1
https://github.com/EpistasisLab/digen
2
https://github.com/EpistasisLab/pmlb
23 36

Experimental Settings
Evaluation Protocol:
▶ 80% training, 20% testing.
▶ 5-fold cross-validation on the training set for parameter tuning.
▶ Repeat experiments with 30 random seeds.
Hyperparameter Tuning:
▶ Use Heteroscedastic Evolutionary Bayesian Optimization (HEBO) 1
for tuning
baseline algorithms.
1
Alexander I Cowen-Rivers et al., Journal of Artificial Intelligence Research (2022)
24 36

Experimental Settings
The detailed parameter space is shown in the paper.
Below is an example parameter space for tuning.
Parameter Space of FTTransformer
Hyperparameter Range
Attention Dropout Uniform[0,0.5]
Residual Dropout Uniform[0,0.2]
FFN Dropout Uniform[0,0.5]
FFN Factor Uniform[2
3 ,8
3 ]
Token Dimension UniformInt[64,512]
Layers UniformInt[1,4]
Learning Rate UniformLog[1e-4,1e-1]
Weight Decay UniformLog[1e-6,1e-3]
25 36

Baseline Algorithms
Machine Learning:
▶ XGBoost 1
, LightGBM 2
, Random Forest (RF), Decision Tree (DT), Logistic Regression
(LR), K-Nearest Neighbors (KNN).
Deep Learning:
▶ Multilayer Perceptron (MLP), ResNet, DCN V2 3
, FT-Transformer 4
.
1
Tianqi Chen and Carlos Guestrin, Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (2016)
2
Guolin Ke et al., Advances in Neural Information Processing Systems (2017)
3
4
26 36

Large-Scale Experiments
Comparison:
▶ Evaluate EvoFeat against traditional and deep learning methods.
Results:
▶ EvoFeat outperforms state-of-the-art methods in average accuracy.
▶ Demonstrates significant improvements in predictive performance.
/
5
.
1
1
'
7
'

1

9

0
/
3
5
H
V
1
H
W
5
)
)
7
7
U
D
Q
V
I
R
U
P
H
U
/
L
J
K
W
*
%
0
;
*
%
R
R
V
W
(
Y
R
)
H
D
W
$OJRULWKP

%DODQFHG$FFXUDF

(a) Balanced testing accuracy.
;
*
%
R
R
V
W
/
L
J
K
W
*
%
0
)
7
7
U
D
Q
V
I
R
U
P
H
U
5
)
5
H
V
1
H
W
0
/
3
'

1

9

'
7
.
1
1
/
5
$OJRULWKP

3HUFHQWDJH,PSURYHPHQW

(b) Improvement in accuracy.
27 36

Large-Scale Experiments
Training Time:
▶ EvoFeat has comparable training time to a fine-tuned LightGBM.
▶ EvoFeat is much faster than a fine-tuned FT-Transformer.
/
5
'
7
5
)
0
/
3
'

1

9

/
L
J
K
W
*
%
0
(
Y
R
)
H
D
W
.
1
1
5
H
V
1
H
W
)
7
7
U
D
Q
V
I
R
U
P
H
U
;
*
%
R
R
V
W
$OJRULWKP

7LPHV

Training Time (seconds).
28 36

Comparison with Traditional Methods
Baseline: XGBoost, LightGBM, RF, DT, LR, KNN.
Results:
▶ EvoFeat achieves the best accuracy.
▶ Significant improvements over XGBoost and LightGBM.
Statistical results of balanced testing accuracy on 90 PMLB and 40 DIGEN datasets.
XGBoost LightGBM RF LR KNN EvoFeat
DT 0/48/82 2/47/81 0/43/87 60/36/34 60/27/43 0/34/96
XGBoost — 13/107/10 43/79/8 72/50/8 107/16/7 4/67/59
LightGBM — — 45/75/10 74/42/14 107/15/8 5/72/53
RF — — — 73/47/10 102/20/8 7/62/61
LR — — — — 54/13/63 7/44/79
KNN — — — — — 3/15/112
29 36

Comparison with Deep Learning Methods
Baseline: MLP, ResNet, DCN V2, FT-Transformer.
Results:
▶ Deep learning methods perform comparably to RF.
▶ EvoFeat outperforms these deep learning methods significantly.
Statistical results of balanced testing accuracy on 90 PMLB and 40 DIGEN datasets.
ResNet DCN V2 FT-Transformer EvoFeat
MLP 18/96/16 9/118/3 10/76/44 4/33/93
ResNet — 8/99/23 46/73/11 3/32/95
DCN V2 — — 45/79/6 2/35/93
FT-Transformer — — — 4/34/92
EvoFeat — — — —
30 36

Ablation Studies
Objective: Validate improvements from heterogeneous base learners and
feature importance-guided search.
Components:
▶ Heterogeneous base learners: Compare EvoFeat with different combinations of
base learners.
▶ Feature importance-guided search: Evaluate the effectiveness of feature
importance-guided operators.
31 36

Base Learners
Objective: Compare heterogeneous base learners (DT+LR) with single base
learners (DT, LR).
Results:
▶ DT+LR achieves better average performance.
▶ Significant improvements over single learners.
Comparison of balanced testing accuracy across different base learners on 90 PMLB
datasets.
LR DT+LR
DT 12(+)/47(∼)/31(-) 0(+)/62(∼)/28(-)
LR — 5(+)/70(∼)/15(-)
32 36

Base Learners
Objective: Compare heterogeneous base learners (DT+LR) with single base
learners (DT, LR).
Results:
▶ DT+LR achieves better average performance.
▶ Significant improvements over single learners.
Z
L
Q
H
B
T
X
D
O
L
W

B
Z
K
L
W
H
Z
L
Q
H
B
T
X
D
O
L
W

B
U
H
G
D
X
W
R
E
D
O
D
Q
F
H
B
V
F
D
O
H
V
R
Q
D
U
L
R
Q
R
V
S
K
H
U
H
V
D
K
H
D
U
W
S
U
Q
Q
B
I
J
O
D
V
V
K
H
D
U
W
B
V
W
D
W
O
R
J
K
H
D
U
W
B
K
'DWDVHW

9DO
X
HV
0RGHO
'7
/5
'7/5
Balanced testing accuracy with different base learners.
33 36

Feature Importance-Guided Search
Objective: Evaluate the effectiveness of feature importance-guided operators.
Methods:
▶ Compare random crossover and mutation (Random) with softmax-based
self-competitive crossover and guided mutation (SS+GM).
Results:
▶ Feature importance-guided search achieves better performance.
Comparison of balanced testing accuracy across different selection operators on 40 DIGEN
datasets.
SC+GM GM Random
SS+GM 12(+)/26(∼)/2(-) 5(+)/34(∼)/1(-) 12(+)/28(∼)/0(-)
SC+GM — 0(+)/30(∼)/10(-) 5(+)/30(∼)/5(-)
GM — — 5(+)/35(∼)/0(-)
34 36

Feature Importance-Guided Search
Objective: Evaluate the effectiveness of feature importance-guided operators.
Methods:
▶ Compare random crossover and mutation (Random) with softmax-based
self-competitive crossover and guided mutation (SS+GM).
Results:
▶ Feature importance-guided search achieves better performance.
G
L
J
H
Q

B

G
L
J
H
Q

B

G
L
J
H
Q

B

G
L
J
H
Q

B

G
L
J
H
Q

B

G
L
J
H
Q

B

G
L
J
H
Q

B

G
L
J
H
Q

B

G
L
J
H
Q

B

G
L
J
H
Q

B

'DWDVHW

9DO
X
HV
0RGHO
66*0
6*0
*0
5DQGRP
Balanced testing accuracy with different selection operators.
35 36

Conclusion
Summary:
▶ EvoFeat outperforms state-of-the-art methods.
▶ Heterogeneous base learners and feature importance-guided search improve
performance.
Future Work:
▶ Investigate modularization techniques for improved interpretability.
▶ Use diversity optimization to enhance ensemble performance.
36 / 36

EvoFeat: Genetic Programming-based Feature Engineering Approach to Tabular Data Classification

More Related Content

Similar to EvoFeat: Genetic Programming-based Feature Engineering Approach to Tabular Data Classification (20)

Recently uploaded (20)

EvoFeat: Genetic Programming-based Feature Engineering Approach to Tabular Data Classification