SlideShare a Scribd company logo
EvoFeat: Genetic Programming based Fea-
ture Engineering Approach to Tabular Data
Classification
Hengzhe Zhang, Qi Chen, Bing Xue, Yan Wang, Aimin Zhou, Mengjie Zhang
Victoria University of Wellington
06/06/2023
Table of Contents
1 Introduction
2 Related Work
3 Preliminaries
4 The Proposed Algorithm
5 Experiments
1 36
Introduction
Introduction
Tabular Data Learning: Widely used in recommendation systems 1 and
advertising 2.
Goal: Capture the relationship between explanatory variables {x1, . . . , xm} and
a response variable y.
Dataset Structure: {({x1
1, . . . , x1
m}, y1), . . . , ({xn
1 , . . . , xn
m}, yn)}, where n is the
number of instances.
Challenge
Linear models assume linear relationships.
Decision trees assume axis-parallel decision boundaries.
Real-world data often violates these assumptions.
1
Ruoxi Wang et al., Proceedings of the Web Conference 2021 (2021)
2
Haizhi Yang et al., Proceedings of the 30th ACM International Conference on Information &
Knowledge Management (2021)
2 36
Feature Engineering Techniques
Manual Design: Based on domain knowledge.
Kernel Methods: Use kernel tricks to transform data into higher dimensions.
Deep Learning: Leverages neural networks to learn features automatically. 1.
Limitations
Manual Design: Labor-intensive.
Kernel Methods: Hard to integrate with tree-based methods.
Deep Learning: Requires large datasets, effectiveness debatable for small,
heterogeneous datasets 2.
1
Jianxun Lian et al., Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining (2018)
2
Yury Gorishniy et al., Advances in Neural Information Processing Systems (2021)
3 36
Motivation
Objective: Feature construction using genetic programming (GP).
GP Advantages: Gradient-free, interpretable, and flexible.
Hypothesis: GP-based feature engineering can outperform both traditional and
deep learning methods on tabular data.
Our Approach: EvoFeat
Constructs nonlinear features with GP.
Enhances ensemble learning models.
Uses cross-validation and feature importance for evaluation.
4 36
Related Work
Related Work
Beam Search Methods:
▶ Greedy, lacks strong mechanisms to prevent overfitting.
Deep Learning Methods:
▶ Effectiveness in comparison to tree-based methods is still debated 1
.
1
Yury Gorishniy et al., Advances in Neural Information Processing Systems (2021)
5 36
Beam Search Methods
Iterative Feature Generation:
▶ Starts with low-order features.
▶ Generates higher-order features based on important low-order features 1
.
Evaluation:
▶ Uses logistic regression accuracy or XGBoost feature importance.
▶ Sole reliance on training loss can lead to overfitting.
Key Limitation
The lack of effective mechanisms to prevent overfitting restricts feature
construction capabilities.
1
Yuanfei Luo et al., Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining (2019)
6 36
Deep Learning Methods
High-Order Feature Construction:
▶ Cross Network in DCN.
▶ Field-wise feature cross in xDeepFM 1
.
▶ Attention mechanism in AutoInt 2
.
Effectiveness
Effectiveness over fully connected NN is debatable 3.
Lack of comprehensive studies comparing with XGBoost 4.
1
Jianxun Lian et al., Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Discovery & Data Mining (2018)
2
Weiping Song et al., Proceedings of the 28th ACM International Conference on Information and
Knowledge Management (2019)
3
Ruoxi Wang et al., Proceedings of the Web Conference 2021 (2021)
4
Yury Gorishniy et al., Advances in Neural Information Processing Systems (2021)
7 36
Evolutionary Feature Construction
Single Learner:
▶ Traditionally, more focus on simple learners like single decision trees 1
.
▶ Gap in enhancing state-of-the-art algorithms.
Ensemble-based Feature Construction:
▶ Promising results in regression 2
.
▶ Requires adaptation for tabular classification.
Notes
Adapting evolutionary feature construction techniques for classification involves:
Adapting loss functions.
Using logistic regression models as base learners.
1
Binh Tran, Bing Xue, and Mengjie Zhang, Pattern Recognition (2019)
2
Hengzhe Zhang, Aimin Zhou, and Hu Zhang, IEEE Transactions on Evolutionary Computation (2021)
8 36
Preliminaries
Feature Engineering Process
Feature Initialization:
▶ Construct initial features based on domain knowledge or randomly.
Feature Evaluation:
▶ Evaluate features using cross-validation and calculate feature importance.
Feature Improvement:
▶ Discard ineffective features and replace them with new ones derived from
important features.
Feature engineering workflow.
9 36
Feature Evaluation and Improvement
Cross-Validation:
▶ Evaluates generalization performance.
Feature Importance:
▶ Identifies useful features.
▶ Risky to rely solely on feature importance.
Key Insight
Constructing multiple sets of features and evaluating them using cross-validation
can provide better insights into their generalization capabilities.
10 36
The Proposed Algorithm
Feature Representation
Symbolic Trees:
▶ Each individual has k GP trees representing k new features.
Tree Structure:
▶ Non-leaf nodes: Functions (e.g., +, −, ∗, log, sin).
▶ Leaf nodes: Original Features.
Base Learners:
▶ Decision trees or linear regression models.
11 36
Algorithm Framework
Initialization
Randomly initialize N individuals, each with k symbolic trees.
Evaluation
Evaluate individuals using cross-validation loss.
Calculate feature importance for each feature.
12 36
Algorithm Framework
Selection
Use lexicase selection 1 to select parent individuals based on cross-validation
losses.
Generation
Generate new individuals using self-competitive crossover and guided mutation 2.
Archive Update
Update archive with top-performing models using reduce-error pruning 3.
1
William La Cava et al., Evolutionary Computation (2019)
2
Hengzhe Zhang et al., IEEE Transactions on Evolutionary Computation (2023)
3
Rich Caruana et al., Proceedings of the Twenty-First International Conference on Machine Learning
(2004)
13 36
Feature Initialization
Initialization Strategy:
▶ Ramped-half-and-half for symbolic trees.
▶ Half full trees, half random depth.
Base Learner Assignment:
▶ Randomly assign decision tree or linear regression model.
14 36
Feature Selection
Three selection operators in EvoFeat:
▶ Base Learner Selection
▶ Individual Selection: Lexicase Selection
▶ Feature Selection: Softmax Selection
15 36
Base Learner Selection
Divide population into two subgroups (decision trees, logistic regression).
Random mating probability (rmp = 0.5):
▶ 50%: Select parents from different subgroups.
▶ 50%: Select parents from the same subgroup.
Inspired by multitask GP 1
1
Fangfang Zhang et al., IEEE Transactions on Cybernetics (2021)
16 36
Individual Selection: Lexicase Selection
Selects individuals based on a vector of cross-validation losses, one for each
instance.
Constructs filters based on each loss value 1:
τj = min
i
Li
j + ϵj, (1)
Where:
▶ τj is the threshold,
▶ Li
j is the loss of the i-th individual on the j-th instance,
▶ ϵj is the median absolute deviation.
1
William La Cava et al., Evolutionary Computation (2019)
17 36
Softmax Selection
Select features based on importance values {θ1, . . . , θk}.
Uses softmax function:
P(θi) =
eθi/T
Pk
j=1 eθj/T
, (2)
Good features are sampled by P(θi), bad features by P(−θi).
18 36
Offspring Generation: Self-Competitive Crossover
Self-Competitive Crossover:
▶ Transfers beneficial material from good features to bad features.
▶ Biased crossover, only modifies bad features, preserving good features 1
.
▶ Ensures top-performing features are preserved.
1
Su Nguyen et al., IEEE Transactions on Cybernetics (2021)
19 36
Feature Importance
Decision Tree:
▶ Calculated by the total reduction of Gini impurity contributed by each feature ϕ.
Logistic Regression:
▶ Calculated by the absolute value of the model coefficients.
▶ Features are standardized to ensure equal influence on the coefficients.
20 36
Offspring Generation: Guided Mutation
Guided Mutation:
▶ Replaces a subtree with a randomly generated subtree.
▶ Uses a guided probability vector for terminal variable selection.
▶ The probability vector corresponds to the terminal usage of archived individuals.
21 36
Feature Evaluation
Cross-Validation:
▶ Partition the training set into five folds.
▶ Train on four folds, validate on one fold.
Loss Function:
▶ Cross entropy: X
c∈C
pc ∗ log(qc), (3)
▶ Where pc is the true probability, qc is the predicted probability.
22 36
Experiments
Experiments
Objective: Compare EvoFeat with popular machine learning and deep learning
methods.
Datasets: 130 datasets from DIGEN and PMLB benchmarks.
▶ DIGEN 1
:
A total of 40 diverse synthetic datasets generated using genetic programming.
▶ PMLB 2
:
Collection of real-world datasets from OpenML.
Focus on classification tasks with more than 200 instances.
A total of 90 datasets selected where the product of the number of instances and the
number of features is less than 105
due to memory constraints.
1
https://github.com/EpistasisLab/digen
2
https://github.com/EpistasisLab/pmlb
23 36
Experimental Settings
Evaluation Protocol:
▶ 80% training, 20% testing.
▶ 5-fold cross-validation on the training set for parameter tuning.
▶ Repeat experiments with 30 random seeds.
Hyperparameter Tuning:
▶ Use Heteroscedastic Evolutionary Bayesian Optimization (HEBO) 1
for tuning
baseline algorithms.
1
Alexander I Cowen-Rivers et al., Journal of Artificial Intelligence Research (2022)
24 36
Experimental Settings
The detailed parameter space is shown in the paper.
Below is an example parameter space for tuning.
Parameter Space of FTTransformer
Hyperparameter Range
Attention Dropout Uniform[0,0.5]
Residual Dropout Uniform[0,0.2]
FFN Dropout Uniform[0,0.5]
FFN Factor Uniform[2
3 ,8
3 ]
Token Dimension UniformInt[64,512]
Layers UniformInt[1,4]
Learning Rate UniformLog[1e-4,1e-1]
Weight Decay UniformLog[1e-6,1e-3]
25 36
Baseline Algorithms
Machine Learning:
▶ XGBoost 1
, LightGBM 2
, Random Forest (RF), Decision Tree (DT), Logistic Regression
(LR), K-Nearest Neighbors (KNN).
Deep Learning:
▶ Multilayer Perceptron (MLP), ResNet, DCN V2 3
, FT-Transformer 4
.
1
Tianqi Chen and Carlos Guestrin, Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (2016)
2
Guolin Ke et al., Advances in Neural Information Processing Systems (2017)
3
Ruoxi Wang et al., Proceedings of the Web Conference 2021 (2021)
4
Yury Gorishniy et al., Advances in Neural Information Processing Systems (2021)
26 36
Large-Scale Experiments
Comparison:
▶ Evaluate EvoFeat against traditional and deep learning methods.
Results:
▶ EvoFeat outperforms state-of-the-art methods in average accuracy.
▶ Demonstrates significant improvements in predictive performance.
/
5
.
1
1
'
7
'

1

9

0
/
3
5
H
V
1
H
W
5
)
)
7
7
U
D
Q
V
I
R
U
P
H
U
/
L
J
K
W
*
%
0
;
*
%
R
R
V
W
(
Y
R
)
H
D
W
$OJRULWKP





%DODQFHG$FFXUDF



 



 

(a) Balanced testing accuracy.
;
*
%
R
R
V
W
/
L
J
K
W
*
%
0
)
7
7
U
D
Q
V
I
R
U
P
H
U
5
)
5
H
V
1
H
W
0
/
3
'

1

9

'
7
.
1
1
/
5
$OJRULWKP




3HUFHQWDJH,PSURYHPHQW
 








(b) Improvement in accuracy.
27 36
Large-Scale Experiments
Training Time:
▶ EvoFeat has comparable training time to a fine-tuned LightGBM.
▶ EvoFeat is much faster than a fine-tuned FT-Transformer.
/
5
'
7
5
)
0
/
3
'

1

9

/
L
J
K
W
*
%
0
(
Y
R
)
H
D
W
.
1
1
5
H
V
1
H
W
)
7
7
U
D
Q
V
I
R
U
P
H
U
;
*
%
R
R
V
W
$OJRULWKP






7LPHV
Training Time (seconds).
28 36
Comparison with Traditional Methods
Baseline: XGBoost, LightGBM, RF, DT, LR, KNN.
Results:
▶ EvoFeat achieves the best accuracy.
▶ Significant improvements over XGBoost and LightGBM.
Statistical results of balanced testing accuracy on 90 PMLB and 40 DIGEN datasets.
XGBoost LightGBM RF LR KNN EvoFeat
DT 0/48/82 2/47/81 0/43/87 60/36/34 60/27/43 0/34/96
XGBoost — 13/107/10 43/79/8 72/50/8 107/16/7 4/67/59
LightGBM — — 45/75/10 74/42/14 107/15/8 5/72/53
RF — — — 73/47/10 102/20/8 7/62/61
LR — — — — 54/13/63 7/44/79
KNN — — — — — 3/15/112
29 36
Comparison with Deep Learning Methods
Baseline: MLP, ResNet, DCN V2, FT-Transformer.
Results:
▶ Deep learning methods perform comparably to RF.
▶ EvoFeat outperforms these deep learning methods significantly.
Statistical results of balanced testing accuracy on 90 PMLB and 40 DIGEN datasets.
ResNet DCN V2 FT-Transformer EvoFeat
MLP 18/96/16 9/118/3 10/76/44 4/33/93
ResNet — 8/99/23 46/73/11 3/32/95
DCN V2 — — 45/79/6 2/35/93
FT-Transformer — — — 4/34/92
EvoFeat — — — —
30 36
Ablation Studies
Objective: Validate improvements from heterogeneous base learners and
feature importance-guided search.
Components:
▶ Heterogeneous base learners: Compare EvoFeat with different combinations of
base learners.
▶ Feature importance-guided search: Evaluate the effectiveness of feature
importance-guided operators.
31 36
Base Learners
Objective: Compare heterogeneous base learners (DT+LR) with single base
learners (DT, LR).
Results:
▶ DT+LR achieves better average performance.
▶ Significant improvements over single learners.
Comparison of balanced testing accuracy across different base learners on 90 PMLB
datasets.
LR DT+LR
DT 12(+)/47(∼)/31(-) 0(+)/62(∼)/28(-)
LR — 5(+)/70(∼)/15(-)
32 36
Base Learners
Objective: Compare heterogeneous base learners (DT+LR) with single base
learners (DT, LR).
Results:
▶ DT+LR achieves better average performance.
▶ Significant improvements over single learners.
Z
L
Q
H
B
T
X
D
O
L
W

B
Z
K
L
W
H
Z
L
Q
H
B
T
X
D
O
L
W

B
U
H
G
D
X
W
R
E
D
O
D
Q
F
H
B
V
F
D
O
H
V
R
Q
D
U
L
R
Q
R
V
S
K
H
U
H
V
D
K
H
D
U
W
S
U
Q
Q
B
I
J
O
D
V
V
K
H
D
U
W
B
V
W
D
W
O
R
J
K
H
D
U
W
B
K
'DWDVHW




9DO
X
HV
0RGHO
'7
/5
'7/5
Balanced testing accuracy with different base learners.
33 36
Feature Importance-Guided Search
Objective: Evaluate the effectiveness of feature importance-guided operators.
Methods:
▶ Compare random crossover and mutation (Random) with softmax-based
self-competitive crossover and guided mutation (SS+GM).
Results:
▶ Feature importance-guided search achieves better performance.
Comparison of balanced testing accuracy across different selection operators on 40 DIGEN
datasets.
SC+GM GM Random
SS+GM 12(+)/26(∼)/2(-) 5(+)/34(∼)/1(-) 12(+)/28(∼)/0(-)
SC+GM — 0(+)/30(∼)/10(-) 5(+)/30(∼)/5(-)
GM — — 5(+)/35(∼)/0(-)
34 36
Feature Importance-Guided Search
Objective: Evaluate the effectiveness of feature importance-guided operators.
Methods:
▶ Compare random crossover and mutation (Random) with softmax-based
self-competitive crossover and guided mutation (SS+GM).
Results:
▶ Feature importance-guided search achieves better performance.
G
L
J
H
Q


B



G
L
J
H
Q

B




G
L
J
H
Q


B



G
L
J
H
Q


B



G
L
J
H
Q


B



G
L
J
H
Q


B




G
L
J
H
Q


B




G
L
J
H
Q


B




G
L
J
H
Q


B




G
L
J
H
Q


B




'DWDVHW







9DO
X
HV
0RGHO
66*0
6*0
*0
5DQGRP
Balanced testing accuracy with different selection operators.
35 36
Conclusion
Summary:
▶ EvoFeat outperforms state-of-the-art methods.
▶ Heterogeneous base learners and feature importance-guided search improve
performance.
Future Work:
▶ Investigate modularization techniques for improved interpretability.
▶ Use diversity optimization to enhance ensemble performance.
36 / 36

More Related Content

PDF
P0126557 report
PDF
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
PDF
Genetic Programming for Evolutionary Feature Construction
PPT
Topic_6
PDF
Xin Yao: "What can evolutionary computation do for you?"
PDF
P0126557 slides
PPTX
Feature Engineering
PPT
PPSN 2004 - 3rd session
P0126557 report
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Genetic Programming for Evolutionary Feature Construction
Topic_6
Xin Yao: "What can evolutionary computation do for you?"
P0126557 slides
Feature Engineering
PPSN 2004 - 3rd session

Similar to EvoFeat: Genetic Programming-based Feature Engineering Approach to Tabular Data Classification (20)

PDF
Multivariate decision tree
PPTX
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
PDF
Disease Classification using ECG Signal Based on PCA Feature along with GA & ...
PPTX
Cubesat challenge considerations deep dive
PDF
777777777777777777777777777777777777777.pdf
PDF
Dynamic Feature Induction: The Last Gist to the State-of-the-Art
PDF
A genetic algorithm approach for predicting ribonucleic acid sequencing data ...
PDF
Genetic Programming-based Evolutionary Feature Construction for Heterogeneous...
PPTX
Modern classification techniques
PDF
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...
PDF
Ajas11 alok
PDF
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
PDF
Kq2418061809
PDF
Diagnosing cancer with Computational Intelligence
PDF
Art of Feature Engineering for Data Science with Nabeel Sarwar
PDF
Evolutionary (deep) neural network
PDF
How to easily find the optimal solution without exhaustive search using Genet...
PPTX
05 -- Feature Engineering (Text).pptxiuy
PDF
IRJET- Survey of Feature Selection based on Ant Colony
Multivariate decision tree
Ensemble of Heterogeneous Flexible Neural Tree for the approximation and feat...
Disease Classification using ECG Signal Based on PCA Feature along with GA & ...
Cubesat challenge considerations deep dive
777777777777777777777777777777777777777.pdf
Dynamic Feature Induction: The Last Gist to the State-of-the-Art
A genetic algorithm approach for predicting ribonucleic acid sequencing data ...
Genetic Programming-based Evolutionary Feature Construction for Heterogeneous...
Modern classification techniques
Leave one out cross validated Hybrid Model of Genetic Algorithm and Naïve Bay...
Ajas11 alok
ENHANCED BREAST CANCER RECOGNITION BASED ON ROTATION FOREST FEATURE SELECTIO...
Kq2418061809
Diagnosing cancer with Computational Intelligence
Art of Feature Engineering for Data Science with Nabeel Sarwar
Evolutionary (deep) neural network
How to easily find the optimal solution without exhaustive search using Genet...
05 -- Feature Engineering (Text).pptxiuy
IRJET- Survey of Feature Selection based on Ant Colony
Ad

Recently uploaded (20)

PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Welding lecture in detail for understanding
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Well-logging-methods_new................
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Construction Project Organization Group 2.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Sustainable Sites - Green Building Construction
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
Project quality management in manufacturing
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Welding lecture in detail for understanding
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
UNIT 4 Total Quality Management .pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Well-logging-methods_new................
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Lecture Notes Electrical Wiring System Components
Construction Project Organization Group 2.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Internet of Things (IOT) - A guide to understanding
CYBER-CRIMES AND SECURITY A guide to understanding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
additive manufacturing of ss316l using mig welding
Sustainable Sites - Green Building Construction
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Project quality management in manufacturing
Ad

EvoFeat: Genetic Programming-based Feature Engineering Approach to Tabular Data Classification

  • 1. EvoFeat: Genetic Programming based Fea- ture Engineering Approach to Tabular Data Classification Hengzhe Zhang, Qi Chen, Bing Xue, Yan Wang, Aimin Zhou, Mengjie Zhang Victoria University of Wellington 06/06/2023
  • 2. Table of Contents 1 Introduction 2 Related Work 3 Preliminaries 4 The Proposed Algorithm 5 Experiments 1 36
  • 4. Introduction Tabular Data Learning: Widely used in recommendation systems 1 and advertising 2. Goal: Capture the relationship between explanatory variables {x1, . . . , xm} and a response variable y. Dataset Structure: {({x1 1, . . . , x1 m}, y1), . . . , ({xn 1 , . . . , xn m}, yn)}, where n is the number of instances. Challenge Linear models assume linear relationships. Decision trees assume axis-parallel decision boundaries. Real-world data often violates these assumptions. 1 Ruoxi Wang et al., Proceedings of the Web Conference 2021 (2021) 2 Haizhi Yang et al., Proceedings of the 30th ACM International Conference on Information & Knowledge Management (2021) 2 36
  • 5. Feature Engineering Techniques Manual Design: Based on domain knowledge. Kernel Methods: Use kernel tricks to transform data into higher dimensions. Deep Learning: Leverages neural networks to learn features automatically. 1. Limitations Manual Design: Labor-intensive. Kernel Methods: Hard to integrate with tree-based methods. Deep Learning: Requires large datasets, effectiveness debatable for small, heterogeneous datasets 2. 1 Jianxun Lian et al., Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018) 2 Yury Gorishniy et al., Advances in Neural Information Processing Systems (2021) 3 36
  • 6. Motivation Objective: Feature construction using genetic programming (GP). GP Advantages: Gradient-free, interpretable, and flexible. Hypothesis: GP-based feature engineering can outperform both traditional and deep learning methods on tabular data. Our Approach: EvoFeat Constructs nonlinear features with GP. Enhances ensemble learning models. Uses cross-validation and feature importance for evaluation. 4 36
  • 8. Related Work Beam Search Methods: ▶ Greedy, lacks strong mechanisms to prevent overfitting. Deep Learning Methods: ▶ Effectiveness in comparison to tree-based methods is still debated 1 . 1 Yury Gorishniy et al., Advances in Neural Information Processing Systems (2021) 5 36
  • 9. Beam Search Methods Iterative Feature Generation: ▶ Starts with low-order features. ▶ Generates higher-order features based on important low-order features 1 . Evaluation: ▶ Uses logistic regression accuracy or XGBoost feature importance. ▶ Sole reliance on training loss can lead to overfitting. Key Limitation The lack of effective mechanisms to prevent overfitting restricts feature construction capabilities. 1 Yuanfei Luo et al., Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2019) 6 36
  • 10. Deep Learning Methods High-Order Feature Construction: ▶ Cross Network in DCN. ▶ Field-wise feature cross in xDeepFM 1 . ▶ Attention mechanism in AutoInt 2 . Effectiveness Effectiveness over fully connected NN is debatable 3. Lack of comprehensive studies comparing with XGBoost 4. 1 Jianxun Lian et al., Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018) 2 Weiping Song et al., Proceedings of the 28th ACM International Conference on Information and Knowledge Management (2019) 3 Ruoxi Wang et al., Proceedings of the Web Conference 2021 (2021) 4 Yury Gorishniy et al., Advances in Neural Information Processing Systems (2021) 7 36
  • 11. Evolutionary Feature Construction Single Learner: ▶ Traditionally, more focus on simple learners like single decision trees 1 . ▶ Gap in enhancing state-of-the-art algorithms. Ensemble-based Feature Construction: ▶ Promising results in regression 2 . ▶ Requires adaptation for tabular classification. Notes Adapting evolutionary feature construction techniques for classification involves: Adapting loss functions. Using logistic regression models as base learners. 1 Binh Tran, Bing Xue, and Mengjie Zhang, Pattern Recognition (2019) 2 Hengzhe Zhang, Aimin Zhou, and Hu Zhang, IEEE Transactions on Evolutionary Computation (2021) 8 36
  • 13. Feature Engineering Process Feature Initialization: ▶ Construct initial features based on domain knowledge or randomly. Feature Evaluation: ▶ Evaluate features using cross-validation and calculate feature importance. Feature Improvement: ▶ Discard ineffective features and replace them with new ones derived from important features. Feature engineering workflow. 9 36
  • 14. Feature Evaluation and Improvement Cross-Validation: ▶ Evaluates generalization performance. Feature Importance: ▶ Identifies useful features. ▶ Risky to rely solely on feature importance. Key Insight Constructing multiple sets of features and evaluating them using cross-validation can provide better insights into their generalization capabilities. 10 36
  • 16. Feature Representation Symbolic Trees: ▶ Each individual has k GP trees representing k new features. Tree Structure: ▶ Non-leaf nodes: Functions (e.g., +, −, ∗, log, sin). ▶ Leaf nodes: Original Features. Base Learners: ▶ Decision trees or linear regression models. 11 36
  • 17. Algorithm Framework Initialization Randomly initialize N individuals, each with k symbolic trees. Evaluation Evaluate individuals using cross-validation loss. Calculate feature importance for each feature. 12 36
  • 18. Algorithm Framework Selection Use lexicase selection 1 to select parent individuals based on cross-validation losses. Generation Generate new individuals using self-competitive crossover and guided mutation 2. Archive Update Update archive with top-performing models using reduce-error pruning 3. 1 William La Cava et al., Evolutionary Computation (2019) 2 Hengzhe Zhang et al., IEEE Transactions on Evolutionary Computation (2023) 3 Rich Caruana et al., Proceedings of the Twenty-First International Conference on Machine Learning (2004) 13 36
  • 19. Feature Initialization Initialization Strategy: ▶ Ramped-half-and-half for symbolic trees. ▶ Half full trees, half random depth. Base Learner Assignment: ▶ Randomly assign decision tree or linear regression model. 14 36
  • 20. Feature Selection Three selection operators in EvoFeat: ▶ Base Learner Selection ▶ Individual Selection: Lexicase Selection ▶ Feature Selection: Softmax Selection 15 36
  • 21. Base Learner Selection Divide population into two subgroups (decision trees, logistic regression). Random mating probability (rmp = 0.5): ▶ 50%: Select parents from different subgroups. ▶ 50%: Select parents from the same subgroup. Inspired by multitask GP 1 1 Fangfang Zhang et al., IEEE Transactions on Cybernetics (2021) 16 36
  • 22. Individual Selection: Lexicase Selection Selects individuals based on a vector of cross-validation losses, one for each instance. Constructs filters based on each loss value 1: τj = min i Li j + ϵj, (1) Where: ▶ τj is the threshold, ▶ Li j is the loss of the i-th individual on the j-th instance, ▶ ϵj is the median absolute deviation. 1 William La Cava et al., Evolutionary Computation (2019) 17 36
  • 23. Softmax Selection Select features based on importance values {θ1, . . . , θk}. Uses softmax function: P(θi) = eθi/T Pk j=1 eθj/T , (2) Good features are sampled by P(θi), bad features by P(−θi). 18 36
  • 24. Offspring Generation: Self-Competitive Crossover Self-Competitive Crossover: ▶ Transfers beneficial material from good features to bad features. ▶ Biased crossover, only modifies bad features, preserving good features 1 . ▶ Ensures top-performing features are preserved. 1 Su Nguyen et al., IEEE Transactions on Cybernetics (2021) 19 36
  • 25. Feature Importance Decision Tree: ▶ Calculated by the total reduction of Gini impurity contributed by each feature ϕ. Logistic Regression: ▶ Calculated by the absolute value of the model coefficients. ▶ Features are standardized to ensure equal influence on the coefficients. 20 36
  • 26. Offspring Generation: Guided Mutation Guided Mutation: ▶ Replaces a subtree with a randomly generated subtree. ▶ Uses a guided probability vector for terminal variable selection. ▶ The probability vector corresponds to the terminal usage of archived individuals. 21 36
  • 27. Feature Evaluation Cross-Validation: ▶ Partition the training set into five folds. ▶ Train on four folds, validate on one fold. Loss Function: ▶ Cross entropy: X c∈C pc ∗ log(qc), (3) ▶ Where pc is the true probability, qc is the predicted probability. 22 36
  • 29. Experiments Objective: Compare EvoFeat with popular machine learning and deep learning methods. Datasets: 130 datasets from DIGEN and PMLB benchmarks. ▶ DIGEN 1 : A total of 40 diverse synthetic datasets generated using genetic programming. ▶ PMLB 2 : Collection of real-world datasets from OpenML. Focus on classification tasks with more than 200 instances. A total of 90 datasets selected where the product of the number of instances and the number of features is less than 105 due to memory constraints. 1 https://github.com/EpistasisLab/digen 2 https://github.com/EpistasisLab/pmlb 23 36
  • 30. Experimental Settings Evaluation Protocol: ▶ 80% training, 20% testing. ▶ 5-fold cross-validation on the training set for parameter tuning. ▶ Repeat experiments with 30 random seeds. Hyperparameter Tuning: ▶ Use Heteroscedastic Evolutionary Bayesian Optimization (HEBO) 1 for tuning baseline algorithms. 1 Alexander I Cowen-Rivers et al., Journal of Artificial Intelligence Research (2022) 24 36
  • 31. Experimental Settings The detailed parameter space is shown in the paper. Below is an example parameter space for tuning. Parameter Space of FTTransformer Hyperparameter Range Attention Dropout Uniform[0,0.5] Residual Dropout Uniform[0,0.2] FFN Dropout Uniform[0,0.5] FFN Factor Uniform[2 3 ,8 3 ] Token Dimension UniformInt[64,512] Layers UniformInt[1,4] Learning Rate UniformLog[1e-4,1e-1] Weight Decay UniformLog[1e-6,1e-3] 25 36
  • 32. Baseline Algorithms Machine Learning: ▶ XGBoost 1 , LightGBM 2 , Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbors (KNN). Deep Learning: ▶ Multilayer Perceptron (MLP), ResNet, DCN V2 3 , FT-Transformer 4 . 1 Tianqi Chen and Carlos Guestrin, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016) 2 Guolin Ke et al., Advances in Neural Information Processing Systems (2017) 3 Ruoxi Wang et al., Proceedings of the Web Conference 2021 (2021) 4 Yury Gorishniy et al., Advances in Neural Information Processing Systems (2021) 26 36
  • 33. Large-Scale Experiments Comparison: ▶ Evaluate EvoFeat against traditional and deep learning methods. Results: ▶ EvoFeat outperforms state-of-the-art methods in average accuracy. ▶ Demonstrates significant improvements in predictive performance. / 5 . 1 1 ' 7 ' 1 9 0 / 3 5 H V 1 H W 5 ) ) 7 7 U D Q V I R U P H U / L J K W * % 0 ; * % R R V W ( Y R ) H D W $OJRULWKP %DODQFHG$FFXUDF (a) Balanced testing accuracy. ; * % R R V W / L J K W * % 0 ) 7 7 U D Q V I R U P H U 5 ) 5 H V 1 H W 0 / 3 ' 1 9 ' 7 . 1 1 / 5 $OJRULWKP 3HUFHQWDJH,PSURYHPHQW (b) Improvement in accuracy. 27 36
  • 34. Large-Scale Experiments Training Time: ▶ EvoFeat has comparable training time to a fine-tuned LightGBM. ▶ EvoFeat is much faster than a fine-tuned FT-Transformer. / 5 ' 7 5 ) 0 / 3 ' 1 9 / L J K W * % 0 ( Y R ) H D W . 1 1 5 H V 1 H W ) 7 7 U D Q V I R U P H U ; * % R R V W $OJRULWKP 7LPHV
  • 36. Comparison with Traditional Methods Baseline: XGBoost, LightGBM, RF, DT, LR, KNN. Results: ▶ EvoFeat achieves the best accuracy. ▶ Significant improvements over XGBoost and LightGBM. Statistical results of balanced testing accuracy on 90 PMLB and 40 DIGEN datasets. XGBoost LightGBM RF LR KNN EvoFeat DT 0/48/82 2/47/81 0/43/87 60/36/34 60/27/43 0/34/96 XGBoost — 13/107/10 43/79/8 72/50/8 107/16/7 4/67/59 LightGBM — — 45/75/10 74/42/14 107/15/8 5/72/53 RF — — — 73/47/10 102/20/8 7/62/61 LR — — — — 54/13/63 7/44/79 KNN — — — — — 3/15/112 29 36
  • 37. Comparison with Deep Learning Methods Baseline: MLP, ResNet, DCN V2, FT-Transformer. Results: ▶ Deep learning methods perform comparably to RF. ▶ EvoFeat outperforms these deep learning methods significantly. Statistical results of balanced testing accuracy on 90 PMLB and 40 DIGEN datasets. ResNet DCN V2 FT-Transformer EvoFeat MLP 18/96/16 9/118/3 10/76/44 4/33/93 ResNet — 8/99/23 46/73/11 3/32/95 DCN V2 — — 45/79/6 2/35/93 FT-Transformer — — — 4/34/92 EvoFeat — — — — 30 36
  • 38. Ablation Studies Objective: Validate improvements from heterogeneous base learners and feature importance-guided search. Components: ▶ Heterogeneous base learners: Compare EvoFeat with different combinations of base learners. ▶ Feature importance-guided search: Evaluate the effectiveness of feature importance-guided operators. 31 36
  • 39. Base Learners Objective: Compare heterogeneous base learners (DT+LR) with single base learners (DT, LR). Results: ▶ DT+LR achieves better average performance. ▶ Significant improvements over single learners. Comparison of balanced testing accuracy across different base learners on 90 PMLB datasets. LR DT+LR DT 12(+)/47(∼)/31(-) 0(+)/62(∼)/28(-) LR — 5(+)/70(∼)/15(-) 32 36
  • 40. Base Learners Objective: Compare heterogeneous base learners (DT+LR) with single base learners (DT, LR). Results: ▶ DT+LR achieves better average performance. ▶ Significant improvements over single learners. Z L Q H B T X D O L W B Z K L W H Z L Q H B T X D O L W B U H G D X W R E D O D Q F H B V F D O H V R Q D U L R Q R V S K H U H V D K H D U W S U Q Q B I J O D V V K H D U W B V W D W O R J K H D U W B K 'DWDVHW 9DO X HV 0RGHO '7 /5 '7/5 Balanced testing accuracy with different base learners. 33 36
  • 41. Feature Importance-Guided Search Objective: Evaluate the effectiveness of feature importance-guided operators. Methods: ▶ Compare random crossover and mutation (Random) with softmax-based self-competitive crossover and guided mutation (SS+GM). Results: ▶ Feature importance-guided search achieves better performance. Comparison of balanced testing accuracy across different selection operators on 40 DIGEN datasets. SC+GM GM Random SS+GM 12(+)/26(∼)/2(-) 5(+)/34(∼)/1(-) 12(+)/28(∼)/0(-) SC+GM — 0(+)/30(∼)/10(-) 5(+)/30(∼)/5(-) GM — — 5(+)/35(∼)/0(-) 34 36
  • 42. Feature Importance-Guided Search Objective: Evaluate the effectiveness of feature importance-guided operators. Methods: ▶ Compare random crossover and mutation (Random) with softmax-based self-competitive crossover and guided mutation (SS+GM). Results: ▶ Feature importance-guided search achieves better performance. G L J H Q B G L J H Q B G L J H Q B G L J H Q B G L J H Q B G L J H Q B G L J H Q B G L J H Q B G L J H Q B G L J H Q B 'DWDVHW 9DO X HV 0RGHO 66*0 6*0 *0 5DQGRP Balanced testing accuracy with different selection operators. 35 36
  • 43. Conclusion Summary: ▶ EvoFeat outperforms state-of-the-art methods. ▶ Heterogeneous base learners and feature importance-guided search improve performance. Future Work: ▶ Investigate modularization techniques for improved interpretability. ▶ Use diversity optimization to enhance ensemble performance. 36 / 36
  • 44. Thanks for listening! Email: Hengzhe.zhang@ecs.vuw.ac.nz GitHub Project: https://github.com/hengzhe-zhang/EvolutionaryForest/