SlideShare a Scribd company logo
Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias Müller, Data Scientist, H2O.ai
Mathias Müller
faron@h2o.ai
kaggle.com/mmueller - github.com/Far0n
Leakage in Meta Modeling &
Its Connection to HCC Target-Encoding
Background
• Born & raised in Berlin
• Diplom in Computer Science from Humboldt University of Berlin
• Joined H2O two month ago
• Data Scientist
• Development of Driverless AI
Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias Müller, Data Scientist, H2O.ai
Leakage
“Data Leakage is the creation of unexpected additional
information in the training data, allowing a model or machine
learning algorithm to make unrealistically good predictions.”
kaggle.com/wiki/leakage
• Many different sources
• ID-Leaks
• Leaking future information into past
• Validating models on already seen data
• Leaking target information into Feature Matrices
• Feedback loops / adaptive data analysis
• …
• Caused damage varies from case to case
6
Meta Modeling / Stacking
Leakage in Meta Modeling
• Suppose a 3-fold split of our training data into (A,B,C)
• Creating of out-of-fold predictions (A2,B2,C2)
• train((A,B)) followed by predict(C) to get C2
• train((A,C)) followed by predict(B) to get B2
• train((B,C)) followed by predict(A) to get A2
AB -> C2 A2B2 -> C3 (BC)(AC) -> C3
AC -> B2 A2C2 -> B3 (BC)(AB) -> B3
BC -> A2 B2C2 -> A3 (AC)(AB) -> A3
Base Level 1st Meta Level Leaked Target Information
HCC Target Encoding
• In general, tree based models like XGBoost, LightGBM, RF, etc.
struggle with (non-ordinal) High Cardinal Categoricals (HCC)
features
• Order of mapped HCC values determines the required amount
of splits to get “useful” data partitions
• Idea: Replace HCC values by their likelihoods to get a “good
order”
K-Fold Target Encoding - Example
X y Fold AB -> C AC -> B BC -> A X_lhood_cv
blue 1 p(y | X = blue) = 0 0
red 1 p(y | X = red) = 1.0 1
blue 0 0.5
blue 0 0.5
blue 0 p(y | X = blue) = 0.333 0.333
red 1 p(y | X = red) = 1.0 1
A
B
C
p(y | X = blue) = 0.5
• We want to replace the categorical values blue and red by their
likelihoods in a k-fold cross-validated fashion:
Recap: Leakage in Meta Modeling
AB -> C2 A2B2 -> C3 (BC)(AC) -> C3
AC -> B2 A2C2 -> B3 (BC)(AB) -> B3
BC -> A2 B2C2 -> A3 (AC)(AB) -> A3
Base Level 1st Meta Level Leaked Target Information
K-Fold Target Encoding - Example
X y Fold AB -> C AC -> B BC -> A X_lhood_cv
blue 1 p(y | X = blue) = 0 0
red 1 p(y | X = red) = 1.0 1
blue 0 0.5
blue 0 0.5
blue 0 p(y | X = blue) = 0.333 0.333
red 1 p(y | X = red) = 1.0 1
A
B
C
p(y | X = blue) = 0.5
• X_lhood_cv values are basically “out-of-fold” predictions of
a maximum likelihood estimator
• Using X_lhood_cv as feature is pretty much the same
procedure as stacking
• Same leakage issue .. but fails more often than strong
model stacking, because of no regularization
Counter-Measures
• Using a fixed holdout set to calculate likelihoods / to generate
out-of-fold predictions
• Loss of training data at later stages
• Using a 2-fold scheme with fixed seed
• Not ideal regarding bias-variance-tradeoff
• Adding Noise to likelihoods / out-of-fold predictions
• Hard to get the noise level right (heavily dataset dependent)
• Avoiding target leakage by nested cross validation
• Order of magnitude higher complexitity: O(k) => O(kouter * kinner)
Nested Cross Validation
AB -> C2 A2B2 -> C3 (B)(A) -> C3
A -> B2
B -> A2
AC -> B2 A2C2 -> B3 (C)(A) -> B3
A -> C2
C -> A2
BC -> A2 B2C2 -> A3 (C)(B) -> A3
B -> C2
C -> B2
Base Level 1st Meta Level No Leaked Target Information
Thank you for your attention!
Any Questions?

More Related Content

PDF
Feature Engineering for ML - Dmitry Larko, H2O.ai
PDF
Feature Engineering
PDF
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
PDF
Feature Engineering Hands-On by Dmitry Larko
PDF
GLM & GBM in H2O
PDF
PDF
Workshop - Introduction to Machine Learning with R
PDF
Gradient Boosted Regression Trees in scikit-learn
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering
Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...
Feature Engineering Hands-On by Dmitry Larko
GLM & GBM in H2O
Workshop - Introduction to Machine Learning with R
Gradient Boosted Regression Trees in scikit-learn

What's hot (19)

PDF
XGBoost @ Fyber
PDF
Counterfactual evaluation of machine learning models
PDF
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
PDF
Feature engineering pipelines
PDF
Probabilistic Data Structures and Approximate Solutions
PPTX
Get Competitive with Driverless AI
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PDF
Introduction to XGBoost
PDF
Feature Engineering - Getting most out of data for predictive models - TDC 2017
PDF
QCon Rio - Machine Learning for Everyone
PDF
Introduction of Feature Hashing
PDF
Demystifying Xgboost
PDF
Machine Learning : why we should know and how it works
PDF
오토인코더의 모든 것
PPTX
Comparison Study of Decision Tree Ensembles for Regression
PDF
Introduction to Boosted Trees by Tianqi Chen
PDF
R user group meeting 25th jan 2017
PDF
KDD Cup 2021 時系列異常検知コンペ 参加報告
PDF
Why biased matrix factorization works well?
XGBoost @ Fyber
Counterfactual evaluation of machine learning models
Feature engineering — HJ Van Veen (Nubank) @@PAPIs Connect — São Paulo 2017
Feature engineering pipelines
Probabilistic Data Structures and Approximate Solutions
Get Competitive with Driverless AI
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Introduction to XGBoost
Feature Engineering - Getting most out of data for predictive models - TDC 2017
QCon Rio - Machine Learning for Everyone
Introduction of Feature Hashing
Demystifying Xgboost
Machine Learning : why we should know and how it works
오토인코더의 모든 것
Comparison Study of Decision Tree Ensembles for Regression
Introduction to Boosted Trees by Tianqi Chen
R user group meeting 25th jan 2017
KDD Cup 2021 時系列異常検知コンペ 参加報告
Why biased matrix factorization works well?
Ad

Similar to Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias Müller, Data Scientist, H2O.ai (20)

PPTX
MyStataLab Assignment Help
PDF
Noha danms13 talk_final
PDF
On the Validity of Bayesian Neural Networks for Uncertainty Estimation
PDF
Randomness and fraud
PPTX
Fast Single-pass K-means Clusterting at Oxford
PDF
Data_Prep_Techniques_Challenges_Methods.pdf
PPT
Reducing Structural Bias in Technology Mapping
PPT
NIPS2007: structured prediction
PPTX
Design of Engineering Experiments Part 5
PPTX
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
PPT
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
PPTX
The Other HPC: High Productivity Computing in Polystore Environments
PDF
Machine Learning: Classification Concepts (Part 1)
PPT
Principal Component Analysis PCA: How to conduct the analysis
PPTX
PDF
Start From A MapReduce Graph Pattern-recognize Algorithm
PDF
Machine Learning Basics
PDF
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
PDF
Ralf Herbrich - Introduction to Graphical models in Industry
PPTX
DeepLearningLecture.pptx
MyStataLab Assignment Help
Noha danms13 talk_final
On the Validity of Bayesian Neural Networks for Uncertainty Estimation
Randomness and fraud
Fast Single-pass K-means Clusterting at Oxford
Data_Prep_Techniques_Challenges_Methods.pdf
Reducing Structural Bias in Technology Mapping
NIPS2007: structured prediction
Design of Engineering Experiments Part 5
SCALABLE PATTERN MATCHING OVER COMPRESSED GRAPHS VIA DE-DENSIFICATION
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
The Other HPC: High Productivity Computing in Polystore Environments
Machine Learning: Classification Concepts (Part 1)
Principal Component Analysis PCA: How to conduct the analysis
Start From A MapReduce Graph Pattern-recognize Algorithm
Machine Learning Basics
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Ralf Herbrich - Introduction to Graphical models in Industry
DeepLearningLecture.pptx
Ad

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
Teaching material agriculture food technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
Digital-Transformation-Roadmap-for-Companies.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Encapsulation_ Review paper, used for researhc scholars
Teaching material agriculture food technology
The AUB Centre for AI in Media Proposal.docx
Building Integrated photovoltaic BIPV_UPV.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Electronic commerce courselecture one. Pdf

Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias Müller, Data Scientist, H2O.ai

  • 2. Mathias Müller faron@h2o.ai kaggle.com/mmueller - github.com/Far0n Leakage in Meta Modeling & Its Connection to HCC Target-Encoding
  • 3. Background • Born & raised in Berlin • Diplom in Computer Science from Humboldt University of Berlin • Joined H2O two month ago • Data Scientist • Development of Driverless AI
  • 5. Leakage “Data Leakage is the creation of unexpected additional information in the training data, allowing a model or machine learning algorithm to make unrealistically good predictions.” kaggle.com/wiki/leakage • Many different sources • ID-Leaks • Leaking future information into past • Validating models on already seen data • Leaking target information into Feature Matrices • Feedback loops / adaptive data analysis • … • Caused damage varies from case to case
  • 6. 6 Meta Modeling / Stacking
  • 7. Leakage in Meta Modeling • Suppose a 3-fold split of our training data into (A,B,C) • Creating of out-of-fold predictions (A2,B2,C2) • train((A,B)) followed by predict(C) to get C2 • train((A,C)) followed by predict(B) to get B2 • train((B,C)) followed by predict(A) to get A2 AB -> C2 A2B2 -> C3 (BC)(AC) -> C3 AC -> B2 A2C2 -> B3 (BC)(AB) -> B3 BC -> A2 B2C2 -> A3 (AC)(AB) -> A3 Base Level 1st Meta Level Leaked Target Information
  • 8. HCC Target Encoding • In general, tree based models like XGBoost, LightGBM, RF, etc. struggle with (non-ordinal) High Cardinal Categoricals (HCC) features • Order of mapped HCC values determines the required amount of splits to get “useful” data partitions • Idea: Replace HCC values by their likelihoods to get a “good order”
  • 9. K-Fold Target Encoding - Example X y Fold AB -> C AC -> B BC -> A X_lhood_cv blue 1 p(y | X = blue) = 0 0 red 1 p(y | X = red) = 1.0 1 blue 0 0.5 blue 0 0.5 blue 0 p(y | X = blue) = 0.333 0.333 red 1 p(y | X = red) = 1.0 1 A B C p(y | X = blue) = 0.5 • We want to replace the categorical values blue and red by their likelihoods in a k-fold cross-validated fashion:
  • 10. Recap: Leakage in Meta Modeling AB -> C2 A2B2 -> C3 (BC)(AC) -> C3 AC -> B2 A2C2 -> B3 (BC)(AB) -> B3 BC -> A2 B2C2 -> A3 (AC)(AB) -> A3 Base Level 1st Meta Level Leaked Target Information
  • 11. K-Fold Target Encoding - Example X y Fold AB -> C AC -> B BC -> A X_lhood_cv blue 1 p(y | X = blue) = 0 0 red 1 p(y | X = red) = 1.0 1 blue 0 0.5 blue 0 0.5 blue 0 p(y | X = blue) = 0.333 0.333 red 1 p(y | X = red) = 1.0 1 A B C p(y | X = blue) = 0.5 • X_lhood_cv values are basically “out-of-fold” predictions of a maximum likelihood estimator • Using X_lhood_cv as feature is pretty much the same procedure as stacking • Same leakage issue .. but fails more often than strong model stacking, because of no regularization
  • 12. Counter-Measures • Using a fixed holdout set to calculate likelihoods / to generate out-of-fold predictions • Loss of training data at later stages • Using a 2-fold scheme with fixed seed • Not ideal regarding bias-variance-tradeoff • Adding Noise to likelihoods / out-of-fold predictions • Hard to get the noise level right (heavily dataset dependent) • Avoiding target leakage by nested cross validation • Order of magnitude higher complexitity: O(k) => O(kouter * kinner)
  • 13. Nested Cross Validation AB -> C2 A2B2 -> C3 (B)(A) -> C3 A -> B2 B -> A2 AC -> B2 A2C2 -> B3 (C)(A) -> B3 A -> C2 C -> A2 BC -> A2 B2C2 -> A3 (C)(B) -> A3 B -> C2 C -> B2 Base Level 1st Meta Level No Leaked Target Information
  • 14. Thank you for your attention! Any Questions?