Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias Müller, Data Scientist, H2O.ai

Mathias Müller
faron@h2o.ai
kaggle.com/mmueller - github.com/Far0n
Leakage in Meta Modeling &
Its Connection to HCC Target-Encoding

Background
• Born & raised in Berlin
• Diplom in Computer Science from Humboldt University of Berlin
• Joined H2O two month ago
• Data Scientist
• Development of Driverless AI

Leakage
“Data Leakage is the creation of unexpected additional
information in the training data, allowing a model or machine
learning algorithm to make unrealistically good predictions.”
kaggle.com/wiki/leakage
• Many different sources
• ID-Leaks
• Leaking future information into past
• Validating models on already seen data
• Leaking target information into Feature Matrices
• Feedback loops / adaptive data analysis
• …
• Caused damage varies from case to case

Leakage in Meta Modeling
• Suppose a 3-fold split of our training data into (A,B,C)
• Creating of out-of-fold predictions (A2,B2,C2)
• train((A,B)) followed by predict(C) to get C2
• train((A,C)) followed by predict(B) to get B2
• train((B,C)) followed by predict(A) to get A2
AB -> C2 A2B2 -> C3 (BC)(AC) -> C3
AC -> B2 A2C2 -> B3 (BC)(AB) -> B3
BC -> A2 B2C2 -> A3 (AC)(AB) -> A3
Base Level 1st Meta Level Leaked Target Information

HCC Target Encoding
• In general, tree based models like XGBoost, LightGBM, RF, etc.
struggle with (non-ordinal) High Cardinal Categoricals (HCC)
features
• Order of mapped HCC values determines the required amount
of splits to get “useful” data partitions
• Idea: Replace HCC values by their likelihoods to get a “good
order”

K-Fold Target Encoding - Example
X y Fold AB -> C AC -> B BC -> A X_lhood_cv
blue 1 p(y | X = blue) = 0 0
red 1 p(y | X = red) = 1.0 1
blue 0 0.5
blue 0 0.5
blue 0 p(y | X = blue) = 0.333 0.333
red 1 p(y | X = red) = 1.0 1
A
B
C
p(y | X = blue) = 0.5
• We want to replace the categorical values blue and red by their
likelihoods in a k-fold cross-validated fashion:

Recap: Leakage in Meta Modeling
AB -> C2 A2B2 -> C3 (BC)(AC) -> C3
AC -> B2 A2C2 -> B3 (BC)(AB) -> B3
BC -> A2 B2C2 -> A3 (AC)(AB) -> A3
Base Level 1st Meta Level Leaked Target Information

K-Fold Target Encoding - Example
X y Fold AB -> C AC -> B BC -> A X_lhood_cv
blue 1 p(y | X = blue) = 0 0
red 1 p(y | X = red) = 1.0 1
blue 0 0.5
blue 0 0.5
blue 0 p(y | X = blue) = 0.333 0.333
red 1 p(y | X = red) = 1.0 1
A
B
C
p(y | X = blue) = 0.5
• X_lhood_cv values are basically “out-of-fold” predictions of
a maximum likelihood estimator
• Using X_lhood_cv as feature is pretty much the same
procedure as stacking
• Same leakage issue .. but fails more often than strong
model stacking, because of no regularization

Counter-Measures
• Using a fixed holdout set to calculate likelihoods / to generate
out-of-fold predictions
• Loss of training data at later stages
• Using a 2-fold scheme with fixed seed
• Not ideal regarding bias-variance-tradeoff
• Adding Noise to likelihoods / out-of-fold predictions
• Hard to get the noise level right (heavily dataset dependent)
• Avoiding target leakage by nested cross validation
• Order of magnitude higher complexitity: O(k) => O(kouter * kinner)

Nested Cross Validation
AB -> C2 A2B2 -> C3 (B)(A) -> C3
A -> B2
B -> A2
AC -> B2 A2C2 -> B3 (C)(A) -> B3
A -> C2
C -> A2
BC -> A2 B2C2 -> A3 (C)(B) -> A3
B -> C2
C -> B2
Base Level 1st Meta Level No Leaked Target Information

Thank you for your attention!
Any Questions?

Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias Müller, Data Scientist, H2O.ai

More Related Content

What's hot (19)

Similar to Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias Müller, Data Scientist, H2O.ai (20)

More from Sri Ambati (20)

Recently uploaded (20)

Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias Müller, Data Scientist, H2O.ai