SlideShare a Scribd company logo
Dealing with imbalanced data in RTB
Yuya Kanemoto
Table of contents
1. Introduction
2. Methods
2.1 Re-sampling
2.2 Cost-sensitive learning
3. Tools in practice
4. Reference
1
Introduction
A classifier can predict the class labels of new data after the
training. Proportion of class labels for the training can be
imbalanced in real-world data sets, and imbalanced data makes the
training difficult for a classifier. This is the case for Real-Time
Bidding (RTB) framework in online advertisement, and there are
several ways to deal with the problem to improve the performance
of the classifier.
2
Methods: Re-sampling
Re-sampling can deal with the imbalanced data by balancing the
proportion of class labels
• Under-sampling the majority class
• Over-sampling the minority class
• Combining over- and under-sampling
• Create ensemble balanced sets
3
Methods: Calibration after re-sampling
There are several ways to calibrate the output probability from a
classifier after the re-sampling
• Isotonic regression
minimize
∑
i wi (yi − ˆyi )2
subject to ˆymin = ˆy1 ≤ ˆy2... ≤ ˆyn = ˆymax
• Calibration factor for negative under-sampling
q = p
p+(1−p)/w
• q: calibrated probability
• p: prediction in under-sampling space
• w: under-sampling rate
4
Methods: Calibration after re-sampling
• Probability calibration should be done on new data not used
for model fitting
• Logistic regression returns well calibrated predictions by
default as it directly optimizes log-loss
5
Cost-sensitive learning
Actual positive Actual negative
Predict positive C(0,0) C(0,1)
Predict negative C(1,0) C(1,1)
• Cost-sensitive learning takes the misclassification costs into
consideration
• R(i|x) =
∑
j P(j|x)C(i, j)
• expected cost R(i|x) of classifying an instance x into class i
• Classifier will classify an instance x into positive class if and
only if:
P(0|x)C(1, 0) ≤ P(1|x)C(0, 1) assumig C(0, 0) = C(1, 1) = 0
6
Cost-sensitive learning types: Thresholding
• Thresholding method modifies the threshold (0.5 by defalut)
to label the class considering the costs
p∗ = C(1,0)
C(1,0)+C(0,1)
• threshold p∗ for the classifier to classify an instance x into
positive if P(1|x) ≥ p∗
7
Cost-sensitive learning types: Sampling
• Re-sampling method described above can be considered as a
part of cost-sensitive learning
• Positive and negative examples are sampled by the ratio of:
p(1)FN : p(0)FP
• p(1) and p(0) are the prior probability of the positive and
negative examples in the original training set
8
Cost-sensitive learning types: Weighting
• Weighting method assigns a normalized weight to each
instance according to the misclassification costs
• This can be considered as a part of Sampling method as
example with high weights (for rare class with high costs) can
be viewed as example duplication - thus sampling
• Weighting method can utilize all data unlike Sampling method
9
Tools in practice: Xgboost
• Balance the positive and negative weights via
scale-pos-weight if you care only about the ranking order of
your prediction
• typically by inserting sum(negative/major samples)
sum(positive/rare samples)
• Use AUC for evaluation. Utility [Chapelle O 2015] can also be
considered as a metric in RTB
• If you care about predicting the right probability, you cannot
re-balance the data
• setting parameter max-delta-step to a finite number (like 1)
will help convergence
10
Reference
• Offline Evaluation of Response Prediction in Online
Advertising Auctions Categories and Subject Descriptors,
Chapelle O 2015
• XGBoost, Chen T et al. 2016
• Practical Lessons from Predicting Clicks on Ads at Facebook,
He X et al. 2014
• Cost-sensitive learning and the class imbalance problem, Ling
C et al. 2008
• Cost-sensitive Learning for Utility Optimization in Online
Advertising Auctions, Vasile F et al. 2016
11

More Related Content

PDF
Your Classifier is Secretly an Energy based model and you should treat it lik...
PPTX
Cross-validation aggregation for forecasting
PDF
Lecture7 cross validation
PPTX
boosting algorithm
PDF
PartOne
PDF
Machine learning (5)
PPTX
Data Analysis project "TITANIC SURVIVAL"
PPTX
adaboost
Your Classifier is Secretly an Energy based model and you should treat it lik...
Cross-validation aggregation for forecasting
Lecture7 cross validation
boosting algorithm
PartOne
Machine learning (5)
Data Analysis project "TITANIC SURVIVAL"
adaboost

Similar to Dealing with imbalanced data in RTB (20)

PDF
Dealing with imbalanced data sets.pdf
PPTX
in5490-classification (1).pptx
PDF
Calibrating Probability with Undersampling for Unbalanced Classification
PDF
6145-Article Text-9370-1-10-20200513.pdf
PDF
lec18_ref.pdf
PDF
Noisy labels
PPTX
CST413 KTU S7 CSE Machine Learning Classification Assessment Confusion matrix...
PDF
Ensemble Learning Notes for students of CS
PPTX
credibility : evaluating what's been learned from data science
PDF
GBM theory code and parameters
PDF
DMTM Lecture 06 Classification evaluation
PPTX
module_of_healthcare_wound_healing_mbbs_3.pptx
PPTX
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
PPT
Data mining techniques unit iv
PDF
L2. Evaluating Machine Learning Algorithms I
PDF
Ensembles of example dependent cost-sensitive decision trees slides
PDF
Handling Imbalanced Data: SMOTE vs. Random Undersampling
PPTX
Maximizing a churn campaigns profitability with cost sensitive machine learning
PPTX
BDAS-2017 | Maximizing a churn campaign’s profitability with cost sensitive m...
Dealing with imbalanced data sets.pdf
in5490-classification (1).pptx
Calibrating Probability with Undersampling for Unbalanced Classification
6145-Article Text-9370-1-10-20200513.pdf
lec18_ref.pdf
Noisy labels
CST413 KTU S7 CSE Machine Learning Classification Assessment Confusion matrix...
Ensemble Learning Notes for students of CS
credibility : evaluating what's been learned from data science
GBM theory code and parameters
DMTM Lecture 06 Classification evaluation
module_of_healthcare_wound_healing_mbbs_3.pptx
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Data mining techniques unit iv
L2. Evaluating Machine Learning Algorithms I
Ensembles of example dependent cost-sensitive decision trees slides
Handling Imbalanced Data: SMOTE vs. Random Undersampling
Maximizing a churn campaigns profitability with cost sensitive machine learning
BDAS-2017 | Maximizing a churn campaign’s profitability with cost sensitive m...
Ad

Recently uploaded (20)

PDF
Mega Projects Data Mega Projects Data
PPTX
Computer network topology notes for revision
PDF
Transcultural that can help you someday.
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Managing Community Partner Relationships
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
SAP 2 completion done . PRESENTATION.pptx
Mega Projects Data Mega Projects Data
Computer network topology notes for revision
Transcultural that can help you someday.
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Database Infoormation System (DBIS).pptx
Quality review (1)_presentation of this 21
Introduction to Knowledge Engineering Part 1
STERILIZATION AND DISINFECTION-1.ppthhhbx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
climate analysis of Dhaka ,Banglades.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
IBA_Chapter_11_Slides_Final_Accessible.pptx
.pdf is not working space design for the following data for the following dat...
Managing Community Partner Relationships
Reliability_Chapter_ presentation 1221.5784
SAP 2 completion done . PRESENTATION.pptx
Ad

Dealing with imbalanced data in RTB

  • 1. Dealing with imbalanced data in RTB Yuya Kanemoto
  • 2. Table of contents 1. Introduction 2. Methods 2.1 Re-sampling 2.2 Cost-sensitive learning 3. Tools in practice 4. Reference 1
  • 3. Introduction A classifier can predict the class labels of new data after the training. Proportion of class labels for the training can be imbalanced in real-world data sets, and imbalanced data makes the training difficult for a classifier. This is the case for Real-Time Bidding (RTB) framework in online advertisement, and there are several ways to deal with the problem to improve the performance of the classifier. 2
  • 4. Methods: Re-sampling Re-sampling can deal with the imbalanced data by balancing the proportion of class labels • Under-sampling the majority class • Over-sampling the minority class • Combining over- and under-sampling • Create ensemble balanced sets 3
  • 5. Methods: Calibration after re-sampling There are several ways to calibrate the output probability from a classifier after the re-sampling • Isotonic regression minimize ∑ i wi (yi − ˆyi )2 subject to ˆymin = ˆy1 ≤ ˆy2... ≤ ˆyn = ˆymax • Calibration factor for negative under-sampling q = p p+(1−p)/w • q: calibrated probability • p: prediction in under-sampling space • w: under-sampling rate 4
  • 6. Methods: Calibration after re-sampling • Probability calibration should be done on new data not used for model fitting • Logistic regression returns well calibrated predictions by default as it directly optimizes log-loss 5
  • 7. Cost-sensitive learning Actual positive Actual negative Predict positive C(0,0) C(0,1) Predict negative C(1,0) C(1,1) • Cost-sensitive learning takes the misclassification costs into consideration • R(i|x) = ∑ j P(j|x)C(i, j) • expected cost R(i|x) of classifying an instance x into class i • Classifier will classify an instance x into positive class if and only if: P(0|x)C(1, 0) ≤ P(1|x)C(0, 1) assumig C(0, 0) = C(1, 1) = 0 6
  • 8. Cost-sensitive learning types: Thresholding • Thresholding method modifies the threshold (0.5 by defalut) to label the class considering the costs p∗ = C(1,0) C(1,0)+C(0,1) • threshold p∗ for the classifier to classify an instance x into positive if P(1|x) ≥ p∗ 7
  • 9. Cost-sensitive learning types: Sampling • Re-sampling method described above can be considered as a part of cost-sensitive learning • Positive and negative examples are sampled by the ratio of: p(1)FN : p(0)FP • p(1) and p(0) are the prior probability of the positive and negative examples in the original training set 8
  • 10. Cost-sensitive learning types: Weighting • Weighting method assigns a normalized weight to each instance according to the misclassification costs • This can be considered as a part of Sampling method as example with high weights (for rare class with high costs) can be viewed as example duplication - thus sampling • Weighting method can utilize all data unlike Sampling method 9
  • 11. Tools in practice: Xgboost • Balance the positive and negative weights via scale-pos-weight if you care only about the ranking order of your prediction • typically by inserting sum(negative/major samples) sum(positive/rare samples) • Use AUC for evaluation. Utility [Chapelle O 2015] can also be considered as a metric in RTB • If you care about predicting the right probability, you cannot re-balance the data • setting parameter max-delta-step to a finite number (like 1) will help convergence 10
  • 12. Reference • Offline Evaluation of Response Prediction in Online Advertising Auctions Categories and Subject Descriptors, Chapelle O 2015 • XGBoost, Chen T et al. 2016 • Practical Lessons from Predicting Clicks on Ads at Facebook, He X et al. 2014 • Cost-sensitive learning and the class imbalance problem, Ling C et al. 2008 • Cost-sensitive Learning for Utility Optimization in Online Advertising Auctions, Vasile F et al. 2016 11