SlideShare a Scribd company logo
31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
 Problem with logistic regression with low event rate
 Way out
 How to do them in SAS?
 How to do them in R?
31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
Analyst 1: I’m in some trouble, my manager wants me to build a logistic regression model but I
have only a 2% event rate in my data. The logistic regression won’t be a good choice here – the
ML estimate will be biased.
Analyst 2: Not necessarily. It’s the total count rather than the percentage of events that matters.
How many cases do you have for the rarer event and how big is your dataset?
Analyst1: We’ve got about 1800 odd events in a dataset of about 100,000 cases a less than 2%
scenario
Analyst2: Hmm. With these many cases for the rarer event, you can very well use logistic
regression. There are methods to address such skewed, or sparse data situations.
Analyst1: Wow. Really! Please tell me more!!
Analyst2: There are couple of alternatives. For one you can use ‘exact’ logistic regression – this
is to be used when sample size is too small for your usual logistic regression using the regular
maximum-likelihood-based estimation. Another option in your scenario is to use the penalized-
likelihood estimation method. This second one has the advantage of being computationally less
demanding than the exact logistic method.
31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
 Low event rate/Rare Event:
 In the current context, this refers to the scenario where under a binary outcome space
(response/no-response, good/bad, default/no-default, purchase/no-purchase, etc.) one of the
two events are far fewer than the other
▪ Suppose in a sample of 1000 applicants for a position only 20 are selected – here the event of being
selected is the rare event with a low event rate of 2%
▪ Suppose, in a sample of 100,000 purchases from an online retailer, about 1800 are returned by the
customer – here the event of goods being returned is the rare event with a low event rate of 1.8%
 Some real life examples:
▪ Charge backs in credit card transactions
▪ Goods returned in online retailing
 Why is this a problem for logistic regression – it’s still binary anyway?
 The problem here is with the estimation method – the usual maximum-likelihood method is
susceptible to ‘small sample bias’ and this bias is strongly dependent on the count (as opposed to
percentage) of the rarer of the events
31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
 In case of small sample and/or very unbalanced binary data (When you have just 20 cases in
a sample of 1000) – ‘exact logistic’ regression is to be used
 Exact logistic regression approach provides an alternative to the maximum likelihood method for
making inferences about the parameters of the logistic regression model
 The method is based on appropriate exact distributions of sufficient statistics for parameters of
interest and the estimates given by exact logistic regression do not depend on asymptotic results
 It is useful for analyzing small or unbalanced binary data with covariates
 This method is usually very computationally intensive
 If, however, you have a larger count of the rarer of the two events, say, 1000, (even better if
it’s 2000) in a sample of 100,000 with the same low event rate (1% to 2%) you can use
logistic regression – the estimation will have to be done using ‘penalized likelihood’ method
(also called Firth’s penalized likelihood approach, after its inventor
 While we mentioned this method in the context of only small sample size/rare event scenario, this is
a method of addressing issues of separability, small sample sizes, and bias of the parameter estimates
31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
Proc Logistic Data = YourRareEventData descending;
Freq CellCount; /* the CellCount variable is weight vbl here */
model RareEvent = X1 X2;
Exact X1 / estimate = both;
Run;
 You can add other options for what you want to have in your output
 The option Exact after the model statement and the Freq statements are
the key differences here
 An alternative Event/Trial Syntax:
Proc Logistic Data = YourRareEventData;
model RareEvent / CellCount = X1 X2;
Exact X1 / estimate = both;
Run;
31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
Proc Logistic Data = YourRareEventData;
class CategoricalVbl1 CategoricalVbl2/ param=ref;
Model Y = CategoricalVbl1 CategoricalVbl2 X1 X1 / firth ;
run;
 You can add other options for what you want to have in your
output
 The option ‘FIRTH’ in the model statement is the key here
31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
 Package Required:
 ‘elrm’
 This package implements (approximate) exact
inference for binomial logistic regression models in
R
31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
 Package:
 ‘logistf’
 This package runs Firth’s bias reduced logistic regression approach with
penalized profile likelihood based confidence intervals for parameter
estimates
 Another package ‘penalized’ runs penalized generalized linear
models, penalized regression models
31 August 2021 ©Arup Guha - Indian Institute of ForeignTrade - New Delhi, India
Data Sciences ATG
EDUCATION
Econometrics,
Statistics, Economics
Vanderbilt, Cincinnati,
Indian Statistical
Institute, Jawaharlal
Nehru University
Research Scholars
Journal Articles
Free Solutions to Challenging Data Problems
EXPERIENCE
18 years combined,
Marketing analytics, Risk
analytics, Financial
analytics, Analytic
Solution & Tools
development, Analytics
CoE set-up, Advanced
Analytics Training
EXISTING/SERVED
CLIENTS
A large Global Beverage
company, A small insurance
company,
A renowned business
school, A large Global HR &
Compensation Consulting
Group, A large Global IT
Research group, A third
party analytics vendor, A
mid sized analytics
consulting
EXPERTISE
Predictive modelling,
Segmentation, Market
research, Clickstream data
analysis, Forecasting,
Financial Time Series,
Simulation, Bayesian
econometrics, Machine
Learning Techniques,
Decision Trees,
SAS, SPSS, R, Octave,
Stata, Eviews, Matlab,
Maxima, Netlogo
What we don’t do…
Quick and dirty back of the
envelope calculation
Use jargon presentations with little
impact on your problem
Hide that we are stumped
What we do…
FREE analytics help to stuck
analysts and consultants
Customized analytics solutions to
institutes and companies
FREE snapshot to companies
considering entering analytics
Apply analytics in non-traditional
areas including films & education
FREE data analysis help to
students researchers and faculty

More Related Content

PPT
Markov chains1
PPTX
Introduction to statistics
PPTX
Business Analytics _ Confidence Interval
PPTX
Descriptive statistics
PDF
Research method ch07 statistical methods 1
PPTX
Relative frequency distribution
PPTX
Medical Statistics Part-II:Inferential statistics
PPTX
The Central Limit Theorem
Markov chains1
Introduction to statistics
Business Analytics _ Confidence Interval
Descriptive statistics
Research method ch07 statistical methods 1
Relative frequency distribution
Medical Statistics Part-II:Inferential statistics
The Central Limit Theorem

What's hot (20)

PDF
Mathematical Modeling for Practical Problems
PPT
R for Statistical Computing
PPTX
Overview of different statistical tests used in epidemiological
PPTX
Agreement analysis
PPTX
INFERENTIAL TECHNIQUES. Inferential Stat. pt 3
PPTX
Regression
PPTX
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptx
PPTX
Parametric tests
PPTX
chi square test ( homo)
PDF
Descriptive statistics and graphs
PDF
Multilevel Binary Logistic Regression
PPT
Confidence Intervals And The T Distribution
PPT
Testing Hypothesis
PDF
Assumptions of Linear Regression - Machine Learning
PPTX
Introduction to the t Statistic
PPTX
When to use, What Statistical Test for data Analysis modified.pptx
PPTX
Ga ppt (1)
PPTX
Installing R and R-Studio
PPTX
MEASURE OF CENTRAL TENDENCY
PPSX
Simple linear regression
Mathematical Modeling for Practical Problems
R for Statistical Computing
Overview of different statistical tests used in epidemiological
Agreement analysis
INFERENTIAL TECHNIQUES. Inferential Stat. pt 3
Regression
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptx
Parametric tests
chi square test ( homo)
Descriptive statistics and graphs
Multilevel Binary Logistic Regression
Confidence Intervals And The T Distribution
Testing Hypothesis
Assumptions of Linear Regression - Machine Learning
Introduction to the t Statistic
When to use, What Statistical Test for data Analysis modified.pptx
Ga ppt (1)
Installing R and R-Studio
MEASURE OF CENTRAL TENDENCY
Simple linear regression
Ad

Viewers also liked (20)

PDF
Inductive Reasoning and (one of) the Foundations of Machine Learning
PPTX
Logistic regression
PPT
Submitting a SPSS Extension To the Community
PDF
Higgs Boson Machine Learning Challenge - Kaggle
PDF
Intro to Classification: Logistic Regression & SVM
PDF
Model building in credit card and loan approval
PDF
World cup 2014 fixture
PDF
RAPID PREDICTIVE MODELING FOR BUSINESS ANALYSTS
PDF
Classification using L1-Penalized Logistic Regression
PPT
Differentiated Instruction For Saturday1111
PDF
CTR logistic regression
PPTX
Content-Based Instruction:Teaching Methods and Strategies
PDF
H2O World - GBM and Random Forest in H2O- Mark Landry
PPTX
classification_methods-logistic regression Machine Learning
PPTX
Trovadores e Troveiros, por Marcos Filho
PDF
Forecasting P2P Credit Risk based on Lending Club data
PDF
Consumer Credit Scoring Using Logistic Regression and Random Forest
PDF
Mini thesis presentation
PPTX
CTR Prediction using Spark Machine Learning Pipelines
PPTX
Formas sacras, por Marcos Filho
Inductive Reasoning and (one of) the Foundations of Machine Learning
Logistic regression
Submitting a SPSS Extension To the Community
Higgs Boson Machine Learning Challenge - Kaggle
Intro to Classification: Logistic Regression & SVM
Model building in credit card and loan approval
World cup 2014 fixture
RAPID PREDICTIVE MODELING FOR BUSINESS ANALYSTS
Classification using L1-Penalized Logistic Regression
Differentiated Instruction For Saturday1111
CTR logistic regression
Content-Based Instruction:Teaching Methods and Strategies
H2O World - GBM and Random Forest in H2O- Mark Landry
classification_methods-logistic regression Machine Learning
Trovadores e Troveiros, por Marcos Filho
Forecasting P2P Credit Risk based on Lending Club data
Consumer Credit Scoring Using Logistic Regression and Random Forest
Mini thesis presentation
CTR Prediction using Spark Machine Learning Pipelines
Formas sacras, por Marcos Filho
Ad

Similar to Logistic regression with low event rate (rare events) (20)

PDF
the unconditional Logistic Regression .pdf
PPTX
Logistical Regression.pptx
PPTX
Group 20_Logistic Regression devara.pptx
PPT
Logistic Regression in Case-Control Study
PDF
4_logit_printable_.pdf
PPTX
LOGISTIC_REGRESSION for AI and ML Beginners
PPTX
lrssssssss7777 s jsjs jssssssjs m.pptx
PPTX
Generalized Logistic Regression - by example (Anthony Kilili)
PPTX
conditional probablity in logistic regression
PDF
Regression shrinkage: better answers to causal questions
PPTX
Logistic regression - one of the key regression tools in experimental research
PPTX
Logistic Regression in machine learning ppt
PPTX
lrmssm sms mssssssssss mssssm - Copy.pptx
PPTX
lrm nns ns sn s sss sssssa aa - Copy.pptx
PPTX
lrmnnnnnnnn hhhhhhhhhh hhhhhhhhh - Copy.pptx
PDF
Logistic regression
PPT
logit_probit.ppt
PPTX
ECONOMETRICS-LBS-2025-#11-LOGIT (1).pptx
PDF
Discussion of Persi Diaconis' lecture at ISBA 2016
PDF
Logistic regression
the unconditional Logistic Regression .pdf
Logistical Regression.pptx
Group 20_Logistic Regression devara.pptx
Logistic Regression in Case-Control Study
4_logit_printable_.pdf
LOGISTIC_REGRESSION for AI and ML Beginners
lrssssssss7777 s jsjs jssssssjs m.pptx
Generalized Logistic Regression - by example (Anthony Kilili)
conditional probablity in logistic regression
Regression shrinkage: better answers to causal questions
Logistic regression - one of the key regression tools in experimental research
Logistic Regression in machine learning ppt
lrmssm sms mssssssssss mssssm - Copy.pptx
lrm nns ns sn s sss sssssa aa - Copy.pptx
lrmnnnnnnnn hhhhhhhhhh hhhhhhhhh - Copy.pptx
Logistic regression
logit_probit.ppt
ECONOMETRICS-LBS-2025-#11-LOGIT (1).pptx
Discussion of Persi Diaconis' lecture at ISBA 2016
Logistic regression

Recently uploaded (20)

PPTX
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
PPTX
HR Introduction Slide (1).pptx on hr intro
PDF
Laughter Yoga Basic Learning Workshop Manual
PPT
Data mining for business intelligence ch04 sharda
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
PDF
Roadmap Map-digital Banking feature MB,IB,AB
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
Training And Development of Employee .pdf
DOCX
unit 2 cost accounting- Tender and Quotation & Reconciliation Statement
PDF
Unit 1 Cost Accounting - Cost sheet
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PDF
DOC-20250806-WA0002._20250806_112011_0000.pdf
PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PDF
Deliverable file - Regulatory guideline analysis.pdf
PDF
Traveri Digital Marketing Seminar 2025 by Corey and Jessica Perlman
PDF
Business model innovation report 2022.pdf
PDF
Reconciliation AND MEMORANDUM RECONCILATION
The Marketing Journey - Tracey Phillips - Marketing Matters 7-2025.pptx
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
HR Introduction Slide (1).pptx on hr intro
Laughter Yoga Basic Learning Workshop Manual
Data mining for business intelligence ch04 sharda
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
Roadmap Map-digital Banking feature MB,IB,AB
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
Training And Development of Employee .pdf
unit 2 cost accounting- Tender and Quotation & Reconciliation Statement
Unit 1 Cost Accounting - Cost sheet
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
DOC-20250806-WA0002._20250806_112011_0000.pdf
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
Deliverable file - Regulatory guideline analysis.pdf
Traveri Digital Marketing Seminar 2025 by Corey and Jessica Perlman
Business model innovation report 2022.pdf
Reconciliation AND MEMORANDUM RECONCILATION

Logistic regression with low event rate (rare events)

  • 1. 31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
  • 2.  Problem with logistic regression with low event rate  Way out  How to do them in SAS?  How to do them in R? 31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
  • 3. Analyst 1: I’m in some trouble, my manager wants me to build a logistic regression model but I have only a 2% event rate in my data. The logistic regression won’t be a good choice here – the ML estimate will be biased. Analyst 2: Not necessarily. It’s the total count rather than the percentage of events that matters. How many cases do you have for the rarer event and how big is your dataset? Analyst1: We’ve got about 1800 odd events in a dataset of about 100,000 cases a less than 2% scenario Analyst2: Hmm. With these many cases for the rarer event, you can very well use logistic regression. There are methods to address such skewed, or sparse data situations. Analyst1: Wow. Really! Please tell me more!! Analyst2: There are couple of alternatives. For one you can use ‘exact’ logistic regression – this is to be used when sample size is too small for your usual logistic regression using the regular maximum-likelihood-based estimation. Another option in your scenario is to use the penalized- likelihood estimation method. This second one has the advantage of being computationally less demanding than the exact logistic method. 31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
  • 4.  Low event rate/Rare Event:  In the current context, this refers to the scenario where under a binary outcome space (response/no-response, good/bad, default/no-default, purchase/no-purchase, etc.) one of the two events are far fewer than the other ▪ Suppose in a sample of 1000 applicants for a position only 20 are selected – here the event of being selected is the rare event with a low event rate of 2% ▪ Suppose, in a sample of 100,000 purchases from an online retailer, about 1800 are returned by the customer – here the event of goods being returned is the rare event with a low event rate of 1.8%  Some real life examples: ▪ Charge backs in credit card transactions ▪ Goods returned in online retailing  Why is this a problem for logistic regression – it’s still binary anyway?  The problem here is with the estimation method – the usual maximum-likelihood method is susceptible to ‘small sample bias’ and this bias is strongly dependent on the count (as opposed to percentage) of the rarer of the events 31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
  • 5.  In case of small sample and/or very unbalanced binary data (When you have just 20 cases in a sample of 1000) – ‘exact logistic’ regression is to be used  Exact logistic regression approach provides an alternative to the maximum likelihood method for making inferences about the parameters of the logistic regression model  The method is based on appropriate exact distributions of sufficient statistics for parameters of interest and the estimates given by exact logistic regression do not depend on asymptotic results  It is useful for analyzing small or unbalanced binary data with covariates  This method is usually very computationally intensive  If, however, you have a larger count of the rarer of the two events, say, 1000, (even better if it’s 2000) in a sample of 100,000 with the same low event rate (1% to 2%) you can use logistic regression – the estimation will have to be done using ‘penalized likelihood’ method (also called Firth’s penalized likelihood approach, after its inventor  While we mentioned this method in the context of only small sample size/rare event scenario, this is a method of addressing issues of separability, small sample sizes, and bias of the parameter estimates 31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
  • 6. 31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
  • 7. Proc Logistic Data = YourRareEventData descending; Freq CellCount; /* the CellCount variable is weight vbl here */ model RareEvent = X1 X2; Exact X1 / estimate = both; Run;  You can add other options for what you want to have in your output  The option Exact after the model statement and the Freq statements are the key differences here  An alternative Event/Trial Syntax: Proc Logistic Data = YourRareEventData; model RareEvent / CellCount = X1 X2; Exact X1 / estimate = both; Run; 31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
  • 8. Proc Logistic Data = YourRareEventData; class CategoricalVbl1 CategoricalVbl2/ param=ref; Model Y = CategoricalVbl1 CategoricalVbl2 X1 X1 / firth ; run;  You can add other options for what you want to have in your output  The option ‘FIRTH’ in the model statement is the key here 31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
  • 9. 31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
  • 10.  Package Required:  ‘elrm’  This package implements (approximate) exact inference for binomial logistic regression models in R 31 August 2021 ©TejamoyGhosh – Data Science ATG - New Delhi, India
  • 11.  Package:  ‘logistf’  This package runs Firth’s bias reduced logistic regression approach with penalized profile likelihood based confidence intervals for parameter estimates  Another package ‘penalized’ runs penalized generalized linear models, penalized regression models 31 August 2021 ©Arup Guha - Indian Institute of ForeignTrade - New Delhi, India
  • 12. Data Sciences ATG EDUCATION Econometrics, Statistics, Economics Vanderbilt, Cincinnati, Indian Statistical Institute, Jawaharlal Nehru University Research Scholars Journal Articles Free Solutions to Challenging Data Problems EXPERIENCE 18 years combined, Marketing analytics, Risk analytics, Financial analytics, Analytic Solution & Tools development, Analytics CoE set-up, Advanced Analytics Training EXISTING/SERVED CLIENTS A large Global Beverage company, A small insurance company, A renowned business school, A large Global HR & Compensation Consulting Group, A large Global IT Research group, A third party analytics vendor, A mid sized analytics consulting EXPERTISE Predictive modelling, Segmentation, Market research, Clickstream data analysis, Forecasting, Financial Time Series, Simulation, Bayesian econometrics, Machine Learning Techniques, Decision Trees, SAS, SPSS, R, Octave, Stata, Eviews, Matlab, Maxima, Netlogo What we don’t do… Quick and dirty back of the envelope calculation Use jargon presentations with little impact on your problem Hide that we are stumped What we do… FREE analytics help to stuck analysts and consultants Customized analytics solutions to institutes and companies FREE snapshot to companies considering entering analytics Apply analytics in non-traditional areas including films & education FREE data analysis help to students researchers and faculty