What is Logistic Regression?

What is Logistic Regression?

Logistic regression is a supervised statistical learning technique used for classification. Instead of predicting a continuous value, it estimates the probability that an instance belongs to a given category (e.g. “yes” vs. “no” or “spam” vs. “not spam”). In practice, logistic regression fits a linear model to the inputs and then applies a logistic (sigmoid) function to map that output into the range [0,1]. The final probability can be thresholded (often at 0.5) to decide class membership. In other words, logistic regression answers questions like “what is the chance of default?” or “what is the probability a patient has a disease?” rather than giving a raw numeric forecast. Its name comes from the logistic function it uses – but unlike linear regression, its focus is on classification and probability estimation.


Article content
Logistic Regression Concept

The Logistic (Sigmoid) Function and Model Formula

At the core of logistic regression is the sigmoid (logistic) function.

Article content

In a logistic regression model, we first compute a linear score

z=b+w1x1+w2x2+⋯+wNxN, where the xi are input features, the wi are learned coefficients, and b is a bias (intercept).

Model Training (Maximum Likelihood)

The model parameters (weights and bias) are typically found by maximum likelihood estimation (MLE). Concretely, we choose the coefficients that make the observed class labels most probable under the logistic model. Equivalently, we minimize the logistic (cross-entropy) loss. Unlike ordinary least squares, there is no closed-form solution for these parameters. Instead, training uses iterative optimization (e.g. gradient descent or Newton’s method) to adjust the weights step by step until the likelihood (or equivalently the log-likelihood) converges. Modern software libraries (scikit-learn, etc.) handle this under the hood. Practitioners should note that perfect separation of classes or extreme multicollinearity can cause convergence issues. Regularization or alternative solvers are common fixes if the training algorithm fails to converge.


Article content
Maximum Likelihood Estimation & Confusion Matrix depicting all 4 scenarios

Assumptions of Logistic Regression

Logistic regression makes fewer assumptions than linear regression, but there are still conditions to consider. Key assumptions include:

  • Binary (or categorical) outcome: The dependent variable should be binary (for basic logistic regression). (Multinomial or ordinal versions handle >2 classes.)
  • Linearity in the log-odds: The log-odds (logit) of the outcome must be a linear combination of the predictors. In other words, each feature has an additive effect on the log-probability.
  • Independent observations: Each training example should be independent of the others (no serial correlation).
  • No (or little) multicollinearity: Features should not be highly correlated with each other. (High collinearity inflates variance of estimates.)
  • No extreme outliers or complete separation: Very large outliers or a predictor that perfectly separates the classes can unduly influence the model or prevent convergence.

Logistic regression does not assume normally distributed errors or constant variance (homoscedasticity) of residuals. However, a reasonably large sample size is advised for stable estimates, and one should check the linearity-of-logit assumption (e.g. via scatter plots or adding interaction terms if needed).

Evaluating Logistic Regression

Since logistic regression is a classification model, it is evaluated using classification metrics rather than regression metrics. Common evaluation measures include:

  • Accuracy: The overall fraction of correct predictions. This is easy to compute but can be misleading if classes are imbalanced.
  • Precision (Positive Predictive Value): The proportion of instances predicted positive that are actually positive. Precision answers: “Of all positive predictions, how many are correct?”.
  • Recall (Sensitivity): The proportion of actual positives that the model correctly identifies. Recall answers: “Of all real positive cases, how many did we catch?”.
  • ROC Curve and AUC: The Receiver Operating Characteristic (ROC) curve plots the true positive rate against the false positive rate across all classification thresholds. The Area Under the ROC Curve (AUC) is a single-number summary of this plot. AUC ranges from 0.5 (no better than chance) to 1.0 (perfect separation). Equivalently, the AUC represents the probability that the model ranks a random positive example higher than a random negative example. Higher AUC means better overall discrimination.

Other metrics (F1-score, confusion matrix, etc.) are also used as needed. The choice of metric often depends on the application: for example, if false negatives are very costly (e.g. missing a disease), one may prioritize high recall.


Article content
Evaluation Metrics

Sample code implementation of Logistic Regression using Python

Example data: Iris dataset

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = (iris.target == 2).astype(int)  # 1 if Virginica, else 0 (binary classification)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate and train the logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]  # Probability of class 1 (Virginica)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))        

Recently, I created a Machine Learning model for detecting whether a financial transaction is fraudulent or not, using Logistic Regression and deployed in Streamlit App.

Link for code: https://drive.google.com/file/d/1o6qtXsPsB8ClJiNTWEHplLaDAszw2TPF/view?usp=sharing

Real-World Applications

Logistic regression is simple yet powerful, and it appears across many industries. Below are just a few examples:

  • Healthcare: Predicting patient outcomes and disease risk. For example, hospitals use logistic models to estimate the probability a patient will develop a complication or be readmitted after surgery. By including risk factors (age, vitals, lab results, medical history, etc.), the model gives a probability score to guide monitoring and intervention. Logistic regression also underpins many epidemiological studies, quantifying how risk factors (smoking, cholesterol, etc.) influence disease probability.
  • Finance: Credit scoring and fraud detection. Banks routinely use logistic regression to estimate the probability that a borrower will default on a loan. The model combines predictors like income, credit history, debt levels, and employment into a risk score (a probability of default). Because logistic regression outputs an easily interpretable probability, institutions can set flexible approval thresholds and explain decisions to regulators. It is also used in fraud detection to flag transactions with high likelihood of fraud.
  • Marketing and Customer Analytics: Churn prediction and targeting. Companies use logistic regression to predict whether a customer will churn (cancel a service). Inputs might include usage frequency, engagement metrics, demographics, and past interactions. The resulting probability helps marketing teams intervene, for example: offering a discount to a customer with a high churn risk. Logistic models are also used for email/web click-through prediction, customer segmentation, and determining the likelihood a campaign will lead to a sale.

Beyond these, logistic regression finds use in manufacturing (e.g. estimating probability of a defective product), insurance (claim risk), HR (employee attrition), and many other domains. Its combination of speed, interpretability, and well-understood behavior keeps it in regular use in industry.

Conclusion

Logistic regression is a foundational classification method that strikes a balance between simplicity and power. It is easy to implement and explain (even to non-technical stakeholders), yet it provides meaningful probability scores and handles real-valued or categorical features. Whether you’re just starting in data science or looking for a reliable baseline model, understanding logistic regression is invaluable. Feel free to comment here if you have any additional insights or thoughts.


To view or add a comment, sign in

Others also viewed

Explore topics