What is Logistic Regression?
Logistic regression is a supervised statistical learning technique used for classification. Instead of predicting a continuous value, it estimates the probability that an instance belongs to a given category (e.g. “yes” vs. “no” or “spam” vs. “not spam”). In practice, logistic regression fits a linear model to the inputs and then applies a logistic (sigmoid) function to map that output into the range [0,1]. The final probability can be thresholded (often at 0.5) to decide class membership. In other words, logistic regression answers questions like “what is the chance of default?” or “what is the probability a patient has a disease?” rather than giving a raw numeric forecast. Its name comes from the logistic function it uses – but unlike linear regression, its focus is on classification and probability estimation.
The Logistic (Sigmoid) Function and Model Formula
At the core of logistic regression is the sigmoid (logistic) function.
In a logistic regression model, we first compute a linear score
z=b+w1x1+w2x2+⋯+wNxN, where the xi are input features, the wi are learned coefficients, and b is a bias (intercept).
Model Training (Maximum Likelihood)
The model parameters (weights and bias) are typically found by maximum likelihood estimation (MLE). Concretely, we choose the coefficients that make the observed class labels most probable under the logistic model. Equivalently, we minimize the logistic (cross-entropy) loss. Unlike ordinary least squares, there is no closed-form solution for these parameters. Instead, training uses iterative optimization (e.g. gradient descent or Newton’s method) to adjust the weights step by step until the likelihood (or equivalently the log-likelihood) converges. Modern software libraries (scikit-learn, etc.) handle this under the hood. Practitioners should note that perfect separation of classes or extreme multicollinearity can cause convergence issues. Regularization or alternative solvers are common fixes if the training algorithm fails to converge.
Assumptions of Logistic Regression
Logistic regression makes fewer assumptions than linear regression, but there are still conditions to consider. Key assumptions include:
Logistic regression does not assume normally distributed errors or constant variance (homoscedasticity) of residuals. However, a reasonably large sample size is advised for stable estimates, and one should check the linearity-of-logit assumption (e.g. via scatter plots or adding interaction terms if needed).
Evaluating Logistic Regression
Since logistic regression is a classification model, it is evaluated using classification metrics rather than regression metrics. Common evaluation measures include:
Other metrics (F1-score, confusion matrix, etc.) are also used as needed. The choice of metric often depends on the application: for example, if false negatives are very costly (e.g. missing a disease), one may prioritize high recall.
Sample code implementation of Logistic Regression using Python
Example data: Iris dataset
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = (iris.target == 2).astype(int) # 1 if Virginica, else 0 (binary classification)
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Instantiate and train the logistic regression model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # Probability of class 1 (Virginica)
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
Recently, I created a Machine Learning model for detecting whether a financial transaction is fraudulent or not, using Logistic Regression and deployed in Streamlit App.
Link for code: https://drive.google.com/file/d/1o6qtXsPsB8ClJiNTWEHplLaDAszw2TPF/view?usp=sharing
Real-World Applications
Logistic regression is simple yet powerful, and it appears across many industries. Below are just a few examples:
Beyond these, logistic regression finds use in manufacturing (e.g. estimating probability of a defective product), insurance (claim risk), HR (employee attrition), and many other domains. Its combination of speed, interpretability, and well-understood behavior keeps it in regular use in industry.
Conclusion
Logistic regression is a foundational classification method that strikes a balance between simplicity and power. It is easy to implement and explain (even to non-technical stakeholders), yet it provides meaningful probability scores and handles real-valued or categorical features. Whether you’re just starting in data science or looking for a reliable baseline model, understanding logistic regression is invaluable. Feel free to comment here if you have any additional insights or thoughts.