Linear Regression: A comprehensive explanation
1. Definition & Purpose
What it is: Linear regression is a supervised learning algorithm used for predicting a continuous target variable based on one or more independent features.
Goal: Model the linear relationship between input variables (features) and a single output variable (target) by fitting a linear equation to observed data.
2. Key Concepts
a) Equation:
Simple Linear Regression (one feature):
Y=β0+β1X+ϵ
Y: Target variable.
X: Feature.
β0: Intercept (value of Y when X=0).
β1: Slope (change in Y per unit change in X).
ϵ: Error term (residuals).
b) Multiple Linear Regression (multiple features):
Y=β0+β1X1+β2X2+⋯+βnXn+ϵ
c) Best-Fit Line: The line that minimizes the sum of squared residuals (differences between predicted and actual values).
d) Residuals:
Residual=Yactual−Ypredicted
Squaring residuals ensures all errors are positive and penalizes larger errors.
3. How It Works: Step-by-Step
a) Data Preparation:
Features (X): Independent variables (e.g., head size, age).
Target (Y): Dependent variable (e.g., brain weight).
Split data into training (80%) and testing (20%) sets.
b) Model Training:
i. Objective: Find coefficients (β0,β1,…) that minimize the cost function (Mean Squared Error).
ii. Cost Function:
iii. Closed-Form Solution (Normal Equation):
iv. X: Feature matrix (with a column of 1s for the intercept).
v. Y: Target vector.
c) Prediction: Use the learned coefficients to predict new values:
d) Evaluation:
i. R-squared (R²): Measures the proportion of variance in Y explained by X.
ii. Adjusted R²: Adjusts R² for the number of predictors to avoid overfitting.
4. Assumptions
Linearity: Relationship between features and target is linear.
Independence: Residuals are uncorrelated.
Homoscedasticity: Residuals have constant variance.
Normality: Residuals are normally distributed.
5. Practical Example
Problem: Predict brain weight (Y) from head size (X).
Steps:
Load Data: Ensure no missing values or outliers.
Split Data: Train on 80%, test on 20%.
Reshape Data: Convert features to 2D arrays (required by libraries like sklearn).
Train Model: Fit LinearRegression to training data.
Predict: Use the model to predict test data.
Evaluate: Calculate R² and visualize predictions vs. actual values.
6. Code Implementation (Python)
7. Strengths & Limitations
Strengths: Simple and interpretable; Fast to train; Works well with linearly separable data.
Limitations: Assumes linearity; fails on non-linear relationships; Sensitive to outliers; Cannot handle categorical features directly (requires encoding).
All-in-all, Linear regression is a foundational algorithm for predicting continuous outcomes. It models relationships using a linear equation, optimizing coefficients to minimize prediction errors. While simple, understanding its assumptions and limitations is crucial for effective application. Tools like R² and adjusted R² help evaluate performance, and libraries like scikit-learn streamline implementation.
For more on this check out this link: Linear Regression - sample code, whiteboard, summary