1. Decision Tree Ensembles
• Tree-based models for supervised learning
• Single decision trees are unstable and may lack accuracy
• Ensemble methods combine multiple trees for better performance
2. Why Ensembles?
• High variance in decision trees: small data changes → big prediction
changes
• Ensemble = Combine weak learners (trees) into a strong model
• Trade-off: Interpretability ↓, Accuracy ↑
3. Bagging (Bootstrap Aggregating)
• Randomly sample data (with replacement) to create bootstrap
samples
• Train a decision tree on each sample
• Average (regression) or vote (classification) for final prediction
• Out-of-bag (OOB) samples = built-in validation set
4. Random Forest
• Bagging + randomness at each split (use random subset of features)
• More diverse trees → stronger ensemble
• Handles categorical & continuous variables
• Shows variable importance
• Robust to outliers & missing data
• Limitation: Interpretability ↓ compared to a single tree
5. Boosted Trees
• Sequentially build trees, each learning from previous errors
• Types: AdaBoost, Gradient Boosting, XGBoost
• Typically more accurate than bagging/random forest
• Sensitive to overfitting and requires careful tuning
6. Key Hyperparameters
• Number of trees/layers (B)
• Tree depth (splits per tree)
• Learning rate (for boosting)
• Minimum split size
7. Evaluation & Overfitting
• Use validation data (not training data) to tune hyperparameters
• Too few splits: Underfitting
• Too many splits: Overfitting
8. Model Assessment
• Metrics: Misclassification rate, AUC, confusion matrix
• Random forest provides feature importance
• Always check if model's accuracy is better than a naive model
9. Comparison Table
• Bagging: Bootstrapped samples, deep trees | Accuracy: Moderate |
Interpretability: Medium
• Random Forest: Bootstrapping + feature randomness | Accuracy:
High | Interpretability: Lower
• Boosted Trees: Sequential, error-correcting | Accuracy: Highest |
Interpretability: Lowest
10. Takeaways
• Ensembles improve tree-based models significantly
• Random Forest: Good default choice for accuracy & feature
importance
• Boosted Trees: Best for accuracy but needs more tuning
• Always evaluate on validation data and check for overfitting