16. 1. Sample records with
replacement (aka "bootstrap"
the training data)
Sampling is the process of selecting a
subset of items from a vast collection of
items.
Bootstrap = Sampling with replacement. It
means a data point in a drawn sample can
reappear in future drawn samples as well.
2. Fit an overgrown tree to
each resampled data set
3. Average predictions
Bagging :
Bootstrap Aggregating :
wisdom of the crowd
17. Bagging : Bootstrap Aggregating : wisdom of the crowd
As we add more trees... our average prediction error reduces
18. • Random forest is identified as a collection of
decision trees. Each tree estimates a
classification, and this is called a “vote”.
Ideally, we consider each vote from every
tree and chose the most voted classification
(Majority-Voting).
• Random Forest follow the same bagging
process as the decision trees but each time a
split is to be performed, the search for the
split variable is limited to a random subset of
m of the p attributes (variables or features)
aka Split-Attribute Randomization :
• classification trees: m = √p
• regression trees: m = p/3
• m is commonly referred to as mtry
• Random Forests produce many unique trees.
Random Forest
19. Bagging vs Random Forest
• Bagging introduces randomness
into the rows of the data.
• Random forest
introduces randomness into the
rows and columns of the data
• Combined, this provides a more
diverse set of trees that almost
always lowers our prediction error.
Split-Attribute Randomization : Prediction Error