SlideShare a Scribd company logo
Feature Engineering on Forest
Cover Type Data with Decision
Trees
CHANDANA T L
NAVAMI K
PRUTHVI H R
NISHA K K
BIJU R MOHAN
Problem Statement
• To determine most the predominant tree species (forest cover type)
in a 30mX30m land using the cartographic variables.
• Forest cover type takes one of the 7 class labels representing the
tree species (lodgepine pine, spruce/fir, ponderosa pine, Douglas-fir, aspen,
cottonwood/willow and krummholz).
Data Description
Attribute Name Description
Elevation Elevation in meters
Aspect Aspect in degree azimuth
Slope Slope in degrees
Horizontal_Distance_To_Hydrology Horizontal distance to nearest surface
water features
Vertical_Distance_To_Hydrology Vertical distance to nearest surface water
features
Horizontal_Distance_To_Roadways Horizontal distance to nearest roadways
Attribute Name Description
Hillshade_9am Hillshade index at 9am, summer solstice
Hillshade_Noon Hillshade index at noon, summer solstice
Hillshade_3pm Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points Horizontal distance to nearest wildfire ignition points
Wilderness Area (4 binary colums) Wilderness area designation
Soil_Type (40 binary columns) Soil type designation
Data Pre-processing
Experiment
No
Feature Description Feature Modifications Reasons Results
1 Missing Soil Types - 7 and Soil
Types – 15 in the training set
Replace of s7 and s15 from
test set with s4 (There were
105 samples with soil type
s7).
The relevance of these features
seemed to be very low according
to the attribute usage obtained
from C50 tree built from test set.
Improved to 69.13%
2 Soil Type (Qualitative) with 40
binary columns
Generalization of Soil Type to
11 Columns
Generalization is based on the
ELU codes of soil types
Decreased to
67.32%
3 Soil Type (Qualitative) with 40
binary columns
Generalization based on
geologic and climatic zones
Generalization is based on the
ELU codes of soil types
Decreased further
to 66.95%
Base case: Accuracy of the model when tested on test set with no changes made to the original dataset, gave an
accuracy of 68.44%
Experiment
No
Feature Description Feature
Modifications
Reasons Results
4 Wilderness Area
(Qualitative) with 4 binary
columns
Generalization to one
categorical feature
To lower this features
used due to its lower
relevance.
Improvement to 68.89%
5 Missing values in
Hillshade_3pm
Replaced those values
with the mean of all
Hillshade_3pm
To remove the outliers Improvement to
69.17%
6 Soil type (Qualitative) with
40 binary columns
Generalized to one
categorical feature
To lower the features
used due to its lower
relevance
Improved to 69.13%
Data Pre-processing Continued…
Feature Engineering
Description of the new features for Forest Cover
Dataset
Attribute Name Description
Wilderness Area Wilderness area designation
Soil Type Soil type designation
Aspect Aspect in degrees azimuth
Highwater Indicative for positive or negative values to Vertical_Distance_To_Hydrology
EVDtH Elevation - Vertical_Distance_To_Hydrology
EHDtH Elevation -Horizontal_Distance_To_Hydrology*0.2
Distance_To_Hydrology (Horizontal_Distance_To_Hydrology^2+Vertical_Distance_To_Hydrology^2)1/2
Attribute Name Description
Hydro_Road_2 Horizontal_Distance_To_Hydrology-Horizontal_Distance_To_Roadways
Fire_Road_1 Horizontal_Distance_To_Fire_Points+Horizontal_Distance_To_Roadways
Fire_Road_2 Horizontal_Distance_To_Fire_Points-Horizontal_Distance_To_Roadways
Hydro_Road_1 Horizontal_Distance_To_Hydrology+Horizontal_Distance_To_Roadways
Hydro_Fire_1 Horizontal_Distance_To_Hydrology + Horizontal_Distance_To_Fire_Points
Hydro_Fire_2 Horizontal_Distance_To_Hydrology-Horizontal_Distance_To_Fire_Points
The new feature set when tested on the C5.0 decision tree with no boosting gave an improvement in
accuracy to 69.22%.
Performance Evaluation of different Decision trees
Performance
metrics/Decision Trees
C4.5 C5.0 CART
Accuracy in % 80.78 91.11 79.45
Size of the tree(nodes) 2111 772 951
Area Under Curve(
AUC)
0.92 0.88 0.94
• Pruning :
• Varying the confidence parameter from 0.25 to 0.15 for the C5.0
Decision tree is found to improve the accuracy to 69.25%.
• Ensemble Techniques:
• Random forest gave an improved accuracy of 77.24%
• C5.0 with boosting parameter trials =10 gave an improved accuracy to
76.02%.
• The forest cover data of the Roosevelt National Forest of northern Colorado
was used to evaluate the performance of various Decision Tree algorithms.
• Among the decision trees, C5.0 was found to give higher accuracy.
• Various feature engineering techniques performed on the dataset showed
improvement over the primary dataset.
• Random forest and C5.0 improved the accuracy by 10%(77.24% and 76.02%
respectively), which showed that ensemble techniques can enhance the
performance of decision trees considerably.
Conclusion

More Related Content

DOCX
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
PDF
Human variation lab
PDF
PPT s08-machine vision-s2
PDF
72. Social_Media_Gipfel_Zürich_Slides.pdf
PDF
Event checklist
PPT
Masters Thesis Defense Presentation
PPTX
Forest Cover Type Prediction
PDF
forest-cover-type
Discoverer 11.1.1.7 web logic (10.3.6) & ebs r12 12.1.3) implementation guide...
Human variation lab
PPT s08-machine vision-s2
72. Social_Media_Gipfel_Zürich_Slides.pdf
Event checklist
Masters Thesis Defense Presentation
Forest Cover Type Prediction
forest-cover-type

Viewers also liked (10)

PDF
Forest Cover type prediction
PDF
Limpiador XXL
PDF
Audiences webinar 10.22.2014
PDF
Winning Kaggle 101: Introduction to Stacking
PDF
Kaggle presentation
PDF
Implementing analytics? You need decision modeling and business rules
PDF
Using Deep Learning to Find Similar Dresses
PPTX
Tips and tricks to win kaggle data science competitions
PDF
XGBoost: the algorithm that wins every competition
PPTX
Business intelligence and data warehousing
Forest Cover type prediction
Limpiador XXL
Audiences webinar 10.22.2014
Winning Kaggle 101: Introduction to Stacking
Kaggle presentation
Implementing analytics? You need decision modeling and business rules
Using Deep Learning to Find Similar Dresses
Tips and tricks to win kaggle data science competitions
XGBoost: the algorithm that wins every competition
Business intelligence and data warehousing
Ad

Similar to Feature Engineering on Forest Cover Type Data with Decision Trees (20)

PPTX
MONITORING FOREST MANAGEMENT ACTIVTIES USING AIRBORNE LIDAR AND ALOS PALSAR.pptx
PPT
IGARSS2011-BIOMASS_Kostas_v1.ppt
PPT
AT_MB_MM_IGARSS2011.ppt
PPTX
MO4.L09 - DIGITAL BEAMFORMING SAR (DBSAR) FOR BIOMASS ESTIMATION
PPTX
WE1.L09 - GLOBAL BIOMASS ESTIMATES FROM DESDYNI
PPT
TU3.L09 - AN OVERVIEW OF RECENT ADVANCES IN POLARIMETRIC SAR INFORMATION EXTR...
PPT
pres_h2a_okconvert.ppt
PPTX
UWGIS_Seattle_City_Light_Vegetation Mapping_Final_Presentation
PPT
2011_0728_IGARSS2011_Motohka.ppt
PPT
2011_0728_IGARSS2011_Motohka.ppt
PDF
WE1.L09.5 - ESTIMATION OF FOREST BIOMASS CHANGE FROM FUSION OF RADAR AND LIDA...
PDF
Alexander vega 2019_iop_conf._ser.__mater._sci._eng._603_022010
PPTX
Presentation karissa reischke
PPT
Molinier - Feature Selection for Tree Species Identification in Very High res...
PPTX
Chris Strother Master's Thesis UGA 2013
PPT
How to estimate future forest cover in a watershed
PDF
4-1 foret
PPTX
application of airborne lidar in detecting forest structure
PPT
cartus_TH4.T02.3.ppt
PPT
BioSAR2010-aSARcampaigninsupportofthebiomassmission.ppt
MONITORING FOREST MANAGEMENT ACTIVTIES USING AIRBORNE LIDAR AND ALOS PALSAR.pptx
IGARSS2011-BIOMASS_Kostas_v1.ppt
AT_MB_MM_IGARSS2011.ppt
MO4.L09 - DIGITAL BEAMFORMING SAR (DBSAR) FOR BIOMASS ESTIMATION
WE1.L09 - GLOBAL BIOMASS ESTIMATES FROM DESDYNI
TU3.L09 - AN OVERVIEW OF RECENT ADVANCES IN POLARIMETRIC SAR INFORMATION EXTR...
pres_h2a_okconvert.ppt
UWGIS_Seattle_City_Light_Vegetation Mapping_Final_Presentation
2011_0728_IGARSS2011_Motohka.ppt
2011_0728_IGARSS2011_Motohka.ppt
WE1.L09.5 - ESTIMATION OF FOREST BIOMASS CHANGE FROM FUSION OF RADAR AND LIDA...
Alexander vega 2019_iop_conf._ser.__mater._sci._eng._603_022010
Presentation karissa reischke
Molinier - Feature Selection for Tree Species Identification in Very High res...
Chris Strother Master's Thesis UGA 2013
How to estimate future forest cover in a watershed
4-1 foret
application of airborne lidar in detecting forest structure
cartus_TH4.T02.3.ppt
BioSAR2010-aSARcampaigninsupportofthebiomassmission.ppt
Ad

Recently uploaded (20)

PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Managing Community Partner Relationships
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Mega Projects Data Mega Projects Data
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PDF
Introduction to the R Programming Language
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Transcultural that can help you someday.
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Managing Community Partner Relationships
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
DATA COLLECTION METHODS-ppt for nursing research
Pilar Kemerdekaan dan Identi Bangsa.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Optimise Shopper Experiences with a Strong Data Estate.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
STERILIZATION AND DISINFECTION-1.ppthhhbx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Mega Projects Data Mega Projects Data
importance of Data-Visualization-in-Data-Science. for mba studnts
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
Introduction to the R Programming Language
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Leprosy and NLEP programme community medicine
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Transcultural that can help you someday.

Feature Engineering on Forest Cover Type Data with Decision Trees

  • 1. Feature Engineering on Forest Cover Type Data with Decision Trees CHANDANA T L NAVAMI K PRUTHVI H R NISHA K K BIJU R MOHAN
  • 2. Problem Statement • To determine most the predominant tree species (forest cover type) in a 30mX30m land using the cartographic variables. • Forest cover type takes one of the 7 class labels representing the tree species (lodgepine pine, spruce/fir, ponderosa pine, Douglas-fir, aspen, cottonwood/willow and krummholz).
  • 3. Data Description Attribute Name Description Elevation Elevation in meters Aspect Aspect in degree azimuth Slope Slope in degrees Horizontal_Distance_To_Hydrology Horizontal distance to nearest surface water features Vertical_Distance_To_Hydrology Vertical distance to nearest surface water features Horizontal_Distance_To_Roadways Horizontal distance to nearest roadways
  • 4. Attribute Name Description Hillshade_9am Hillshade index at 9am, summer solstice Hillshade_Noon Hillshade index at noon, summer solstice Hillshade_3pm Hillshade index at 3pm, summer solstice Horizontal_Distance_To_Fire_Points Horizontal distance to nearest wildfire ignition points Wilderness Area (4 binary colums) Wilderness area designation Soil_Type (40 binary columns) Soil type designation
  • 5. Data Pre-processing Experiment No Feature Description Feature Modifications Reasons Results 1 Missing Soil Types - 7 and Soil Types – 15 in the training set Replace of s7 and s15 from test set with s4 (There were 105 samples with soil type s7). The relevance of these features seemed to be very low according to the attribute usage obtained from C50 tree built from test set. Improved to 69.13% 2 Soil Type (Qualitative) with 40 binary columns Generalization of Soil Type to 11 Columns Generalization is based on the ELU codes of soil types Decreased to 67.32% 3 Soil Type (Qualitative) with 40 binary columns Generalization based on geologic and climatic zones Generalization is based on the ELU codes of soil types Decreased further to 66.95% Base case: Accuracy of the model when tested on test set with no changes made to the original dataset, gave an accuracy of 68.44%
  • 6. Experiment No Feature Description Feature Modifications Reasons Results 4 Wilderness Area (Qualitative) with 4 binary columns Generalization to one categorical feature To lower this features used due to its lower relevance. Improvement to 68.89% 5 Missing values in Hillshade_3pm Replaced those values with the mean of all Hillshade_3pm To remove the outliers Improvement to 69.17% 6 Soil type (Qualitative) with 40 binary columns Generalized to one categorical feature To lower the features used due to its lower relevance Improved to 69.13% Data Pre-processing Continued…
  • 8. Description of the new features for Forest Cover Dataset Attribute Name Description Wilderness Area Wilderness area designation Soil Type Soil type designation Aspect Aspect in degrees azimuth Highwater Indicative for positive or negative values to Vertical_Distance_To_Hydrology EVDtH Elevation - Vertical_Distance_To_Hydrology EHDtH Elevation -Horizontal_Distance_To_Hydrology*0.2 Distance_To_Hydrology (Horizontal_Distance_To_Hydrology^2+Vertical_Distance_To_Hydrology^2)1/2
  • 9. Attribute Name Description Hydro_Road_2 Horizontal_Distance_To_Hydrology-Horizontal_Distance_To_Roadways Fire_Road_1 Horizontal_Distance_To_Fire_Points+Horizontal_Distance_To_Roadways Fire_Road_2 Horizontal_Distance_To_Fire_Points-Horizontal_Distance_To_Roadways Hydro_Road_1 Horizontal_Distance_To_Hydrology+Horizontal_Distance_To_Roadways Hydro_Fire_1 Horizontal_Distance_To_Hydrology + Horizontal_Distance_To_Fire_Points Hydro_Fire_2 Horizontal_Distance_To_Hydrology-Horizontal_Distance_To_Fire_Points The new feature set when tested on the C5.0 decision tree with no boosting gave an improvement in accuracy to 69.22%.
  • 10. Performance Evaluation of different Decision trees Performance metrics/Decision Trees C4.5 C5.0 CART Accuracy in % 80.78 91.11 79.45 Size of the tree(nodes) 2111 772 951 Area Under Curve( AUC) 0.92 0.88 0.94
  • 11. • Pruning : • Varying the confidence parameter from 0.25 to 0.15 for the C5.0 Decision tree is found to improve the accuracy to 69.25%. • Ensemble Techniques: • Random forest gave an improved accuracy of 77.24% • C5.0 with boosting parameter trials =10 gave an improved accuracy to 76.02%.
  • 12. • The forest cover data of the Roosevelt National Forest of northern Colorado was used to evaluate the performance of various Decision Tree algorithms. • Among the decision trees, C5.0 was found to give higher accuracy. • Various feature engineering techniques performed on the dataset showed improvement over the primary dataset. • Random forest and C5.0 improved the accuracy by 10%(77.24% and 76.02% respectively), which showed that ensemble techniques can enhance the performance of decision trees considerably. Conclusion

Editor's Notes