SlideShare a Scribd company logo
KEY FACTORS AND JURISDICTION RECOMMENDATIONS
NewMet Bootcamp – Fall 2015
Final presentation
Ankoor Bhagat
Overview
• Definitions
• Data Sources
• Data Cleaning
• EDA
• Feature Engineering
• Modeling
• Recommendations
Note: Data & Codes available at: https://github.com/ankoorb/NPO-Project
Definitions
• Objective – Identify Factors affecting garbage production rate
and make market targeting recommendations
• Annual Per Capita Disposal Rate (PPD) – Calculated as
Disposal Tons x 2000 Lbs / Population / 365
• 50 % Target Per Capita Disposal Rate (PPD) – Calculation
• Used Jurisdiction specific average of 2003-2006 Per Capita Generation
Rates
• Divide Average Per Capita Generation Rates by 2 to get disposal a
jurisdiction would have disposed if it was exactly 50% diversion
• Indicators –
• Primary – Population of Jurisdiction (Per Resident Disposal)
• Secondary – Jurisdiction Industry Employment (Per Employee Disposal)
• Judging Criteria – To Meet 50% goal, jurisdictions must
dispose off not more than their 50% Per Capita Disposal
Target
Point Arena
Santa Monica
Data Sources
• Disposal Rates: California Disposal Progress Report Year
2007 to 2013: http://www.calrecycle.ca.gov/LGCentral/Reports/jurisdiction/diversiondisposal.aspx
• CalRecycle Program Data: Program Counts by Status, Year
and Jurisdiction Data (2007-2013): CalRecycle
• Crime Data: Criminal Justice Statistics Center Statistics –
Crimes and Clearances (2005-2014): https://oag.ca.gov/crime/cjsc/stats/crimes-
clearances
• Solar Data: California Solar Initiative – Working Dataset:
https://www.californiasolarstatistics.ca.gov/data_downloads/
• California City Area – Wikipedia:
https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California
• Building Construction Permit by Jurisdiction (2007-2013):
State of the Cities Data Systems Database: http://socds.huduser.gov/permits/
• Voter Registration Data: California Report of Registration
(2007-2013): http://www.sos.ca.gov/elections/report-registration/
Data Cleaning
• Changed Data Structure – Row to Column/ Column to Row
• Filtered data to select data between 2007 and 2013
• String manipulation
• str to int/float
• Removing unnecessary characters: , $ - * N 2,500- 250,000+
• Jurisdiction name: capital to lower, removing -, (manual spelling
change to match during merging step)
• Renamed columns – Very long names. Key-Id data dictionary
• Replaced NaN with median values and in some cases with 0
• Merged data
Data Cleaning
• Initial Stats – 378 Jurisdictions and 946 Features
• After removal of 2007 to 2012 data - 378 Jurisdictions and
380 Features
• Feature Engineering - 378 Jurisdictions and 45 Features
(more on this later)
EDA (2013 Data)
• Histograms
• Joint Distributions
• Pearson Correlation
EDA (2013 Data)
Feature Engineering
• Ethnic Diversity Index
• Voter Registration Rate
• Republican to Democratic Ratio
• Major Crime to Minor Crime Ratio
• Percent Violent Crime
• Total Crime/1000 Inhabitants
• Crime Index
• Theil Index (Household Income)
• Mean Logarithmic Deviation (Household Income)
• Household Income Ratio
• Income Index
Feature Engineering
• Per Capita Income Index
• Travel Time Index
• Median Income Index
• Male to Female Median Earning Income Index
• Residential Solar Units/Person
• Residential Solar Units/Household
• and a lot more…
Feature Engineering
• Some Plots
Modeling
• Difference between Target Residential PPD and Annual
Residential PPD Calculated
• Difference discretized based on Quantiles. Intervals and labels
• [-1.9, 1.3) – Low
• (1.3, 2.4) – Fair
• (2.4, 3.6] – Good
• (3.6, 7052.8] – Excellent
• Clustering - looking at groups of jurisdictions that have similar
social, economic, housing, demographic, and political
characteristics
Modeling
• K Means Clustering – Checked Silhouette Scores (closer
to 0)
• Used Linear Discriminant Analysis based on labels from K
Means to visually inspect clusters – n_clusters = 4 chosen
• Decision Tree to check Feature Importance
• Random Forest to check Feature Importance and
accuracy
• Cross Validation – 10 folds
Modeling
• Still it was not clear what each cluster means
• Selected Features with Feature Importance > 0.05
• Plotted PCA Biplots for different clusters
Recommendations
• Good Performers
• Recommended Jurisdictions
Thank You!

More Related Content

ODT
Freyas kitchen volume 1 december
PPTX
9 Foods to Avoid When Shopping
PDF
Michael McCrorie_Certificates
PDF
Final PPP slide show
ODP
6 Foods That Will Help You Sleep
PPTX
El transporte como factor de desarrollo de la economía - FADEEAC - XIII Encue...
PPTX
Etapas de la reingenieria
PDF
Megan Taite - Professional Persona Project
Freyas kitchen volume 1 december
9 Foods to Avoid When Shopping
Michael McCrorie_Certificates
Final PPP slide show
6 Foods That Will Help You Sleep
El transporte como factor de desarrollo de la economía - FADEEAC - XIII Encue...
Etapas de la reingenieria
Megan Taite - Professional Persona Project

Viewers also liked (8)

PPTX
8 Superfoods to Grow at Home
ODP
5 Cleaning Hacks for the Kitchen
PDF
Own Your Future
PDF
Feature-Engineering-Earth-Advocacy-Project-2015
PPTX
17° Congreso de vialidad y tránsito - Propuesta Vehicular de FADEEAC
PDF
Prob-Dist-Toll-Forecast-Uncertainty
PDF
Time Management-Is the urgent getting in the way of the important?
ODP
5 Reasons Fast Food Is Bad
8 Superfoods to Grow at Home
5 Cleaning Hacks for the Kitchen
Own Your Future
Feature-Engineering-Earth-Advocacy-Project-2015
17° Congreso de vialidad y tránsito - Propuesta Vehicular de FADEEAC
Prob-Dist-Toll-Forecast-Uncertainty
Time Management-Is the urgent getting in the way of the important?
5 Reasons Fast Food Is Bad
Ad

Final_Presentation

  • 1. KEY FACTORS AND JURISDICTION RECOMMENDATIONS NewMet Bootcamp – Fall 2015 Final presentation Ankoor Bhagat
  • 2. Overview • Definitions • Data Sources • Data Cleaning • EDA • Feature Engineering • Modeling • Recommendations Note: Data & Codes available at: https://github.com/ankoorb/NPO-Project
  • 3. Definitions • Objective – Identify Factors affecting garbage production rate and make market targeting recommendations • Annual Per Capita Disposal Rate (PPD) – Calculated as Disposal Tons x 2000 Lbs / Population / 365 • 50 % Target Per Capita Disposal Rate (PPD) – Calculation • Used Jurisdiction specific average of 2003-2006 Per Capita Generation Rates • Divide Average Per Capita Generation Rates by 2 to get disposal a jurisdiction would have disposed if it was exactly 50% diversion • Indicators – • Primary – Population of Jurisdiction (Per Resident Disposal) • Secondary – Jurisdiction Industry Employment (Per Employee Disposal) • Judging Criteria – To Meet 50% goal, jurisdictions must dispose off not more than their 50% Per Capita Disposal Target
  • 5. Data Sources • Disposal Rates: California Disposal Progress Report Year 2007 to 2013: http://www.calrecycle.ca.gov/LGCentral/Reports/jurisdiction/diversiondisposal.aspx • CalRecycle Program Data: Program Counts by Status, Year and Jurisdiction Data (2007-2013): CalRecycle • Crime Data: Criminal Justice Statistics Center Statistics – Crimes and Clearances (2005-2014): https://oag.ca.gov/crime/cjsc/stats/crimes- clearances • Solar Data: California Solar Initiative – Working Dataset: https://www.californiasolarstatistics.ca.gov/data_downloads/ • California City Area – Wikipedia: https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California • Building Construction Permit by Jurisdiction (2007-2013): State of the Cities Data Systems Database: http://socds.huduser.gov/permits/ • Voter Registration Data: California Report of Registration (2007-2013): http://www.sos.ca.gov/elections/report-registration/
  • 6. Data Cleaning • Changed Data Structure – Row to Column/ Column to Row • Filtered data to select data between 2007 and 2013 • String manipulation • str to int/float • Removing unnecessary characters: , $ - * N 2,500- 250,000+ • Jurisdiction name: capital to lower, removing -, (manual spelling change to match during merging step) • Renamed columns – Very long names. Key-Id data dictionary • Replaced NaN with median values and in some cases with 0 • Merged data
  • 7. Data Cleaning • Initial Stats – 378 Jurisdictions and 946 Features • After removal of 2007 to 2012 data - 378 Jurisdictions and 380 Features • Feature Engineering - 378 Jurisdictions and 45 Features (more on this later)
  • 8. EDA (2013 Data) • Histograms • Joint Distributions • Pearson Correlation
  • 10. Feature Engineering • Ethnic Diversity Index • Voter Registration Rate • Republican to Democratic Ratio • Major Crime to Minor Crime Ratio • Percent Violent Crime • Total Crime/1000 Inhabitants • Crime Index • Theil Index (Household Income) • Mean Logarithmic Deviation (Household Income) • Household Income Ratio • Income Index
  • 11. Feature Engineering • Per Capita Income Index • Travel Time Index • Median Income Index • Male to Female Median Earning Income Index • Residential Solar Units/Person • Residential Solar Units/Household • and a lot more…
  • 13. Modeling • Difference between Target Residential PPD and Annual Residential PPD Calculated • Difference discretized based on Quantiles. Intervals and labels • [-1.9, 1.3) – Low • (1.3, 2.4) – Fair • (2.4, 3.6] – Good • (3.6, 7052.8] – Excellent • Clustering - looking at groups of jurisdictions that have similar social, economic, housing, demographic, and political characteristics
  • 14. Modeling • K Means Clustering – Checked Silhouette Scores (closer to 0) • Used Linear Discriminant Analysis based on labels from K Means to visually inspect clusters – n_clusters = 4 chosen • Decision Tree to check Feature Importance • Random Forest to check Feature Importance and accuracy • Cross Validation – 10 folds
  • 15. Modeling • Still it was not clear what each cluster means • Selected Features with Feature Importance > 0.05 • Plotted PCA Biplots for different clusters
  • 16. Recommendations • Good Performers • Recommended Jurisdictions