SlideShare a Scribd company logo
Towards the modelling and mapping of
active questing ticks using random forest
Irene GARCIA-MARTI
Raúl ZURITA-MILLA
Session B3: GeoHealth
Spatial Statistics Conference
5th July 2017
Context
 Re-emergence of vector-borne diseases:
 Global & socio-economic changes
 VBD: 23 diseases (including Zika)
 Changes + ticks:
 Longer tick season
 Increase in tick populations
 Northward + elevation expansion
 New suitable habitats
Source: WHO
Data
 Since 2006:
 Trained volunteers sample 17
locations in NL on a monthly
basis
 Count ticks in its different life
stages (i.e. larvae, nymph,
adults)
 First citizen science project of
its kind!
Source: WUR
3
Data
Habitat:
• Coniferous
• Deciduous
• Bushlands
Questing ticks in the day:
• 2006 – 2014
• ~3,000 observations
• Two transects per site
Data
Data
Data
7
From “Lyme disease: The ecology of a complex system”
R.Ostfeld (2012)
Objectives
 Multivariate regression with Random Forest:
 Goals:
 Predict tick dynamics in space and time
 Assess feature importance to predict tick dynamics
 Data-driven approach:
 Does not assume any distribution
 Fits non-linear natural processes
 Deals with high-dimensionality
Methodology
• T, P, EV, RH, SD,
VP
• KNMI, 1km, daily
Weather
• Veg. indices
• MODIS, 500m, 8-
day
Remote
sensing
• Tick habitat, mast
years, land coverEcology
Methodology
2. Modelling w/ Random Forest:
 Ensemble learning model
 Forest of DT
 Trained w/ bootstrap samples
 Characteristics:
 Perturbation:
 Ensure diversity
 Trees have high variance
 Random selection of features
 Variable importance
Tree T
Forest of decision trees
Methodology
• Weak: DT
• Strong: Bayesian
Type of
learner
• Bagging
• Boosting
Growing
scheme
• Algebraic combiner
• Voting-based method
Combination
rule
+ RSF
= Random
Forest
Ensemble Learning
Methodology
1) Draw a random bootstrap sample from the dataset
2) Grow a tree estimator on the selected samples
 Select n features at random
 Find the best split on the selected n features
 Grow the tree estimator: child nodes
3) Repeat 1-2 for a number of tree estimators
4) Combine the predictions by averaging
bagging
DT
RSM
splitting
branching
combining
Methodology
 In general, ML algorithms do not perform well with time:
 Multiple weather conditions yield same counts of tick activity
 Hampers learning process of the algorithm
3. Temporal Z-scores:
 Re-scaling the dependent variable
 Descriptive statistics in the long time-series to find the range
 Trees can distinguish between conditions that yield low tick activity
Results
14
Results
Top ten most important features:
 Percent of tick activity captured
by each feature alone
 Short-term vs long-term
weather events
 Atmospheric water levels seem
to be prominent to predict tick
activity
# Feature Time unit Importance
1 Evapotranspiration 365 15%
2 Relative humidity 30 11%
3 Min. Temp. 365 7%
4 Evapotranspiration 2 5%
5 Precipitation 3 4%
6 Evapotranspiration 5 3%
7 Relative humidity 365 3%
8 Min. Temp. 90 2%
9 Precipitation 365 2%
10 Land Cover - 2%
Results
16
Predicted tick activity
June 1st, 2014
 Trained model is applied to each
pixel with forest of the Netherlands
 Interpretation:
 Provinces of Drenthe and
Groningen presented high activity
of ticks on that day
 Randstad area presented the
lowest tick activity
Results
17
Spatial Statistics 2017 Conference: Towards the modelling and mapping of active questing ticks using random forest
Conclusions
 Case of data fusion of volunteered data and environmental data:
 Volunteered dataset is very noisy, but we could model transects with high
and low activity together
 Predict tick activity in space and time
 Assess the feature importance to predict tick activity
 Added time-awareness to a regression with Random Forest
Citizen science initiatives allow the crowdsourced monitoring
of environmental phenomena and can produce
geospatial data collections to serve scientific workflows
Thanks for your attention

More Related Content

PDF
summer-studentship-report-PDF
PDF
Applications of ecological niche modelling for mapping the risk of Rift Valle...
PPT
Community ecology lab
PPTX
Week3pres sample
PPTX
Plants and Rainfall
PDF
Metadata Standards in CKAN for Biodiversity Pilot in NextGEOSS
PPTX
Avifaunal disarray Ralph MacNally ACEAS Grand 2014
PDF
Implementation of RS-EBVs in Habitat Modelling
summer-studentship-report-PDF
Applications of ecological niche modelling for mapping the risk of Rift Valle...
Community ecology lab
Week3pres sample
Plants and Rainfall
Metadata Standards in CKAN for Biodiversity Pilot in NextGEOSS
Avifaunal disarray Ralph MacNally ACEAS Grand 2014
Implementation of RS-EBVs in Habitat Modelling

What's hot (15)

PDF
Geospatial Techniques for Measuring SI Assessment Indicators
PDF
Importing satellite imagery into R from NASA and the U.S. Geological Survey
PPTX
Vegetation cover monitoring
PDF
Day 2 Speaker Presentation - Dr Rachel Lowe
PDF
Weather Data: Virtual, In-Field, or Regional Network—Does It Matter?
PDF
Investigating the spatial epidemiology of zoonotic viral haemorrhagic fevers
PPT
2008-06-08 HTAP Aerosol Science Review
PPTX
Quantifying trends of rainfall and temperature extremes over Central Tanzania...
PDF
Why Your GDD Estimate Isn’t Good Enough—and How to Fix It
PPTX
Andy Jarvis - Parasid Near Real Time Monitoring Of Habitat Change Tnc Slt
PPT
CROP MONITORING HILLARY
PPT
A System for the Automatic Comparison of Machine and Human Geocoded Documents
PDF
Assessment of wheat crop coefficient using remote sensing techniques
PDF
CoE_poster
PPT
E-Bird and Climate Change distribution and abundance models, John Alexander
Geospatial Techniques for Measuring SI Assessment Indicators
Importing satellite imagery into R from NASA and the U.S. Geological Survey
Vegetation cover monitoring
Day 2 Speaker Presentation - Dr Rachel Lowe
Weather Data: Virtual, In-Field, or Regional Network—Does It Matter?
Investigating the spatial epidemiology of zoonotic viral haemorrhagic fevers
2008-06-08 HTAP Aerosol Science Review
Quantifying trends of rainfall and temperature extremes over Central Tanzania...
Why Your GDD Estimate Isn’t Good Enough—and How to Fix It
Andy Jarvis - Parasid Near Real Time Monitoring Of Habitat Change Tnc Slt
CROP MONITORING HILLARY
A System for the Automatic Comparison of Machine and Human Geocoded Documents
Assessment of wheat crop coefficient using remote sensing techniques
CoE_poster
E-Bird and Climate Change distribution and abundance models, John Alexander
Ad

Similar to Spatial Statistics 2017 Conference: Towards the modelling and mapping of active questing ticks using random forest (20)

PDF
Modelling tick densities using VGI and machine learning (2016)
PDF
Modelling tick dynamics using volunteer data (2017)
PDF
GeoComputation Conference - Dallas (2015)
PDF
Modelling tick bites dynamics using VGI (2015)
PDF
Vector-borne diseases and Lyme disease (2016)
PDF
Modeling the biosphere: the natural historian's perspective
PPTX
Keynote Speaker 1 - Data Intensive Challenges in Biodiversity Conservation: a...
PDF
Statistical Approaches For Hidden Variables In Ecology Nathalie Peyrard Olivi...
PPTX
Modeling present and prospective distribution of Phyteuma genus in Carpathian...
PPT
Global Modeling of Biodiversity and Climate Change
PPTX
Modeling_fundamentals For MsC environment science
PDF
Behavioral Poster Spring 2013
PPT
Paradigm shifts in wildlife and biodiversity management through machine learning
PDF
The Mathematical Epidemiology of Human Babesiosis in the North-Eastern United...
PPTX
Society for American Archaeology - 2015
PPTX
D1T3 enm workflows updated
PPTX
Updating Ecological Niche Modeling Methodologies
PDF
A comparative analysis of predictve data mining techniques
PDF
Isec2012 o hara
PDF
Introduction to Bayesian Divergence Time Estimation
Modelling tick densities using VGI and machine learning (2016)
Modelling tick dynamics using volunteer data (2017)
GeoComputation Conference - Dallas (2015)
Modelling tick bites dynamics using VGI (2015)
Vector-borne diseases and Lyme disease (2016)
Modeling the biosphere: the natural historian's perspective
Keynote Speaker 1 - Data Intensive Challenges in Biodiversity Conservation: a...
Statistical Approaches For Hidden Variables In Ecology Nathalie Peyrard Olivi...
Modeling present and prospective distribution of Phyteuma genus in Carpathian...
Global Modeling of Biodiversity and Climate Change
Modeling_fundamentals For MsC environment science
Behavioral Poster Spring 2013
Paradigm shifts in wildlife and biodiversity management through machine learning
The Mathematical Epidemiology of Human Babesiosis in the North-Eastern United...
Society for American Archaeology - 2015
D1T3 enm workflows updated
Updating Ecological Niche Modeling Methodologies
A comparative analysis of predictve data mining techniques
Isec2012 o hara
Introduction to Bayesian Divergence Time Estimation
Ad

Recently uploaded (20)

PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Leprosy and NLEP programme community medicine
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Navigating the Thai Supplements Landscape.pdf
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
Managing Community Partner Relationships
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
DOCX
Factor Analysis Word Document Presentation
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
CYBER SECURITY the Next Warefare Tactics
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
ISS -ESG Data flows What is ESG and HowHow
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Leprosy and NLEP programme community medicine
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Navigating the Thai Supplements Landscape.pdf
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
retention in jsjsksksksnbsndjddjdnFPD.pptx
modul_python (1).pptx for professional and student
Managing Community Partner Relationships
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Factor Analysis Word Document Presentation
Pilar Kemerdekaan dan Identi Bangsa.pptx

Spatial Statistics 2017 Conference: Towards the modelling and mapping of active questing ticks using random forest

  • 1. Towards the modelling and mapping of active questing ticks using random forest Irene GARCIA-MARTI Raúl ZURITA-MILLA Session B3: GeoHealth Spatial Statistics Conference 5th July 2017
  • 2. Context  Re-emergence of vector-borne diseases:  Global & socio-economic changes  VBD: 23 diseases (including Zika)  Changes + ticks:  Longer tick season  Increase in tick populations  Northward + elevation expansion  New suitable habitats Source: WHO
  • 3. Data  Since 2006:  Trained volunteers sample 17 locations in NL on a monthly basis  Count ticks in its different life stages (i.e. larvae, nymph, adults)  First citizen science project of its kind! Source: WUR 3
  • 4. Data Habitat: • Coniferous • Deciduous • Bushlands Questing ticks in the day: • 2006 – 2014 • ~3,000 observations • Two transects per site
  • 7. Data 7 From “Lyme disease: The ecology of a complex system” R.Ostfeld (2012)
  • 8. Objectives  Multivariate regression with Random Forest:  Goals:  Predict tick dynamics in space and time  Assess feature importance to predict tick dynamics  Data-driven approach:  Does not assume any distribution  Fits non-linear natural processes  Deals with high-dimensionality
  • 9. Methodology • T, P, EV, RH, SD, VP • KNMI, 1km, daily Weather • Veg. indices • MODIS, 500m, 8- day Remote sensing • Tick habitat, mast years, land coverEcology
  • 10. Methodology 2. Modelling w/ Random Forest:  Ensemble learning model  Forest of DT  Trained w/ bootstrap samples  Characteristics:  Perturbation:  Ensure diversity  Trees have high variance  Random selection of features  Variable importance Tree T Forest of decision trees
  • 11. Methodology • Weak: DT • Strong: Bayesian Type of learner • Bagging • Boosting Growing scheme • Algebraic combiner • Voting-based method Combination rule + RSF = Random Forest Ensemble Learning
  • 12. Methodology 1) Draw a random bootstrap sample from the dataset 2) Grow a tree estimator on the selected samples  Select n features at random  Find the best split on the selected n features  Grow the tree estimator: child nodes 3) Repeat 1-2 for a number of tree estimators 4) Combine the predictions by averaging bagging DT RSM splitting branching combining
  • 13. Methodology  In general, ML algorithms do not perform well with time:  Multiple weather conditions yield same counts of tick activity  Hampers learning process of the algorithm 3. Temporal Z-scores:  Re-scaling the dependent variable  Descriptive statistics in the long time-series to find the range  Trees can distinguish between conditions that yield low tick activity
  • 15. Results Top ten most important features:  Percent of tick activity captured by each feature alone  Short-term vs long-term weather events  Atmospheric water levels seem to be prominent to predict tick activity # Feature Time unit Importance 1 Evapotranspiration 365 15% 2 Relative humidity 30 11% 3 Min. Temp. 365 7% 4 Evapotranspiration 2 5% 5 Precipitation 3 4% 6 Evapotranspiration 5 3% 7 Relative humidity 365 3% 8 Min. Temp. 90 2% 9 Precipitation 365 2% 10 Land Cover - 2%
  • 16. Results 16 Predicted tick activity June 1st, 2014  Trained model is applied to each pixel with forest of the Netherlands  Interpretation:  Provinces of Drenthe and Groningen presented high activity of ticks on that day  Randstad area presented the lowest tick activity
  • 19. Conclusions  Case of data fusion of volunteered data and environmental data:  Volunteered dataset is very noisy, but we could model transects with high and low activity together  Predict tick activity in space and time  Assess the feature importance to predict tick activity  Added time-awareness to a regression with Random Forest Citizen science initiatives allow the crowdsourced monitoring of environmental phenomena and can produce geospatial data collections to serve scientific workflows
  • 20. Thanks for your attention