SlideShare a Scribd company logo
Phyteuma vagneri
(
Phyteuma confusum
MODELING PRESENT AND PROSPECTIVE DISTRIBUTION OF PHYTEUMA GENUS IN
CARPATHIAN REGION WITH MACHINE LEARNING TECHNIQUES USING OPEN CLIMATIC AND
SOIL DATA
Assoc. Prof. Dr. Alexander Mkrtchian
Main problems in species distribution modeling
Problem Solution
Data of different formats, different
quality, duplicated data
Data preprocessing
Presence-only data availale Introducing randomly distributed
background data as a surrogate for
absence data
Data distribution violates normality
assumption
Using non-parametric methods (machine
learning)
Spatially autocorrelated data Explicit methods to account for spatial
autocorrelation; rigorous testing of
modeling results on test sets
• Mackey B. G., Nix H. A., Hutchinson M. F., Macmahon J. P., Fleming P. M. (1988). Assessing
representativeness of places for conservation reservation and heritage listing // Environmental
Management, 12 (4), pp. 501–514
• Franklin J. (1995). Predictive vegetation mapping: geographic modelling of biospatial patterns in
relation to environmental gradients // Progress in Physical Geography, 1995, 19, pp. 474–499
• Guisan A.; Zimmermann N. E. (2000). Predictive habitat distribution models in ecology // Ecological
Modelling, 135, pp. 147–186
• Elith J., Graham K. H., Anderson R. P. et al. (2006). Novel methods improve prediction of species’
distributions from occurrence data // Ecography, 29, pp. 129–151
• Elith J., Leathwick J. R., Hastie T. A. (2008). Working guide to boosted regression trees // Journal of
Animal Ecology, 2008, 77, 802–813
• Elith J., Phillips S. J., Hastie T., Dudı´k M., Chee Y. En, Yates C. J. (2011). A statistical explanation of
MaxEnt for ecologists // Diversity and Distributions, 17, pp. 43–57
• https://rspatial.org/raster/sdm/index.html
Main theoretical works
• BIOCLIM – based on «ecological envelope» principle, calculating the
probability of species in the attributes hyperspace
• Maxent – based on minimizing relative entropy of probability
densities in the attributes hyperspace
• General linear models (GLM)
• General additive models (GAM)
• Artificial neural networks (ANN)
• Random forest (RF)
• Bootstrap aggregation (bagging)
• Boosted regression trees (BRT)
• Support vectors machines (SWM)
Main types of models for species distribution modeling
From Elith J., Graham K. H., Anderson R. P. et al.
(2006). Novel methods improve prediction of
species’ distributions from occurrence data //
Ecography, 29, pp. 129–151
 Model specifics and features selected have to be
meaningful from an ecological viewpoint
 Predicted spatial pattern should look plausible from
an expert viewpoint
 Residuals on test set has to be small and random
 Quality, size and spatial extent of data, as well as
appropriate features selected are mostly more
important than the choice of the concrete model and
tuning of its parameters
Data obtained from GBIF for
the Carpathian region
contained 148 records of
Phyteuma genus in total.
After removing duplicates, 80
records have been kept. A
double of observation records
(160 points) were thus
designated as background, with
random coverage of geographic
space inside study area. Six
data points (3 from observed
data and another 3 from
simulated background data)
have subsequently been
removed due to omissions in
predictors data.
Data on climatic conditions were derived from WorldClim
database (1 km. sq. spatial resolution)
There are layers of 19 bioclimatic variables, derived from monthly temperature and
precipitation with a consideration to have biological significance.
17 out of 19 bioclimatic variables were taken out as predictive variables for SDM.
For data on soil conditions, SoilGrids digital maps of soil
properties were used (250 m spatial resolution)
From 11 available physical and chemical soil properties, 4 were chosen as the most
suitable predictors: soil acidity, organic carbon stock, cation exchange capacity, and
total nitrogen. Among six standard depth intervals available, 15–30 cm depth interval
was chosen as the most appropriate for the purpose.
• ML methods examined: Maxent, Random
Forest, Artificial Neural Networks (ANN),
Boosted Regression Trees.
• Assessing the performance of the models and
tuning their parameters: AUC and TSS criteria
calculated for testing data with 6-fold cross-
validation
• Tools used: R programming language and
software environment, SDMtune package
Method AUC
(training)
AUC
(testing)
TSS
(training)
TSS
(testing)
Maxent
(ME)
0.8162 0.796
(0.0138)
0.6114 0.5938
(0.0252)
Artificial neural networks
(ANN)
0.9437 0.9369
(0.005)
0.7508 0.7891
(0.018)
Random forest
(RF)
1 0.9321
(0.0076)
1 0.7851
(0.0261)
Boosted regression trees (BRT) 0.964 0.9304
(0.0062)
0.8145 0.775
(0.0162)
Performance metrics of different SDM methods calculated on training dataset and
with aforementioned testing procedure. For testing case, standard deviations are given in
parentheses.
Model AUC
(training)
AUC
(testing)
TSS
(training)
TSS
(testing)
Full set of 21 variables 0.9437 0.9369
(0.005)
0.7508 0.7891
(0.018)
Reduced set of 6 variables 0.94 0.9381
(0.0036)
0.7384 0.7893
(0.0132)
Performance metrics of ANN SDM method calculated with a full and reduced sets
of predictive variables
ANN method appears to work the best for the case
Relative occurrence probabilities (left) and predicted habitat (right) of Phyteuma
Predicted occurrence probabilities (left) and habitat area (right) of Phyteuma
for year 2050 based on RCP 4.5 climate projections

More Related Content

PPTX
Module 5 - EN - Promoting data use III: Most frequent data analysis techniques
PPTX
New predictive characterization methods for accessing and using crop wild rel...
PDF
Feature Selection Approach based on Firefly Algorithm and Chi-square
PDF
Feature selection and microarray data
PDF
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
PPTX
Artificial Intelligence in Plant Breeding
PPS
UHDMML.pps
PPTX
D1T2 canonical ecological niche modeling
Module 5 - EN - Promoting data use III: Most frequent data analysis techniques
New predictive characterization methods for accessing and using crop wild rel...
Feature Selection Approach based on Firefly Algorithm and Chi-square
Feature selection and microarray data
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Artificial Intelligence in Plant Breeding
UHDMML.pps
D1T2 canonical ecological niche modeling

Similar to Modeling present and prospective distribution of Phyteuma genus in Carpathian region with machine learning techniques using open climatic and soil data (20)

PPT
iEvobIO
PPTX
D1T3 enm workflows updated
PPTX
Updating Ecological Niche Modeling Methodologies
PPTX
Society for American Archaeology - 2015
PDF
A Machine Learning Framework for Materials Knowledge Systems
PPTX
TERN eMAST : Observations and terrestrial ecosystem models : Terrestrial Ecos...
PPTX
Long term socio ecological research sites for crp6
PPTX
Predictive association between trait data and eco-geographic data for Nordic ...
PDF
Rethinking data intensive science using scalable analytics systems
PPTX
Automating Machine Learning - Is it feasible?
DOCX
Data preprocessing
PDF
TERN Surveillance Training 2019 - Day 1, Session 2
PDF
Physics inspired artificial intelligence/machine learning
PDF
2013 Poster Session, Geospatial Modeling of Mountain Pine Beetle Mortality by...
PPTX
Bayesian network-based predictive analytics applied to invasive species distr...
PDF
Plant and Animal Genome XXIII, Together for better HTP digital phenotyping
PPTX
land health surveillance highlights
PDF
PREDICTIVE DATA MINING ALGORITHMS FOR OPTIMIZED BEST CROP IN SOIL DATA CLASSI...
PDF
DATA MINING ATTRIBUTE SELECTION APPROACH FOR DROUGHT MODELLING: A CASE STUDY ...
iEvobIO
D1T3 enm workflows updated
Updating Ecological Niche Modeling Methodologies
Society for American Archaeology - 2015
A Machine Learning Framework for Materials Knowledge Systems
TERN eMAST : Observations and terrestrial ecosystem models : Terrestrial Ecos...
Long term socio ecological research sites for crp6
Predictive association between trait data and eco-geographic data for Nordic ...
Rethinking data intensive science using scalable analytics systems
Automating Machine Learning - Is it feasible?
Data preprocessing
TERN Surveillance Training 2019 - Day 1, Session 2
Physics inspired artificial intelligence/machine learning
2013 Poster Session, Geospatial Modeling of Mountain Pine Beetle Mortality by...
Bayesian network-based predictive analytics applied to invasive species distr...
Plant and Animal Genome XXIII, Together for better HTP digital phenotyping
land health surveillance highlights
PREDICTIVE DATA MINING ALGORITHMS FOR OPTIMIZED BEST CROP IN SOIL DATA CLASSI...
DATA MINING ATTRIBUTE SELECTION APPROACH FOR DROUGHT MODELLING: A CASE STUDY ...
Ad

More from Alexander Mkrtchian (9)

PPTX
Annual precipitation data processing and interpolation for the weather statio...
PPTX
Lviv city climate tendencies
PPTX
Impact of prospective climate changes on future distribution of eco-climate b...
PPTX
Обробка та геостатистична інтерполяція даних щодо кількості опадів метеостанц...
PPTX
Quantifying Landscape Changes through Land Cover Transition Potential Analysi...
PPTX
Новітні та прогнозні зміни кліматичних умов західних регіонів України: просто...
PPTX
The analysis of relations between land surface morphometry and spectral chara...
PPT
Interpolation of meteodata using the method of regression-kriging
PDF
Modeling the location of natural cold-limited treeline and alpine meadow habi...
Annual precipitation data processing and interpolation for the weather statio...
Lviv city climate tendencies
Impact of prospective climate changes on future distribution of eco-climate b...
Обробка та геостатистична інтерполяція даних щодо кількості опадів метеостанц...
Quantifying Landscape Changes through Land Cover Transition Potential Analysi...
Новітні та прогнозні зміни кліматичних умов західних регіонів України: просто...
The analysis of relations between land surface morphometry and spectral chara...
Interpolation of meteodata using the method of regression-kriging
Modeling the location of natural cold-limited treeline and alpine meadow habi...
Ad

Recently uploaded (20)

PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPTX
limit test definition and all limit tests
PPTX
Probability.pptx pearl lecture first year
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
Substance Disorders- part different drugs change body
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PPT
LEC Synthetic Biology and its application.ppt
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PPTX
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPT
Presentation of a Romanian Institutee 2.
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PPTX
gene cloning powerpoint for general biology 2
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPTX
Seminar Hypertension and Kidney diseases.pptx
PDF
Wound infection.pdfWound infection.pdf123
PPTX
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
limit test definition and all limit tests
Probability.pptx pearl lecture first year
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Substance Disorders- part different drugs change body
Enhancing Laboratory Quality Through ISO 15189 Compliance
LEC Synthetic Biology and its application.ppt
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
Presentation of a Romanian Institutee 2.
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
gene cloning powerpoint for general biology 2
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Seminar Hypertension and Kidney diseases.pptx
Wound infection.pdfWound infection.pdf123
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx

Modeling present and prospective distribution of Phyteuma genus in Carpathian region with machine learning techniques using open climatic and soil data

  • 1. Phyteuma vagneri ( Phyteuma confusum MODELING PRESENT AND PROSPECTIVE DISTRIBUTION OF PHYTEUMA GENUS IN CARPATHIAN REGION WITH MACHINE LEARNING TECHNIQUES USING OPEN CLIMATIC AND SOIL DATA Assoc. Prof. Dr. Alexander Mkrtchian
  • 2. Main problems in species distribution modeling Problem Solution Data of different formats, different quality, duplicated data Data preprocessing Presence-only data availale Introducing randomly distributed background data as a surrogate for absence data Data distribution violates normality assumption Using non-parametric methods (machine learning) Spatially autocorrelated data Explicit methods to account for spatial autocorrelation; rigorous testing of modeling results on test sets
  • 3. • Mackey B. G., Nix H. A., Hutchinson M. F., Macmahon J. P., Fleming P. M. (1988). Assessing representativeness of places for conservation reservation and heritage listing // Environmental Management, 12 (4), pp. 501–514 • Franklin J. (1995). Predictive vegetation mapping: geographic modelling of biospatial patterns in relation to environmental gradients // Progress in Physical Geography, 1995, 19, pp. 474–499 • Guisan A.; Zimmermann N. E. (2000). Predictive habitat distribution models in ecology // Ecological Modelling, 135, pp. 147–186 • Elith J., Graham K. H., Anderson R. P. et al. (2006). Novel methods improve prediction of species’ distributions from occurrence data // Ecography, 29, pp. 129–151 • Elith J., Leathwick J. R., Hastie T. A. (2008). Working guide to boosted regression trees // Journal of Animal Ecology, 2008, 77, 802–813 • Elith J., Phillips S. J., Hastie T., Dudı´k M., Chee Y. En, Yates C. J. (2011). A statistical explanation of MaxEnt for ecologists // Diversity and Distributions, 17, pp. 43–57 • https://rspatial.org/raster/sdm/index.html Main theoretical works
  • 4. • BIOCLIM – based on «ecological envelope» principle, calculating the probability of species in the attributes hyperspace • Maxent – based on minimizing relative entropy of probability densities in the attributes hyperspace • General linear models (GLM) • General additive models (GAM) • Artificial neural networks (ANN) • Random forest (RF) • Bootstrap aggregation (bagging) • Boosted regression trees (BRT) • Support vectors machines (SWM) Main types of models for species distribution modeling From Elith J., Graham K. H., Anderson R. P. et al. (2006). Novel methods improve prediction of species’ distributions from occurrence data // Ecography, 29, pp. 129–151
  • 5.  Model specifics and features selected have to be meaningful from an ecological viewpoint  Predicted spatial pattern should look plausible from an expert viewpoint  Residuals on test set has to be small and random  Quality, size and spatial extent of data, as well as appropriate features selected are mostly more important than the choice of the concrete model and tuning of its parameters
  • 6. Data obtained from GBIF for the Carpathian region contained 148 records of Phyteuma genus in total. After removing duplicates, 80 records have been kept. A double of observation records (160 points) were thus designated as background, with random coverage of geographic space inside study area. Six data points (3 from observed data and another 3 from simulated background data) have subsequently been removed due to omissions in predictors data.
  • 7. Data on climatic conditions were derived from WorldClim database (1 km. sq. spatial resolution) There are layers of 19 bioclimatic variables, derived from monthly temperature and precipitation with a consideration to have biological significance. 17 out of 19 bioclimatic variables were taken out as predictive variables for SDM. For data on soil conditions, SoilGrids digital maps of soil properties were used (250 m spatial resolution) From 11 available physical and chemical soil properties, 4 were chosen as the most suitable predictors: soil acidity, organic carbon stock, cation exchange capacity, and total nitrogen. Among six standard depth intervals available, 15–30 cm depth interval was chosen as the most appropriate for the purpose.
  • 8. • ML methods examined: Maxent, Random Forest, Artificial Neural Networks (ANN), Boosted Regression Trees. • Assessing the performance of the models and tuning their parameters: AUC and TSS criteria calculated for testing data with 6-fold cross- validation • Tools used: R programming language and software environment, SDMtune package
  • 9. Method AUC (training) AUC (testing) TSS (training) TSS (testing) Maxent (ME) 0.8162 0.796 (0.0138) 0.6114 0.5938 (0.0252) Artificial neural networks (ANN) 0.9437 0.9369 (0.005) 0.7508 0.7891 (0.018) Random forest (RF) 1 0.9321 (0.0076) 1 0.7851 (0.0261) Boosted regression trees (BRT) 0.964 0.9304 (0.0062) 0.8145 0.775 (0.0162) Performance metrics of different SDM methods calculated on training dataset and with aforementioned testing procedure. For testing case, standard deviations are given in parentheses. Model AUC (training) AUC (testing) TSS (training) TSS (testing) Full set of 21 variables 0.9437 0.9369 (0.005) 0.7508 0.7891 (0.018) Reduced set of 6 variables 0.94 0.9381 (0.0036) 0.7384 0.7893 (0.0132) Performance metrics of ANN SDM method calculated with a full and reduced sets of predictive variables ANN method appears to work the best for the case
  • 10. Relative occurrence probabilities (left) and predicted habitat (right) of Phyteuma
  • 11. Predicted occurrence probabilities (left) and habitat area (right) of Phyteuma for year 2050 based on RCP 4.5 climate projections