Predictive association between trait data and eco-geographic data for Nordic barley landraces (Gatersleben, 2010-05-12)

A Lifeboat to the Gene PoolPredictive association between trait data and eco-geographic data for identification of trait properties useful for improvement of food cropsVavilov Seminar at IPK GaterslebenMay 12, 2010 - Dag Endresen, NordGen

Topics:Utilization of genetic diversity

Domestication bottleneckunlocking genetic potential from the wildwild tomatotomatoteosintecorn, maize3

Crop Genetic DiversityTraditional landracesCrop Wild RelativesModern cultivarsGenetic bottlenecks during crop domestication and during modern plant breeding. The circles represent allelic variation. The funnels represents allelic variation of genes found in the crop wild relatives, but gradually lost during domestication, traditional cultivation and modern plant breeding.Illustration based on: Tanksley, Steven D. and Susan R. McCouch 1997. Seed Banks and Molecular Maps: Unlocking Genetic Potential from the Wild Science 277 (5329), 1063. (22 August 1997). doi:10.1126/science.277.5329.10634

Plant Genetic Resources for Crop ImprovementPrimitive crops and traditional landraces are an important source for novel traits for improvement of modern crops.Landraces are often not well described for the economically valuable traits.Identification of novel crop traits will often be the result of a larger field trial screening project (thousands of individual plants).Large scale field trials are very costly, area and human working hours.5

Challenges for improved utilization of genetic resources for crop improvement :* Large gene bank collections* Limited screening capacity6

A needle in a hay stackScientists and plant breeders want a few hundred germplasm accessions to evaluate for a particular trait.How does the scientist select a small subset likely to have the useful trait?Example: More than 560 000 wheat accessions in genebanks worldwide.Slide adopted from a slide by Ken Street, ICARDA (FIGS team)7

Core collection subsetThe scientist or the breeder need a smaller subset to cope with the field screening experiments.A common approach is to create a so-called core collection.Sir Otto H. Frankel (1900-1998) proposed a limited set established from an existing collection with minimum similarity between its entries.The core collection is of limited size and chosen to represent the genetic diversityof a large collection (1984) .8

Core subset selectionGiven that the trait property you are looking for is relatively rare:Perhaps as rare as a unique allele for one single landrace cultivar...Getting what you want is largely a question of LUCK!9Slide adopted from a slide by Ken Street, ICARDA (FIGS team)

FIGS analysis methodFocused Identification of Germplasm Strategy10

Focused Identification of Germplasm StrategyObjective of this method: Explore climate data as a prediction model for “computer pre-screening” of crop traits BEFORE full scale field trials.Identification of landraces with a higher probability of holding an interesting trait property.11

Climate effect during the cultivation processPrimitive cultivated crops are shaped by local climate and humansWild relatives are shaped by the environmentTraditional cultivated crops (landraces) are shaped by climate and humansModern cultivated crops are mostly shaped by humans (plant breeders)Perhaps future crops are shaped in the molecular laboratory…?12

Predictive pattern between eco-geography and traitThe predictive pattern between the eco-geography and the traits can of course also have other sources than adaption.During traditional cultivation the farmer will also select for and introduce germplasm for improved suitability of the landrace to the local conditions.13

FIGS selection methodAssumption: the climate at the original source location, where the landrace was developed during long-term traditional cultivation, is correlated to the trait score. Aim: to build a computer model explaining the crop trait score (dependent variables) from the climate data (independent variables).14

We combine three datasetsLandrace samples (genebank seed accessions)Trait observations (experimental design) - High cost dataClimate data (for the landrace location of origin) - Low cost dataThe accession identifier (accession number) provides the bridge to the crop trait observations.

The longitude, latitude coordinates for the original collecting site of the accessions (landraces) provide the bridge to the environmental data. 15

1. Genetic resources, genebank collectionsLima, PeruAlnarp, SwedenSvalbardBenin16More than 7.4 million genebank accessions, more than 1 400 genebanks, worldwide.

2. Trait data, descriptive crop dataField trials, Gatersleben, GermanyPotato Priekuli LatviaFaba bean, FinlandLinnés äppleForage crops, Dotnuva, LithuaniaRadish (S. Jeppson)17Powdery Mildew, Blumeria graminisLeaf spotsAscochyta sp.Yellow rustPuccinia strilformisBlack stem rustPuccinia graminishttp://barley.ipk-gatersleben.de

3. Climate data – WorldClim The climate data can be extracted from the WorldClim dataset.http://www.worldclim.org/ Data from weather stations worldwide are combined to a continuous surface layer. Climate data for each landrace is extracted from this surface layer.Precipitation: 20 590 stationsTemperature: 7 280 stations18

FIGS – Focused Identification of Germplasm StrategyFIGS selection is a new method to predict crop traits of primitive cultivated material from climate variables by using multivariate statistical methods.19

What is FIGShttp://www.figstraitmine.org/FocusedIdentification of GermplasmStrategyMediterranean regionOrigin of Concept (1980s):Wheat and barley landraces from marine soils in the Mediterranean region provided genetic variation for boron toxicity.South AustraliaSlide made byMichael Mackay 199520

21FIGS The FIGS technology takes much of the guess work out of choosing which accessions are most likely to contain the specific characteristics being sought by plant breeders to improve plant productivity across numerous challenging environments.http://www.figstraitmine.org/FIGS salinity set21

Slide made byMichael Mackay 199522

Ecological Niche ModelingSpecies Distribution ModelsThe fundamental ecological niche of an organism was formalized by G. E. Hutchinson[1] in 1957 as a multidimensional hypercube defining the ecological conditions that allow a species to exist.A computer model of the occurrence localities together with associated environmental conditions such as rainfall, temperature, day length etc., provides an approximation of the fundamental niche.Popular software implementations for modeling the ecological niche include openModeller, MaxEnt, BioCLIM, DesktopGARP, etc.23George Evelyn Hutchinson (1903 – 1991)

Data for the simulation modelTraining setFor the initial calibration or training step.Calibration setFurther calibration, tuning stepOften cross-validation on the training set is used to reduce the consumption of raw data.Test setFor the model validation or goodness of fit testing.New external data, not used in the model calibration.25

A model of the real worldValidation stepNo model can ever be absolutely correctA simulation model can only be an approximationA model is always created for a specific purposeApply the modelThe simulation model is applied to make predictions based on new fresh dataBe aware to avoid extrapolation problems26

Model validationResidual analysis (RMSE)

Pearson Product-Moment Correlation Coefficient (r)27

Residuals (validate model fit)The distance between the model (predictions) and the reference values (validation) is the residuals.Example of a bad model calibrationCalibration stepCross-validation indicates the appropriate model complexity.28Be aware of over-fitting! NB! Model validation!

Morphological traits in Nordic Barley landracesField observations by AgneseKolodinskaBrantestam (2005)Multi-way N-PLS data analysis, Dag Endresen (2009)30Priekuli (L)Bjorke (N)Landskrona (S)

Landrace origin locations (georeferencing)From a total of 19 landrace accessions included in the dataset, only 4 of the landrace accessions included geo-referenced coordinates in the NordGen SESTO database. 10 accessions were geo-referenced from the reported place name and descriptions of the original gathering site included in SESTO and other sources. For 5 accessions there were not enough information available to locate the original gathering location.Right side illustrationExample of georeferencing for NGB9529, landrace reported as originating from Lyderupgaard using KRAK.dk and maps.google.com31

Landrace origin locations (gathering sites)32

Multi-way analysis with PLS Toolbox and MATLAB33

3-way cube model (climate data, X)3-way cube:Climate data (mode 3): Minimum temperature

… (many more layers can be added)3 climate variablesX14 landraces(location of origin)12 monthly means2-way array (bi-linear):36 variablesMin. temperatureMax. temperaturePrecipitationJan, Feb, Mar, …Jan, Feb, Mar, …Jan, Feb, Mar, …14 samples34

3-way cube model (trait data, Y)3-way cube: Mode 2 (Traits) * Heading days* Ripening days* Length of plant* Harvest index* Volumetric weight* Grain weight (tgw) Mode 3 (experiment site)* Latvia, 2002* Latvia, 2003* Norway, 2002* Norway, 2003* Sweden, 2002* Sweden, 20036 year + locationY14 samples, landraces (x2)6 crop traits2-way array (bi-linear):36 variablesBjørke (N)2002Bjørke (N)2003Landskrona (S)2003Landskrona (S)2002Priekuli (Lv)2002Priekuli (Lv)200314 samples (x2)6 traits6 traits6 traits6 traits6 traits6 traits35

Trait scores (pre-processing)Here: Across mode 2 (traits) Auto-scaling is a combination of mean centering and variance scaling.

Mean centering removes the absolute intensity to avoid the model to focus on the variables with the highest numerical values (intensity).

Scaling makes the relative distribution of values (range spread) more equal between variables.

After auto-scaling all variables have a mean of zero and a standard deviation of one.

The objective is to help the model to separate the relevant information from the noise.36

Trait dataset - outlierOutlier: NGB6300, replicate 2 from Priekuli 2003 (LYR122)The influence plot (residuals against leverage) shows sample NGB6300 (FRO) observed at Priekuli in 2003 (replicate 2) with a very high leverage - well separated from the “data cloud”. After looking into the raw data (see the table above), this observation point was removed as outlier (set to NaN).37

PARAFAC split-half, trait data (3-way)PARAFAC split-half (mode 1) analysis:The two PARAFAC models each calibrated from two independent split-half subsets, both converge to the same solutions.The PARAFAC 3-way method produces thus a stable model for this dataset.38

Predictive association between trait data and eco-geographic data for Nordic barley landraces (Gatersleben, 2010-05-12)

More Related Content

What's hot (14)

Viewers also liked (20)

Similar to Predictive association between trait data and eco-geographic data for Nordic barley landraces (Gatersleben, 2010-05-12) (20)

More from Dag Endresen (20)

Recently uploaded (20)

Predictive association between trait data and eco-geographic data for Nordic barley landraces (Gatersleben, 2010-05-12)

Editor's Notes