SlideShare a Scribd company logo
Predicting Phenotype from Multi-Scale Genomic
and Environment Data using Neural Networks
and Knowledge Graphs: An Introduction to the
NSF GenoPhenoEnvo Project
Anne E Thessen, Michael Behrisch, Emily J Cain, Remco Chang, Bryan Heidorn,
Pankaj Jaiswal, David LeBauer, Ab Mosca, Monica C Munoz-Torres, Arun Ross,
Tyson Swetnam
Acknowledgements
● NSF Ideas Lab
● NSF Award 1940330 Harnessing
the Data Revolution
● Translational and Integrative
Sciences Lab OSU (tislab.org)
● Two new members: Ishita Debnath
(MSU) and Ryan Bartelme (UA)
Predicting Phenotype from Genes and Environments
● G + E = P only works
in the simplest
systems, if at all
● There’s a lot we
don’t know about
how genes are
translated into
phenotypes
● How do phenotypes
affect the
ecosystem?
How can we predict phenotype given an organism’s environmental conditions and
genomic endowment?
● State-of-the-art statistical
modeling has led to many
insights, but has been applied
to very controlled systems.
● Getting the phenotype is only
part of the answer.
● Can we use the predictive
model to reveal hidden
processes? Critical variables?
Machine Learning for Results and Process
● Pros
○ Capable of coping with non-linearity
in biological systems
○ Find hidden relationships
● Cons
○ Output is opaque
○ Not enough curated data not available
○ Data that are available not “ML ready”
GenoPhenoEnvo Project
● Goal: Develop a machine learning framework capable of predicting
phenotypes based on multi-scale data about genes and environments.
○ Leverage existing, well-structured, cross-species reference data about genes and phenotypes
○ Provide interactive data visualizations for examining and interpreting the “black-box” behavior
of ML models and their results
○ Realize a new model for relating phenotypes, genetic endowment, and environmental
characteristics
● Just started Oct 1
● Now in Year 1 Q2
Training Data
Year 1
● TERRA-REF sorghum
● Heavily controlled and measured
environment data
● Thorough genotyping and
phenotyping
● Knowledge graph links
● How do we prepare these data to
be ML ready?
Year 2
● TERRA-REF wheat
● NEON, EOS
● Citizen science phenology
What is a Knowledge Graph?
How Can Knowledge Graphs Help?
DOI: 10.1126/scitranslmed.3009262
How Can Knowledge Graphs Help?
How Can Knowledge Graphs Help?
● Constrain ML and prioritize results
● Quality control - sanity check
● Integrate heterogeneous data
○ Manage terminology
○ Manage scale and granularity
● Find new relationships
● Fill in data gaps with inferencing
Training Data - Genomic
● Use VCF files and phenotype data from
TERRA-REF
● SNP to phenotype associations with p values and
effect sizes (GWAS)
● Manhattan plot for all phenotype data
List 1: TERRA-REF data for Y1
Gene Information (G)
● Sorghum whole genome
● Sorghum genotypes[219]
Phenotype Information (P)
● Emergence Date
● End of Season Biomass
● End of Season Height
● Flowering Date
Environment Information (E)
● Soil moisture
● Air temperature
● PAR (Irradiance)
● Wind speed
● Humidity
● Precipitation and Irrigation
● Fertilizer inputs
By M. Kamran Ikram et al - Ikram MK et al (2010) Four Novel Loci (19q13, 6q24, 12q24, and
5q14) Influence the Microcirculation In Vivo. PLoS Genet. 2010 Oct 28;6(10):e1001184.
doi:10.1371/journal.pgen.1001184.g001, CC BY 2.5,
https://commons.wikimedia.org/w/index.php?curid=18056138
*example
GWAS results can be combined
with the knowledge graph results to
reduce input variables for ML
Training Data - Phenomic
● Days to flowering
● Growing Degree Days to flowering
● Days to flag leaf emergence
● Growing Degree Days to flag leaf emergence
● Canopy height
● End of season canopy height
● Above ground dry biomass at harvest
List 1: TERRA-REF data for Y1
Gene Information (G)
● Sorghum whole genome
● Sorghum genotypes[219]
Phenotype Information (P)
● Emergence Date
● End of Season Biomass
● End of Season Height
● Flowering Date
Environment Information (E)
● Soil moisture
● Air temperature
● PAR (Irradiance)
● Wind speed
● Humidity
● Precipitation and Irrigation
● Fertilizer inputs
Phenotypes can be
represented as
categories or
integers (10 cm or
decreased height?)
Training Data - Environmental
● Air temperature
● Relative humidity
● Precipitation
● Wind speed and direction
● Growing degree days
● Cumulative precipitation
List 1: TERRA-REF data for Y1
Gene Information (G)
● Sorghum whole genome
● Sorghum genotypes[219]
Phenotype Information (P)
● Emergence Date
● End of Season Biomass
● End of Season Height
● Flowering Date
Environment Information (E)
● Soil moisture
● Air temperature
● PAR (Irradiance)
● Wind speed
● Humidity
● Precipitation and Irrigation
● Fertilizer inputs
Data from weather station and gantry
Abstracted to daily average, min, and max
Machine Learning - Preliminary
1. Regression Models
2. Simple Neural Networks
3. Deep Neural Networks
Year 2 - Expanding to Ecosystems (Preliminary)
Leverage data and resources from
multiple NSF supported programs:
● Analyze genetic and remote
sensing data from NEON
● Utilize the CyVerse Data Science
Workbench
● XSEDE for ML computations
List 2: Observational & EOS data for Y2
Gene Information (G)
● Cottonwood whole genome
● Cottonwood genotypes
● Sorghum whole genome
● Sorghum genotypes
● Wheat whole genome
● Wheat genotypes
Environment Information (E)
● Soil moisture (e.g., SMAP)
● Precipitation (Daymet)
● Air temperature (Daymet)
● PAR (NARR)
● Soil Type (USDA)
Phenotype Information (P)
● Leafing date
● Flowering date
● Breaking leaf buds (#)
● Ripe fruits (#)
● Increasing leaf size (Y/N)
● Falling leaves (Y/N)
● Colored leaves (Y/N)
● Flowers or buds (#)
● Open flowers (#)
● Pollen release (Y/N)
● Recent fruit/seed drop
(Y/N)
● Fruits (#)
● NDVI (EOS)
GenoPhenoEnvo Project Information
● Join our Google Group
● Watch our GitHub Repo
github.com/genophenoenvo
● Search Twitter hashtag
#GenoPhenoEnvo
● Visit the project web page
● Anne E Thessen
● annethessen@gmail.com
Questions?

More Related Content

PPTX
Geomagnetic activity and sleep patterns
PPTX
Fiu overview microbes_2014
PDF
Geomagnetic Activity and its Effect on Sleep Patterns
PPTX
12 muranty
PPTX
ASACSSA Annual_Meeting_DSSAT_Update.pptx
PDF
2015. Patrik Schnable. Trait associated SNPs provide insights into heterosis...
PPTX
UC Davis Plant Science Symposium: Topological Data Analysis
PPTX
EcoTas13 BradEvans e-MAST data
Geomagnetic activity and sleep patterns
Fiu overview microbes_2014
Geomagnetic Activity and its Effect on Sleep Patterns
12 muranty
ASACSSA Annual_Meeting_DSSAT_Update.pptx
2015. Patrik Schnable. Trait associated SNPs provide insights into heterosis...
UC Davis Plant Science Symposium: Topological Data Analysis
EcoTas13 BradEvans e-MAST data

Similar to Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neural Networks and Knowledge Graphs: An Introduction to the NSF GenoPhenoEnvo Project (20)

PPTX
Open Science and Ecological meta-anlaysis
PPTX
Phenomics in crop improvement
PPTX
My presentation at ICEM 2017: From data mining to information extraction: usi...
PPTX
Affordable field high-throughput phenotyping - some tips
PDF
Perth ausplots presentation_070616_internet_qu
PPTX
Undergraduate Modeling Workshop - Vegetation Working Group Final Presentatio...
PPTX
Imprint of nutrient availability on WUE and Ts- α relationship
PPTX
Kim_WE3_T05_2.pptx
PPTX
New predictive characterization methods for accessing and using crop wild rel...
PPT
Newproject
PDF
Utility of transcriptome sequencing for phylogenetic
PPTX
EcoTas13 BradEvans e-MAST
PDF
agriculture decision support system
PPT
Jianqiang Ren_Simulation of regional winter wheat yield by EPIC model.ppt
PDF
Baseline study for EIA
PDF
Drought Assessment + Impacts: A Preview
PPTX
Dr. Andres Perez - PRRS Epidemiology: Best Principles of Control at a Regiona...
PPTX
General ausplots school
PPTX
Topological Data Analysis What is it? What is it good for? How can it be use...
Open Science and Ecological meta-anlaysis
Phenomics in crop improvement
My presentation at ICEM 2017: From data mining to information extraction: usi...
Affordable field high-throughput phenotyping - some tips
Perth ausplots presentation_070616_internet_qu
Undergraduate Modeling Workshop - Vegetation Working Group Final Presentatio...
Imprint of nutrient availability on WUE and Ts- α relationship
Kim_WE3_T05_2.pptx
New predictive characterization methods for accessing and using crop wild rel...
Newproject
Utility of transcriptome sequencing for phylogenetic
EcoTas13 BradEvans e-MAST
agriculture decision support system
Jianqiang Ren_Simulation of regional winter wheat yield by EPIC model.ppt
Baseline study for EIA
Drought Assessment + Impacts: A Preview
Dr. Andres Perez - PRRS Epidemiology: Best Principles of Control at a Regiona...
General ausplots school
Topological Data Analysis What is it? What is it good for? How can it be use...
Ad

More from Anne Thessen (13)

PDF
Unifying Genomics, Phenomics, and Environments
PPTX
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
PPTX
Bridging discrepancies across North American butterfly naming authorities: Su...
PPTX
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
PPTX
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
PPTX
Linking biodiversity data for ecology
PPTX
Data Infrastructure for Coastal and Estuarine Science
PPTX
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
PPTX
Knowledge extraction from the Encyclopedia of Life using Python NLTK
PPTX
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
PPTX
Visualizing Evolution
PPTX
The Future of Microalgal Taxonomy
PPTX
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
Unifying Genomics, Phenomics, and Environments
Combining Phenomes and Genomes to Fill Analytical Gaps: Data Management in Ph...
Bridging discrepancies across North American butterfly naming authorities: Su...
Ontological Support of Data Discovery and Synthesis in Estuarine and Coastal ...
Next-Gen Taxonomic Descriptions for Microbial Eukaryotes
Linking biodiversity data for ecology
Data Infrastructure for Coastal and Estuarine Science
Gulf of Mexico Hydrocarbon Database: Integrating Heterogeneous Data for Impro...
Knowledge extraction from the Encyclopedia of Life using Python NLTK
Marrying models and data: Adventures in Modeling, Data Wrangling and Software...
Visualizing Evolution
The Future of Microalgal Taxonomy
Knowledge Extraction and Semantic Linking in the Encyclopedia of Life
Ad

Recently uploaded (20)

PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Application of enzymes in medicine (2).pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
An interstellar mission to test astrophysical black holes
PDF
The Land of Punt — A research by Dhani Irwanto
PPTX
C1 cut-Methane and it's Derivatives.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
CORDINATION COMPOUND AND ITS APPLICATIONS
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
The Minerals for Earth and Life Science SHS.pptx
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
7. General Toxicologyfor clinical phrmacy.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Application of enzymes in medicine (2).pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
An interstellar mission to test astrophysical black holes
The Land of Punt — A research by Dhani Irwanto
C1 cut-Methane and it's Derivatives.pptx
Phytochemical Investigation of Miliusa longipes.pdf
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Introduction to Cardiovascular system_structure and functions-1
CORDINATION COMPOUND AND ITS APPLICATIONS
Biophysics 2.pdffffffffffffffffffffffffff
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
6.1 High Risk New Born. Padetric health ppt
Science Quipper for lesson in grade 8 Matatag Curriculum
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
The Minerals for Earth and Life Science SHS.pptx

Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neural Networks and Knowledge Graphs: An Introduction to the NSF GenoPhenoEnvo Project

  • 1. Predicting Phenotype from Multi-Scale Genomic and Environment Data using Neural Networks and Knowledge Graphs: An Introduction to the NSF GenoPhenoEnvo Project Anne E Thessen, Michael Behrisch, Emily J Cain, Remco Chang, Bryan Heidorn, Pankaj Jaiswal, David LeBauer, Ab Mosca, Monica C Munoz-Torres, Arun Ross, Tyson Swetnam
  • 2. Acknowledgements ● NSF Ideas Lab ● NSF Award 1940330 Harnessing the Data Revolution ● Translational and Integrative Sciences Lab OSU (tislab.org) ● Two new members: Ishita Debnath (MSU) and Ryan Bartelme (UA)
  • 3. Predicting Phenotype from Genes and Environments ● G + E = P only works in the simplest systems, if at all ● There’s a lot we don’t know about how genes are translated into phenotypes ● How do phenotypes affect the ecosystem?
  • 4. How can we predict phenotype given an organism’s environmental conditions and genomic endowment? ● State-of-the-art statistical modeling has led to many insights, but has been applied to very controlled systems. ● Getting the phenotype is only part of the answer. ● Can we use the predictive model to reveal hidden processes? Critical variables?
  • 5. Machine Learning for Results and Process ● Pros ○ Capable of coping with non-linearity in biological systems ○ Find hidden relationships ● Cons ○ Output is opaque ○ Not enough curated data not available ○ Data that are available not “ML ready”
  • 6. GenoPhenoEnvo Project ● Goal: Develop a machine learning framework capable of predicting phenotypes based on multi-scale data about genes and environments. ○ Leverage existing, well-structured, cross-species reference data about genes and phenotypes ○ Provide interactive data visualizations for examining and interpreting the “black-box” behavior of ML models and their results ○ Realize a new model for relating phenotypes, genetic endowment, and environmental characteristics ● Just started Oct 1 ● Now in Year 1 Q2
  • 7. Training Data Year 1 ● TERRA-REF sorghum ● Heavily controlled and measured environment data ● Thorough genotyping and phenotyping ● Knowledge graph links ● How do we prepare these data to be ML ready? Year 2 ● TERRA-REF wheat ● NEON, EOS ● Citizen science phenology
  • 8. What is a Knowledge Graph?
  • 9. How Can Knowledge Graphs Help? DOI: 10.1126/scitranslmed.3009262
  • 10. How Can Knowledge Graphs Help?
  • 11. How Can Knowledge Graphs Help? ● Constrain ML and prioritize results ● Quality control - sanity check ● Integrate heterogeneous data ○ Manage terminology ○ Manage scale and granularity ● Find new relationships ● Fill in data gaps with inferencing
  • 12. Training Data - Genomic ● Use VCF files and phenotype data from TERRA-REF ● SNP to phenotype associations with p values and effect sizes (GWAS) ● Manhattan plot for all phenotype data List 1: TERRA-REF data for Y1 Gene Information (G) ● Sorghum whole genome ● Sorghum genotypes[219] Phenotype Information (P) ● Emergence Date ● End of Season Biomass ● End of Season Height ● Flowering Date Environment Information (E) ● Soil moisture ● Air temperature ● PAR (Irradiance) ● Wind speed ● Humidity ● Precipitation and Irrigation ● Fertilizer inputs By M. Kamran Ikram et al - Ikram MK et al (2010) Four Novel Loci (19q13, 6q24, 12q24, and 5q14) Influence the Microcirculation In Vivo. PLoS Genet. 2010 Oct 28;6(10):e1001184. doi:10.1371/journal.pgen.1001184.g001, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=18056138 *example GWAS results can be combined with the knowledge graph results to reduce input variables for ML
  • 13. Training Data - Phenomic ● Days to flowering ● Growing Degree Days to flowering ● Days to flag leaf emergence ● Growing Degree Days to flag leaf emergence ● Canopy height ● End of season canopy height ● Above ground dry biomass at harvest List 1: TERRA-REF data for Y1 Gene Information (G) ● Sorghum whole genome ● Sorghum genotypes[219] Phenotype Information (P) ● Emergence Date ● End of Season Biomass ● End of Season Height ● Flowering Date Environment Information (E) ● Soil moisture ● Air temperature ● PAR (Irradiance) ● Wind speed ● Humidity ● Precipitation and Irrigation ● Fertilizer inputs Phenotypes can be represented as categories or integers (10 cm or decreased height?)
  • 14. Training Data - Environmental ● Air temperature ● Relative humidity ● Precipitation ● Wind speed and direction ● Growing degree days ● Cumulative precipitation List 1: TERRA-REF data for Y1 Gene Information (G) ● Sorghum whole genome ● Sorghum genotypes[219] Phenotype Information (P) ● Emergence Date ● End of Season Biomass ● End of Season Height ● Flowering Date Environment Information (E) ● Soil moisture ● Air temperature ● PAR (Irradiance) ● Wind speed ● Humidity ● Precipitation and Irrigation ● Fertilizer inputs Data from weather station and gantry Abstracted to daily average, min, and max
  • 15. Machine Learning - Preliminary 1. Regression Models 2. Simple Neural Networks 3. Deep Neural Networks
  • 16. Year 2 - Expanding to Ecosystems (Preliminary) Leverage data and resources from multiple NSF supported programs: ● Analyze genetic and remote sensing data from NEON ● Utilize the CyVerse Data Science Workbench ● XSEDE for ML computations List 2: Observational & EOS data for Y2 Gene Information (G) ● Cottonwood whole genome ● Cottonwood genotypes ● Sorghum whole genome ● Sorghum genotypes ● Wheat whole genome ● Wheat genotypes Environment Information (E) ● Soil moisture (e.g., SMAP) ● Precipitation (Daymet) ● Air temperature (Daymet) ● PAR (NARR) ● Soil Type (USDA) Phenotype Information (P) ● Leafing date ● Flowering date ● Breaking leaf buds (#) ● Ripe fruits (#) ● Increasing leaf size (Y/N) ● Falling leaves (Y/N) ● Colored leaves (Y/N) ● Flowers or buds (#) ● Open flowers (#) ● Pollen release (Y/N) ● Recent fruit/seed drop (Y/N) ● Fruits (#) ● NDVI (EOS)
  • 17. GenoPhenoEnvo Project Information ● Join our Google Group ● Watch our GitHub Repo github.com/genophenoenvo ● Search Twitter hashtag #GenoPhenoEnvo ● Visit the project web page ● Anne E Thessen ● annethessen@gmail.com Questions?