Vivek Krishnakumar1, Chia-Yi Cheng1, Maria Kim1, Erik Ferlanti1, Irina Belyaeva1, Seth
Schobel1, Sergio Contrino3, Matthew R. Hanlon2, Walter Moreira2, Steve Mock2, Joe Stubbs2,
Agnes P. Chan1, Jason R. Miller1, Matthew W. Vaughn2, Gos Micklem3, Christopher D. Town1

1J. Craig Venter Institute, Rockville, MD, USA; 2Texas Advanced Computing Center, Austin, TX, USA; 3Cambridge University, Cambridge, UK
Araport, the Arabidopsis Information Portal, (https://www.araport.org), is an open-access, online resource for the Arabidopsis research community funded by the NSF and BBSRC. Since its inception in late 2013, the goal of
Araport has been to provide users with a “one-stop-shop” through data federation. Araport exposes a searchable index of TAIR10 genomic data as well as additional datasets from UniProt (protein), BAR (expression), EPIC-CoGe
(epigenomics), IntAct (interaction networks), ATTED-II (co-expression), PubMed (literature), and other diverse and geographically dispersed resources using a combination of warehousing and state-of-the-art web technologies.
Araport incorporates and integrates software from GMOD including InterMine, JBrowse, GBrowse, WebApollo, Tripal, and Chado. Araport has inherited from TAIR the responsibility of providing continued access to up-to-date
structural and functional annotation for the Col-0 genome. Later this year, the Araport11 annotation update will be released including over 1,000 novel protein coding gene loci and ~50k splice variants derived from ~28k gene
loci using 11 tissue-specific bins of RNA-seq datasets spanning over 100 SRA accessions, as well as various classes of non-coding RNA.
Araport: Data Integration for the Arabidopsis Research Community
Araport (https://www.araport.org)
“One-stop-shop” for Arabidopsis data
ThaleMine report pages present a comprehensive set of data integrated from a variety sources.
Report below shows up-to-date information about EMBRYO DEFECTIVE 2770, such as: GO
annotation(s), publications, array based expression, protein–protein interactions, metabolic pathways
and homologs in other plant species.
113 SRA accessions
Binned by 11 Tissue/Organ
TopHat Alignment to TAIR10
Genome-Guided
Trinity Assembly
Binned by 11 Tissue/Organ
De novo Trinity
Assembly
Concatenating De Novo Assembly and Genome-
Guided Assembly for each Tissue/Organ
11 Transcriptomes Assembled by PASA
Annotation Update by PASA
Consolidating 11 Transcriptomes
Re-indexing updated gene models
Araport11 Protein-Coding Gene
NCBI and MAKER-P
Assembly
Uniprot Protein
Novel Transcribed Regions
Filtering
Novel Loci
Appending Novel Transcripts
to TAIR10
Augmented TAIR10
Unique Models
Filtering
Protein Alignment
Literature
Araport11 Annotation Pipeline
JBrowse genome viewer presents users with data organized into hierarchical and faceted track list(s).
Genomic region shown below represents the features within the vicinity of EMBRYO DEFECTIVE 2770,
highlighting the Col-0 methylation data retrieved on-the-fly from EPIC-CoGe, Paired-end analysis of
TSS (PEAT) peaks, TDNA-seq based insertion sites and 1001 genomes variants alongside the updated
Araport11 annotation set.
Category
 TAIR10
 Araport11
 Description
Long intergenic noncoding
RNA (linc RNA)
2,708
The 2,708 intergenic transcripts were detected by tiling array and
confirmed by RNA-seq (Liu et al., 2012) 
Natural antisense transcript
(NAT)
2,980
Li et al (2013) identified 1490 NAT pairs in whole root samples using
strand-specific RNA-seq followed by computational analysis
(NASTIseq)
microRNA (miRNA)
 177
 427
 miRBase 21
Small nucleolar RNA
(snoRNA)
71
 287
Sherstnev et al (2012) incorporated data from TAIR, PlantDB, Chen
and Wu (2009) and Kim et al (2010) and annotated 287 snoRNA. 
tRNA
 689
 689
Small nuclear RNA (snRNA)
 13
 13
Small RNA
 24,575
We used ShortStack (Axtell, 2013), a software designed for annotation
of small RNA genes, to analyze public data sets (Law et al., 2013).
ShortStack was able to recapitulate >99% of the siRNAs clusters
reported by Law et al (2013), which was based on TAIR8 genome. We
ran ShortStack using 'de novo discovery mode', supplemented with
TAIR10 and miRBase 21 as the reference, and identified 24,575
smRNA non-miRNA non-hairpin small RNA loci. 
rRNA
 15
 15
Other RNA
 394
Total
 1,359
 31,681
Araport11 protein-coding gene annotation: TAIR10 annotation was supplemented with novel transcripts from NCBI
and MAKER-P assemblies and used as the reference annotation set. RNA-seq reads from SRA grouped into 11 tissue/
organ types, assembled by Trinity; tissue specific transcriptomes reconstructed from a hybrid assembly of de novo
and genome-guided assemblies. PASA based annotation update was performed independently for each tissue group
to avoid constituting chimeric transcripts and the 11 transcriptomes were consolidated using a custom Python script
to collapse isoforms differing in terminal UTR length. Around 300 Uniprot protein records inconsistent with TAIR10
were evaluated, filtered, and appended to the PASA updated set. Additional novel transcripts extracted from PASA
and literature were used to further quantify novel loci. Updated gene models and novel loci part of Araport11, will be
re-indexed with appropriate locus and isoform identifiers and released for community review.

Statistics: Araport11 updated 80.3% (28,429/35,385) of TAIR10 protein-coding gene models of which 3.3% (933) and
88.2% (25,079/28,429) altered CDS and UTR respectively. A total of 1,162 new loci and 14,880 new gene models were
added. 38.3% (18% in TAIR10) of protein-coding genes now have additional splice variants. Overall, the Araport11
pre-release contains 28,565 protein-coding gene loci encompassing 50,265 gene models.
Araport11 non-coding RNA annotation
Publications
1.  Araport: the Arabidopsis Information Portal. Nucleic Acids Research (2014) doi: 10.1093/nar/gku1200
2.  The Arabidopsis Information Portal: An Application Platform for Data Discovery. Proceedings of the 9th Gateway
Computing Environments Workshop (2014) doi: 10.1109/GCE.2014.10
We thank NCBI RefSeq team and Mark Yandell lab for sharing the TAIR10 re-annotation data, authors of the RNA-seq
data sets used in our coding and non-coding RNA annotation, Michael Axtell (PSU) and Ho-Ming Chen (Academia
Sinica) for helpful discussions.
Acknowledgements

More Related Content

PDF
Article
PPTX
Proteomics repositories integration using EUDAT resources
PPTX
Biological databases
DOCX
Major biological nucleotide databases
PPTX
Help
PPTX
Data submissions and archiving raw data in life sciences. A pilot with Proteo...
PPTX
Using the Ondex system for exploring Arabidopsis regulatory networks
PPTX
PROTEIN DATABASE
Article
Proteomics repositories integration using EUDAT resources
Biological databases
Major biological nucleotide databases
Help
Data submissions and archiving raw data in life sciences. A pilot with Proteo...
Using the Ondex system for exploring Arabidopsis regulatory networks
PROTEIN DATABASE

What's hot (18)

PDF
Bioinformatics-2009-Moura-1096-8
PPTX
Data analysis & integration challenges in genomics
PPTX
High-throughput proteomics: from understanding data to predicting them
PPTX
Biological database by kk sahu
PDF
Mir prapik
PPTX
What's New at Araport - ICAR 2017
PPTX
Bioinformatics for beginners (exam point of view)
PPTX
PDF
TOOLS AND DATA BASES OF NCBI
PPT
Bioinformatics Databases
PPTX
Databases ii
PPT
Intro bioinfo
PDF
Introduction to Bioinformatics.
PPTX
Biological databases
PPTX
Features of biological databases
PPT
RML NCBI Resources
PPTX
Emerging challenges in data-intensive genomics
Bioinformatics-2009-Moura-1096-8
Data analysis & integration challenges in genomics
High-throughput proteomics: from understanding data to predicting them
Biological database by kk sahu
Mir prapik
What's New at Araport - ICAR 2017
Bioinformatics for beginners (exam point of view)
TOOLS AND DATA BASES OF NCBI
Bioinformatics Databases
Databases ii
Intro bioinfo
Introduction to Bioinformatics.
Biological databases
Features of biological databases
RML NCBI Resources
Emerging challenges in data-intensive genomics
Ad

Viewers also liked (9)

PPTX
Wesport
PDF
Tapahtumamarkkinointi ja some Tampere 2016
PDF
2016 aapor presentation virginia
PPTX
Josselyn espinal
PPTX
Folleto final La Llora 2015
PPTX
Présentation Bel monnaie Mardinnov'
ODP
Promeneurs suite
PDF
Mapa conceptual norma juridica
Wesport
Tapahtumamarkkinointi ja some Tampere 2016
2016 aapor presentation virginia
Josselyn espinal
Folleto final La Llora 2015
Présentation Bel monnaie Mardinnov'
Promeneurs suite
Mapa conceptual norma juridica
Ad

Similar to Araport Data Integration - 2015 UMD Minisymposium (20)

PPTX
Rnaseq forgenefinding
PDF
ICAR 2015 Poster - Araport
PDF
sequencing-methods-review
PDF
Bioinformatics seminar
PDF
Whole Transcriptome Analysis of Testicular Germ Cell Tumors
PPT
Prediction of protein function
PDF
Journal club Aug04 2015 GeneMarkET
DOC
Protein databases
PDF
2011-NAR
PDF
EVE 161 Winter 2018 Class 13
PPT
Biological databases
PPT
Role of bioinformatics in life sciences research
PPTX
Tair workshop stanford2017
PDF
RT-PCR and DNA microarray measurement of mRNA cell proliferation
PDF
EVE 161 Winter 2018 Class 16
PPT
databaseofptoreinsteycturrdescribing.ppt
PDF
Use of TGIRT for ssDNA-seq
DOCX
Liu_Jiangyuan_1201662_FR
PDF
New methods for high-throughput nucleic sequencing and diagnostics using a th...
Rnaseq forgenefinding
ICAR 2015 Poster - Araport
sequencing-methods-review
Bioinformatics seminar
Whole Transcriptome Analysis of Testicular Germ Cell Tumors
Prediction of protein function
Journal club Aug04 2015 GeneMarkET
Protein databases
2011-NAR
EVE 161 Winter 2018 Class 13
Biological databases
Role of bioinformatics in life sciences research
Tair workshop stanford2017
RT-PCR and DNA microarray measurement of mRNA cell proliferation
EVE 161 Winter 2018 Class 16
databaseofptoreinsteycturrdescribing.ppt
Use of TGIRT for ssDNA-seq
Liu_Jiangyuan_1201662_FR
New methods for high-throughput nucleic sequencing and diagnostics using a th...

More from Vivek Krishnakumar (9)

PPTX
JBrowse and Inter-"Mine" Communication - IMDEV 2017
PPTX
Integrate JBrowse REST API Framework with Adama Federation Architecture
PDF
Teaching Bioinformatics data analysis using Medicago truncatula as a model - ...
PPTX
Interoperation between InterMines
PPTX
InterMine Infrastructure LF Meeting 20150428
PPTX
JBrowse within the Arabidopsis Information Portal - PAG XXIII
PDF
Tripal within the Arabidopsis Information Portal - PAG XXIII
PPTX
Quick Intro to InterMine within AIP and MTGD - JCVI Research Works-in-Progres...
PPTX
Tutorial 1: Your First Science App - Araport Developer Workshop
JBrowse and Inter-"Mine" Communication - IMDEV 2017
Integrate JBrowse REST API Framework with Adama Federation Architecture
Teaching Bioinformatics data analysis using Medicago truncatula as a model - ...
Interoperation between InterMines
InterMine Infrastructure LF Meeting 20150428
JBrowse within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIII
Quick Intro to InterMine within AIP and MTGD - JCVI Research Works-in-Progres...
Tutorial 1: Your First Science App - Araport Developer Workshop

Recently uploaded (20)

PPTX
endocrine - management of adrenal incidentaloma.pptx
PPTX
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PPTX
gene cloning powerpoint for general biology 2
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
PPT
Presentation of a Romanian Institutee 2.
PPTX
perinatal infections 2-171220190027.pptx
PDF
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
PPT
veterinary parasitology ````````````.ppt
PPTX
Seminar Hypertension and Kidney diseases.pptx
PDF
Packaging materials of fruits and vegetables
PPTX
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
Understanding the Circulatory System……..
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
endocrine - management of adrenal incidentaloma.pptx
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
TORCH INFECTIONS in pregnancy with toxoplasma
gene cloning powerpoint for general biology 2
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Presentation1 INTRODUCTION TO ENZYMES.pptx
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
Presentation of a Romanian Institutee 2.
perinatal infections 2-171220190027.pptx
GROUP 2 ORIGINAL PPT. pdf Hhfiwhwifhww0ojuwoadwsfjofjwsofjw
veterinary parasitology ````````````.ppt
Seminar Hypertension and Kidney diseases.pptx
Packaging materials of fruits and vegetables
INTRODUCTION TO PAEDIATRICS AND PAEDIATRIC HISTORY TAKING-1.pptx
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Animal tissues, epithelial, muscle, connective, nervous tissue
Understanding the Circulatory System……..
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Looking into the jet cone of the neutrino-associated very high-energy blazar ...

Araport Data Integration - 2015 UMD Minisymposium

  • 1. Vivek Krishnakumar1, Chia-Yi Cheng1, Maria Kim1, Erik Ferlanti1, Irina Belyaeva1, Seth Schobel1, Sergio Contrino3, Matthew R. Hanlon2, Walter Moreira2, Steve Mock2, Joe Stubbs2, Agnes P. Chan1, Jason R. Miller1, Matthew W. Vaughn2, Gos Micklem3, Christopher D. Town1 1J. Craig Venter Institute, Rockville, MD, USA; 2Texas Advanced Computing Center, Austin, TX, USA; 3Cambridge University, Cambridge, UK Araport, the Arabidopsis Information Portal, (https://www.araport.org), is an open-access, online resource for the Arabidopsis research community funded by the NSF and BBSRC. Since its inception in late 2013, the goal of Araport has been to provide users with a “one-stop-shop” through data federation. Araport exposes a searchable index of TAIR10 genomic data as well as additional datasets from UniProt (protein), BAR (expression), EPIC-CoGe (epigenomics), IntAct (interaction networks), ATTED-II (co-expression), PubMed (literature), and other diverse and geographically dispersed resources using a combination of warehousing and state-of-the-art web technologies. Araport incorporates and integrates software from GMOD including InterMine, JBrowse, GBrowse, WebApollo, Tripal, and Chado. Araport has inherited from TAIR the responsibility of providing continued access to up-to-date structural and functional annotation for the Col-0 genome. Later this year, the Araport11 annotation update will be released including over 1,000 novel protein coding gene loci and ~50k splice variants derived from ~28k gene loci using 11 tissue-specific bins of RNA-seq datasets spanning over 100 SRA accessions, as well as various classes of non-coding RNA. Araport: Data Integration for the Arabidopsis Research Community Araport (https://www.araport.org) “One-stop-shop” for Arabidopsis data ThaleMine report pages present a comprehensive set of data integrated from a variety sources. Report below shows up-to-date information about EMBRYO DEFECTIVE 2770, such as: GO annotation(s), publications, array based expression, protein–protein interactions, metabolic pathways and homologs in other plant species. 113 SRA accessions Binned by 11 Tissue/Organ TopHat Alignment to TAIR10 Genome-Guided Trinity Assembly Binned by 11 Tissue/Organ De novo Trinity Assembly Concatenating De Novo Assembly and Genome- Guided Assembly for each Tissue/Organ 11 Transcriptomes Assembled by PASA Annotation Update by PASA Consolidating 11 Transcriptomes Re-indexing updated gene models Araport11 Protein-Coding Gene NCBI and MAKER-P Assembly Uniprot Protein Novel Transcribed Regions Filtering Novel Loci Appending Novel Transcripts to TAIR10 Augmented TAIR10 Unique Models Filtering Protein Alignment Literature Araport11 Annotation Pipeline JBrowse genome viewer presents users with data organized into hierarchical and faceted track list(s). Genomic region shown below represents the features within the vicinity of EMBRYO DEFECTIVE 2770, highlighting the Col-0 methylation data retrieved on-the-fly from EPIC-CoGe, Paired-end analysis of TSS (PEAT) peaks, TDNA-seq based insertion sites and 1001 genomes variants alongside the updated Araport11 annotation set. Category TAIR10 Araport11 Description Long intergenic noncoding RNA (linc RNA) 2,708 The 2,708 intergenic transcripts were detected by tiling array and confirmed by RNA-seq (Liu et al., 2012) Natural antisense transcript (NAT) 2,980 Li et al (2013) identified 1490 NAT pairs in whole root samples using strand-specific RNA-seq followed by computational analysis (NASTIseq) microRNA (miRNA) 177 427 miRBase 21 Small nucleolar RNA (snoRNA) 71 287 Sherstnev et al (2012) incorporated data from TAIR, PlantDB, Chen and Wu (2009) and Kim et al (2010) and annotated 287 snoRNA. tRNA 689 689 Small nuclear RNA (snRNA) 13 13 Small RNA 24,575 We used ShortStack (Axtell, 2013), a software designed for annotation of small RNA genes, to analyze public data sets (Law et al., 2013). ShortStack was able to recapitulate >99% of the siRNAs clusters reported by Law et al (2013), which was based on TAIR8 genome. We ran ShortStack using 'de novo discovery mode', supplemented with TAIR10 and miRBase 21 as the reference, and identified 24,575 smRNA non-miRNA non-hairpin small RNA loci. rRNA 15 15 Other RNA 394 Total 1,359 31,681 Araport11 protein-coding gene annotation: TAIR10 annotation was supplemented with novel transcripts from NCBI and MAKER-P assemblies and used as the reference annotation set. RNA-seq reads from SRA grouped into 11 tissue/ organ types, assembled by Trinity; tissue specific transcriptomes reconstructed from a hybrid assembly of de novo and genome-guided assemblies. PASA based annotation update was performed independently for each tissue group to avoid constituting chimeric transcripts and the 11 transcriptomes were consolidated using a custom Python script to collapse isoforms differing in terminal UTR length. Around 300 Uniprot protein records inconsistent with TAIR10 were evaluated, filtered, and appended to the PASA updated set. Additional novel transcripts extracted from PASA and literature were used to further quantify novel loci. Updated gene models and novel loci part of Araport11, will be re-indexed with appropriate locus and isoform identifiers and released for community review. Statistics: Araport11 updated 80.3% (28,429/35,385) of TAIR10 protein-coding gene models of which 3.3% (933) and 88.2% (25,079/28,429) altered CDS and UTR respectively. A total of 1,162 new loci and 14,880 new gene models were added. 38.3% (18% in TAIR10) of protein-coding genes now have additional splice variants. Overall, the Araport11 pre-release contains 28,565 protein-coding gene loci encompassing 50,265 gene models. Araport11 non-coding RNA annotation Publications 1.  Araport: the Arabidopsis Information Portal. Nucleic Acids Research (2014) doi: 10.1093/nar/gku1200 2.  The Arabidopsis Information Portal: An Application Platform for Data Discovery. Proceedings of the 9th Gateway Computing Environments Workshop (2014) doi: 10.1109/GCE.2014.10 We thank NCBI RefSeq team and Mark Yandell lab for sharing the TAIR10 re-annotation data, authors of the RNA-seq data sets used in our coding and non-coding RNA annotation, Michael Axtell (PSU) and Ho-Ming Chen (Academia Sinica) for helpful discussions. Acknowledgements