using_webbased_tools.ppt

Bioinformatics Resources and Tools
on the Web: A Primer
Joel H. Graber
Center for Advanced Biotechnology
Boston University

Outline
• Introduction: What is bioinformatics?
• The basics
– The five sites that all biologists should know
• Some examples
– Using the tools in a somewhat less-than-naïve manner
• Questions/comments are welcome at all points
• Much of this material comes from the Boston
University course: BF527 Bioinformatic
Applications (http://matrix.bu.edu/BF527/)

Examples of Bioinformatics
• Database interfaces
– Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …
• Sequence alignment
– BLAST, FASTA
• Multiple sequence alignment
– Clustal, MultAlin, DiAlign
• Gene finding
– Genscan, GenomeScan, GeneMark, GRAIL
• Protein Domain analysis and identification
– pfam, BLOCKS, ProDom,
• Pattern Identification/Characterization
– Gibbs Sampler, AlignACE, MEME
• Protein Folding prediction
– PredictProtein, SwissModeler

Things to know and remember about
using web server-based tools
• You are using someone else’s computer
• You are (probably) getting a reduced set of
options or capacity
• Servers are great for sporadic or proof-of-
principle work, but for intensive work, the
software should be obtained and run locally

Five websites that all biologists
should know
• NCBI (The National Center for Biotechnology Information;
– http://www.ncbi.nlm.nih.gov/
• EBI (The European Bioinformatics Institute)
– http://www.ebi.ac.uk/
• The Canadian Bioinformatics Resource
– http://www.cbr.nrc.ca/
• SwissProt/ExPASy (Swiss Bioinformatics Resource)
– http://expasy.cbr.nrc.ca/sprot/
• PDB (The Protein Databank)
– http://www.rcsb.org/PDB/

NCBI (http://www.ncbi.nlm.nih.gov/)
• Entrez interface to databases
– Medline/OMIM
– Genbank/Genpept/Structures
• BLAST server(s)
– Five-plus flavors of blast
• Draft Human Genome
• Much, much more…

EBI (http://www.ebi.ac.uk/)
• SRS database interface
– EMBL, SwissProt, and many more
• Many server-based tools
– ClustalW, DALI, …

SwissProt (http://expasy.cbr.nrc.ca/sprot/)
• Curation!!!
– Error rate in the information is greatly reduced in
comparison to most other databases.
• Extensive cross-linking to other data sources
• SwissProt is the ‘gold-standard’ by which
other databases can be measured, and is the
best place to start if you have a specific
protein to investigate

A few more resources to be aware of
• Human Genome Working Draft
– http://genome.ucsc.edu/
• TIGR (The Institute for Genomics Research)
– http://www.tigr.org/
• Celera
– http://www.celera.com/
• (Model) Organism specific information:
– Yeast: http://genome-www.stanford.edu/Saccharomyces/
– Arabidopis: http://www.tair.org/
– Mouse: http://www.jax.org/
– Fruitfly: http://www.fruitfly.org/
– Nematode: http://www.wormbase.org/
• Nucleic Acids Research Database Issue
– http://nar.oupjournals.org/ (First issue every year)

Example 1: Searching a new
genome for a specific protein
• Specific problem: We want to find the closest
match in C. elegans of D. melanogaster protein
NTF1, a transcription factor
• First- understanding the different forms of blast

The different versions of BLAST

1st Step: Search the proteins
• blastp is used to search for C. elegans
proteins that are similar to NTF1
• Two reasonable hits are found, but the hits
have suspicious characteristics
– besides the fact that they weren’t included in the
complete genome!

2nd Step: Search the nucleotides
• tblastn is used to search for translations of C.
elegans nucleotide that are similar to NTF1
• Now we have only one hit
– How are they related?

Conclusion: Incorrect gene
prediction/annotation
• The two predicted proteins have essentially
identical annotation
• The protein-protein alignments are disjoint
and consecutive on the protein
• The protein-nucleotide alignment includes
both protein-protein alignments in the proper
order
• Why/how does this happen?

Final(?) Check: Gene prediction
• Genscan is the best available ab initio gene
predictor
– http://genes.mit.edu/GENSCAN.html
• Genscan’s prediction spans both protein-
protein alignments, reinforcing our conclusion
of a bad prediction

Ab initio vs. similarity vs. hybrid
models for gene finding
• Ab initio: The gene looks like the average of
many genes
– Genscan, GeneMark, GRAIL…
• Similarity: The gene looks like a specific
known gene
– Procrustes,…
• Hybrid: A combination of both
– Genomescan (http://genes.mit.edu/genomescan/)

A similar example: Fruitfly homolog
of mRNA localization protein VERA
• Similar procedure as just described
– Tblastn search with BLOSUM45 produces an unexpected exon
• Conclusion: Incomplete (as opposed to incorrect)
annotation
– We have verified the existence of the rare isoform through RT-PCR

Another example: Find all genes with
pdz domains
• Multiple methods are possible
• The ‘best’ method will depend on many things
– How much do you know about the domain?
– Do you know the exact extent of the domain?
– How many examples do you expect to find?

Some possible methods if the domain
is a known domain:
• SwissProt
– text search capabilities
– good annotation of known domains
– crosslinks to other databases (domains)
• Databases of known domains:
– BLOCKS (http://blocks.fhcrc.org/)
– Pfam (http://pfam.wustl.edu/)
– Others (ProDom, ProSite, DOMO,…)

Determination of the nature of
conservation in a domain
• For new domains, multiple alignment is your
best option
– Global: clustalw
– Local: DiAlign
– Hidden Markov Model: HMMER
• For known domains, this work has largely
been done for you
– BLOCKS
– Pfam

If you have a protein, and want to
search it to known domains
• Search/Analysis tools
– Pfam
– BLOCKS
– PredictProtein
(http://cubic.bioc.columbia.edu/predictprotein/predictprotein.html)

Different representations of
conserved domains
• BLOCKS
– Gapless regions
– Often multiple blocks for one domain
• PFAM
– Statistical model, based on HMM
– Since gaps are allowed, most domains have only
one pfam model

Conclusions
• We have only touched small parts of the
elephant
• Trial and error (intelligently) is often your best
tool
• Keep up with the main five sites, and you’ll
have a pretty good idea of what is happening
and available

using_webbased_tools.ppt

More Related Content

Similar to using_webbased_tools.ppt (20)

Recently uploaded (20)

using_webbased_tools.ppt