2. FASTA format
Definition:
The FASTA format is a text-based format for representing nucleotides
sequence or protein sequence . It is commonly used in bioinformatics for
storing and exchanging sequence information between databases and
software.
Purpose:
The primary purpose of this format is to facilitate data manipulation and
analysis,allowing researchers to easily access and process biological
sequence.
3. Structure and components
A typical FASTA file contains sequence identifiers and the
sequence themselves. The first line begins with a
‘>’symbol,followed by the identifier and optional description.
Subsequent line contains the actual sequence data , which can
be represented in multiple lines for longsequences. The format
is designed to be simple and human-readable ,ensuring easy
compatibility with a range of bioinformatics tools.
4. Structure of a FASTA file:
1.Header Line :
Starts with a ‘>’(greater than)symbol followed by a description or
identifier .
2.Scequence Lines :
One or more Lines containing the actual sequence (DNA,RNA,OR
PROTEIN)without spaces or numbers .
5. Example
For a DNA sequence:
>sequence1 description
ATGCGTAATAGCTAGCTAGCTAATCG
CGATCGATCGATCGTAGCTAGCTA
For a protein sequence:
>protein1 hypothetical protein
MKWVTFISLLFLFSSAYSRGVFRRDTHKSEIAHRFKDLG
6. Notes :
1-Scequence can be split into multiple lines for readability.
2-No strict limit on line length,thought 60--80 characters per
line is common.
3-Used in many bioinformatics tools for tasks like alignment
(e.g;BLAST) ,sequence assembly and annotations.
7. Application in bioinformatics:
The FASTA format plays a vital role in various aspects of
bioinformatics,including sequence alignment,similarity searches
and genome assembly.
It is commonly used in tools such as BLAST (Basic Local
Alignment Search Tool)which allows researchers to compare their
sequence against a large database to identify homologous
sequence and functional elements. Additionally,the format is used
for input data in software for phylogenetic analyses ,where it help
determine evolutionary relationship between species .
8. Methods for Accessing FASTA Data
Accessing FASTA data can be achieved through several methods
including command-line tool such as wget or curl,which can
download files directly from public databases. Web-based APIs are
also popular,enabling researchers to programmatically retrieve
sequences using specific queries.Local databases can be
maintained for larger projects,where bulk downloads can be
performed to ensure quick access to frequently used sequences.
9. Tools for processing FASTA files
Various bioinformatics software tools are designed for processing
FASTA files,allowing the manipulation and analysis of sequence
data. Example include Seqkit,a command-line toolkit for FASTA /Q
file processing and bioconductor packages in R that provide
extensive functionalities for data analysis.
Additionally,python libraries such as biopython offer
comprehensive capabilities for parsing,searching,and analysing
FASTA sequences,facilitating the extraction of meaningful insights.
10. Integration with bioinformatics databases:
FASTA files can be effectively integrated with major
bioinformatics databases such as NCBI,Ensembl,and
Uniprot .These databases allow users to upload FASTA files
for sequence,search,annotations, and comparative analysis.
The interoperability of FASTA format ensures that data can be
cross-referenced with genomic annotations,protein structures
and molecular functionality, enhancing the overall
effectiveness of bioinformatics research.
11. Example for a specific species/
Analysis related to FASTA format
Here's is a real world FASTA format example for a gene from Homo-sapiens (human)-the BRCA1 gene,which is
commonly analysed in cancer research.
Example:BRCA1 (partial sequence)
>NM_007294.4 Homo sapiens BRCA1, DNA repair associated (BRCA1), mRNA
ATGAAAAGCTCAGAGGAGGAAGAGGAAAGGAGGAAGAGGAGGAGGAAGAGGAAGAGGAAGAGGAA
AGAGGAGGAAGAGGAAGAGGAAGAGGAAGAGGAAGAGGAAGAGGAAGAGGAAGAGGAAGAGGAAG
TGTTTATTTTTTATTTTGATTTTTTTTTTTGAGACAGAGTCTCGCTCTGTCGCCCAGGCTGGAGT
GCAGTGGCACGATCTTGGCTCACTGCAAGCTCCGCCTCCCAGGTTCAAGCAATTCTCCTGCCTCA
GCCTCCCGAGTAGCTGGGACTACAGGCACCCGCCACCACGCCTGGCTAATTTTTGTATTTTTAGT
AGAGATAGGGTTTCACCATGTTGGCCAGGCTGGTCTCAAACTCCTGACCTCGTGATCCACCCGC
12. Example(to be continued)
How this is used:
● This FASTA file could be input to tools like BALST, Clustal
Omega or MAFFT to compare the sequence against
others.
● Researchers use this for mutation analysis,gene
expression, or to design primers for PCR
● The accession number NM-007294.4 tells you this is an
NCBI RefSeq mRNA sequence.
13. Alignment of sequence In FASTA FORMAT
There are several ways to align or analyse a sequence in FASTA format
depending On your goals (e.g;comparing With other sequences,identifying
mutations,or finding similar sequences.)
Here's a breakdown of the most common methods:
●BLAST (Basic Local Alignment Search Tool)
Purpose:
Find regions Of similarity between your sequence and others in a
database.
14. BLAST(TO BE CONTINUED)
How:
a. Go to
https://blast.ncbi.nlm.nih.gov/Blast.cgi
b• choose a tool (e.g;nucleotide BLAST if your sequence is DNA)
c• Paste your FASTA sequence into the query box.
d. Chose the database (e.g;human genome or all organisms)
e• click BLAST and view results like similar sequences,identify %,E-values,etc
16. MSA (TO BE CONTINUED)
How:
a• collect multiple FASTA sequence you want to align.
b• Paste them into the tool.
c• Run the alignment
d• Download or view the alignment file, often used for evolutionary
analysis or primer design.
17. ●Primer design / Mutation Analysis
TOOLS:
○Primer BLAST
○SnapGene viewer
○Benchling
●you can input your FASTA and look for ;
○specific mutations (SNPs, insertions , deletions)
○Potential Primers
○Protein Translation
18. Command-line tools(for bioinformatics scripting)
If you are comfortable with coding or scripting :
●use Biopython (python library) to parse and analyse FASTA files.
●use tools like MAAFT, BALST+ ,or Samtools in a terminal for bulk analysis.
Conclusion:
In summary,the FASTA format is critical for the representation and
manipulation of biological data sequence in bioinformatics. It's simplicity
facilitate wide-ranging application.
19. Conclusion (TO BE CONTINUED)
While various methods Nd tools allow for effective data retrieval and processing.
THE integration with prominent bioinformatics databases further enhances its
utility,making it an indispensable format in the global bioinformatics community.