SlideShare a Scribd company logo
Biopython
     Karin Lagesen

karin.lagesen@bio.uio.no
ConcatFasta.py
Create a script that has the following:
  function get_fastafiles(dirname)
     gets all the files in the directory, checks if they are fasta
       files (end in .fsa), returns list of fasta files
     hint: you need os.path to create full relative file names
  function concat_fastafiles(filelist, outfile)
     takes a list of fasta files, opens and reads each of them,
       writes them to outfile
  if __name__ == “__main__”:
     do what needs to be done to run script
Remember imports!
Object oriented programming
Biopython is object-oriented
Some knowledge helps understand how
 biopython works
OOP is a way of organizing data and
 methods that work on them in a coherent
 package
OOP helps structure and organize the code
Classes and objects
A class:
  is a user defined type
  is a mold for creating objects
  specifies how an object can contain and
    process data
  represents an abstraction or a template for how
    an object of that class will behave
An object is an instance of a class
All objects have a type – shows which class
  they were made from
Attributes and methods
Classes specify two things:
  attributes – data holders
  methods – functions for this class
Attributes are variables that will contain the
 data that each object will have
Methods are functions that an object of that
 class will be able to perform
Class and object example
Class: MySeq
MySeq has:
   attribute length
   method translate
An object of the class MySeq is created like this:
   myseq = MySeq(“ATGGCCG”)
Get sequence length:
   myseq.length
Get translation:
   myseq.translate()
Summary
An object has to be instantiated, i.e.
 created, to exist
Every object has a certain type, i.e. is of a
 certain class
The class decides which attributes and
 methods an object has
Attributes and methods are accessed
 using . after the object variable name
Biopython
Package that assists with processing
 biological data
Consists of several modules – some with
 common operations, some more
 specialized
Website: biopython.org
Working with sequences
Biopython has many ways of working with
  sequence data
Components for today:
  Alphabet
  Seq
  SeqRecord
  SeqIO
Other useful classes for working with alignments,
 blast searches and results etc are also available,
 not covered today
Class Alphabet
Every sequence needs an alphabet
CCTTGGCC – DNA or protein?
Biopython contains several alphabets
  DNA
  RNA
  Protein
  the three above with IUPAC codes
  ...and others
Can all be found in Bio.Alphabet package
Alphabet example
Go to freebee
Do module load python (necessary to find biopython
  modules) – start python
  >>> import Bio.Alphabet
                                           NOTE: have to import
  >>> Bio.Alphabet.ThreeLetterProtein.letters
                                           Alphabets to use them
  ['Ala', 'Asx', 'Cys', 'Asp', 'Glu', 'Phe', 'Gly', 'His', 'Ile', 
  'Lys', 'Leu', 'Met', 'Asn', 'Pro', 'Gln', 'Arg', 'Ser', 'Thr', 
  'Sec', 'Val', 'Trp', 'Xaa', 'Tyr', 'Glx']
  >>> from Bio.Alphabet import IUPAC
  >>> IUPAC.IUPACProtein.letters
  'ACDEFGHIKLMNPQRSTVWY'
  >>> IUPAC.unambiguous_dna.letters
  'GATC'
  >>> 
Packages, modules and
              classes
What happens here?
>>> from Bio.Alphabet import IUPAC
   >>> IUPAC.IUPACProtein.letters


Bio and Alphabet are packages
    packages contain modules
IUPAC is a module
    a module is a file with python code
IUPAC module contains class IUPACProtein and
  other classes specifying alphabets
IUPACProtein has attribute letters
Seq
Represents one sequence with its alphabet
Methods:
  translate()
  transcribe()
  complement()
  reverse_complement()
  ...
Using Seq

>>> from Bio.Seq import Seq
>>> import Bio.Alphabet       Create object
>>> seq = Seq("CCGGGTT", Bio.Alphabet.IUPAC.unambiguous_dna)
>>> seq
Seq('CCGGGTT', IUPACUnambiguousDNA())
>>> seq.transcribe()
Seq('CCGGGUU', IUPACUnambiguousRNA()) Use methods
>>> seq.translate()
Seq('PG', IUPACProtein())
>>> seq = Seq("CCGGGUU", Bio.Alphabet.IUPAC.unambiguous_rna)
>>> seq.transcribe()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>        New object, different alphabet
  File "/site/VERSIONS/python­2.6.2/lib/python2.6/site­packages/Bio/Seq.py",
 line 830, in transcribe
    raise ValueError("RNA cannot be transcribed!")
ValueError: RNA cannot be transcribed!
>>> seq.translate()
Seq('PG', IUPACProtein())
>>> 
                                            Alphabet dictates which
                                            methods make sense
Seq as a string
Most string methods work on Seqs
If string is needed, do str(seq)
>>> seq = Seq('CCGGGTTAACGTA',Bio.Alphabet.IUPAC.unambiguous_dna)
>>> seq[:5]
Seq('CCGGG', IUPACUnambiguousDNA())
>>> len(seq)
13
>>> seq.lower()
Seq('ccgggttaacgta', DNAAlphabet())
>>> print seq
CCGGGTTAACGTA
>>> list(seq)
['C', 'C', 'G', 'G', 'G', 'T', 'T', 'A', 'A', 'C', 'G', 'T', 'A']
>>> mystring = str(seq)
>>> print mystring
CCGGGTTAACGTA
>>> type(seq)
<class 'Bio.Seq.Seq'>       How to check what class
>>> type(mystring)          or type an object is from
<type 'str'>
>>> 
MutableSeq
Seqs are immutable as strings are
If mutable string is needed, convert to MutableSeq
Allows in-place changes
  >>> seq[0] = 'T'
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: 'Seq' object does not support item assignment
  >>> mut_seq = seq.tomutable()
  >>> seq
  Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA())
  >>> seq[0] = 'T'
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  TypeError: 'Seq' object does not support item assignment
  >>> mut_seq = seq.tomutable()
  >>> mut_seq[0] = 'T'
  >>> mut_seq
  MutableSeq('TCGGGTTAACGTA', IUPACUnambiguousDNA())
  >>> mut_seq.complement()
  >>> mut_seq
  MutableSeq('AGCCCAATTGCAT', IUPACUnambiguousDNA())
  >>>                                Notice: object is changed!
SeqRecord
Seq contains the sequence and alphabet
But sequences often come with a lot more
SeqRecord = Seq + metadata
Main attributes:
   id – name or identifier
   seq – seq object containing the sequence
>>> seq   Existing sequence
Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()) SeqRecord is a class
>>> from Bio.SeqRecord import SeqRecord     found inside the
>>> seqRecord = SeqRecord(seq, id='001')
>>> seqRecord
                                            Bio.SeqRecord module
SeqRecord(seq=Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()), 
id='001', name='<unknown name>', description='<unknown description>', 
dbxrefs=[])
>>> 
SeqRecord attributes
From the biopython webpages:
Main attributes:

id - Identifier such as a locus tag (string)
seq - The sequence itself (Seq object or similar)

Additional attributes:

name - Sequence name, e.g. gene name (string)
description - Additional text (string)
dbxrefs - List of database cross references (list of strings)
features - Any (sub)features defined (list of SeqFeature objects)
annotations - Further information about the whole sequence (dictionary)
      Most entries are strings, or lists of strings.
letter_annotations - Per letter/symbol annotation (restricted dictionary). This holds
      Python sequences (lists, strings or tuples) whose length matches that of the
      sequence. A typical use would be to hold a list of integers representing
      sequencing quality scores, or a string representing the secondary structure.
SeqRecords in practice...
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import DNAAlphabet
>>> seqRecord = SeqRecord(Seq('GCAGCCTCAAACCCCAGCTG', 
… DNAAlphabet), id = 'NM_005368.2', name = 'NM_005368', 
… description = 'Myoglobin var 1',
… dbxrefs = ['GeneID:4151', 'HGNC:6915'])
>>> seqRecord.annotations['note'] = 'Information goes here'
>>> seqRecord
SeqRecord(seq=Seq('GCAGCCTCAAACCCCAGCTG',
 <class 'Bio.Alphabet.DNAAlphabet'>), id='NM_005368.2', 
name='NM_005368', description='Myoglobin var 1', 
dbxrefs=['GeneID:4151', 'HGNC:6915'])
>>> seqRecord.annotations
{'note': 'Information goes here'}
>>> 
SeqIO
How to get sequences in and out of files
Retrieves sequences as SeqRecords, can
 write SeqRecords to files
Reading:
  parse(filehandle, format)
  returns a generator that gives SeqRecords
Writing:
  write(SeqRecord(s), filehandle, format)

  NOTE: examples in this section from http://biopython.org/wiki/SeqIO
SeqIO formats
List: http://biopython.org/wiki/SeqIO
Some examples:
  fasta
  genbank
  several fastq-formats
  ace
Note: a format might be readable but not
 writable depending on biopython version
Reading a file
        from Bio import SeqIO
        handle = open("example.fasta", "r")
        for record in SeqIO.parse(handle,"fasta") :
            print record.id
        handle.close()




SeqIO.parse returns a SeqRecord iterator
An iterator will give you the next element the
 next time it is called – compare to
 readline()
Useful because if a file contains many
 records, we avoid putting all into memory
 all at once
Exercise
Use mb.gbk, found in Karins folder
Use the SeqIO methods to
   read in the file
   print the id of each of the records
   print the first 10 nucleotide of each record
 >>> from Bio import SeqIO
 >>> fh = open("mb.gbk", "r")
 >>> for record in SeqIO.parse(fh, "genbank"):
 ...     print record.id
 ...     print record.seq[:10]
 ... 
 NM_005368.2
 GCAGCCTCAA
 XM_001081975.2
 CCTCTCCCCA
 NM_001164047.1
 TAGCTGCCCA
 >>> 
SeqRecords lists and
           dictionaries
To get everything as a list:
  handle = open("example.fasta", "r")
     records = list(SeqIO.parse(handle, "fasta"))
     handle.close()


To get everything as a dictionary:
  handle = open("example.fasta", "r")
     record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta"))
     handle.close()


But: avoid if at all possible
Writing files
                                      sequences are here a
           from Bio import SeqIO      list of SeqRecords
           sequences = ... # add code here
           output_handle = open("example.fasta", "w")
           SeqIO.write(sequences, output_handle, "fasta")
           output_handle.close()




Note: sequences is here a list
Can write any iterable containing
 SeqRecords to a file
Can also write a single sequence
seq_length.py
Write script that reads a file containing genbank
 sequences and writes out name and sequence
 length
Should have
  Function sequence_length(inputfile)
      Open file
      Per seqRecord in input:
           Print name, length of sequence
      Close file
  If __name__ == “__main__”:
      Get input from command line:
           inputfile
Modifications
Figure out how to:
  print the description of each genbank entry
  which annotations each entry has
  print the taxonomy for each entry
Description:
  seqRecord.description
Annotations:
  seqRecord.annotations.keys()
Taxonomy:
  seqRecord.annotations['taxonomy']
tag_fasta.py
Create script that takes a file containing fasta
  sequences, adds a tag at the front of the name and
  writes it out to a new file
Should have
  Function change_name(seqRecord, tag)
     Change name, return seqRecord
  Function read_write_fasta(tag, input, output)
     Per seqRecord in input:
         Change name
         Write to output
  If __name__ == “__main__”:
     Get input from command line:
         Tag, input file, output file
Optional homework
           convert.py
Create a script that:
  takes input filename, input file type, output
    filename and output file type
  converts input file to output file type and writes it
    to output file

More Related Content

PPT
Genome annotation 2013
PPT
Biopython
PPTX
How to submit a sequence in NCBI
PPTX
Gene prediction and expression
PPT
Est database
PPT
Sequence Alignment In Bioinformatics
PPT
Systems biology & Approaches of genomics and proteomics
Genome annotation 2013
Biopython
How to submit a sequence in NCBI
Gene prediction and expression
Est database
Sequence Alignment In Bioinformatics
Systems biology & Approaches of genomics and proteomics

What's hot (20)

PPT
Proteome analysis
PPTX
System's Biology
PDF
Machine Learning in Bioinformatics
PPTX
Genome annotation
PPTX
Multiple sequence alignment
PDF
Gene prediction method
PPTX
sequence of file formats in bioinformatics
PPTX
Genome Database Systems
PPTX
Scoring schemes in bioinformatics
PPT
COMPARATIVE GENOMICS.ppt
PDF
Dot matrix
PPTX
Serial analysis of gene expression
PDF
Ab Initio Protein Structure Prediction
PPTX
Orthologs,Paralogs & Xenologs
PPTX
Blast and fasta
PPTX
Protein protein interactions
PPTX
Massively Parallel Signature Sequencing (MPSS)
PPTX
Protein Threading
PPTX
Protein identification and analysis on ExPASy server
Proteome analysis
System's Biology
Machine Learning in Bioinformatics
Genome annotation
Multiple sequence alignment
Gene prediction method
sequence of file formats in bioinformatics
Genome Database Systems
Scoring schemes in bioinformatics
COMPARATIVE GENOMICS.ppt
Dot matrix
Serial analysis of gene expression
Ab Initio Protein Structure Prediction
Orthologs,Paralogs & Xenologs
Blast and fasta
Protein protein interactions
Massively Parallel Signature Sequencing (MPSS)
Protein Threading
Protein identification and analysis on ExPASy server
Ad

Similar to Biopython (20)

ODP
Java 7 Features and Enhancements
PDF
Biopython: Overview, State of the Art and Outlook
PDF
Java7 New Features and Code Examples
PPTX
Implementing jsp tag extensions
PDF
File Handling in Java.pdf
PDF
Functions and modules in python
PPTX
15. text files
PPS
Advance Java
PPT
Learning Java 1 – Introduction
PDF
Struts2 - 101
PPTX
Python and You Series
PPTX
2016 bioinformatics i_bio_python_wimvancriekinge
ODP
Dynamic Python
PDF
PPT
Jug java7
PDF
What`s new in Java 7
PPT
JDK1.7 features
PPT
BioMake BOSC 2004
PDF
WhatsNewNIO2.pdf
PDF
Java IO Stream, the introduction to Streams
Java 7 Features and Enhancements
Biopython: Overview, State of the Art and Outlook
Java7 New Features and Code Examples
Implementing jsp tag extensions
File Handling in Java.pdf
Functions and modules in python
15. text files
Advance Java
Learning Java 1 – Introduction
Struts2 - 101
Python and You Series
2016 bioinformatics i_bio_python_wimvancriekinge
Dynamic Python
Jug java7
What`s new in Java 7
JDK1.7 features
BioMake BOSC 2004
WhatsNewNIO2.pdf
Java IO Stream, the introduction to Streams
Ad

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation theory and applications.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
20250228 LYD VKU AI Blended-Learning.pptx
Spectroscopy.pptx food analysis technology
MIND Revenue Release Quarter 2 2025 Press Release
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation theory and applications.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
MYSQL Presentation for SQL database connectivity
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
KodekX | Application Modernization Development

Biopython

  • 1. Biopython Karin Lagesen karin.lagesen@bio.uio.no
  • 2. ConcatFasta.py Create a script that has the following: function get_fastafiles(dirname) gets all the files in the directory, checks if they are fasta files (end in .fsa), returns list of fasta files hint: you need os.path to create full relative file names function concat_fastafiles(filelist, outfile) takes a list of fasta files, opens and reads each of them, writes them to outfile if __name__ == “__main__”: do what needs to be done to run script Remember imports!
  • 3. Object oriented programming Biopython is object-oriented Some knowledge helps understand how biopython works OOP is a way of organizing data and methods that work on them in a coherent package OOP helps structure and organize the code
  • 4. Classes and objects A class: is a user defined type is a mold for creating objects specifies how an object can contain and process data represents an abstraction or a template for how an object of that class will behave An object is an instance of a class All objects have a type – shows which class they were made from
  • 5. Attributes and methods Classes specify two things: attributes – data holders methods – functions for this class Attributes are variables that will contain the data that each object will have Methods are functions that an object of that class will be able to perform
  • 6. Class and object example Class: MySeq MySeq has: attribute length method translate An object of the class MySeq is created like this: myseq = MySeq(“ATGGCCG”) Get sequence length: myseq.length Get translation: myseq.translate()
  • 7. Summary An object has to be instantiated, i.e. created, to exist Every object has a certain type, i.e. is of a certain class The class decides which attributes and methods an object has Attributes and methods are accessed using . after the object variable name
  • 8. Biopython Package that assists with processing biological data Consists of several modules – some with common operations, some more specialized Website: biopython.org
  • 9. Working with sequences Biopython has many ways of working with sequence data Components for today: Alphabet Seq SeqRecord SeqIO Other useful classes for working with alignments, blast searches and results etc are also available, not covered today
  • 10. Class Alphabet Every sequence needs an alphabet CCTTGGCC – DNA or protein? Biopython contains several alphabets DNA RNA Protein the three above with IUPAC codes ...and others Can all be found in Bio.Alphabet package
  • 11. Alphabet example Go to freebee Do module load python (necessary to find biopython modules) – start python >>> import Bio.Alphabet NOTE: have to import >>> Bio.Alphabet.ThreeLetterProtein.letters Alphabets to use them ['Ala', 'Asx', 'Cys', 'Asp', 'Glu', 'Phe', 'Gly', 'His', 'Ile',  'Lys', 'Leu', 'Met', 'Asn', 'Pro', 'Gln', 'Arg', 'Ser', 'Thr',  'Sec', 'Val', 'Trp', 'Xaa', 'Tyr', 'Glx'] >>> from Bio.Alphabet import IUPAC >>> IUPAC.IUPACProtein.letters 'ACDEFGHIKLMNPQRSTVWY' >>> IUPAC.unambiguous_dna.letters 'GATC' >>> 
  • 12. Packages, modules and classes What happens here? >>> from Bio.Alphabet import IUPAC >>> IUPAC.IUPACProtein.letters Bio and Alphabet are packages packages contain modules IUPAC is a module a module is a file with python code IUPAC module contains class IUPACProtein and other classes specifying alphabets IUPACProtein has attribute letters
  • 13. Seq Represents one sequence with its alphabet Methods: translate() transcribe() complement() reverse_complement() ...
  • 14. Using Seq >>> from Bio.Seq import Seq >>> import Bio.Alphabet Create object >>> seq = Seq("CCGGGTT", Bio.Alphabet.IUPAC.unambiguous_dna) >>> seq Seq('CCGGGTT', IUPACUnambiguousDNA()) >>> seq.transcribe() Seq('CCGGGUU', IUPACUnambiguousRNA()) Use methods >>> seq.translate() Seq('PG', IUPACProtein()) >>> seq = Seq("CCGGGUU", Bio.Alphabet.IUPAC.unambiguous_rna) >>> seq.transcribe() Traceback (most recent call last):   File "<stdin>", line 1, in <module> New object, different alphabet   File "/site/VERSIONS/python­2.6.2/lib/python2.6/site­packages/Bio/Seq.py",  line 830, in transcribe     raise ValueError("RNA cannot be transcribed!") ValueError: RNA cannot be transcribed! >>> seq.translate() Seq('PG', IUPACProtein()) >>>  Alphabet dictates which methods make sense
  • 15. Seq as a string Most string methods work on Seqs If string is needed, do str(seq) >>> seq = Seq('CCGGGTTAACGTA',Bio.Alphabet.IUPAC.unambiguous_dna) >>> seq[:5] Seq('CCGGG', IUPACUnambiguousDNA()) >>> len(seq) 13 >>> seq.lower() Seq('ccgggttaacgta', DNAAlphabet()) >>> print seq CCGGGTTAACGTA >>> list(seq) ['C', 'C', 'G', 'G', 'G', 'T', 'T', 'A', 'A', 'C', 'G', 'T', 'A'] >>> mystring = str(seq) >>> print mystring CCGGGTTAACGTA >>> type(seq) <class 'Bio.Seq.Seq'> How to check what class >>> type(mystring) or type an object is from <type 'str'> >>> 
  • 16. MutableSeq Seqs are immutable as strings are If mutable string is needed, convert to MutableSeq Allows in-place changes >>> seq[0] = 'T' Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: 'Seq' object does not support item assignment >>> mut_seq = seq.tomutable() >>> seq Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()) >>> seq[0] = 'T' Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: 'Seq' object does not support item assignment >>> mut_seq = seq.tomutable() >>> mut_seq[0] = 'T' >>> mut_seq MutableSeq('TCGGGTTAACGTA', IUPACUnambiguousDNA()) >>> mut_seq.complement() >>> mut_seq MutableSeq('AGCCCAATTGCAT', IUPACUnambiguousDNA()) >>>  Notice: object is changed!
  • 17. SeqRecord Seq contains the sequence and alphabet But sequences often come with a lot more SeqRecord = Seq + metadata Main attributes: id – name or identifier seq – seq object containing the sequence >>> seq Existing sequence Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()) SeqRecord is a class >>> from Bio.SeqRecord import SeqRecord found inside the >>> seqRecord = SeqRecord(seq, id='001') >>> seqRecord Bio.SeqRecord module SeqRecord(seq=Seq('CCGGGTTAACGTA', IUPACUnambiguousDNA()),  id='001', name='<unknown name>', description='<unknown description>',  dbxrefs=[]) >>> 
  • 18. SeqRecord attributes From the biopython webpages: Main attributes: id - Identifier such as a locus tag (string) seq - The sequence itself (Seq object or similar) Additional attributes: name - Sequence name, e.g. gene name (string) description - Additional text (string) dbxrefs - List of database cross references (list of strings) features - Any (sub)features defined (list of SeqFeature objects) annotations - Further information about the whole sequence (dictionary) Most entries are strings, or lists of strings. letter_annotations - Per letter/symbol annotation (restricted dictionary). This holds Python sequences (lists, strings or tuples) whose length matches that of the sequence. A typical use would be to hold a list of integers representing sequencing quality scores, or a string representing the secondary structure.
  • 20. SeqIO How to get sequences in and out of files Retrieves sequences as SeqRecords, can write SeqRecords to files Reading: parse(filehandle, format) returns a generator that gives SeqRecords Writing: write(SeqRecord(s), filehandle, format) NOTE: examples in this section from http://biopython.org/wiki/SeqIO
  • 21. SeqIO formats List: http://biopython.org/wiki/SeqIO Some examples: fasta genbank several fastq-formats ace Note: a format might be readable but not writable depending on biopython version
  • 22. Reading a file from Bio import SeqIO handle = open("example.fasta", "r") for record in SeqIO.parse(handle,"fasta") :     print record.id handle.close() SeqIO.parse returns a SeqRecord iterator An iterator will give you the next element the next time it is called – compare to readline() Useful because if a file contains many records, we avoid putting all into memory all at once
  • 23. Exercise Use mb.gbk, found in Karins folder Use the SeqIO methods to read in the file print the id of each of the records print the first 10 nucleotide of each record >>> from Bio import SeqIO >>> fh = open("mb.gbk", "r") >>> for record in SeqIO.parse(fh, "genbank"): ...     print record.id ...     print record.seq[:10] ...  NM_005368.2 GCAGCCTCAA XM_001081975.2 CCTCTCCCCA NM_001164047.1 TAGCTGCCCA >>> 
  • 24. SeqRecords lists and dictionaries To get everything as a list: handle = open("example.fasta", "r") records = list(SeqIO.parse(handle, "fasta")) handle.close() To get everything as a dictionary: handle = open("example.fasta", "r") record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta")) handle.close() But: avoid if at all possible
  • 25. Writing files sequences are here a from Bio import SeqIO list of SeqRecords sequences = ... # add code here output_handle = open("example.fasta", "w") SeqIO.write(sequences, output_handle, "fasta") output_handle.close() Note: sequences is here a list Can write any iterable containing SeqRecords to a file Can also write a single sequence
  • 26. seq_length.py Write script that reads a file containing genbank sequences and writes out name and sequence length Should have Function sequence_length(inputfile) Open file Per seqRecord in input: Print name, length of sequence Close file If __name__ == “__main__”: Get input from command line: inputfile
  • 27. Modifications Figure out how to: print the description of each genbank entry which annotations each entry has print the taxonomy for each entry Description: seqRecord.description Annotations: seqRecord.annotations.keys() Taxonomy: seqRecord.annotations['taxonomy']
  • 28. tag_fasta.py Create script that takes a file containing fasta sequences, adds a tag at the front of the name and writes it out to a new file Should have Function change_name(seqRecord, tag) Change name, return seqRecord Function read_write_fasta(tag, input, output) Per seqRecord in input: Change name Write to output If __name__ == “__main__”: Get input from command line: Tag, input file, output file
  • 29. Optional homework convert.py Create a script that: takes input filename, input file type, output filename and output file type converts input file to output file type and writes it to output file

Editor's Notes

  • #11: Show webpage for alphabets. Each alphabet is a separate class