2015 bioinformatics python_io_wim_vancriekinge

FBW
20-10-2015
Wim Van Criekinge

Overview
What is Python ?
Why Python 4 Bioinformatics ?
How to Python
IDE: Eclipse & PyDev / Athena
Code Sharing: Git(hub)
Strings
Regular expressions

Python
• Programming languages are overrated
– If you are going into bioinformatics you probably
learn/need multiple
– If you know one you know 90% of a second
• Choice does matter but it matters far less than people think it
does
• Why Python?
– Lets you start useful programs asap
– Build-in libraries – incl BioPython
– Free, most platforms, widely (scientifically) used
• Versus Perl?
– Incredibly similar
– Consistent syntax, indentation

Version 2.7 and 3.4 on athena.ugent.be

GitHub: Hosted GIT
• Largest open source git hosting site
• Public and private options
• User-centric rather than project-centric
• http://github.ugent.be (use your Ugent
login and password)
– Accept invitation from Bioinformatics-I-
2015
URI:
– https://github.ugent.be/Bioinformatics-I-
2015/Python.git

Run Install.py (is BioPython installed ?)
import pip
import sys
import platform
import webbrowser
print ("Python " + platform.python_version()+ " installed
packages:")
installed_packages = pip.get_installed_distributions()
installed_packages_list = sorted(["%s==%s" % (i.key, i.version)
for i in installed_packages])
print(*installed_packages_list,sep="n")

Control Structures
if condition:
statements
[elif condition:
statements] ...
else:
statements
while condition:
statements
for var in sequence:
statements
break
continue

Lists
• Flexible arrays, not Lisp-like linked
lists
• a = [99, "bottles of beer", ["on", "the",
"wall"]]
• Same operators as for strings
• a+b, a*3, a[0], a[-1], a[1:], len(a)
• Item and slice assignment
• a[0] = 98
• a[1:2] = ["bottles", "of", "beer"]
-> [98, "bottles", "of", "beer", ["on", "the", "wall"]]
• del a[-1] # -> [98, "bottles", "of", "beer"]

Dictionaries
• Hash tables, "associative arrays"
• d = {"duck": "eend", "water": "water"}
• Lookup:
• d["duck"] -> "eend"
• d["back"] # raises KeyError exception
• Delete, insert, overwrite:
• del d["water"] # {"duck": "eend", "back": "rug"}
• d["back"] = "rug" # {"duck": "eend", "back":
"rug"}
• d["duck"] = "duik" # {"duck": "duik", "back":
"rug"}

Regex.py
text = 'abbaaabbbbaaaaa'
pattern = 'ab'
for match in re.finditer(pattern, text):
s = match.start()
e = match.end()
print ('Found "%s" at %d:%d' % (text[s:e], s, e))

Find the answer in ultimate-sequence.txt ?
>ultimate-sequence
ACTCGTTATGATATTTTTTTTGAACGTGAAAATACT
TTTCGTGCTATGGAAGGACTCGTTATCGTGAAGT
TGAACGTTCTGAATGTATGCCTCTTGAAATGGA
AAATACTCATTGTTTATCTGAAATTTGAATGGGA
ATTTTATCTACAATGTTTTATTCTTACAGAACAT
TAAATTGTGTTATGTTTCATTTCACATTTTAGTA
GTTTTTTCAGTGAAAGCTTGAAAACCACCAAGA
AGAAAAGCTGGTATGCGTAGCTATGTATATATA
AAATTAGATTTTCCACAAAAAATGATCTGATAA
ACCTTCTCTGTTGGCTCCAAGTATAAGTACGAAA
AGAAATACGTTCCCAAGAATTAGCTTCATGAGT
AAGAAGAAAAGCTGGTATGCGTAGCTATGTATA
TATAAAATTAGATTTTCCACAAAAAATGATCTG
ATAA
Question 2

AA1 =
{'UUU':'F','UUC':'F','UUA':'L','UUG':'L','UCU':'S','
UCC':'S','UCA':'S','UCG':'S','UAU':'Y','UAC':'Y','UA
A':'*','UAG':'*','UGU':'C','UGC':'C','UGA':'*','UGG':
'W','CUU':'L','CUC':'L','CUA':'L','CUG':'L','CCU':'P',
'CCC':'P','CCA':'P','CCG':'P','CAU':'H','CAC':'H','CA
A':'Q','CAG':'Q','CGU':'R','CGC':'R','CGA':'R','CGG'
:'R','AUU':'I','AUC':'I','AUA':'I','AUG':'M','ACU':'T','
ACC':'T','ACA':'T','ACG':'T','AAU':'N','AAC':'N','AAA'
:'K','AAG':'K','AGU':'S','AGC':'S','AGA':'R','AGG':'R',
'GUU':'V','GUC':'V','GUA':'V','GUG':'V','GCU':'A','G
CC':'A','GCA':'A','GCG':'A','GAU':'D','GAC':'D','GA
A':'E','GAG':'E','GGU':'G','GGC':'G','GGA':'G','GGG
':'G' }
Hint: Use Dictionaries

Hint 2: Translations
Python way:
tab = str.maketrans("ACGU","UGCA")
sequence = sequence.translate(tab)[::-1]

17
Reading Files
name = open("filename")
– opens the given file for reading, and returns a file object
name.read() - file's entire contents as a string
name.readline() - next line from file as a string
name.readlines() - file's contents as a list of lines
– the lines from a file object can also be read using a for loop
>>> f = open("hours.txt")
>>> f.read()
'123 Susan 12.5 8.1 7.6 3.2n
456 Brad 4.0 11.6 6.5 2.7 12n
789 Jenn 8.0 8.0 8.0 8.0 7.5n'

18
File Input Template
• A template for reading files in Python:
name = open("filename")
for line in name:
statements
>>> input = open("hours.txt")
>>> for line in input:
... print(line.strip()) # strip() removes n
123 Susan 12.5 8.1 7.6 3.2
456 Brad 4.0 11.6 6.5 2.7 12
789 Jenn 8.0 8.0 8.0 8.0 7.5

19
Writing Files
name = open("filename", "w")
name = open("filename", "a")
– opens file for write (deletes previous contents), or
– opens file for append (new data goes after previous data)
name.write(str) - writes the given string to the file
name.close() - saves file once writing is done
>>> out = open("output.txt", "w")
>>> out.write("Hello, world!n")
>>> out.write("How are you?")
>>> out.close()
>>> open("output.txt").read()
'Hello, world!nHow are you?'

Question 3. Swiss-Knife.py
• Using a database as input ! Parse
the entire Swiss Prot collection
– How many entries are there ?
– Average Protein Length (in aa and
MW)
– Relative frequency of amino acids
• Compare to the ones used to construct
the PAM scoring matrixes from 1978 –
1991

Question 3: Getting the database
Uniprot_sprot.dat.gz – 528Mb
(on Github onder Files)
Unzipped 2.92 Gb !
http://www.ebi.ac.uk/uniprot/download-center

Amino acid frequencies
1978 1991
L 0.085 0.091
A 0.087 0.077
G 0.089 0.074
S 0.070 0.069
V 0.065 0.066
E 0.050 0.062
T 0.058 0.059
K 0.081 0.059
I 0.037 0.053
D 0.047 0.052
R 0.041 0.051
P 0.051 0.051
N 0.040 0.043
Q 0.038 0.041
F 0.040 0.040
Y 0.030 0.032
M 0.015 0.024
H 0.034 0.023
C 0.033 0.020
W 0.010 0.014
Second step: Frequencies of Occurence

Extra Questions
• How many records have a sequence of length 260?
• What are the first 20 residues of 143X_MAIZE?
• What is the identifier for the record with the
shortest sequence? Is there more than one record
with that length?
• What is the identifier for the record with the
longest sequence? Is there more than one record
with that length?
• How many contain the subsequence "ARRA"?
• How many contain the substring "KCIP-1" in the
description?

Question 4
• Program your own prosite parser !
• Download prosite pattern database
(prosite.dat)
• Automatically generate >2000 search
patterns, and search in sequence set
from question 1

2015 bioinformatics python_io_wim_vancriekinge

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 2015 bioinformatics python_io_wim_vancriekinge (20)

More from Prof. Wim Van Criekinge (20)

Recently uploaded (20)

2015 bioinformatics python_io_wim_vancriekinge