SlideShare a Scribd company logo
Working with Dictionaries and Lists/Sets
Modules you can use:
argparse, reLinks, osLinks, collectionsLinks, sysLinks
General Guidelines (Steps 1-6)
You have more flexibility to implement your own function names and logic in these programs.
The data files you need for this assignment can obtained from:
HUGO_genes.txt, chr21_genes.txt
Create an output directory inside your assignment4 directory called "OUTPUT" for result files, so
that they will not mix with your programs. Output from your programs will be here!
Pay close attention to how you'll run the run_lints.sh script (see below)
Your program must implement command line options for the infiles it must open, but for
testing purposes it should run by default, so if no command line option is passed at the command
line, the program will still run. This will help in the grading of your program.
Create a Python Module called called io_utils.py. Put this io_utils.py Module in a subdirectory
named assignment4 inside your assignment4 top-level directory (see the tree below). Anytime a
file needs to be opened (read or write) in your programs in this assignment, the program should
call on this module's function get_filehandle. You can then use io_utils.get_filehandle by doing
this at the top of your programs:
from assignment4 import io_utils
# I can then use the module's get_filehandle() function by:
fh_in = io_utils.get_filehandle(infile1, "r") # note the function call "io_utils.get_filehandle()"
You can also import this way (Either way is acceptable, but I will assume in this assignment you
did it the way below. I think this is better for Pycharm)
from assignment4.io_utils import get_filehandle
# I can then use the module's get_filehandle() function by:
fh_in = get_filehandle(infile1, "r") # note the function call "get_filehandle()"
Your final submission must have the following files in bold and must use this directory structure.
Make sure you put a blank __init__.py where I've denoted below, e.g.: assignment4/
assignment4 (see below)
Make sure you have __init__.py only in assignment4/assignment4 folder, the tests folder, the unit
folder, and not in the assignment4 main folder.
Information on Source files
The chr21_genes.txt file lists genes from human chromosome 21, in their order along the
chromosome, as described in Hattori et al. (Nature 405, 311-319)Links to an external site.. For
each gene, the file gives the gene symbol, description and category. The fields are separated by
tabs. You will need to get the the meaning of each category. You can find these meanings in the
original paperLinks to an external site., under the "Gene categories" section. Create a file named
chr21_genes_categories.txt that store this information in tab separated fields:
This will be used in program #2
The HUGO_genes.txt file lists all human genes having official symbol approved by the HUGO
gene nomenclature committeeLinks to an external site. (some have probably changed by now).
For each gene, the file gives its symbol and description, separated by a TAB character.
Exercises
1. Write a program (call it gene_names_from_chr21.py) that asks the user to enter a gene
symbol and then prints the description for that gene based on data from the chr21_genes.txt file.
The program should give an error message if the entered symbol is not found in the table (the user
should should not have to worry about case, i.e. it will be a case-insensitive search). The program
should continue to ask the user for genes until "quit" or "exit" is given (case-insensitive). Make
sure to prompt the user to enter the quit to end the program. Use Dictionaries to solve this
problem. HINT: Feel free to use as Dictionary of Dictionaries, but it is not required.
HINT: First read the entire text file into a Dictionary that maps the association between gene
symbol and description. Once again, make sure to use a Dictionary.
Remember to have these command line options:
$ python3 gene_names_from_chr21.py -i chr21_genes.txt
Output from this program should just go to <STDOUT>:
2. Write a program (call it find_common_cats.py) that counts how many genes are in each
category (1.1, 1.2, 2.1 etc.) based on data from the chr21_genes.txt file. The program should print
the results so that categories are arranged in ascending order to an output file (call the output
output OUTPUT/categories.txt . Read the paper to see what the categories represent and have
this part of your output (this will be input from chr21_genes_categories.txt). Use Dictionaries to
solve this problem. HINT: Feel free to use as Dictionary of Dictionaries, but it is not required.
Note: you will notice that one gene has no category information. That's due to missing data in the
file, JUST IGNORE THIS GENE!.
Remember to have these command line options:
$ python3 find_common_cats.py -i1 chr21_genes.txt -i2 chr21_genes_categories.txt
Output to the file (OUTPUT/categories.txt) from this program:
Note <Occurrence Here> is a number
3. Write a program (call it intersection_of_gene_names.py) that finds all gene symbols that
appear both in the chr21_genes.txt file and in the HUGO_genes.txt file. These gene symbols
should be printed to a file in alphabetical order (you can hard code the output file
OUTPUT/intersection_output.txt) . The program should also print on the terminal how many
common gene symbols were found. Use Lists or Sets to solve the problem. It is fine to use a
temporary Dictionary to find the intersection of two Lists, but this can be simplified with Sets. Note:
HUGO_genes.txt could have some duplicate entries.
Remember to have these command line options:
$ python3 intersection_of_gene_names.py -i1 chr21_genes.txt -i2 HUGO_genes.txt # the N
's below are an integer and bolded for illustration only
Number of unique gene names in chr21_genes.txt: N
Number of unique gene names in HUGO_genes.txt: N
Number of common gene symbols found: N
Output stored in OUTPUT/intersection_output.txt
STDOUT is shown above, and the actual output of the intersection goes to the file (
OUTPUT/intersection_output.txt) from this program:
If you implemented intersection_of_gene_names.py correctly, this program could take any gene
file that has the gene in the first column (even if it's the only column)
(additional examples: hgnc_complete_set_reduced.txtLinks to an external site. and
gene_age.txtLinks to an external site.)
$ python3 intersection_of_gene_names.py -i1 hgnc_complete_set_reduced.txt -i2
HUGO_genes.txt
Number of unique gene names in hgnc_complete_set_reduced.txt: 43547
Number of unique gene names in HUGO_genes.txt: 11815
Number of common gene symbols found: 8654
Output stored in OUTPUT/intersection_output.txt
$ python3 intersection_of_gene_names.py -i1 gene_age.txt -i2 chr21_genes.txt
Number of unique gene names in gene_age.txt: 307
Number of unique gene names in chr21_genes.txt: 285
Number of common gene symbols found: 4
Output stored in OUTPUT/intersection_output.txt
You must solve exercises 1 and 2 by using Dictionaries, and exercise 3 using Lists or Sets

More Related Content

DOCX
You must implement the following functions- Name the functions exactly.docx
PDF
PDF
Functions and modules in python
DOCX
Hierarchies of LifeExperiment 1 Classification of Common Objects.docx
PDF
1st KeyStone Summer School - Hackathon Challenge
ODP
biopython, doctest and makefiles
PPT
IntroductionSTATA.ppt
PDF
ppgardner-lecture05-alignment-comparativegenomics.pdf
You must implement the following functions- Name the functions exactly.docx
Functions and modules in python
Hierarchies of LifeExperiment 1 Classification of Common Objects.docx
1st KeyStone Summer School - Hackathon Challenge
biopython, doctest and makefiles
IntroductionSTATA.ppt
ppgardner-lecture05-alignment-comparativegenomics.pdf

Similar to Working with Dictionaries and ListsSets Modules you can use.pdf (19)

PPTX
2015 bioinformatics bio_python
PPTX
2016 bioinformatics i_io_wim_vancriekinge
PPTX
prediction methods for ORF
PPTX
Expressions and Variables
DOCX
import os import matplotlib-pyplot as plt import pandas as pd import r.docx
PDF
Genome_annotation@BioDec: Python all over the place
ODP
PPTX
Introduction to Python Programming.pptx
PPTX
bio informatics ppt on bio technologypptx
PDF
Python for Chemistry
PDF
Python for Chemistry
PDF
PhyloPipe.v1.1_manual_20150610
DOCX
1PhylogeneticAnalysisHomeworkassignmentThisa.docx
PPTX
PPT
Kyle Jensen's MIT Ph.D. Thesis Proposal
PPTX
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
PDF
SeqinR - biological data handling
PDF
Biopython: Overview, State of the Art and Outlook
2015 bioinformatics bio_python
2016 bioinformatics i_io_wim_vancriekinge
prediction methods for ORF
Expressions and Variables
import os import matplotlib-pyplot as plt import pandas as pd import r.docx
Genome_annotation@BioDec: Python all over the place
Introduction to Python Programming.pptx
bio informatics ppt on bio technologypptx
Python for Chemistry
Python for Chemistry
PhyloPipe.v1.1_manual_20150610
1PhylogeneticAnalysisHomeworkassignmentThisa.docx
Kyle Jensen's MIT Ph.D. Thesis Proposal
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
SeqinR - biological data handling
Biopython: Overview, State of the Art and Outlook
Ad

More from advancesystem (20)

PDF
Write a Fortran program to SOS take the name and ID number.pdf
PDF
Write a function that will take four parameters omega x0.pdf
PDF
Write a function that will take four parameters omega p.pdf
PDF
Write a generalpurpose program with loop and indexed addres.pdf
PDF
Write a computer program in JAVA that hides a secret message.pdf
PDF
Write a Fortran program to 1 take the name and ID number of.pdf
PDF
Write a function called ApologyLine that take an integer c a.pdf
PDF
Write a C++ code that makes pyramid shape as long as user wa.pdf
PDF
Write a class called Window that contains the following info.pdf
PDF
Write a C code that uses struct to create a userdefined typ.pdf
PDF
Write a 12 page report on igneous rock and how they are m.pdf
PDF
WQ4 Coevolution of Central American ants in the Pseudomyrme.pdf
PDF
With the following companies Apple Caterpillar Consolidat.pdf
PDF
Would the current answer be considered correct 2 Identify .pdf
PDF
WQ3 Considering that you know about natural selection is o.pdf
PDF
would be earned How much of the total is simple interest an.pdf
PDF
write 350400 WORDS AND EXPLAIN BRIEFLY lapter 7 How Touris.pdf
PDF
World vegetation maps and world climate maps are very simila.pdf
PDF
Working individually or in pairs you will apply what you ha.pdf
PDF
Without using a function write the JavaScript code so that w.pdf
Write a Fortran program to SOS take the name and ID number.pdf
Write a function that will take four parameters omega x0.pdf
Write a function that will take four parameters omega p.pdf
Write a generalpurpose program with loop and indexed addres.pdf
Write a computer program in JAVA that hides a secret message.pdf
Write a Fortran program to 1 take the name and ID number of.pdf
Write a function called ApologyLine that take an integer c a.pdf
Write a C++ code that makes pyramid shape as long as user wa.pdf
Write a class called Window that contains the following info.pdf
Write a C code that uses struct to create a userdefined typ.pdf
Write a 12 page report on igneous rock and how they are m.pdf
WQ4 Coevolution of Central American ants in the Pseudomyrme.pdf
With the following companies Apple Caterpillar Consolidat.pdf
Would the current answer be considered correct 2 Identify .pdf
WQ3 Considering that you know about natural selection is o.pdf
would be earned How much of the total is simple interest an.pdf
write 350400 WORDS AND EXPLAIN BRIEFLY lapter 7 How Touris.pdf
World vegetation maps and world climate maps are very simila.pdf
Working individually or in pairs you will apply what you ha.pdf
Without using a function write the JavaScript code so that w.pdf
Ad

Recently uploaded (20)

PDF
Basic Mud Logging Guide for educational purpose
PPTX
master seminar digital applications in india
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Insiders guide to clinical Medicine.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
Basic Mud Logging Guide for educational purpose
master seminar digital applications in india
VCE English Exam - Section C Student Revision Booklet
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Microbial disease of the cardiovascular and lymphatic systems
human mycosis Human fungal infections are called human mycosis..pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Supply Chain Operations Speaking Notes -ICLT Program
2.FourierTransform-ShortQuestionswithAnswers.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Anesthesia in Laparoscopic Surgery in India
Insiders guide to clinical Medicine.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Microbial diseases, their pathogenesis and prophylaxis
O5-L3 Freight Transport Ops (International) V1.pdf
PPH.pptx obstetrics and gynecology in nursing
Module 4: Burden of Disease Tutorial Slides S2 2025

Working with Dictionaries and ListsSets Modules you can use.pdf

  • 1. Working with Dictionaries and Lists/Sets Modules you can use: argparse, reLinks, osLinks, collectionsLinks, sysLinks General Guidelines (Steps 1-6) You have more flexibility to implement your own function names and logic in these programs. The data files you need for this assignment can obtained from: HUGO_genes.txt, chr21_genes.txt Create an output directory inside your assignment4 directory called "OUTPUT" for result files, so that they will not mix with your programs. Output from your programs will be here! Pay close attention to how you'll run the run_lints.sh script (see below) Your program must implement command line options for the infiles it must open, but for testing purposes it should run by default, so if no command line option is passed at the command line, the program will still run. This will help in the grading of your program. Create a Python Module called called io_utils.py. Put this io_utils.py Module in a subdirectory named assignment4 inside your assignment4 top-level directory (see the tree below). Anytime a file needs to be opened (read or write) in your programs in this assignment, the program should call on this module's function get_filehandle. You can then use io_utils.get_filehandle by doing this at the top of your programs: from assignment4 import io_utils # I can then use the module's get_filehandle() function by: fh_in = io_utils.get_filehandle(infile1, "r") # note the function call "io_utils.get_filehandle()" You can also import this way (Either way is acceptable, but I will assume in this assignment you did it the way below. I think this is better for Pycharm) from assignment4.io_utils import get_filehandle # I can then use the module's get_filehandle() function by: fh_in = get_filehandle(infile1, "r") # note the function call "get_filehandle()" Your final submission must have the following files in bold and must use this directory structure. Make sure you put a blank __init__.py where I've denoted below, e.g.: assignment4/ assignment4 (see below) Make sure you have __init__.py only in assignment4/assignment4 folder, the tests folder, the unit folder, and not in the assignment4 main folder. Information on Source files The chr21_genes.txt file lists genes from human chromosome 21, in their order along the chromosome, as described in Hattori et al. (Nature 405, 311-319)Links to an external site.. For each gene, the file gives the gene symbol, description and category. The fields are separated by tabs. You will need to get the the meaning of each category. You can find these meanings in the original paperLinks to an external site., under the "Gene categories" section. Create a file named
  • 2. chr21_genes_categories.txt that store this information in tab separated fields: This will be used in program #2 The HUGO_genes.txt file lists all human genes having official symbol approved by the HUGO gene nomenclature committeeLinks to an external site. (some have probably changed by now). For each gene, the file gives its symbol and description, separated by a TAB character. Exercises 1. Write a program (call it gene_names_from_chr21.py) that asks the user to enter a gene symbol and then prints the description for that gene based on data from the chr21_genes.txt file. The program should give an error message if the entered symbol is not found in the table (the user should should not have to worry about case, i.e. it will be a case-insensitive search). The program should continue to ask the user for genes until "quit" or "exit" is given (case-insensitive). Make sure to prompt the user to enter the quit to end the program. Use Dictionaries to solve this problem. HINT: Feel free to use as Dictionary of Dictionaries, but it is not required. HINT: First read the entire text file into a Dictionary that maps the association between gene symbol and description. Once again, make sure to use a Dictionary. Remember to have these command line options: $ python3 gene_names_from_chr21.py -i chr21_genes.txt Output from this program should just go to <STDOUT>: 2. Write a program (call it find_common_cats.py) that counts how many genes are in each category (1.1, 1.2, 2.1 etc.) based on data from the chr21_genes.txt file. The program should print the results so that categories are arranged in ascending order to an output file (call the output output OUTPUT/categories.txt . Read the paper to see what the categories represent and have this part of your output (this will be input from chr21_genes_categories.txt). Use Dictionaries to solve this problem. HINT: Feel free to use as Dictionary of Dictionaries, but it is not required. Note: you will notice that one gene has no category information. That's due to missing data in the file, JUST IGNORE THIS GENE!. Remember to have these command line options: $ python3 find_common_cats.py -i1 chr21_genes.txt -i2 chr21_genes_categories.txt Output to the file (OUTPUT/categories.txt) from this program: Note <Occurrence Here> is a number 3. Write a program (call it intersection_of_gene_names.py) that finds all gene symbols that appear both in the chr21_genes.txt file and in the HUGO_genes.txt file. These gene symbols should be printed to a file in alphabetical order (you can hard code the output file OUTPUT/intersection_output.txt) . The program should also print on the terminal how many common gene symbols were found. Use Lists or Sets to solve the problem. It is fine to use a temporary Dictionary to find the intersection of two Lists, but this can be simplified with Sets. Note: HUGO_genes.txt could have some duplicate entries. Remember to have these command line options: $ python3 intersection_of_gene_names.py -i1 chr21_genes.txt -i2 HUGO_genes.txt # the N 's below are an integer and bolded for illustration only
  • 3. Number of unique gene names in chr21_genes.txt: N Number of unique gene names in HUGO_genes.txt: N Number of common gene symbols found: N Output stored in OUTPUT/intersection_output.txt STDOUT is shown above, and the actual output of the intersection goes to the file ( OUTPUT/intersection_output.txt) from this program: If you implemented intersection_of_gene_names.py correctly, this program could take any gene file that has the gene in the first column (even if it's the only column) (additional examples: hgnc_complete_set_reduced.txtLinks to an external site. and gene_age.txtLinks to an external site.) $ python3 intersection_of_gene_names.py -i1 hgnc_complete_set_reduced.txt -i2 HUGO_genes.txt Number of unique gene names in hgnc_complete_set_reduced.txt: 43547 Number of unique gene names in HUGO_genes.txt: 11815 Number of common gene symbols found: 8654 Output stored in OUTPUT/intersection_output.txt $ python3 intersection_of_gene_names.py -i1 gene_age.txt -i2 chr21_genes.txt Number of unique gene names in gene_age.txt: 307 Number of unique gene names in chr21_genes.txt: 285 Number of common gene symbols found: 4 Output stored in OUTPUT/intersection_output.txt You must solve exercises 1 and 2 by using Dictionaries, and exercise 3 using Lists or Sets