Bio Scope

BioScope
Advanced Search Grammar Tool for identification of Functional
Noncoding Elements

Principal Investigator - Hariharane Ramasamy
Sanjeev Mishra
Tulasi Ravuri
Summary
The completion of several genomic sequences has provided the
motivation for development of a tool that can aid in locating
and analyzing transcription factor binding sites (TFBS)
responsible for regulating the gene transcritption. TFBS are
short sequences 4-20 in length, and often located near the genes
they regulate. These sequences occur in groups or modules also
called enhancer or cisRegulatory modules (CRM). CRM contain one
or more TFBS and interact with a specific combination of
transcription factors to regulate gene expression. Such
sequences are often abundant near the genes they regulate. The
goal of developmental biologists is to understand how these CRM
are organized in a genome, and regulate the gene. Laboratory
methods, that are performed to locate CRM, are often laborious
and time consuming. Hence computational methods have become an
invaluable tool. The success of computational methods depends on
how well they can be utilized in a lab environment. Several
computational tools exist to locate motifs in a genomic
sequence. These tools fall under two categories. The first
category tools employ statistical and probabilistic methods
using known motifs and the frequencies of codons in a genomics
sequence. Although some motifs have been discovered using these
tools, often they yield more false positives. Tools in the
second category employ fundamental principles of the
combinatorial logic underlying the occurrence of the enhancers /
cisRegulatory modules (CRM). It is believed that genes with
similar temporal and spatial expression patterns are controlled
by similar CRM. The experimental biologists who are
knowledgeable about CRM occurrences need an efficient tool to
locate them by applying the combinatorial knowledge such as
counts of the binding site occurrences within a specified width,
logical combination of one of more binding sites, orientation
and more. The tools should be efficient, scalable, and fast. The
aim of this proposal is to build such tools.

1 Introduction
Several genomes including the human and the mouse genomes have
been sequenced close to completion. In this post-genomic era, it
is imperative that researchers are equipped with novel
methodologies that will facilitate them to rapidly and

accurately identify, annotate and functionally characterize
genes. Thus, mining of genomics and proteomics data using
computational approaches seems to be the superior way to extract
information from these resources in a short time frame. The
transcriptional regulation of a gene depends on the concerted
action of multiple transcription factors that bind to cis-
regulatory modules located in the vicinity of the gene. Cis-
regulatory modules are regulatory elements that occur close to
each other and control the spatial and temporal expression of
genes. The regulatory language that the genome uses to dictate
transcriptional dynamics can be revealed by identifying these
cis-regulatory elements. Often these elements are transferred
evolutionarily across organisms with little mutations but
without losing their functional value. Knowledge of these motifs
may help drive discovery of similar genes in other closely
related organisms. The availability of accurate models along
with useful search methods with enhanced sensitivity and
specificity will be the first step in being able to detect
putative regulatory elements in a genome-wide manner.

2 Background

The identification of regulatory sequences and their location in
a genome is an important step in understanding the gene
expression. Genes that have similar expression are believed to
have similar regulatory logic. Such genes are governed by unique
combinatorial transcriptional codes known as cis-acting
regulatory modules (CRMs) or enhancers. CRMs are oligonucleotide
sequences that act together to activate or suppress the gene. In
the past, several studies have been performed in understanding
the behavior of enhancers and their role in developmental
biology. The experiments, performed to study the expression of
the gene in a developmental stage, are often time consuming,
and laborious. Computational tools are often sought by
biologists to scan the whole genome for better candidate
selection of these regulatory regions.

Several computational methods exist to predict the regulatory
motif sequences. The motifs are overly represented near the gene
they transcribe. Using the earlier knowledge and position based
probabilities, several tools were built to predict new
regulatory motifs. CisAnalyst, developed by Berman et. al., has
been successfully applied for fruitfly to find new clusters
using a purely computational approach. Bioprospector uses Gibb
sampler to predict regulatory sequences. The main problem with
these tools are the presence of background noise and the
inability to differentiate between a true regulatory motif

versus a false positive. Besides, the variations in genomic
sequence across species further increases the noise. Although
computational methods have served well for purposes of finding
genes and even individual exons in genomic data, regulatory
element predictions have proven difficult.

Markstein [1] developed a tool for biologists to search using
the previous knowledge of enhancers. The tool allows the
biologists to input desired regular expressions using {A,T,G,C},
gene name, width, and proximity constraints. However, the tool
is genome-specific and does not contain some important
constraints like distance to the next binding site, orientation
and order of the motifs, low affinity sequences, variable length
regular expression, and user-defined overlap constraints.

A brief survey for computational identification of regulatory
DNA is described in Dmitri Papatsenko and Michael Levine. The
paper elucidates the need for computational tools providing a
comparison of available tools without going into the specific
details of the algorithms. The article however emphasizes the
need for a fast and efficient computational tools.

3 Project Proposal

The project aims to provide the following :
1.restrictive search capabilities like distance to the next
motif, orientation of the motif, low affinity motif, order of
motif occurrence [5],
2.limited integrated information like nearby genes/exons, gene
expression data, annotation details around the target once it is
located [5],
3.interactive chain search where a search for a target on an
organism can be linked to intra species or cross species search.
4.Scalable, and efficient

More importantly, our proposed module will be highly flexible,
allowing constant integration of newer genomes and at the same
time being a powerful tool that will allow the researcher to
search for complex gene clusters.

To that end we developed a software program that will more
precisely locate the regulatory region with far more ease for
the researcher than programs that are currently available. The
control, more importantly, of the result of the program will be
given to developmental biologist. The tool is very ideal for a
lab environment.

3.1 Phase I Specific Aims
1.To develop a web-based module that allows the researcher to
search for cisregulatory elements. The tool will input motif
and search constraints as mentioned in figure 1 and will display
results as shown in figure 2 and 3. The search feature of the
program will provide
◦ability to enter 10 regular expressions using A,T,G,C
and letters given in the table below.
◦an option to allow self overlap
◦capacity to input a name for the motif
◦a box to specify width constraint
◦flexibility to input logical combination of motifs typed
in (1) such as (2A and 2B), (A or B or C)
◦ability to disallow overlap across motifs type in first
item.
◦To type name of the gene within a specified distance
once a cluster is found using the above rules
◦a name to save the results. The name will/can be used in
SuperCluster

Letter Codon
B C,G,T
D A,G,T
H A,C,T
K G,T
M A,C
N A,C,G,T
R A,G
S C,G
V A,C,G
W A,T
Y C,

4 Summary: Significance of proposed work
The tool will also provide integration and maintenance that
include
1. Update to new versions of genomics sequences when they are
available from the public site.
2. Rerun the program on old results and inform automatically via
email on new results.
3. Integrate with Gene Ontology information and other useful
databases as advised by biologists.

4. Provide a work_ow like tool which takes the query run on an
organism and apply it another organism with a single key
5. Storage and maintenance of results.

5 Commercialization Strategy
After Phase I launch, every person who visits the site will be
requested to fill their profile before access to use their
program along with the purpose of the visit. The visitor will
also be requested to give feedback which will be collected and
used as leads to prepare the BioRegulatory Appliance in Phase II.

6 KEY PERSONNEL

1)Hariharane Ramasamy is pursing his PhD Computer Science, at
Illinois Institute of Technology, IL., and has more than 15
years of experience in developing applied computational tools
for biomedical engineering. Few relevant tools include
•implemented motif search system for genomic sequences that
displays the results graphically on the screen along with the
sequence annotation.
•developed surveillance system to detect novel sequences.
•Developed a program that calculates the digest of peptides for
user input proteins and also performs differential combination
of post-translational modification along with pI/Mw calculations.
•Pattern induced Multiple alignment using properties of amino
acids.
•New Extended Genetic Algorithm for 3D lattice simulation of
protein folding using conflicting criteria,
•Simulation of human stand-sit movement using 3 link stick figure
model.

Sanjeev Mishra
Sanjeev Mishra is a seasoned professional having about 20 years
of industry experience. Half of his industry life is spent doing
startups in the field of business activity management, business
intelligence and mobile application and management platforms.
Rest half in research and development. He is awarded with one US
patent. Sanjeev is passionate about biking, hiking, running,
meditation and gardening. Sanjeev holds a masters degree in
Physics from DBS College Dehradun, India.

Tulasi Ravuri
Tulasi Ravuri is an experienced software engineering manager
with 23 years of experience at several Silicon Valley companies
such as Unisys, Novell, McAfee, DoCoMo Labs and others. Through

his broad career he has helped bring several products to market.
His most recent work is in Life Sciences Regulatory Compliance
and Administration software suite used by Universities like
Stanford, Berkeley, Harvard; Pharma companies such as GSK,
Hospitals such as Palo Alto Medical Foundation and Government.
He advises several software companies and is an advocate of open
source software. He has an MSCS from University of Louisiana &
BS (Chemical Engg.) from Andhra University, India.

7 Consultants
In phase I, the following help will be used to guide the program
to Phase II
1. two student interns for refining the search and gathering
data on the abilities of the program
2. Consultant for designing user interface and graphics display

8 Prior Support
The proposal has no prior or current support.

References cited
[1] Marc S. Ha_on, Yonaton Grad, George M. Church, Alan M.
Michelson, computation-Based Discovery of Related
Transcriptional Regulatory Modules and Motifs Using an
Experimentally Validated Combinatorial Model Howard Hughes
Medical Institute and Department of
Medicine, Brigham and Women's Hospital, Link®oping University,
Sweden.
[2] Dimitri Papatsenko, Michael Levine, Computational
Identification of regulatory DNAs underlying animal development
Nature Methods, Vol. 2 No. 7:529-534, 2005.
[3] Markstein, M., Markstein, P., Markstein, V. Levine, M.S.,
ìGenome-wide analysis of clustered Dorsal binding sites
identifies putative target genes in the Drosophila embryo,
Proc.Natl Acad. Sci. USA, Vol. 99:763-768, 2002.
[4] Benjamin P. Berman, Barret D. Pfeiffer, Todd R. Laverty,
Steven L.Salzberg, Gerald M.Rubin, Michael B. Eisen and Susan E.
Celniker, Computational identification of developmental
enhancers : conservation and function of transcription factor
binding-site clusters in Drosophila melanogaster and Drosophila
pseudoobscura. Genome Biology, Vol. 5:R81, 2004.
[5] Alan M. Michelson,Deciphering genetic regulatory codes : A
challenge for functional genomics. PNAS, Vol. 99 No. 2, 546-548,
2002.
[6] Matthias Harbers, Piero Carninci, Tag-based approaches for
transcriptiome research and genome annotation. Nature Methods,
Vol. 2, No 7, 499-502, 2005.

[7] Yueyi Liu, Liping Wei, Sera_m Batzaglou, Douglas L. Brutlag,
Jun S. Liu and X.Shirley Liu A suite of web-based programs to
search for transcriptional regulatory motifs. Nucleic Acids
Research, Vol. 32 Web Server Issue, 2004.
[8] Mike P. Liang, Olga G. Troyanskaya, Alain Laederach, Douglas
L Brutlag, and Russ B. Altman Computational Functional Genomics.
IEEE Signal Processing Magazine, 2004.

Budget

Description Expense Amount for 6 months
Salary for Principal $36,000
Investigator
Salary for Software engineer $30,000
Salary for 2 student interns $24,000
Salary for Biology $24,000
consultant
Hardware and Software cost $24,000
(4)
Internet & Cloud hosting $12,000
services
Miscellaneous expenses $6,000
Office rent & expenses $15,000
Travel $5,000
Total Cost $176,000

Figure 1: Input web form to search the genomic sequence using
user defined constraints

Figure 2: Results summary

Figure 3: Detailed results display for

Figure 4: Flow chart describing the flow of the algorithm

Figure 5: Diagram describing the Phase I flow

Appendix

The ultimate goal is to build a self-contained BioRegulatory
appliance that supports automatic updates of the genomic
sequences, rerun the old queries on the new sequences and inform
users of new results, thereby saving enormous amount of time for
the developmental biologist who depend on computers to locate
the target.

Phase II Plan

Specific Aims - To enhance the available module, Biocis so that
the module is user friendly and easy to navigate by a
researcher. Phase II will also aim to create a work_ow module
that will allow easy storage and retrieval of data from
disparate sources and will integrate with useful information.
The phase II feature will include
1.Advanced Regular Expression Search Tool for genomic sequences
that uses the prebuilt index positions for 4 length bases (AAAA,
AAAG, ,,,, GCGC, ...,TTTT) to locate the motifs.
2.Advance multithreaded server tool to perform fast parallel
search of the motif sequences.
3.Advanced caching in memory/disk and database to avoid repeated
search of previous sequences
4.Automated daemon process to get new releases and rerun the
saved searches, inform via email to scientists on new results.
5.Link to GeneOntology database that provides gene function
information
6.Cross species ortholog results from existing public annotated
database.
7.simple statitical tools to look at the motif occurrences on
the whole genome from the interesting results
8.creation of BioRegulatroy software package and plan for
designing a spec for BioRegulatory Appliance.
9.to provide supercluster tool which will perform a similar
search as in Aim I.
10.The input in A -J are the names of the search performed in Aim
I. The tool will help supporting the theory where cluster of
enhancers act to in regulating the gene. A sample input form is
shown in 6

3.1.2 Phase III
The phase III

•Creating a sound computing infrastructure. The infrastructure
requires writing(?) a separate server to perform the
search/caching capabilities. The search module will not be run
via a web server like some of the existing tools. Every request
to perform a search on the web server indicates the whole genome
sequence will be read in memory. The length of genomic sequence
varies from 1 Megabytes to 200 Megabytes in length. If the
number of users on the system grows, the system will run out of
memory, thus imposing a limit on the number of users. Using a
web server to preload the data during startup is not advisable.
Hence a separate server, to perform the search for any generic
genome sequence is needed. The caching in phase I is achieved in
two levels - memory, and disk.
• will concentrate on adding more features to the query, creating
a continuity in search.

For example, once one performs a search, the result will display
genes along with the other species orthologs. The search can be
immediately performed for the same enhancer for the species that
has the closest orthologs. Phase III will also look at improving
the performance of the BioRegulatory appliance.

Figure 6: SuperCluster - Web form for user input

Bio Scope

More Related Content

What's hot (20)

Similar to Bio Scope (20)

Recently uploaded (20)

Bio Scope