BioScope
  Advanced Search Grammar Tool for identification of Functional
                        Noncoding Elements

          Principal Investigator - Hariharane Ramasamy
                         Sanjeev Mishra
                          Tulasi Ravuri
Summary
The completion of several genomic sequences has provided the
motivation for development of a tool that can aid in locating
and analyzing transcription factor binding sites (TFBS)
responsible for regulating the gene transcritption. TFBS are
short sequences 4-20 in length, and often located near the genes
they regulate. These sequences occur in groups or modules also
called enhancer or cisRegulatory modules (CRM). CRM contain one
or more TFBS and interact with a specific combination of
transcription factors to regulate gene expression. Such
sequences are often abundant near the genes they regulate. The
goal of developmental biologists is to understand how these CRM
are organized in a genome, and regulate the gene. Laboratory
methods, that are performed to locate CRM, are often laborious
and time consuming. Hence computational methods have become an
invaluable tool. The success of computational methods depends on
how well they can be utilized in a lab environment. Several
computational tools exist to locate motifs in a genomic
sequence. These tools fall under two categories. The first
category tools employ statistical and probabilistic methods
using known motifs and the frequencies of codons in a genomics
sequence. Although some motifs have been discovered using these
tools, often they yield more false positives. Tools in the
second category employ fundamental principles of the
combinatorial logic underlying the occurrence of the enhancers /
cisRegulatory modules (CRM). It is believed that genes with
similar temporal and spatial expression patterns are controlled
by similar CRM. The experimental biologists who are
knowledgeable about CRM occurrences need an efficient tool to
locate them by applying the combinatorial knowledge such as
counts of the binding site occurrences within a specified width,
logical combination of one of more binding sites, orientation
and more. The tools should be efficient, scalable, and fast. The
aim of this proposal is to build such tools.

1 Introduction
Several genomes including the human and the mouse genomes have
been sequenced close to completion. In this post-genomic era, it
is imperative that researchers are equipped with novel
methodologies that will facilitate them to rapidly and
accurately identify, annotate and functionally characterize
genes. Thus, mining of genomics and proteomics data using
computational approaches seems to be the superior way to extract
information from these resources in a short time frame. The
transcriptional regulation of a gene depends on the concerted
action of multiple transcription factors that bind to cis-
regulatory modules located in the vicinity of the gene. Cis-
regulatory modules are regulatory elements that occur close to
each other and control the spatial and temporal expression of
genes. The regulatory language that the genome uses to dictate
transcriptional dynamics can be revealed by identifying these
cis-regulatory elements. Often these elements are transferred
evolutionarily across organisms with little mutations but
without losing their functional value. Knowledge of these motifs
may help drive discovery of similar genes in other closely
related organisms. The availability of accurate models along
with useful search methods with enhanced sensitivity and
specificity will be the first step in being able to detect
putative regulatory elements in a genome-wide manner.

2 Background

The identification of regulatory sequences and their location in
a genome is an important step in understanding the gene
expression. Genes that have similar expression are believed to
have similar regulatory logic. Such genes are governed by unique
combinatorial transcriptional codes known as cis-acting
regulatory modules (CRMs) or enhancers. CRMs are oligonucleotide
sequences that act together to activate or suppress the gene. In
the past, several studies have been performed in understanding
the behavior of enhancers and their role in developmental
biology. The experiments, performed to study the expression of
the gene in a developmental stage, are often time consuming,
and laborious. Computational tools are often sought by
biologists to scan the whole genome for better candidate
selection of these regulatory regions.

Several computational methods exist to predict the regulatory
motif sequences. The motifs are overly represented near the gene
they transcribe. Using the earlier knowledge and position based
probabilities, several tools were built to predict new
regulatory motifs. CisAnalyst, developed by Berman et. al., has
been successfully applied for fruitfly to find new clusters
using a purely computational approach. Bioprospector uses Gibb
sampler to predict regulatory sequences. The main problem with
these tools are the presence of background noise and the
inability to differentiate between a true regulatory motif
versus a false positive. Besides, the variations in genomic
sequence across species further increases the noise. Although
computational methods have served well for purposes of finding
genes and even individual exons in genomic data, regulatory
element predictions have proven difficult.

Markstein [1] developed a tool for biologists to search using
the previous knowledge of enhancers. The tool allows the
biologists to input desired regular expressions using {A,T,G,C},
gene name, width, and proximity constraints. However, the tool
is genome-specific and does not contain some important
constraints like distance to the next binding site, orientation
and order of the motifs, low affinity sequences, variable length
regular expression, and user-defined overlap constraints.

A brief survey for computational identification of regulatory
DNA is described in Dmitri Papatsenko and Michael Levine. The
paper elucidates the need for computational tools providing a
comparison of available tools without going into the specific
details of the algorithms. The article however emphasizes the
need for a fast and efficient computational tools.

3 Project Proposal

The project aims to provide the following :
1.restrictive search capabilities like distance to the next
motif, orientation of the motif, low affinity motif, order of
motif occurrence [5],
2.limited integrated information like nearby genes/exons, gene
expression data, annotation details around the target once it is
located [5],
3.interactive chain search where a search for a target on an
organism can be linked to intra species or cross species search.
4.Scalable, and efficient


More importantly, our proposed module will be highly flexible,
allowing constant integration of newer genomes and at the same
time being a powerful tool that will allow the researcher to
search for complex gene clusters.

To that end we developed a software program that will more
precisely locate the regulatory region with far more ease for
the researcher than programs that are currently available. The
control, more importantly, of the result of the program will be
given to developmental biologist. The tool is very ideal for a
lab environment.
3.1 Phase I Specific Aims
1.To develop a web-based module that allows the researcher to
search for cisregulatory elements. The tool will input motif
and search constraints as mentioned in figure 1 and will display
results as shown in figure 2 and 3. The search feature of the
program will provide
        ◦ability to enter 10 regular expressions using A,T,G,C
        and letters given in the table below.
        ◦an option to allow self overlap
        ◦capacity to input a name for the motif
        ◦a box to specify width constraint
        ◦flexibility to input logical combination of motifs typed
        in (1) such as (2A and 2B), (A or B or C)
        ◦ability to disallow overlap across motifs type in first
        item.
        ◦To type name of the gene within a specified distance
        once a cluster is found using the above rules
        ◦a name to save the results. The name will/can be used in
        SuperCluster


       Letter       Codon
       B            C,G,T
       D            A,G,T
       H            A,C,T
       K            G,T
       M            A,C
       N            A,C,G,T
       R            A,G
       S            C,G
       V            A,C,G
       W            A,T
       Y            C,


4 Summary: Significance of proposed work
The tool will also provide integration and maintenance that
include
1. Update to new versions of genomics sequences when they are
available from the public site.
2. Rerun the program on old results and inform automatically via
email on new results.
3. Integrate with Gene Ontology information and other useful
databases as advised by biologists.
4. Provide a work_ow like tool which takes the query run on an
organism and apply it another organism with a single key
5. Storage and maintenance of results.

5 Commercialization Strategy
After Phase I launch, every person who visits the site will be
requested to fill their profile before access to use their
program along with the purpose of the visit. The visitor will
also be requested to give feedback which will be collected and
used as leads to prepare the BioRegulatory Appliance in Phase II.

6 KEY PERSONNEL

1)Hariharane Ramasamy is pursing his PhD Computer Science, at
Illinois Institute of Technology, IL., and has more than 15
years of experience in developing applied computational tools
for biomedical engineering. Few relevant tools include
•implemented motif search system for genomic sequences that
displays the results graphically on the screen along with the
sequence annotation.
•developed surveillance system to detect novel sequences.
•Developed a program that calculates the digest of peptides for
user input proteins and also performs differential combination
of post-translational modification along with pI/Mw calculations.
•Pattern induced Multiple alignment using properties of amino
acids.
•New Extended Genetic Algorithm for 3D lattice simulation of
protein folding using conflicting criteria,
•Simulation of human stand-sit movement using 3 link stick figure
model.

Sanjeev Mishra
Sanjeev Mishra is a seasoned professional having about 20 years
of industry experience. Half of his industry life is spent doing
startups in the field of business activity management, business
intelligence and mobile application and management platforms.
Rest half in research and development. He is awarded with one US
patent. Sanjeev is passionate about biking, hiking, running,
meditation and gardening. Sanjeev holds a masters degree in
Physics from DBS College Dehradun, India.


Tulasi Ravuri
Tulasi Ravuri is an experienced software engineering manager
with 23 years of experience at several Silicon Valley companies
such as Unisys, Novell, McAfee, DoCoMo Labs and others. Through
his broad career he has helped bring several products to market.
His most recent work is in Life Sciences Regulatory Compliance
and Administration software suite used by Universities like
Stanford, Berkeley, Harvard; Pharma companies such as GSK,
Hospitals such as Palo Alto Medical Foundation and Government.
He advises several software companies and is an advocate of open
source software. He has an MSCS from University of Louisiana &
BS (Chemical Engg.) from Andhra University, India.


7 Consultants
In phase I, the following help will be used to guide the program
to Phase II
1. two student interns for refining the search and gathering
data on the abilities of the program
2. Consultant for designing user interface and graphics display

8 Prior Support
The proposal has no prior or current support.

References cited
[1] Marc S. Ha_on, Yonaton Grad, George M. Church, Alan M.
Michelson, computation-Based Discovery of Related
Transcriptional Regulatory Modules and Motifs Using an
Experimentally Validated Combinatorial Model Howard Hughes
Medical Institute and Department of
Medicine, Brigham and Women's Hospital, Link®oping University,
Sweden.
[2] Dimitri Papatsenko, Michael Levine, Computational
Identification of regulatory DNAs underlying animal development
Nature Methods, Vol. 2 No. 7:529-534, 2005.
[3] Markstein, M., Markstein, P., Markstein, V. Levine, M.S.,
ìGenome-wide analysis of clustered Dorsal binding sites
identifies putative target genes in the Drosophila embryo,
Proc.Natl Acad. Sci. USA, Vol. 99:763-768, 2002.
[4] Benjamin P. Berman, Barret D. Pfeiffer, Todd R. Laverty,
Steven L.Salzberg, Gerald M.Rubin, Michael B. Eisen and Susan E.
Celniker, Computational identification of developmental
enhancers : conservation and function of transcription factor
binding-site clusters in Drosophila melanogaster and Drosophila
pseudoobscura. Genome Biology, Vol. 5:R81, 2004.
[5] Alan M. Michelson,Deciphering genetic regulatory codes : A
challenge for functional genomics. PNAS, Vol. 99 No. 2, 546-548,
2002.
[6] Matthias Harbers, Piero Carninci, Tag-based approaches for
transcriptiome research and genome annotation. Nature Methods,
Vol. 2, No 7, 499-502, 2005.
[7] Yueyi Liu, Liping Wei, Sera_m Batzaglou, Douglas L. Brutlag,
Jun S. Liu and X.Shirley Liu A suite of web-based programs to
search for transcriptional regulatory motifs. Nucleic Acids
Research, Vol. 32 Web Server Issue, 2004.
[8] Mike P. Liang, Olga G. Troyanskaya, Alain Laederach, Douglas
L Brutlag, and Russ B. Altman Computational Functional Genomics.
IEEE Signal Processing Magazine, 2004.

Budget


   Description                    Expense Amount for 6 months
   Salary for Principal           $36,000
   Investigator
   Salary for Software engineer   $30,000
   Salary for 2 student interns   $24,000
   Salary for Biology             $24,000
   consultant
   Hardware and Software cost     $24,000
   (4)
   Internet & Cloud hosting       $12,000
   services
   Miscellaneous expenses         $6,000
   Office rent & expenses         $15,000
   Travel                         $5,000
   Total Cost                     $176,000
Figure 1: Input web form to search the genomic sequence using
                   user defined constraints
Figure 2: Results summary




Figure 3: Detailed results display for
Figure 4: Flow chart describing the flow of the algorithm
Figure 5: Diagram describing the Phase I flow
Appendix

 The ultimate goal is to build a self-contained BioRegulatory
appliance that supports automatic updates of the genomic
sequences, rerun the old queries on the new sequences and inform
users of new results, thereby saving enormous amount of time for
the developmental biologist who depend on computers to locate
the target.

Phase II Plan

Specific Aims - To enhance the available module, Biocis so that
the module is user friendly and easy to navigate by a
researcher. Phase II will also aim to create a work_ow module
that will allow easy storage and retrieval of data from
disparate sources and will integrate with useful information.
The phase II feature will include
1.Advanced Regular Expression Search Tool for genomic sequences
that uses the prebuilt index positions for 4 length bases (AAAA,
AAAG, ,,,, GCGC, ...,TTTT) to locate the motifs.
2.Advance multithreaded server tool to perform fast parallel
search of the motif sequences.
3.Advanced caching in memory/disk and database to avoid repeated
search of previous sequences
4.Automated daemon process to get new releases and rerun the
saved searches, inform via email to scientists on new results.
5.Link to GeneOntology database that provides gene function
information
6.Cross species ortholog results from existing public annotated
database.
7.simple statitical tools to look at the motif occurrences on
the whole genome from the interesting results
8.creation of BioRegulatroy software package and plan for
designing a spec for BioRegulatory Appliance.
9.to provide supercluster tool which will perform a similar
search as in Aim I.
10.The input in A -J are the names of the search performed in Aim
I. The tool will help supporting the theory where cluster of
enhancers act to in regulating the gene. A sample input form is
shown in 6



3.1.2 Phase III
The phase III
•Creating a sound computing infrastructure. The infrastructure
requires writing(?) a separate server to perform the
search/caching capabilities. The search module will not be run
via a web server like some of the existing tools. Every request
to perform a search on the web server indicates the whole genome
sequence will be read in memory. The length of genomic sequence
varies from 1 Megabytes to 200 Megabytes in length. If the
number of users on the system grows, the system will run out of
memory, thus imposing a limit on the number of users. Using a
web server to preload the data during startup is not advisable.
Hence a separate server, to perform the search for any generic
genome sequence is needed. The caching in phase I is achieved in
two levels - memory, and disk.
• will concentrate on adding more features to the query, creating
a continuity in search.

For example, once one performs a search, the result will display
genes along with the other species orthologs. The search can be
immediately performed for the same enhancer for the species that
has the closest orthologs. Phase III will also look at improving
the performance of the BioRegulatory appliance.
Figure 6: SuperCluster - Web form for user input

More Related Content

PPTX
2017 amp benchmarking_poster_justin
ODP
Bioc strucvariant seattle_11_09
PDF
2017 agbt benchmarking_poster
PDF
Giab ashg 2017
PPTX
Tools for Using NIST Reference Materials
PPTX
AI in Bioinformatics
PPTX
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
PPTX
Genome in a Bottle- reference materials to benchmark challenging variants and...
2017 amp benchmarking_poster_justin
Bioc strucvariant seattle_11_09
2017 agbt benchmarking_poster
Giab ashg 2017
Tools for Using NIST Reference Materials
AI in Bioinformatics
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
Genome in a Bottle- reference materials to benchmark challenging variants and...

What's hot (20)

PDF
Usual Questions with Unusual Answers: Application of Multi-class Supervised A...
PPTX
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
PPTX
ASHG 2015 Genome in a bottle
PPTX
GIAB-GRC workshop oct2015 giab introduction 151005
PPTX
Aug2013 illumina platinum genomes
PPTX
GIAB Technical Germline Benchmark roadmap discussion
PDF
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
PPTX
171017 giab for giab grc workshop
PPTX
171114 best practices for benchmarking variant calls justin
PPTX
161115 precision fda giab
PPTX
Giab jan2016 analysis team breakout SNP indel update zook
PPTX
Giab jan2016 intro and update 160128
PDF
2017 agbt giab_poster
PPTX
Giab jan2016 analysis team breakout summary
PPTX
170120 giab stanford genetics seminar
PPTX
Jan2016 bina giab
PDF
Giab agbt small_var_2020
PPTX
GIAB for AMP GeT-RM Forum
PPTX
Aug2015 Giab nist integration methods
PPTX
Genome in a Bottle
Usual Questions with Unusual Answers: Application of Multi-class Supervised A...
Aug2015 Ali Bashir and Jason Chin Pac bio giab_assembly_summary_ali3
ASHG 2015 Genome in a bottle
GIAB-GRC workshop oct2015 giab introduction 151005
Aug2013 illumina platinum genomes
GIAB Technical Germline Benchmark roadmap discussion
Ga4gh 2019 - Assuring data quality with benchmarking tools from GIAB and GA4GH
171017 giab for giab grc workshop
171114 best practices for benchmarking variant calls justin
161115 precision fda giab
Giab jan2016 analysis team breakout SNP indel update zook
Giab jan2016 intro and update 160128
2017 agbt giab_poster
Giab jan2016 analysis team breakout summary
170120 giab stanford genetics seminar
Jan2016 bina giab
Giab agbt small_var_2020
GIAB for AMP GeT-RM Forum
Aug2015 Giab nist integration methods
Genome in a Bottle
Ad

Similar to Bio Scope (20)

PPTX
2013 nas-ehs-data-integration-dc
PDF
BITS: Basics of sequence databases
PDF
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
PPTX
Bioinformatic tool for Annotation of gene
PPTX
Informal presentation on bioinformatics
PPTX
DNA Sequence Data in Big Data Perspective
PDF
call for papers, research paper publishing, where to publish research paper, ...
PPT
Bioinformatics MiRON
PPTX
Cool Informatics Tools and Services for Biomedical Research
PPTX
Allelic Imbalance for Pre-capture Whole Exome Sequencing
PPTX
Bioinformatics_1_ChenS.pptx
PPTX
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
PPTX
Bioinformatic, and tools by kk sahu
PPT
SooryaKiran Bioinformatics
PDF
Design and development of learning model for compression and processing of d...
PDF
Comparative analysis of dynamic programming
PDF
Comparative analysis of dynamic programming algorithms to find similarity in ...
PPTX
Talk at Bioinformatics Open Source Conference, 2012
PPTX
CT Brown - Doing next-gen sequencing analysis in the cloud
PPTX
2012 hpcuserforum talk
2013 nas-ehs-data-integration-dc
BITS: Basics of sequence databases
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
Bioinformatic tool for Annotation of gene
Informal presentation on bioinformatics
DNA Sequence Data in Big Data Perspective
call for papers, research paper publishing, where to publish research paper, ...
Bioinformatics MiRON
Cool Informatics Tools and Services for Biomedical Research
Allelic Imbalance for Pre-capture Whole Exome Sequencing
Bioinformatics_1_ChenS.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
Bioinformatic, and tools by kk sahu
SooryaKiran Bioinformatics
Design and development of learning model for compression and processing of d...
Comparative analysis of dynamic programming
Comparative analysis of dynamic programming algorithms to find similarity in ...
Talk at Bioinformatics Open Source Conference, 2012
CT Brown - Doing next-gen sequencing analysis in the cloud
2012 hpcuserforum talk
Ad

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Hybrid model detection and classification of lung cancer
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
Unlock new opportunities with location data.pdf
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Five Habits of High-Impact Board Members
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
Developing a website for English-speaking practice to English as a foreign la...
DOCX
search engine optimization ppt fir known well about this
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
Assigned Numbers - 2025 - Bluetooth® Document
Chapter 5: Probability Theory and Statistics
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
CloudStack 4.21: First Look Webinar slides
Hybrid model detection and classification of lung cancer
O2C Customer Invoices to Receipt V15A.pptx
Getting Started with Data Integration: FME Form 101
Unlock new opportunities with location data.pdf
A review of recent deep learning applications in wood surface defect identifi...
Five Habits of High-Impact Board Members
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
Developing a website for English-speaking practice to English as a foreign la...
search engine optimization ppt fir known well about this
A novel scalable deep ensemble learning framework for big data classification...
Hindi spoken digit analysis for native and non-native speakers
Zenith AI: Advanced Artificial Intelligence
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
WOOl fibre morphology and structure.pdf for textiles

Bio Scope

  • 1. BioScope Advanced Search Grammar Tool for identification of Functional Noncoding Elements Principal Investigator - Hariharane Ramasamy Sanjeev Mishra Tulasi Ravuri Summary The completion of several genomic sequences has provided the motivation for development of a tool that can aid in locating and analyzing transcription factor binding sites (TFBS) responsible for regulating the gene transcritption. TFBS are short sequences 4-20 in length, and often located near the genes they regulate. These sequences occur in groups or modules also called enhancer or cisRegulatory modules (CRM). CRM contain one or more TFBS and interact with a specific combination of transcription factors to regulate gene expression. Such sequences are often abundant near the genes they regulate. The goal of developmental biologists is to understand how these CRM are organized in a genome, and regulate the gene. Laboratory methods, that are performed to locate CRM, are often laborious and time consuming. Hence computational methods have become an invaluable tool. The success of computational methods depends on how well they can be utilized in a lab environment. Several computational tools exist to locate motifs in a genomic sequence. These tools fall under two categories. The first category tools employ statistical and probabilistic methods using known motifs and the frequencies of codons in a genomics sequence. Although some motifs have been discovered using these tools, often they yield more false positives. Tools in the second category employ fundamental principles of the combinatorial logic underlying the occurrence of the enhancers / cisRegulatory modules (CRM). It is believed that genes with similar temporal and spatial expression patterns are controlled by similar CRM. The experimental biologists who are knowledgeable about CRM occurrences need an efficient tool to locate them by applying the combinatorial knowledge such as counts of the binding site occurrences within a specified width, logical combination of one of more binding sites, orientation and more. The tools should be efficient, scalable, and fast. The aim of this proposal is to build such tools. 1 Introduction Several genomes including the human and the mouse genomes have been sequenced close to completion. In this post-genomic era, it is imperative that researchers are equipped with novel methodologies that will facilitate them to rapidly and
  • 2. accurately identify, annotate and functionally characterize genes. Thus, mining of genomics and proteomics data using computational approaches seems to be the superior way to extract information from these resources in a short time frame. The transcriptional regulation of a gene depends on the concerted action of multiple transcription factors that bind to cis- regulatory modules located in the vicinity of the gene. Cis- regulatory modules are regulatory elements that occur close to each other and control the spatial and temporal expression of genes. The regulatory language that the genome uses to dictate transcriptional dynamics can be revealed by identifying these cis-regulatory elements. Often these elements are transferred evolutionarily across organisms with little mutations but without losing their functional value. Knowledge of these motifs may help drive discovery of similar genes in other closely related organisms. The availability of accurate models along with useful search methods with enhanced sensitivity and specificity will be the first step in being able to detect putative regulatory elements in a genome-wide manner. 2 Background The identification of regulatory sequences and their location in a genome is an important step in understanding the gene expression. Genes that have similar expression are believed to have similar regulatory logic. Such genes are governed by unique combinatorial transcriptional codes known as cis-acting regulatory modules (CRMs) or enhancers. CRMs are oligonucleotide sequences that act together to activate or suppress the gene. In the past, several studies have been performed in understanding the behavior of enhancers and their role in developmental biology. The experiments, performed to study the expression of the gene in a developmental stage, are often time consuming, and laborious. Computational tools are often sought by biologists to scan the whole genome for better candidate selection of these regulatory regions. Several computational methods exist to predict the regulatory motif sequences. The motifs are overly represented near the gene they transcribe. Using the earlier knowledge and position based probabilities, several tools were built to predict new regulatory motifs. CisAnalyst, developed by Berman et. al., has been successfully applied for fruitfly to find new clusters using a purely computational approach. Bioprospector uses Gibb sampler to predict regulatory sequences. The main problem with these tools are the presence of background noise and the inability to differentiate between a true regulatory motif
  • 3. versus a false positive. Besides, the variations in genomic sequence across species further increases the noise. Although computational methods have served well for purposes of finding genes and even individual exons in genomic data, regulatory element predictions have proven difficult. Markstein [1] developed a tool for biologists to search using the previous knowledge of enhancers. The tool allows the biologists to input desired regular expressions using {A,T,G,C}, gene name, width, and proximity constraints. However, the tool is genome-specific and does not contain some important constraints like distance to the next binding site, orientation and order of the motifs, low affinity sequences, variable length regular expression, and user-defined overlap constraints. A brief survey for computational identification of regulatory DNA is described in Dmitri Papatsenko and Michael Levine. The paper elucidates the need for computational tools providing a comparison of available tools without going into the specific details of the algorithms. The article however emphasizes the need for a fast and efficient computational tools. 3 Project Proposal The project aims to provide the following : 1.restrictive search capabilities like distance to the next motif, orientation of the motif, low affinity motif, order of motif occurrence [5], 2.limited integrated information like nearby genes/exons, gene expression data, annotation details around the target once it is located [5], 3.interactive chain search where a search for a target on an organism can be linked to intra species or cross species search. 4.Scalable, and efficient More importantly, our proposed module will be highly flexible, allowing constant integration of newer genomes and at the same time being a powerful tool that will allow the researcher to search for complex gene clusters. To that end we developed a software program that will more precisely locate the regulatory region with far more ease for the researcher than programs that are currently available. The control, more importantly, of the result of the program will be given to developmental biologist. The tool is very ideal for a lab environment.
  • 4. 3.1 Phase I Specific Aims 1.To develop a web-based module that allows the researcher to search for cisregulatory elements. The tool will input motif and search constraints as mentioned in figure 1 and will display results as shown in figure 2 and 3. The search feature of the program will provide ◦ability to enter 10 regular expressions using A,T,G,C and letters given in the table below. ◦an option to allow self overlap ◦capacity to input a name for the motif ◦a box to specify width constraint ◦flexibility to input logical combination of motifs typed in (1) such as (2A and 2B), (A or B or C) ◦ability to disallow overlap across motifs type in first item. ◦To type name of the gene within a specified distance once a cluster is found using the above rules ◦a name to save the results. The name will/can be used in SuperCluster Letter Codon B C,G,T D A,G,T H A,C,T K G,T M A,C N A,C,G,T R A,G S C,G V A,C,G W A,T Y C, 4 Summary: Significance of proposed work The tool will also provide integration and maintenance that include 1. Update to new versions of genomics sequences when they are available from the public site. 2. Rerun the program on old results and inform automatically via email on new results. 3. Integrate with Gene Ontology information and other useful databases as advised by biologists.
  • 5. 4. Provide a work_ow like tool which takes the query run on an organism and apply it another organism with a single key 5. Storage and maintenance of results. 5 Commercialization Strategy After Phase I launch, every person who visits the site will be requested to fill their profile before access to use their program along with the purpose of the visit. The visitor will also be requested to give feedback which will be collected and used as leads to prepare the BioRegulatory Appliance in Phase II. 6 KEY PERSONNEL 1)Hariharane Ramasamy is pursing his PhD Computer Science, at Illinois Institute of Technology, IL., and has more than 15 years of experience in developing applied computational tools for biomedical engineering. Few relevant tools include •implemented motif search system for genomic sequences that displays the results graphically on the screen along with the sequence annotation. •developed surveillance system to detect novel sequences. •Developed a program that calculates the digest of peptides for user input proteins and also performs differential combination of post-translational modification along with pI/Mw calculations. •Pattern induced Multiple alignment using properties of amino acids. •New Extended Genetic Algorithm for 3D lattice simulation of protein folding using conflicting criteria, •Simulation of human stand-sit movement using 3 link stick figure model. Sanjeev Mishra Sanjeev Mishra is a seasoned professional having about 20 years of industry experience. Half of his industry life is spent doing startups in the field of business activity management, business intelligence and mobile application and management platforms. Rest half in research and development. He is awarded with one US patent. Sanjeev is passionate about biking, hiking, running, meditation and gardening. Sanjeev holds a masters degree in Physics from DBS College Dehradun, India. Tulasi Ravuri Tulasi Ravuri is an experienced software engineering manager with 23 years of experience at several Silicon Valley companies such as Unisys, Novell, McAfee, DoCoMo Labs and others. Through
  • 6. his broad career he has helped bring several products to market. His most recent work is in Life Sciences Regulatory Compliance and Administration software suite used by Universities like Stanford, Berkeley, Harvard; Pharma companies such as GSK, Hospitals such as Palo Alto Medical Foundation and Government. He advises several software companies and is an advocate of open source software. He has an MSCS from University of Louisiana & BS (Chemical Engg.) from Andhra University, India. 7 Consultants In phase I, the following help will be used to guide the program to Phase II 1. two student interns for refining the search and gathering data on the abilities of the program 2. Consultant for designing user interface and graphics display 8 Prior Support The proposal has no prior or current support. References cited [1] Marc S. Ha_on, Yonaton Grad, George M. Church, Alan M. Michelson, computation-Based Discovery of Related Transcriptional Regulatory Modules and Motifs Using an Experimentally Validated Combinatorial Model Howard Hughes Medical Institute and Department of Medicine, Brigham and Women's Hospital, Link®oping University, Sweden. [2] Dimitri Papatsenko, Michael Levine, Computational Identification of regulatory DNAs underlying animal development Nature Methods, Vol. 2 No. 7:529-534, 2005. [3] Markstein, M., Markstein, P., Markstein, V. Levine, M.S., ìGenome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo, Proc.Natl Acad. Sci. USA, Vol. 99:763-768, 2002. [4] Benjamin P. Berman, Barret D. Pfeiffer, Todd R. Laverty, Steven L.Salzberg, Gerald M.Rubin, Michael B. Eisen and Susan E. Celniker, Computational identification of developmental enhancers : conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biology, Vol. 5:R81, 2004. [5] Alan M. Michelson,Deciphering genetic regulatory codes : A challenge for functional genomics. PNAS, Vol. 99 No. 2, 546-548, 2002. [6] Matthias Harbers, Piero Carninci, Tag-based approaches for transcriptiome research and genome annotation. Nature Methods, Vol. 2, No 7, 499-502, 2005.
  • 7. [7] Yueyi Liu, Liping Wei, Sera_m Batzaglou, Douglas L. Brutlag, Jun S. Liu and X.Shirley Liu A suite of web-based programs to search for transcriptional regulatory motifs. Nucleic Acids Research, Vol. 32 Web Server Issue, 2004. [8] Mike P. Liang, Olga G. Troyanskaya, Alain Laederach, Douglas L Brutlag, and Russ B. Altman Computational Functional Genomics. IEEE Signal Processing Magazine, 2004. Budget Description Expense Amount for 6 months Salary for Principal $36,000 Investigator Salary for Software engineer $30,000 Salary for 2 student interns $24,000 Salary for Biology $24,000 consultant Hardware and Software cost $24,000 (4) Internet & Cloud hosting $12,000 services Miscellaneous expenses $6,000 Office rent & expenses $15,000 Travel $5,000 Total Cost $176,000
  • 8. Figure 1: Input web form to search the genomic sequence using user defined constraints
  • 9. Figure 2: Results summary Figure 3: Detailed results display for
  • 10. Figure 4: Flow chart describing the flow of the algorithm
  • 11. Figure 5: Diagram describing the Phase I flow
  • 12. Appendix The ultimate goal is to build a self-contained BioRegulatory appliance that supports automatic updates of the genomic sequences, rerun the old queries on the new sequences and inform users of new results, thereby saving enormous amount of time for the developmental biologist who depend on computers to locate the target. Phase II Plan Specific Aims - To enhance the available module, Biocis so that the module is user friendly and easy to navigate by a researcher. Phase II will also aim to create a work_ow module that will allow easy storage and retrieval of data from disparate sources and will integrate with useful information. The phase II feature will include 1.Advanced Regular Expression Search Tool for genomic sequences that uses the prebuilt index positions for 4 length bases (AAAA, AAAG, ,,,, GCGC, ...,TTTT) to locate the motifs. 2.Advance multithreaded server tool to perform fast parallel search of the motif sequences. 3.Advanced caching in memory/disk and database to avoid repeated search of previous sequences 4.Automated daemon process to get new releases and rerun the saved searches, inform via email to scientists on new results. 5.Link to GeneOntology database that provides gene function information 6.Cross species ortholog results from existing public annotated database. 7.simple statitical tools to look at the motif occurrences on the whole genome from the interesting results 8.creation of BioRegulatroy software package and plan for designing a spec for BioRegulatory Appliance. 9.to provide supercluster tool which will perform a similar search as in Aim I. 10.The input in A -J are the names of the search performed in Aim I. The tool will help supporting the theory where cluster of enhancers act to in regulating the gene. A sample input form is shown in 6 3.1.2 Phase III The phase III
  • 13. •Creating a sound computing infrastructure. The infrastructure requires writing(?) a separate server to perform the search/caching capabilities. The search module will not be run via a web server like some of the existing tools. Every request to perform a search on the web server indicates the whole genome sequence will be read in memory. The length of genomic sequence varies from 1 Megabytes to 200 Megabytes in length. If the number of users on the system grows, the system will run out of memory, thus imposing a limit on the number of users. Using a web server to preload the data during startup is not advisable. Hence a separate server, to perform the search for any generic genome sequence is needed. The caching in phase I is achieved in two levels - memory, and disk. • will concentrate on adding more features to the query, creating a continuity in search. For example, once one performs a search, the result will display genes along with the other species orthologs. The search can be immediately performed for the same enhancer for the species that has the closest orthologs. Phase III will also look at improving the performance of the BioRegulatory appliance.
  • 14. Figure 6: SuperCluster - Web form for user input