SlideShare a Scribd company logo
2
Most read
3
Most read
Data mining in Bioinformatics:
Data Mining is the process of automatic discovery of novel and understandable models and
patterns from large amounts of data involving methods at the intersection of machine learning,
statistics and database systems. Bioinformatics is the science of storing, analyzing, and utilizing
information from biological data such as sequences, molecules, gene expressions, and
pathways. Development of novel data mining methods will play a fundamental role in
understanding these rapidly expanding sources of biological data.
Data mining is an interdisciplinary subfield of computer science and statistics with an overall
goal to extract information from a large set of data and transform the information into a
comprehensible structure for further use. Data mining is the analysis step of the "knowledge
discovery in databases" process or KDD (Fig.1). Aside from the raw analysis step, it also
involves database and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-processing of
discovered structures, visualization, and online updating.
Fig. 1: The process of KDD and the steps involved.
Data mining approaches seem ideally suited in the field of bioinformatics with enormous
volumes of data deposited at every second. The extensive databases of biological information
create both challenges and opportunities for developing novel data mining methods. Every
year, workshop on Data Mining in Bioinformatics (BIOKDD) is held since 2001 with a goal
to encourage the KDD researchers worldwide to take on the numerous challenges that
Bioinformatics offers.
The difference between data analysis and data mining is that data analysis is used to test models
and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign,
regardless of the amount of data; in contrast, data mining uses machine learning and statistical
models to uncover clandestine or hidden patterns in a large volume of data.
BOTMT:604
Bioinformatics and Biophysics
Prepared By-
Dr. Sangeeta Das.
Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.
Data mining Tools in Bioinformatics:
Various tools for data mining are used in bioinformatics. The following are the tools for
nucleotide sequence analysis:
1. BLAST:
The Basic Local Alignment Search Tool (BLAST) for comparing gene and protein sequences
against others in public databases, now comes in several types including PSI-BLAST, PHI-
BLAST, and BLAST 2 sequences. Specialized BLASTs are also available for human,
microbial, malaria, and other genomes, as well as for vector contamination, immunoglobulins,
and tentative human consensus sequences.
2. Electronic PCR:
This tool allows to search the target DNA sequence for sequence tagged sites (STSs) that have
been used as landmarks in various types of genomic maps. It compares the query sequence
against data in NCBI’s UniSTS, a unified, non-redundant view of STSs from a wide range of
sources.
3. Entrez:
The Entrez is Global Query Cross-Database Search System is a federated search engine, or
web portal that allows users to search many discrete health sciences databases at the National
Center for Biotechnology Information (NCBI) website. The name "Entrez" (meaning "Come
in" in French) was chosen to reflect the spirit of welcoming the public to search the content
available from the National Library of Medicine (NLM).
Entrez Global Query is an integrated search and retrieval system that provides access to all
databases simultaneously with a single query string and user interface. Entrez can efficiently
retrieve related sequences, structures, and references. The Entrez system can provide views of
gene and protein sequences and chromosome maps. Some textbooks are also available online
through the Entrez system. Entrez searches the databases such as PubMed, PubMed Central,
Site Search, online Books, Online Mendelian Inheritance in Man (OMIM), Nucleotide
sequence database (GenBank), Protein sequence database, Genome Project, UniGene, NLM
Catalog, etc.
Each Entrez Gene record encapsulates a wide range of information for a given gene and
organism. When possible, the information includes results of analyses that have been done on
the sequence data. The amount and type of information presented depend on what is available
for a particular gene and organism and includes:
(1) graphic summary of the genomic context, intron/exon structure, and flanking genes
(2) link to a graphic view of the mRNA sequence, which in turn shows biological features such
as CDS, SNPs, etc.
(3) links to gene ontology and phenotypic information
(4) links to corresponding protein sequence data and conserved domains
(5) links to related resources, such as mutation databases. Entrez Gene is a successor to
LocusLink.
4. Model Maker:
It allows to view the evidence (mRNAs, ESTs, and gene predictions) that was aligned to
assembled genomic sequence to build a gene model and to edit the model by selecting or
BOTMT:604
Bioinformatics and Biophysics
Prepared By-
Dr. Sangeeta Das.
Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.
removing putative exons. Model Maker is accessible from sequence maps that were analyzed
at NCBI and displayed in Map Viewer.
5. ORF (Open Reading Frame) Finder:
ORF Finder identifies all possible ORFs in a DNA sequence by locating the standard and
alternative stop and start codons. The deduced amino acid sequences can then be used to
BLAST against GenBank. ORF finder is also packaged in the sequence submission software
Sequin.
6. SAGEMAP:
It is a tool for performing statistical tests designed specifically for differential-type analyses of
SAGE (Serial Analysis of Gene Expression) data. The data include SAGE libraries generated
by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP),
which have been submitted to Gene Expression Omnibus (GEO). Gene expression profiles that
compare the expression in different SAGE libraries are also available on the Entrez GEO
Profiles pages. It is possible to enter a query sequence in the SAGEmap resource to determine
what SAGE tags are in the sequence, then map to associated SAGEtag records and view the
expression of those tags in different CGAP SAGE libraries.
7. Spidey:
It aligns one or more mRNA sequences to a single genomic sequence. Spidey will try to
determine the exon/intron structure, returning one or more models of the genomic structure,
including the genomic/mRNA alignments for each exon.
8. VecScreen:
It is a tool for identifying segments of a nucleic acid sequence that may be of vector, linker, or
adapter origin prior to sequence analysis or submission. VecScreen was developed to combat
the problem of vector contamination in public sequence databases.
BOTMT:604
Bioinformatics and Biophysics
Prepared By-
Dr. Sangeeta Das.
Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.

More Related Content

PPT
01 Data Mining: Concepts and Techniques, 2nd ed.
DOC
STRUCTURE AND ORGANIZATION OF CHROMATIN
PPTX
Blast and fasta
PPTX
Complement system
PPTX
Data Mining: What is Data Mining?
PPTX
Features of biological databases
PPT
PHAGE THERAPY-NEW
PPTX
Introduction to NCBI
01 Data Mining: Concepts and Techniques, 2nd ed.
STRUCTURE AND ORGANIZATION OF CHROMATIN
Blast and fasta
Complement system
Data Mining: What is Data Mining?
Features of biological databases
PHAGE THERAPY-NEW
Introduction to NCBI

What's hot (20)

PPTX
Protein Databases
PPTX
DNA data bank of japan (DDBJ)
PPTX
(Expasy)
PDF
Nucleic Acid Sequence databases
PPT
Pairwise sequence alignment
PPTX
Protein database
DOCX
Bioinformatics on internet
PDF
Data mining
PPTX
Biological database
PDF
Dot matrix
PPTX
blast bioinformatics
PPTX
Kegg
PPTX
PPTX
Scop database
DOCX
PPTX
Swiss prot database
PPT
ENTREZ.ppt
PPTX
Protein Databases
DNA data bank of japan (DDBJ)
(Expasy)
Nucleic Acid Sequence databases
Pairwise sequence alignment
Protein database
Bioinformatics on internet
Data mining
Biological database
Dot matrix
blast bioinformatics
Kegg
Scop database
Swiss prot database
ENTREZ.ppt
Ad

Similar to Bioinformatics data mining (20)

PPTX
Informal presentation on bioinformatics
PDF
Bioinformatics - Exam_Materials.pdf by uos
PPT
SooryaKiran Bioinformatics
PPTX
Bio informatics importance types applications
PPTX
8. Data mining_warehousing_integration.pptx
PDF
database retrival.pdf
PPTX
Bioinformatic, and tools by kk sahu
PDF
Introduction to Bioinformatics-1.pdf
PPTX
Bioinformatics introduction
PPTX
Introduction to databases.pptx
PPTX
Bioinformatics1234kuhutgytdrtq3e2w5resdtyfv
PPTX
Data retriveal ,srg and dbget
PPTX
Introduction to bioinformatics
PDF
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
PDF
call for papers, research paper publishing, where to publish research paper, ...
PDF
D1803012022
PDF
LECTURE NOTES ON BIOINFORMATICS
PPTX
Bioinformatics_1_ChenS.pptx
Informal presentation on bioinformatics
Bioinformatics - Exam_Materials.pdf by uos
SooryaKiran Bioinformatics
Bio informatics importance types applications
8. Data mining_warehousing_integration.pptx
database retrival.pdf
Bioinformatic, and tools by kk sahu
Introduction to Bioinformatics-1.pdf
Bioinformatics introduction
Introduction to databases.pptx
Bioinformatics1234kuhutgytdrtq3e2w5resdtyfv
Data retriveal ,srg and dbget
Introduction to bioinformatics
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
call for papers, research paper publishing, where to publish research paper, ...
D1803012022
LECTURE NOTES ON BIOINFORMATICS
Bioinformatics_1_ChenS.pptx
Ad

More from Sangeeta Das (20)

PPT
Strategies and Solutions for Sustainable Land Management.ppt
PPT
Cyanophyta
PPTX
Human Impact on Forests.pptx
PPTX
Women in NE India-A Holistic Approach
PPTX
Can organic feed the world
PPT
Chlamydomonas
PPTX
Evolution of sporophyte in bryotphytes
PPTX
Botanical garden
PPTX
Herbarium Techniques
PPTX
Numerical taxonomy_Plant Taxonomy
DOCX
Bioinformatics_Sequence Analysis
PPTX
Chemotaxonomy-Plant Taxonomy
PPTX
Cytotaxonomy plant taxonomy
PPTX
Rosaceae family-Plant Taxonomy
PDF
Bioinformatics biological databases
PDF
Cytokinin
PDF
Documentation in plant taxonomy
PPTX
Aims and objectives of plant taxonomy
PPTX
History and development of plant taxonomy
PPTX
Archegoniates
Strategies and Solutions for Sustainable Land Management.ppt
Cyanophyta
Human Impact on Forests.pptx
Women in NE India-A Holistic Approach
Can organic feed the world
Chlamydomonas
Evolution of sporophyte in bryotphytes
Botanical garden
Herbarium Techniques
Numerical taxonomy_Plant Taxonomy
Bioinformatics_Sequence Analysis
Chemotaxonomy-Plant Taxonomy
Cytotaxonomy plant taxonomy
Rosaceae family-Plant Taxonomy
Bioinformatics biological databases
Cytokinin
Documentation in plant taxonomy
Aims and objectives of plant taxonomy
History and development of plant taxonomy
Archegoniates

Recently uploaded (20)

PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Lesson notes of climatology university.
PDF
01-Introduction-to-Information-Management.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Cell Types and Its function , kingdom of life
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
Computing-Curriculum for Schools in Ghana
Final Presentation General Medicine 03-08-2024.pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
VCE English Exam - Section C Student Revision Booklet
human mycosis Human fungal infections are called human mycosis..pptx
Pharma ospi slides which help in ospi learning
Lesson notes of climatology university.
01-Introduction-to-Information-Management.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Sports Quiz easy sports quiz sports quiz
Abdominal Access Techniques with Prof. Dr. R K Mishra
TR - Agricultural Crops Production NC III.pdf
Cell Types and Its function , kingdom of life
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Supply Chain Operations Speaking Notes -ICLT Program
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPH.pptx obstetrics and gynecology in nursing

Bioinformatics data mining

  • 1. Data mining in Bioinformatics: Data Mining is the process of automatic discovery of novel and understandable models and patterns from large amounts of data involving methods at the intersection of machine learning, statistics and database systems. Bioinformatics is the science of storing, analyzing, and utilizing information from biological data such as sequences, molecules, gene expressions, and pathways. Development of novel data mining methods will play a fundamental role in understanding these rapidly expanding sources of biological data. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a large set of data and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process or KDD (Fig.1). Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Fig. 1: The process of KDD and the steps involved. Data mining approaches seem ideally suited in the field of bioinformatics with enormous volumes of data deposited at every second. The extensive databases of biological information create both challenges and opportunities for developing novel data mining methods. Every year, workshop on Data Mining in Bioinformatics (BIOKDD) is held since 2001 with a goal to encourage the KDD researchers worldwide to take on the numerous challenges that Bioinformatics offers. The difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data. BOTMT:604 Bioinformatics and Biophysics Prepared By- Dr. Sangeeta Das. Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.
  • 2. Data mining Tools in Bioinformatics: Various tools for data mining are used in bioinformatics. The following are the tools for nucleotide sequence analysis: 1. BLAST: The Basic Local Alignment Search Tool (BLAST) for comparing gene and protein sequences against others in public databases, now comes in several types including PSI-BLAST, PHI- BLAST, and BLAST 2 sequences. Specialized BLASTs are also available for human, microbial, malaria, and other genomes, as well as for vector contamination, immunoglobulins, and tentative human consensus sequences. 2. Electronic PCR: This tool allows to search the target DNA sequence for sequence tagged sites (STSs) that have been used as landmarks in various types of genomic maps. It compares the query sequence against data in NCBI’s UniSTS, a unified, non-redundant view of STSs from a wide range of sources. 3. Entrez: The Entrez is Global Query Cross-Database Search System is a federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. The name "Entrez" (meaning "Come in" in French) was chosen to reflect the spirit of welcoming the public to search the content available from the National Library of Medicine (NLM). Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system. Entrez searches the databases such as PubMed, PubMed Central, Site Search, online Books, Online Mendelian Inheritance in Man (OMIM), Nucleotide sequence database (GenBank), Protein sequence database, Genome Project, UniGene, NLM Catalog, etc. Each Entrez Gene record encapsulates a wide range of information for a given gene and organism. When possible, the information includes results of analyses that have been done on the sequence data. The amount and type of information presented depend on what is available for a particular gene and organism and includes: (1) graphic summary of the genomic context, intron/exon structure, and flanking genes (2) link to a graphic view of the mRNA sequence, which in turn shows biological features such as CDS, SNPs, etc. (3) links to gene ontology and phenotypic information (4) links to corresponding protein sequence data and conserved domains (5) links to related resources, such as mutation databases. Entrez Gene is a successor to LocusLink. 4. Model Maker: It allows to view the evidence (mRNAs, ESTs, and gene predictions) that was aligned to assembled genomic sequence to build a gene model and to edit the model by selecting or BOTMT:604 Bioinformatics and Biophysics Prepared By- Dr. Sangeeta Das. Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.
  • 3. removing putative exons. Model Maker is accessible from sequence maps that were analyzed at NCBI and displayed in Map Viewer. 5. ORF (Open Reading Frame) Finder: ORF Finder identifies all possible ORFs in a DNA sequence by locating the standard and alternative stop and start codons. The deduced amino acid sequences can then be used to BLAST against GenBank. ORF finder is also packaged in the sequence submission software Sequin. 6. SAGEMAP: It is a tool for performing statistical tests designed specifically for differential-type analyses of SAGE (Serial Analysis of Gene Expression) data. The data include SAGE libraries generated by individual labs as well as those generated by the Cancer Genome Anatomy Project (CGAP), which have been submitted to Gene Expression Omnibus (GEO). Gene expression profiles that compare the expression in different SAGE libraries are also available on the Entrez GEO Profiles pages. It is possible to enter a query sequence in the SAGEmap resource to determine what SAGE tags are in the sequence, then map to associated SAGEtag records and view the expression of those tags in different CGAP SAGE libraries. 7. Spidey: It aligns one or more mRNA sequences to a single genomic sequence. Spidey will try to determine the exon/intron structure, returning one or more models of the genomic structure, including the genomic/mRNA alignments for each exon. 8. VecScreen: It is a tool for identifying segments of a nucleic acid sequence that may be of vector, linker, or adapter origin prior to sequence analysis or submission. VecScreen was developed to combat the problem of vector contamination in public sequence databases. BOTMT:604 Bioinformatics and Biophysics Prepared By- Dr. Sangeeta Das. Assistant Professor, Department of Botany, Bahona College, Jorhat, Assam, India.