SlideShare a Scribd company logo
Genomic Big Data
Management, Integration, and Mining
E. Weitschek1,2
1 Department of Engineering, Uninettuno International University, Italy
2 Institute of Systems Analysis and Computer Science, National Research Council, Italy
Joint work with P. Bertolazzi, G. Felici , F. Cumbo, G. Fiscon, E. Cappelli
2
Outline
• Growth of biological data
• Next generation sequencing
• Biological data sources
• Biological data management
• Biological data integration
• Big data bioinformatics
• Knowledge extraction
• Supervised Learning
• Biomedical applications
• Conclusions and future directions
3
Growth of biological data
• Advances in molecular biology lead to an exponential growth of biological data thanks
to the support of computer science
‒ originated by the DNA sequencing method invented by Sanger in early eighties
‒ late nineties significant advances in sequence generation, e.g. Human Genome
Project
‒ actually the genomic sequences are doubling every 18 months
‒ GenBank: collection of all publicly available nucleotide sequences (160 M seq)
4
Growth of biological data
• Advances in molecular biology lead to an exponential growth of biological data thanks
to the support of computer science
‒ Today next generation high throughput data from modern parallel sequencing
machines, are collected and huge amounts of biological data are currently
available on public and private sources
‒ 10000 Human Genomes project (3000 Mbp)
‒ Nowadays: 1000$ genome
• Very large data sets, that are generated by several different biological experiments,
need to be automatically processed and analyzed with computer science methods
5
DNA Sequencing
• DNA (deoxyribonucleic acid) is the hereditary material in almost all organisms
• DNA sequencing is the process of determining the order of nucleotides within a DNA
molecule
• It includes any method or technology that is used to determine the order of
the four bases—adenine (A), cytosine (C), guanine (G), and thymine (T)
• Originated by the DNA sequencing method invented by Sanger in early eighties
• In late nineties significant advances in sequence generation techniques, largely
inspired by massive projects such as the Human Genome Project
• High costs and time, e.g., for the Human Genome Project 5 billions $ and 13 years
6
Next Generation Sequencing (NGS)
• Today: next generation high throughput data from modern parallel sequencing
machines
‒ Roche 454, Illumina, Applied Biosystems SOLiD,
Helicos Heliscope, Complete Genomics,
Pacific Biosciences SMRT, ION Torrent
‒ Next generation sequencing (NGS) machines output a
large amount of short DNA sequences, called reads
(in fastq format)
‒ Cannot read entire genome one nucleotides at a time from
beginning to end
‒ shred the genome and generate shorts reads
‒ Low cost per base (1000$ for a whole human genome)
‒ High speed (24h to sequence a whole human genome )
‒ Large number of reads
‒ Problems: data storage and analysis, high costs for IT infrastructure
7
Next Generation Sequencing (NGS)
• Data dimension, time and cost of Next Generation Sequencing
Seq type Data Price $ Time
Human Genome 90 GB 1000 1 day
Human Gene
Expression
9 GB 500 12 h
Plant Genome 150 GB 2000 5 days
Bacterial Genome 1 GB 300 6 h
8
Biological data sources
• Several heterogeneous sources of biomedical data are available
• Sequence Read Archive
• The Gene Expression Omnibus
• NCBI
• ELIXIR
• The Cancer Genome Atlas (TCGA)
9
Biological data management
10
Biological data integration
• Challenge for the research community
• Allow everyone to store, organize, access, and analyze the information
available on the web and/or on private repositories
• Integration of data: providing a unified access to heterogeneous and
independent data sources as a single source
• Many solutions from the I.T. and from the bioinformatics community, e.g.
− Heterogeneous Database Systems
− Distributed Database Systems
− SRS
− NCBI Entrez
− Federated databases (BioKleisli)
− Multi-databases (TAMBIS),
− Mediator-based (Bio-DataServer)
− Data warehousing (BioWarehouse)
• Integration of clinical and genomic data
11
Bioinformatics
• New methods are demanded able to extract relevant information from biological data
sets
• Effective and efficient computer science methods are needed to support the analysis
of complex biological data sets
• Modern biology is frequently combined with computer science, leading to
Bioinformatics
• Bioinformatics is a discipline where biology and computer science merge together in
order to design and develop efficient methods for analyzing biological data, for
supporting in vivo, in vitro and in silicio experiments and for automatically solving
complex life science problems
• Bioinformatician: a computer scientist and biology domain expert, who is able to deal
with the computer aided resolution of life science problems
12
• The attention to Big Data in bioinformatics is steadily increasing,
proportionally to the growth of the amount of biological data obtained
through sequencing
• Dealing with such an amount of data, recorded at different stages during
the life of a person and stored for dynamic analysis studies, requires
scalable systems suitable for the collection, management, and analysis
• Biological Big Data Bases
Big Data Bioinformatics
13
• Comprehensive genomic characterization and analysis of more than 30 cancer
type
• National Cancer Institute (NCI), National Human Genome Research Institute
(NHGRI), and National Institute of Health (NIH)
• Aim: improve the ability to diagnose, treat and prevent cancer
• A free-available platform to search, download, and analyze data sets
• 33 tumors with more than 10000 patients
• Public data distributed with the open access paradigm
• Genomic experiments
– Copy Number Variation (CNV)
– DNA-methylation
– DNA-sequencing (whole genome, whole exome, mutations)
– Gene expression data (RNA-Seq V1, V2)
– MicroRNA sequencing
– Meta data (Clinical and Biospecimen)
• Contains more than 15 TB of genomic and clinical data, whose analysis and
interpretation are posing great challenges to the bioinformatics community
The Cancer Genome Atlas (TCGA)
14
TCGA2BED
Data integration from external dbs
15
data set:
DNA-Methylation
data set:
RNA-sequencing
Genomic data integration
Typical problem in Bioinformatics:
• More than 1000 samples (patients), 450 000 features (genes, sites, clinical
variables, proteins, )
• Aim: distinguish healthy vs diseased samples
• Not addressable by a classic machine learning algorithm
• Big Data solutions
16
• Aims: distinguish the diseased from the healthy samples and prediction
• Input: a training set (reference library) containing samples with a priori
known class membership
• Model building: based on this training set the software computes the
classification model
• The classification model can be applied to a test set (query set) which
contains samples that require classification:
− query samples with unknown species membership or
− samples that also have a priori known species membership, allowing verification of the
classifications
Classification and supervised machine learning
17
Rule-based classification
A rule-based classifier is a technique for classifying samples by using a collection of
“if… then rules”, named logic formulas:
– Antecedent  Consequent
– (Condition1) or (Condition2) or … or (Conditionn)  Class
– Conditioni: (A1 op v1) and (A2 op v2) and … and (Am op vm)
– A = attribute; v = value; op = operator {=, ≠, <, >, ≤, ≥}
• Example of logic classification formula is
• The evaluation of the logic formulas and the classification of the samples to the right
class is performed according :
– Percentage split or cross validation sampling
– Accuracy
– F-measure
“IF Aph1b<0.507 then the experimental sample is CONTROL”
18
CAMUR
• Classifier with Alternative and Multiple Rule-based models (CAMUR)
• New method for classifying RNA-seq case-control samples, which is able to compute
multiple human readable classification models
• Aims of CAMUR:
1) To classify RNA-seq experiments
2) To extract several alternative and equivalent rule-based models,
which represent relevant sets of genes related to the case and control samples
• CAMUR extracts multiple classification models by adopting a feature elimination
technique and by iterating the classification procedure
• Prerequisite: Gene expression normalization
(RPKM or RSEM )
• Available at: http://dmb.iasi.cnr.it/camur.php
19
CAMUR: method
• CAMUR is based on:
1) a rule-based classifier (i.e., in this work RIPPER)
2) an iterative feature elimination technique
3) a repeated classification procedure
4) an ad-hoc storage structure for the classification rules (CAMUR database)
• In brief, CAMUR:
• iteratively computes a rule-based classification model through the supervised
RIPPER algorithm,
• calculates the power set (or a partial combination) of the features present in the
rules,
• iteratively eliminates those combinations from the data set, and
• performs again the classification procedure until a stopping criterion is verified:
 F-measure < threshold
 Maximum number of iterations reached
20
Experimentation and results
21
Experimentation and results
22
(MAMDC2_dMet >= 6.63) and
(ACACB_rnaSeq >= 887.80)
=> class=normal (19.0/3.0)
[ ] => class=tumoral (1102.0/1.0)
Correctly Classified Instances 98.11 %
Incorrectly Classified Instances 1.88 %
Gene occurrences
FIGF_rnaSeq 44
SPRY2_dMet 37
SCN3A_rnaSeq 25
PAMR1_dMet 20
MMP11_rnaSeq 20
Class rule accuracy
Normal (FIGF_rnaSeq >= 184.15) and
(CLEC5A_dMet <= 5.44) ||
(TSHZ2_rnaSeq >= 471.04) and
(DLGAP2_dMet >= 10.06)
9.800
Normal (SPRY2_dMet >= 0.55) and
(CD300LG_rnaSeq >= 454.24) ||
(PAMR1_rnaSeq >= 712.17) and
(PARP8_dMet >= 2.17)
9.700
Camur: occurrences
Classification models for breast cancer
CAMUR: rules
Supervised model extraction
23
Aim: To extract relevant features from the ever-increasing amount of
biological data and to apply supervised learning to classify them
Biology Issue Features Software Data source
Clinical patient
classification
Clinical variables (blood,
imaging, psicosometric
tests…)
DMB, Weka
Heterogeneous
health care facilities
Gene Expression
Analysis
Discretize gene expression
profiles
Gela, CAMUR TCGA, EBRI
DNA barcoding
Nucleotide sequences of
DNA-barcode
Blog, Fasta2Weka
Barcode of Life
Consortium
Polyoma/Rhyno
Viruses
Nucleotide sequences of
Polyoma/Rhyno viruses
DMB, MISSAL
Istituto Superiore di
Sanità
EEG signals
processing
Fourier Coefficients
extracted from EEG
recordings
Matlab, Weka, DMB
IRCCS Centro di
Neurolesi “Bonino-
Pulejo” of Messina
Biomedical
image processing
Oriented Fast and Rotated
BRIEF
Matlab, Weka, DMB
Alzheimer's Disease
Neuroimaging
Initiative
Other applications on biomedical data
24
Conclusions and future directions
• Exponential growth of biomedical data
• Release of many public data bases, data
collection and data management projects
• Data integration
• Supervised classification analysis
• Advanced systems for data integration
• New big data approaches
25
Acknowledgments
Emanuel Weitschek
Department of Engineering
Uninettuno International University
www.iasi.cnr.it/~eweitschek
emanuel@iasi.cnr.it

More Related Content

PPT
Intro to databases
PDF
The next generation sequencing platform of roche 454
PDF
Gene prediction method
PDF
Genomic Data Analysis
PPT
Genome data management
Intro to databases
The next generation sequencing platform of roche 454
Gene prediction method
Genomic Data Analysis
Genome data management

What's hot (20)

PPTX
Protein identification - peptide mass fingerprinting
PPTX
GenBank Database and its different sections (Bioinformatics)
PPT
Dna chips and microarrays
PPTX
gene prediction programs
PPTX
Genome Database Systems
PPTX
Scale up of animal cell cultture
PPTX
Kegg databse
PPTX
Data mining ppt
PDF
PPT
Protein function prediction
PPTX
Synthetic Genome
PPT
Biological data base
PPTX
Introduction to second generation sequencing
PPTX
Forensic Issues in Forensic Serology & DNA Typing.pptx
PPTX
Next Generation Sequencing
PPTX
Biological databases
PPTX
DNA Microarray introdution and application
PPT
Pubchem
PPT
8. Biology and characterization of cultured cells
Protein identification - peptide mass fingerprinting
GenBank Database and its different sections (Bioinformatics)
Dna chips and microarrays
gene prediction programs
Genome Database Systems
Scale up of animal cell cultture
Kegg databse
Data mining ppt
Protein function prediction
Synthetic Genome
Biological data base
Introduction to second generation sequencing
Forensic Issues in Forensic Serology & DNA Typing.pptx
Next Generation Sequencing
Biological databases
DNA Microarray introdution and application
Pubchem
8. Biology and characterization of cultured cells
Ad

Viewers also liked (20)

PDF
How AI will impact Web and Social Media Intelligence - Uljan Sharka (Crystal.io)
PDF
Il valore delle Indicazioni Geografiche nell'economia italiana - Mauro Rosati
PDF
The mine of the public open data, a fundamental asset - Flavia Marzano
PDF
Knowledge graph: il percorso di Cerved per connettere i Big Data - Diego Sanvito
PDF
Il deep learning ed una nuova generazione di AI - Simone Scardapane
PDF
Towards intelligent data insights in central banks: challenges and opportunit...
PPTX
Disrupting the weather market, one thousand drops at a time - Paola Allamano ...
PDF
Big Data and Data Science @ BNL - D. Morgagni & L. Dell'Anna
PPTX
Data driven innovation in chirurgia: il caso EVARplanning - Paolo Spada
PDF
Il paradigma dei Big Data e Predictive Analysis, un valido supporto al contra...
PDF
A visual approach to fraud detection and investigation - Giuseppe Francavilla
PPTX
Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...
PDF
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...
PDF
Healthware for medicine - Roberto Ascione
PPTX
Cognitive computing in the digital health era - Federico Neri
PDF
Data Driven UX: Come lo facciamo? C. Frinolli & N. Molchanova (Nois3)
PDF
How Data Drive Beyond Bank - Christian Miccoli (Conio)
PPTX
Portabilità dei dati e benessere del consumatore di servizi cloud - Davide Mula
PPTX
LCA as an innovation tool - Barilla - Luca Ruini
PDF
No Data, No Party - Roberto Magnifico
How AI will impact Web and Social Media Intelligence - Uljan Sharka (Crystal.io)
Il valore delle Indicazioni Geografiche nell'economia italiana - Mauro Rosati
The mine of the public open data, a fundamental asset - Flavia Marzano
Knowledge graph: il percorso di Cerved per connettere i Big Data - Diego Sanvito
Il deep learning ed una nuova generazione di AI - Simone Scardapane
Towards intelligent data insights in central banks: challenges and opportunit...
Disrupting the weather market, one thousand drops at a time - Paola Allamano ...
Big Data and Data Science @ BNL - D. Morgagni & L. Dell'Anna
Data driven innovation in chirurgia: il caso EVARplanning - Paolo Spada
Il paradigma dei Big Data e Predictive Analysis, un valido supporto al contra...
A visual approach to fraud detection and investigation - Giuseppe Francavilla
Polyglot Persistence e Big Data: tra innovazione e difficoltà su casi reali -...
L'economia europea dei dati. Politiche europee e opportunità di finanziamento...
Healthware for medicine - Roberto Ascione
Cognitive computing in the digital health era - Federico Neri
Data Driven UX: Come lo facciamo? C. Frinolli & N. Molchanova (Nois3)
How Data Drive Beyond Bank - Christian Miccoli (Conio)
Portabilità dei dati e benessere del consumatore di servizi cloud - Davide Mula
LCA as an innovation tool - Barilla - Luca Ruini
No Data, No Party - Roberto Magnifico
Ad

Similar to Genomic Big Data Management, Integration and Mining - Emanuel Weitschek (20)

PPTX
Data analysis & integration challenges in genomics
DOCX
Bioinformatics
PDF
Introduction to Bioinformatics 2025.....pdf
PPTX
EiTESAL eHealth Conference 14&15 May 2017
PDF
Bioinformatics issues and challanges presentation at s p college
PPTX
Bioinformatics
PDF
Accomplishments And Challenges In Bioinformatics
PPT
Introducción a la bioinformatica
PPTX
2015 genome-center
PPTX
2016 davis-biotech
PPTX
Bioinformatics t1-introduction wim-vancriekinge_v2013
PPTX
2015 bioinformatics wim_vancriekinge
PDF
Introduction to Bioinformatics.
PPTX
Bioinformatics_1_ChenS.pptx
PPTX
Brief introduction to Bioinformatics
PPTX
Bioinformatica 29-09-2011-t1-bioinformatics
PDF
Basic of bioinformatics
PPTX
2016 bioinformatics i_wim_vancriekinge_vupload
PPTX
bioinformatics simple
PPTX
bioinformatics presentation in the master presentation
Data analysis & integration challenges in genomics
Bioinformatics
Introduction to Bioinformatics 2025.....pdf
EiTESAL eHealth Conference 14&15 May 2017
Bioinformatics issues and challanges presentation at s p college
Bioinformatics
Accomplishments And Challenges In Bioinformatics
Introducción a la bioinformatica
2015 genome-center
2016 davis-biotech
Bioinformatics t1-introduction wim-vancriekinge_v2013
2015 bioinformatics wim_vancriekinge
Introduction to Bioinformatics.
Bioinformatics_1_ChenS.pptx
Brief introduction to Bioinformatics
Bioinformatica 29-09-2011-t1-bioinformatics
Basic of bioinformatics
2016 bioinformatics i_wim_vancriekinge_vupload
bioinformatics simple
bioinformatics presentation in the master presentation

More from Data Driven Innovation (20)

PDF
Integrazione della mobilità elettrica nei sistemi urbani (Stefano Carrese, Un...
PDF
La statistica ufficiale e i trasporti marittimi nell'era dei big data (Vincen...
PDF
How can we realize the Mobility as a Service (Maas) (Andrea Paletti, London S...
PDF
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
PDF
CHNet-DHLab: Servizi Cloud a supporto dei beni culturali (Fabio Proietti, INF...
PDF
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
PDF
Una infrastruttura per l’accesso al patrimonio culturale: il Progetto del Por...
PDF
Utilizzo dei Big data per l’analisi dei flussi veicolari e della mobilità (Ma...
PDF
I dati personali nell'analisi comportamentale della mobilità di dipendenti e ...
PDF
Estrarre valore dai dati: tecnologie per ottimizzare la mobilità del futuro (...
PPTX
Le piattaforme dati per la mobilità nelle città italiane (Marco Mena, EY)
PDF
WiseTown, un ecosistema di applicazioni e strumenti per migliorare la qualità...
PDF
CityOpenSource as a civic tech tool (Ilaria Vitellio, CityOpenSource)
PDF
Big Data Confederation: toward the local urban data market place (Renzo Taffa...
PDF
Making citizens the eyes of policy makers: a sweet spot for hybrid AI? (Danie...
PDF
Dall'Agenda Digitale alla Smart City: il percorso di Roma Capitale verso il D...
PDF
Reusing open data: how to make a difference (Vittorio Scarano, Università di ...
PDF
Gestire i beni culturali con i big data (Sandro Stancampiano, Istat)
PDF
Data Governance: cos’è e perché è importante? (Elena Arista, Erwin)
PDF
Data driven economy: bastano i dati per avviare una start up? (Gabriele Anton...
Integrazione della mobilità elettrica nei sistemi urbani (Stefano Carrese, Un...
La statistica ufficiale e i trasporti marittimi nell'era dei big data (Vincen...
How can we realize the Mobility as a Service (Maas) (Andrea Paletti, London S...
Il DTC-Lazio e i dati del patrimonio culturale (Maria Prezioso, Università To...
CHNet-DHLab: Servizi Cloud a supporto dei beni culturali (Fabio Proietti, INF...
Progetto EOSC-Pillar (Fulvio Galeazzi, GARR)
Una infrastruttura per l’accesso al patrimonio culturale: il Progetto del Por...
Utilizzo dei Big data per l’analisi dei flussi veicolari e della mobilità (Ma...
I dati personali nell'analisi comportamentale della mobilità di dipendenti e ...
Estrarre valore dai dati: tecnologie per ottimizzare la mobilità del futuro (...
Le piattaforme dati per la mobilità nelle città italiane (Marco Mena, EY)
WiseTown, un ecosistema di applicazioni e strumenti per migliorare la qualità...
CityOpenSource as a civic tech tool (Ilaria Vitellio, CityOpenSource)
Big Data Confederation: toward the local urban data market place (Renzo Taffa...
Making citizens the eyes of policy makers: a sweet spot for hybrid AI? (Danie...
Dall'Agenda Digitale alla Smart City: il percorso di Roma Capitale verso il D...
Reusing open data: how to make a difference (Vittorio Scarano, Università di ...
Gestire i beni culturali con i big data (Sandro Stancampiano, Istat)
Data Governance: cos’è e perché è importante? (Elena Arista, Erwin)
Data driven economy: bastano i dati per avviare una start up? (Gabriele Anton...

Recently uploaded (20)

PPTX
ca esophagus molecula biology detailaed molecular biology of tumors of esophagus
PDF
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
PPTX
Fundamentals of human energy transfer .pptx
PPT
MENTAL HEALTH - NOTES.ppt for nursing students
PPTX
Important Obstetric Emergency that must be recognised
PPTX
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
PPTX
Note on Abortion.pptx for the student note
PPT
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
PDF
CT Anatomy for Radiotherapy.pdf eryuioooop
PPTX
Acid Base Disorders educational power point.pptx
PPTX
History and examination of abdomen, & pelvis .pptx
PPT
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
PPT
ASRH Presentation for students and teachers 2770633.ppt
DOC
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
DOCX
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
PPTX
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
PPT
Breast Cancer management for medicsl student.ppt
DOCX
NEET PG 2025 | Pharmacology Recall: 20 High-Yield Questions Simplified
PPTX
CME 2 Acute Chest Pain preentation for education
PPT
Obstructive sleep apnea in orthodontics treatment
ca esophagus molecula biology detailaed molecular biology of tumors of esophagus
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
Fundamentals of human energy transfer .pptx
MENTAL HEALTH - NOTES.ppt for nursing students
Important Obstetric Emergency that must be recognised
DENTAL CARIES FOR DENTISTRY STUDENT.pptx
Note on Abortion.pptx for the student note
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
CT Anatomy for Radiotherapy.pdf eryuioooop
Acid Base Disorders educational power point.pptx
History and examination of abdomen, & pelvis .pptx
Copy-Histopathology Practical by CMDA ESUTH CHAPTER(0) - Copy.ppt
ASRH Presentation for students and teachers 2770633.ppt
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
RUHS II MBBS Microbiology Paper-II with Answer Key | 6th August 2025 (New Sch...
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
Breast Cancer management for medicsl student.ppt
NEET PG 2025 | Pharmacology Recall: 20 High-Yield Questions Simplified
CME 2 Acute Chest Pain preentation for education
Obstructive sleep apnea in orthodontics treatment

Genomic Big Data Management, Integration and Mining - Emanuel Weitschek

  • 1. Genomic Big Data Management, Integration, and Mining E. Weitschek1,2 1 Department of Engineering, Uninettuno International University, Italy 2 Institute of Systems Analysis and Computer Science, National Research Council, Italy Joint work with P. Bertolazzi, G. Felici , F. Cumbo, G. Fiscon, E. Cappelli
  • 2. 2 Outline • Growth of biological data • Next generation sequencing • Biological data sources • Biological data management • Biological data integration • Big data bioinformatics • Knowledge extraction • Supervised Learning • Biomedical applications • Conclusions and future directions
  • 3. 3 Growth of biological data • Advances in molecular biology lead to an exponential growth of biological data thanks to the support of computer science ‒ originated by the DNA sequencing method invented by Sanger in early eighties ‒ late nineties significant advances in sequence generation, e.g. Human Genome Project ‒ actually the genomic sequences are doubling every 18 months ‒ GenBank: collection of all publicly available nucleotide sequences (160 M seq)
  • 4. 4 Growth of biological data • Advances in molecular biology lead to an exponential growth of biological data thanks to the support of computer science ‒ Today next generation high throughput data from modern parallel sequencing machines, are collected and huge amounts of biological data are currently available on public and private sources ‒ 10000 Human Genomes project (3000 Mbp) ‒ Nowadays: 1000$ genome • Very large data sets, that are generated by several different biological experiments, need to be automatically processed and analyzed with computer science methods
  • 5. 5 DNA Sequencing • DNA (deoxyribonucleic acid) is the hereditary material in almost all organisms • DNA sequencing is the process of determining the order of nucleotides within a DNA molecule • It includes any method or technology that is used to determine the order of the four bases—adenine (A), cytosine (C), guanine (G), and thymine (T) • Originated by the DNA sequencing method invented by Sanger in early eighties • In late nineties significant advances in sequence generation techniques, largely inspired by massive projects such as the Human Genome Project • High costs and time, e.g., for the Human Genome Project 5 billions $ and 13 years
  • 6. 6 Next Generation Sequencing (NGS) • Today: next generation high throughput data from modern parallel sequencing machines ‒ Roche 454, Illumina, Applied Biosystems SOLiD, Helicos Heliscope, Complete Genomics, Pacific Biosciences SMRT, ION Torrent ‒ Next generation sequencing (NGS) machines output a large amount of short DNA sequences, called reads (in fastq format) ‒ Cannot read entire genome one nucleotides at a time from beginning to end ‒ shred the genome and generate shorts reads ‒ Low cost per base (1000$ for a whole human genome) ‒ High speed (24h to sequence a whole human genome ) ‒ Large number of reads ‒ Problems: data storage and analysis, high costs for IT infrastructure
  • 7. 7 Next Generation Sequencing (NGS) • Data dimension, time and cost of Next Generation Sequencing Seq type Data Price $ Time Human Genome 90 GB 1000 1 day Human Gene Expression 9 GB 500 12 h Plant Genome 150 GB 2000 5 days Bacterial Genome 1 GB 300 6 h
  • 8. 8 Biological data sources • Several heterogeneous sources of biomedical data are available • Sequence Read Archive • The Gene Expression Omnibus • NCBI • ELIXIR • The Cancer Genome Atlas (TCGA)
  • 10. 10 Biological data integration • Challenge for the research community • Allow everyone to store, organize, access, and analyze the information available on the web and/or on private repositories • Integration of data: providing a unified access to heterogeneous and independent data sources as a single source • Many solutions from the I.T. and from the bioinformatics community, e.g. − Heterogeneous Database Systems − Distributed Database Systems − SRS − NCBI Entrez − Federated databases (BioKleisli) − Multi-databases (TAMBIS), − Mediator-based (Bio-DataServer) − Data warehousing (BioWarehouse) • Integration of clinical and genomic data
  • 11. 11 Bioinformatics • New methods are demanded able to extract relevant information from biological data sets • Effective and efficient computer science methods are needed to support the analysis of complex biological data sets • Modern biology is frequently combined with computer science, leading to Bioinformatics • Bioinformatics is a discipline where biology and computer science merge together in order to design and develop efficient methods for analyzing biological data, for supporting in vivo, in vitro and in silicio experiments and for automatically solving complex life science problems • Bioinformatician: a computer scientist and biology domain expert, who is able to deal with the computer aided resolution of life science problems
  • 12. 12 • The attention to Big Data in bioinformatics is steadily increasing, proportionally to the growth of the amount of biological data obtained through sequencing • Dealing with such an amount of data, recorded at different stages during the life of a person and stored for dynamic analysis studies, requires scalable systems suitable for the collection, management, and analysis • Biological Big Data Bases Big Data Bioinformatics
  • 13. 13 • Comprehensive genomic characterization and analysis of more than 30 cancer type • National Cancer Institute (NCI), National Human Genome Research Institute (NHGRI), and National Institute of Health (NIH) • Aim: improve the ability to diagnose, treat and prevent cancer • A free-available platform to search, download, and analyze data sets • 33 tumors with more than 10000 patients • Public data distributed with the open access paradigm • Genomic experiments – Copy Number Variation (CNV) – DNA-methylation – DNA-sequencing (whole genome, whole exome, mutations) – Gene expression data (RNA-Seq V1, V2) – MicroRNA sequencing – Meta data (Clinical and Biospecimen) • Contains more than 15 TB of genomic and clinical data, whose analysis and interpretation are posing great challenges to the bioinformatics community The Cancer Genome Atlas (TCGA)
  • 15. 15 data set: DNA-Methylation data set: RNA-sequencing Genomic data integration Typical problem in Bioinformatics: • More than 1000 samples (patients), 450 000 features (genes, sites, clinical variables, proteins, ) • Aim: distinguish healthy vs diseased samples • Not addressable by a classic machine learning algorithm • Big Data solutions
  • 16. 16 • Aims: distinguish the diseased from the healthy samples and prediction • Input: a training set (reference library) containing samples with a priori known class membership • Model building: based on this training set the software computes the classification model • The classification model can be applied to a test set (query set) which contains samples that require classification: − query samples with unknown species membership or − samples that also have a priori known species membership, allowing verification of the classifications Classification and supervised machine learning
  • 17. 17 Rule-based classification A rule-based classifier is a technique for classifying samples by using a collection of “if… then rules”, named logic formulas: – Antecedent  Consequent – (Condition1) or (Condition2) or … or (Conditionn)  Class – Conditioni: (A1 op v1) and (A2 op v2) and … and (Am op vm) – A = attribute; v = value; op = operator {=, ≠, <, >, ≤, ≥} • Example of logic classification formula is • The evaluation of the logic formulas and the classification of the samples to the right class is performed according : – Percentage split or cross validation sampling – Accuracy – F-measure “IF Aph1b<0.507 then the experimental sample is CONTROL”
  • 18. 18 CAMUR • Classifier with Alternative and Multiple Rule-based models (CAMUR) • New method for classifying RNA-seq case-control samples, which is able to compute multiple human readable classification models • Aims of CAMUR: 1) To classify RNA-seq experiments 2) To extract several alternative and equivalent rule-based models, which represent relevant sets of genes related to the case and control samples • CAMUR extracts multiple classification models by adopting a feature elimination technique and by iterating the classification procedure • Prerequisite: Gene expression normalization (RPKM or RSEM ) • Available at: http://dmb.iasi.cnr.it/camur.php
  • 19. 19 CAMUR: method • CAMUR is based on: 1) a rule-based classifier (i.e., in this work RIPPER) 2) an iterative feature elimination technique 3) a repeated classification procedure 4) an ad-hoc storage structure for the classification rules (CAMUR database) • In brief, CAMUR: • iteratively computes a rule-based classification model through the supervised RIPPER algorithm, • calculates the power set (or a partial combination) of the features present in the rules, • iteratively eliminates those combinations from the data set, and • performs again the classification procedure until a stopping criterion is verified:  F-measure < threshold  Maximum number of iterations reached
  • 22. 22 (MAMDC2_dMet >= 6.63) and (ACACB_rnaSeq >= 887.80) => class=normal (19.0/3.0) [ ] => class=tumoral (1102.0/1.0) Correctly Classified Instances 98.11 % Incorrectly Classified Instances 1.88 % Gene occurrences FIGF_rnaSeq 44 SPRY2_dMet 37 SCN3A_rnaSeq 25 PAMR1_dMet 20 MMP11_rnaSeq 20 Class rule accuracy Normal (FIGF_rnaSeq >= 184.15) and (CLEC5A_dMet <= 5.44) || (TSHZ2_rnaSeq >= 471.04) and (DLGAP2_dMet >= 10.06) 9.800 Normal (SPRY2_dMet >= 0.55) and (CD300LG_rnaSeq >= 454.24) || (PAMR1_rnaSeq >= 712.17) and (PARP8_dMet >= 2.17) 9.700 Camur: occurrences Classification models for breast cancer CAMUR: rules Supervised model extraction
  • 23. 23 Aim: To extract relevant features from the ever-increasing amount of biological data and to apply supervised learning to classify them Biology Issue Features Software Data source Clinical patient classification Clinical variables (blood, imaging, psicosometric tests…) DMB, Weka Heterogeneous health care facilities Gene Expression Analysis Discretize gene expression profiles Gela, CAMUR TCGA, EBRI DNA barcoding Nucleotide sequences of DNA-barcode Blog, Fasta2Weka Barcode of Life Consortium Polyoma/Rhyno Viruses Nucleotide sequences of Polyoma/Rhyno viruses DMB, MISSAL Istituto Superiore di Sanità EEG signals processing Fourier Coefficients extracted from EEG recordings Matlab, Weka, DMB IRCCS Centro di Neurolesi “Bonino- Pulejo” of Messina Biomedical image processing Oriented Fast and Rotated BRIEF Matlab, Weka, DMB Alzheimer's Disease Neuroimaging Initiative Other applications on biomedical data
  • 24. 24 Conclusions and future directions • Exponential growth of biomedical data • Release of many public data bases, data collection and data management projects • Data integration • Supervised classification analysis • Advanced systems for data integration • New big data approaches
  • 25. 25 Acknowledgments Emanuel Weitschek Department of Engineering Uninettuno International University www.iasi.cnr.it/~eweitschek emanuel@iasi.cnr.it