BioWeka

BioWeka Extending the Weka framework for Bioinformatics Martin Szugat ( [email_address] ) http:// www.bioweka.org

What is BioWeka? An extension to the Weka data mining framework for bioinformatics A framework for additional data mining tools in bioinformatics An open source project for the Weka/Bioinformatics community from the Ludwig-Maximilians-University, Munich ( www.bio.ifi.lmu.de )

Agenda Introduction: Data mining & Biology Motivation: Comparability & Interoperability Solution: Extensibility & Standardization Foundation: Weka & Co. Implementation: Guidelines & Patterns Application: BioWeka & Eclat Conclusion: Prototypes & Experiments

KD, DM and ML Knowledge Discovery (KD): Process of finding unknown patterns in known data Data Mining (DM): Step in the KD process Descriptive: Clustering, Associate Rules Predictive: Classification, Regression Machine Learning : Methods for DM Unsupervised  Descriptive Mining Supervised  Predictive Mining

Knowledge Discovery Process Data Selection and Preparation Transformation and Reduction Data Mining Evaluation and Visualization

Mining biological data Clustering of gene expression data Standard formats (CSV) and algorithms ( k -Means) Classification of sequences DNA/RNA: coding/non-coding, species Proteins: function, structure or localization Data mining on strings not well supported “extraordinary” data formats, e.g. FASTA Feature extraction: string  numbers Classification based on alignment scores

Knowledge Discovery Experiment Data Selection and Preparation Transformation and Reduction Data Mining Evaluation and Visualization

Comparability & Interoperability Comparison with regard to performance Find best combination of model, parameters and transformation New method/data set vs. old method/data set Technical problems: Different data formats: conversions Different software interfaces: mappings, wrappers

Intermediate data format Format A Tool X Tool Y Tool Z Format C Format B Mediator

Unified interfaces Loader Data Classifier Filter 1 … Filter n FastaLoader CSVLoader … SymbolCounter SymbolAnalyzer … Alignment SVM …  Extendable and Customizable Execution Pipeline

Foundation: Weka Fortunately such a solution already exists: Weka ( http:// www.cs.waikato.ac.nz/ml/weka / ) Open Source Software (GPL) written in Java Intermediate data format: ARFF Extendable: Classifier , Loader , Saver , Filter , … Extensive: > 70 Classifiers , > 40 Filters , … Interfaces: API, CLI, GUI

Tools & Libraries BioJava: Sequence handling JAligner: Smith-Waterman-Algorithm FoldRec: Secondary Structure Element Alignment BioJava-Ext: Needleman-Wunsch-Algorithm BLAST, PSI-BLAST AAindex: Amino acid properties InterProScan: Sequence Patterns

Implementation Guidelines Software for Users Provide a easy-to-use GUI Extend existing software well-known to the user Make it extensible for additional software Software for Developers Provide a well-documented API Define interfaces for external extensions Offer abstract base classes Implement at least a simple class for each interface

BioWeka 0.4 Weka extensions: Converters : Load & Save foreign file formats Filters : Feature extraction & transformation Classifiers : Alignment-based ~, ECLAT BioWeka specific: Normalizers : Normalize feature vectors Evaluators : Turn scores into likelihoods …

Converters Sequence file formats: FASTA, EMBL, SwissProt, GenBank Mappers map sequences & annotations into attributes XML-based file formats: InterProScan, ProML, MAGE-ML Based on XSL stylesheets Gene expression data formats: TAV, MEV, Stanford, Spot, … Customizable CSV loader

Sequence filters Feature extraction: Sequence properties: e.g. AAindex Sequence composition: Codons, amino acids, etc. Attribute normalization: counts  frequencies Transformation: Translation pipelines: e.g. DNA to RNA to AA, reverse complement, stop codon termination Frame shifter: e.g. generate open reading frames

Universal filters MultipleFilter : Build filter pipelines (  FilteredClassifier + Trainable filters) Normalize : Normalization over a set of instances MergeSets : Merge two or more ARFF files Save : Export data set in a foreign file format SetClass : Set class attribute (  Experimenter)

Alignment-based Classification Alignment methods (sequence  score): 1 vs. 1: Local, global, secondary structure element ~ 1 vs. m : BLAST, PSI-BLAST (WU-BLAST, etc.) Score evaluation (score  class probability): Linear evaluators: Sum, max, average Ranked evaluators: SimpleRankEvaluator Meta evaluators: SimpleTransformingScoreEvaluator

Precomputed Alignments Precompute alignment scores  Try out different evaluation schemes AlignmentScorer filter: n sequences  n x n scoring matrix Symmetric alignment: O(n^2/2) AlignmentScoreClassifier : based on evaluator Other: NN, SVM, …

BioWeka distribution Documentation: Readme, Changelogs, API Libraries: BioWeka, BioJava, JAligner, … Source code: library & tests Data: AAindex database, Substitution matrices Stylesheets: ProML, MAGE-ML, InterProScan Patches: converter.pl (InterProScan)

BioWeka 0.4.1 Batch scripts for Linux and Windows Integration of LibSVM via WLSVM (GPL) Integration of Weka-CG (GPL) Multifactor Dimensionality Reduction (MDR)-Filter More than 50 Weka components 247 Java classes with about 12800 lines of code Majority are interfaces and abstract base classes  Extensibility

Application: ECLAT Friedel et al.: vector machines for separation of mixed plant-pathogen EST collections based on codon usage. Eclat: LibSVM & Codon frequencies Reimplementation using standard Weka & BioWeka components Evaluation on the barley-blumeria set (1315/1902)

Training Eclat Sequences Codon frequencies Norm- alization SVM Frames Factors 1 Model 1 Codon frequencies Norm- alization SVM Factors 2 Model 2

Evaluating Eclat Sequence Frames Norm- alization SVM Factors 2 Model 2 Codon frequencies Norm- alization SVM Factors 1 Model 1 Correct Frame

EclatClassifier/Filter Filtered Classifier SMO Random Forest Naive Bayes JRip J4.8 Multiple Filter Translate Terminator Symbol Counter Sum Normalizer Remove/ Copy Normalize MinMax Normalizer newMin = -1.0 newMax = 1.0 pseudoCount = 1.0 alphabet = DNA symbolWidth = 3.0

Classifier Comparison 1) LibSVM vs. SMO 10 x Sampling with Stratification (2:1)  10 x 3-fold CV 0.7 1.1 0.7 0.7 0.2 Deviation 81.5 (1.3) 84.8 (1.1) 87.1 (1.0) 88.2 (1.0) 93.1 (0.7) BioWeka 82.2 (1.6) 85.9 (1.8) 87.8 (0.9) 88.9 (1.0) 92.9 (0.6) Eclat J4.8 JRip Naïve Bayes Random Forest SVM 1 Accuracy [%] (SD)

EclatFrameFinder/Classifier EclatFrame Classifier EclatFrame Finder EclatFrame Shifter Frame Shifter Eclat Classifier

Frame Prediction 10 x Sampling (2:1) with Stratification Half sequences correct frame, half incorrect frame (randomly choosen) All frames incorrect or multiple frames correct  Eclat: Hyperplane margin, BioWeka: Random 0.6 95.1 BioWeka 0.4 97.7 Eclat Standard deviation Accuracy Implementation

Eclat Eclat EclatFrame Finder Eclat Classifier

Species discrimination Species discrimination & frame prediction 10 x 10-fold Cross-Validation 1.4 (10-CV) 92.0 BioWeka/LibSVM 1.0 91.3 BioWeka/SMO na 93.1 Eclat Standard deviation Accuracy Implementation

BioWeka’s Eclat Complete solution: 647 lines of code Integrated in the Weka workbench Reusability: EclatFilter , EclatFrameFinder , … Configurability: Configure Filter / Classifier Extensibility: Replace Filter / Classifier Runtime performance (10-CV) BioWeka 23 min. vs. Eclat 5 min.

Prototyping & Experiments Evaluating standard procedures without writing a single line of code using BioWeka BioWeka is good for rapid application development  Prototypes Experimenting with different data sets, filters, classifiers, etc. is easy within Weka Runtime performance is the weak point of (Bio)Weka

Web & Download Statistics Project site: sourceforge.net/projects/bioweka : Sourceforge  Open Source Project (GNU GPL) Forums, mailing lists, bug tracker, CSV, … Downloads of BioWeka 0.4: > 110 (12/07/2005) Web site: www.bioweka.org MediaWiki  Open Content (GNU FDL) Project description, documentation, news, … Main page hits: > 900 (12/07/2005)

Acknowledgements LMU members: Ralf Zimmer Jan Gewehr Caroline Friedel Other: Weka contributors Mark Schreiber (BioJava) Andreas Dräger ( NeedlemanWunsch ) Ahmed Moustafa (JAligner) Joe White (MAGE-ML) Many more …

Thanks for your attention! Questions? http:// www.bioweka.org

BioWeka

More Related Content

Similar to BioWeka (20)

More from Martin Szugat (20)

Recently uploaded (20)

BioWeka