SlideShare a Scribd company logo
Knowledge Discovery through Data Mining C. Devakumar Indian Council of Agricultural Research New Delhi-110 012 [email_address]
Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Classification of data mining systems Major issues in data mining
Why Data Mining?  The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras, YouTube  We are drowning in data, but starving for knowledge!   “ Necessity is the mother of invention” — Data mining — Automated analysis of massive data sets
Evolution of Database Technology 1960s: Data collection, database creation, IMS and network DBMS 1970s:  Relational data model, relational DBMS implementation 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s:  Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
What is Data Mining? Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of  large quantities of data in order to discover meaningful patterns
What is not Data Mining? Look up phone number in phone directory Query a Web search engine for information
Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to  Enormity of data High dimensionality  of data Heterogeneous,  distributed nature  of data Machine Learning/ Pattern   Recognition Statistics/ AI Data Mining Database systems
+ = Data Interestingness criteria Hidden patterns
+ = Data Interestingness criteria Hidden patterns Type of  Patterns
+ = Data Interestingness criteria Hidden patterns Type of data Type of  Interestingness criteria
Type of Data Tabular  (Ex: Transaction data) Relational Multi-dimensional Spatial  (Ex: Remote sensing data) Temporal  (Ex: Log information)  Streaming  (Ex: multimedia, network traffic) Spatio-temporal  (Ex: GIS) Tree  (Ex: XML data) Graphs  (Ex: WWW, BioMolecular data) Sequence  (Ex: DNA, activity logs)  Text, Multimedia …
Type of Interestingness Frequency Rarity Correlation  Length of occurrence  (for sequence and temporal data) Consistency  Repeating / periodicity  “ Abnormal” behavior  Other patterns of interestingness…
Statistics: Conceptual Model (Hypothesis) Statistical Reasoning “ Proof” (Validation of Hypothesis)
Data mining: Mining Algorithm Based on  Interestingness Data Pattern  (model, rule,  hypothesis) discovery
Explores Your Data Finds Patterns Performs Predictions
Presentation Exploration Discovery Passive Interactive Proactive Role of Software Business Insight Predictive Analysis Canned reporting Ad-hoc reporting OLAP Data mining
Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data.
Data Mining Tasks... Classification [Predictive] Regression [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Semi-supervised Learning Semi-supervised Clustering Semi-supervised Classification
Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data
Data Mining and Business Intelligence  Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision   Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Confluence of Multiple Disciplines  Data Mining Database  Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization
Architecture: Typical Data Mining System data cleaning, integration, and selection Database or Data Warehouse Server Data Mining Engine Pattern Evaluation Graphical User Interface Knowledge-Base Database Data  Warehouse World-Wide Web Other Info Repositories
Multi-Dimensional View of Data Mining Data to be mined Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
Data Mining: Classification Schemes General functionality Descriptive data mining  Predictive data mining Different views lead to different classifications Data  view: Kinds of data to be mined Knowledge  view: Kinds of knowledge to be discovered Method  view: Kinds of techniques utilized Application  view: Kinds of applications adapted
Data Mining Functionalities Multidimensional concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Frequent patterns, association, correlation vs. causality Classification and prediction  Construct models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown or missing numerical values
Data Mining Functionalities (2) Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera    large SD memory Periodicity analysis Similarity-based analysis Other pattern-directed or statistical analyses
Major Issues in Data Mining Mining methodology  Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion  User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy
Bioinformatics, Computational Biology, Data Mining Bioinformatics is an interdisciplinary field about the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems. Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g. Genomes (viruses, bacteria, fungi, plants, insects,…) Proteins and Proteomes Biological Sequences Molecular Function and Structure Data Mining is searching for knowledge in data Knowledge mining from databases Knowledge extraction Data/pattern analysis Data dredging Knowledge Discovery in Databases (KDD)
Data production at the levels of molecules, cells, organs, organisms, populations Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, … Prediction of Molecular Function and Structure Computational biology: synthesis (simulations) and analysis (machine learning)  Problems in Bioinformatics Domain
Tokenization , which splits a text document into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters with single white spaces Filtering  methods remove words like articles, conjunctions, prepositions, etc.  Lemmatization  methods try to map verb forms to the infinite tense and nouns to their singular form.  Stemming  methods attempt to build the basic forms of words, for example, by stripping the plural 's' from nouns, the 'ing' from verbs, or other affixes.  Additional linguistic preprocessing N-grams  individualization, which is n-word generic sequences that do not necessarily correspond to an idiomatic use;  Anaphora  resolution, which can identify relationships among a linguistic expression (anaphora) and its preceding phrase, thus, determining the corresponding reference;  Part-of-speech tagging  (POS) determines the part of speech tag, noun, verb, adjective, etc. for each term; Text chunking aims at grouping adjacent words in a sentence;  Word Sense Disambiguation  (WSD) tries to resolve the ambiguity in the meaning of single words or phrases;  Parsing  produces a full parse tree of a sentence (subject, object, etc.).
Castellano, M. et al.  A bioinformatics knowledge discovery in text application for grid Computing  BMC Bioinformatics 2009, 10(Suppl 6):S23
BIOINFORMATICS ARCHITECTURE The Layer Architecture consisting of GATE 4.0 Toolkit for Text Mining, a Middleware solution written by Java API, the grid infrastructure middleware, and a physical layer that consists of a Gnu/Linux Operating System. The integrated development environment, GATE was used for the text mining process. GATE operated on a collection of scientific publications in full text available on MedLine/Pubmed (in pdf format) using the process of Text Mining
Technology Platform
What is New in SQL Server 2008? Data Mining Enhancements In addition to other new aspects of SQL Server: Enhanced Mining Structures Easier to prepare and test your models Models allow for cross-validation Filtering Algorithm Updates Improved Time Series algorithm combining best of ARIMA and ARTXP “ What-If” analysis Microsoft Data Mining Framework Supplements CRISP-DM
Microsoft DM Competitors SAS , largest market share of DM, specialised product for traditional experts SPSS  (Clementine), strength in statistical analysis IBM  (Intelligent Miner) tied to DB2, interoperates with Microsoft through PMML Oracle  (10g), supports Java APIs Angoss  (KnowledgeSTUDIO), result visualisation, works with SQL Server KXEN , supports OLAP and Excel
 
Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications Includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining;  Let’s mine for valuable gems of knowledge in our databases!

More Related Content

PPTX
Knowledge Discovery and Data Mining
PPT
introduction to data mining tutorial
PPTX
Data mining , Knowledge Discovery Process, Classification
PPTX
Knowledge discovery process
PPTX
Introduction to Data mining
PPT
Data mining
PPT
1.2 steps and functionalities
PDF
Data preprocessing using Machine Learning
Knowledge Discovery and Data Mining
introduction to data mining tutorial
Data mining , Knowledge Discovery Process, Classification
Knowledge discovery process
Introduction to Data mining
Data mining
1.2 steps and functionalities
Data preprocessing using Machine Learning

What's hot (20)

PPTX
Data preprocessing
PDF
Decision tree lecture 3
PDF
Introduction to Data Science and Analytics
PPT
Introduction to Data Mining
PPTX
Classification of data
PPTX
Data mining tasks
PPTX
OLAP & DATA WAREHOUSE
PPTX
Data Mining: Classification and analysis
PDF
Data visualization in Python
PPTX
3 Data Mining Tasks
PPTX
PPT
2.4 rule based classification
PPTX
Data mining concepts and work
PPT
Machine learning
PPTX
Unsupervised learning (clustering)
PDF
Web scraping in python
PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PPTX
Data mining primitives
PPT
1.7 data reduction
PPTX
Data mining
Data preprocessing
Decision tree lecture 3
Introduction to Data Science and Analytics
Introduction to Data Mining
Classification of data
Data mining tasks
OLAP & DATA WAREHOUSE
Data Mining: Classification and analysis
Data visualization in Python
3 Data Mining Tasks
2.4 rule based classification
Data mining concepts and work
Machine learning
Unsupervised learning (clustering)
Web scraping in python
Decision tree induction \ Decision Tree Algorithm with Example| Data science
Data mining primitives
1.7 data reduction
Data mining
Ad

Viewers also liked (19)

PDF
Browsing The Source Code of Linux Packages
PDF
Hewahi, saad 2006 - class outliers mining distance-based approach
PPT
3.7 outlier analysis
PPT
The x86 Family
PPT
Assembly Language Lecture 5
PDF
مقدمة في تكنواوجيا المعلومات
PPT
Intel 64bit Architecture
PPT
OS Lab: Introduction to Linux
PDF
Open Source Business Models
PDF
Browsing Linux Kernel Source
PDF
Cross Language Concept Mining
PDF
Class Outlier Mining
PDF
Data Mining and Business Intelligence Tools
PPT
Assembly Language Lecture 3
PPTX
Data Mining: Outlier analysis
PPT
Assembly Language Lecture 4
PPTX
Structured Vs, Object Oriented Analysis and Design
PPT
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
PPT
Introduction to CLIPS Expert System
Browsing The Source Code of Linux Packages
Hewahi, saad 2006 - class outliers mining distance-based approach
3.7 outlier analysis
The x86 Family
Assembly Language Lecture 5
مقدمة في تكنواوجيا المعلومات
Intel 64bit Architecture
OS Lab: Introduction to Linux
Open Source Business Models
Browsing Linux Kernel Source
Cross Language Concept Mining
Class Outlier Mining
Data Mining and Business Intelligence Tools
Assembly Language Lecture 3
Data Mining: Outlier analysis
Assembly Language Lecture 4
Structured Vs, Object Oriented Analysis and Design
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Introduction to CLIPS Expert System
Ad

Similar to Knowledge discovery thru data mining (20)

PPT
Introduction to data warehouse
PPT
Chapter 1. Introduction
PPT
Unit 1 (Chapter-1) on data mining concepts.ppt
PPT
Data mining Introduction
PDF
Cs501 dm intro
PDF
2 introductory slides
PPT
introduction to data minining and unit iii
PPT
Talk
PPT
Introduction of Data Mining - Concept and techniques
PDF
Introduction to Data Mining
PPT
Chapter 01Intro.ppt full explanation used
PPT
unit 1 DATA MINING.ppt
PPT
Dwdmunit1 a
PDF
A Review On Data Mining From Past To The Future
PPT
Data Mining introduction and basic concepts
PDF
Data mining chapter for students of university
PPT
data mining
PPTX
Data Mining Application and Trends
PPTX
Data Mining: Application and trends in data mining
PPTX
Data Mining: Application and trends in data mining
Introduction to data warehouse
Chapter 1. Introduction
Unit 1 (Chapter-1) on data mining concepts.ppt
Data mining Introduction
Cs501 dm intro
2 introductory slides
introduction to data minining and unit iii
Talk
Introduction of Data Mining - Concept and techniques
Introduction to Data Mining
Chapter 01Intro.ppt full explanation used
unit 1 DATA MINING.ppt
Dwdmunit1 a
A Review On Data Mining From Past To The Future
Data Mining introduction and basic concepts
Data mining chapter for students of university
data mining
Data Mining Application and Trends
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data mining

More from Devakumar Jain (12)

PDF
Emerging research agenda in pesticide science
PPT
Jain philosophical insights- I
DOC
Particle physics article
PPTX
Synthetic pest management chemicals
PPTX
Botanical pesticides in pm
PPTX
Research Avenues in Drug discovery of natural products
DOC
Acarya kund kund and samayasara
PPT
Particle physics article
PPT
Performance Related Incentive Scheme for Indian Agricutural Scientists
PPT
MALDI-TOF: Pricinple and Its Application in Biochemistry and Biotechnology
PPT
An Introduction to Chemoinformatics for the postgraduate students of Agriculture
PPT
Consortium on Digitization of Indian Agricultural Library Resources
Emerging research agenda in pesticide science
Jain philosophical insights- I
Particle physics article
Synthetic pest management chemicals
Botanical pesticides in pm
Research Avenues in Drug discovery of natural products
Acarya kund kund and samayasara
Particle physics article
Performance Related Incentive Scheme for Indian Agricutural Scientists
MALDI-TOF: Pricinple and Its Application in Biochemistry and Biotechnology
An Introduction to Chemoinformatics for the postgraduate students of Agriculture
Consortium on Digitization of Indian Agricultural Library Resources

Recently uploaded (20)

PDF
Insiders guide to clinical Medicine.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Classroom Observation Tools for Teachers
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Pre independence Education in Inndia.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Cell Structure & Organelles in detailed.
PDF
RMMM.pdf make it easy to upload and study
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Microbial disease of the cardiovascular and lymphatic systems
Insiders guide to clinical Medicine.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
VCE English Exam - Section C Student Revision Booklet
Classroom Observation Tools for Teachers
Renaissance Architecture: A Journey from Faith to Humanism
Module 4: Burden of Disease Tutorial Slides S2 2025
Final Presentation General Medicine 03-08-2024.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
Computing-Curriculum for Schools in Ghana
Microbial diseases, their pathogenesis and prophylaxis
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pre independence Education in Inndia.pdf
PPH.pptx obstetrics and gynecology in nursing
Cell Structure & Organelles in detailed.
RMMM.pdf make it easy to upload and study
102 student loan defaulters named and shamed – Is someone you know on the list?
2.FourierTransform-ShortQuestionswithAnswers.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Microbial disease of the cardiovascular and lymphatic systems

Knowledge discovery thru data mining

  • 1. Knowledge Discovery through Data Mining C. Devakumar Indian Council of Agricultural Research New Delhi-110 012 [email_address]
  • 2. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Classification of data mining systems Major issues in data mining
  • 3. Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube We are drowning in data, but starving for knowledge! “ Necessity is the mother of invention” — Data mining — Automated analysis of massive data sets
  • 4. Evolution of Database Technology 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems
  • 5. What is Data Mining? Many Definitions Non-trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
  • 6. What is not Data Mining? Look up phone number in phone directory Query a Web search engine for information
  • 7. Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems
  • 8. + = Data Interestingness criteria Hidden patterns
  • 9. + = Data Interestingness criteria Hidden patterns Type of Patterns
  • 10. + = Data Interestingness criteria Hidden patterns Type of data Type of Interestingness criteria
  • 11. Type of Data Tabular (Ex: Transaction data) Relational Multi-dimensional Spatial (Ex: Remote sensing data) Temporal (Ex: Log information) Streaming (Ex: multimedia, network traffic) Spatio-temporal (Ex: GIS) Tree (Ex: XML data) Graphs (Ex: WWW, BioMolecular data) Sequence (Ex: DNA, activity logs) Text, Multimedia …
  • 12. Type of Interestingness Frequency Rarity Correlation Length of occurrence (for sequence and temporal data) Consistency Repeating / periodicity “ Abnormal” behavior Other patterns of interestingness…
  • 13. Statistics: Conceptual Model (Hypothesis) Statistical Reasoning “ Proof” (Validation of Hypothesis)
  • 14. Data mining: Mining Algorithm Based on Interestingness Data Pattern (model, rule, hypothesis) discovery
  • 15. Explores Your Data Finds Patterns Performs Predictions
  • 16. Presentation Exploration Discovery Passive Interactive Proactive Role of Software Business Insight Predictive Analysis Canned reporting Ad-hoc reporting OLAP Data mining
  • 17. Data Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data.
  • 18. Data Mining Tasks... Classification [Predictive] Regression [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Semi-supervised Learning Semi-supervised Clustering Semi-supervised Classification
  • 19. Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribution Privacy Preservation Streaming Data
  • 20. Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
  • 21. Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization
  • 22. Architecture: Typical Data Mining System data cleaning, integration, and selection Database or Data Warehouse Server Data Mining Engine Pattern Evaluation Graphical User Interface Knowledge-Base Database Data Warehouse World-Wide Web Other Info Repositories
  • 23. Multi-Dimensional View of Data Mining Data to be mined Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.
  • 24. Data Mining: Classification Schemes General functionality Descriptive data mining Predictive data mining Different views lead to different classifications Data view: Kinds of data to be mined Knowledge view: Kinds of knowledge to be discovered Method view: Kinds of techniques utilized Application view: Kinds of applications adapted
  • 25. Data Mining Functionalities Multidimensional concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Frequent patterns, association, correlation vs. causality Classification and prediction Construct models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown or missing numerical values
  • 26. Data Mining Functionalities (2) Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera  large SD memory Periodicity analysis Similarity-based analysis Other pattern-directed or statistical analyses
  • 27. Major Issues in Data Mining Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy
  • 28. Bioinformatics, Computational Biology, Data Mining Bioinformatics is an interdisciplinary field about the information processing problems in computational biology and a unified treatment of the data mining methods for solving these problems. Computational Biology is about modeling real data and simulating unknown data of biological entities, e.g. Genomes (viruses, bacteria, fungi, plants, insects,…) Proteins and Proteomes Biological Sequences Molecular Function and Structure Data Mining is searching for knowledge in data Knowledge mining from databases Knowledge extraction Data/pattern analysis Data dredging Knowledge Discovery in Databases (KDD)
  • 29. Data production at the levels of molecules, cells, organs, organisms, populations Integration of structure and function data, gene expression data, pathway data, phenotypic and clinical data, … Prediction of Molecular Function and Structure Computational biology: synthesis (simulations) and analysis (machine learning) Problems in Bioinformatics Domain
  • 30. Tokenization , which splits a text document into a stream of words by removing all punctuation marks and by replacing tabs and other non-text characters with single white spaces Filtering methods remove words like articles, conjunctions, prepositions, etc. Lemmatization methods try to map verb forms to the infinite tense and nouns to their singular form. Stemming methods attempt to build the basic forms of words, for example, by stripping the plural 's' from nouns, the 'ing' from verbs, or other affixes. Additional linguistic preprocessing N-grams individualization, which is n-word generic sequences that do not necessarily correspond to an idiomatic use; Anaphora resolution, which can identify relationships among a linguistic expression (anaphora) and its preceding phrase, thus, determining the corresponding reference; Part-of-speech tagging (POS) determines the part of speech tag, noun, verb, adjective, etc. for each term; Text chunking aims at grouping adjacent words in a sentence; Word Sense Disambiguation (WSD) tries to resolve the ambiguity in the meaning of single words or phrases; Parsing produces a full parse tree of a sentence (subject, object, etc.).
  • 31. Castellano, M. et al. A bioinformatics knowledge discovery in text application for grid Computing BMC Bioinformatics 2009, 10(Suppl 6):S23
  • 32. BIOINFORMATICS ARCHITECTURE The Layer Architecture consisting of GATE 4.0 Toolkit for Text Mining, a Middleware solution written by Java API, the grid infrastructure middleware, and a physical layer that consists of a Gnu/Linux Operating System. The integrated development environment, GATE was used for the text mining process. GATE operated on a collection of scientific publications in full text available on MedLine/Pubmed (in pdf format) using the process of Text Mining
  • 34. What is New in SQL Server 2008? Data Mining Enhancements In addition to other new aspects of SQL Server: Enhanced Mining Structures Easier to prepare and test your models Models allow for cross-validation Filtering Algorithm Updates Improved Time Series algorithm combining best of ARIMA and ARTXP “ What-If” analysis Microsoft Data Mining Framework Supplements CRISP-DM
  • 35. Microsoft DM Competitors SAS , largest market share of DM, specialised product for traditional experts SPSS (Clementine), strength in statistical analysis IBM (Intelligent Miner) tied to DB2, interoperates with Microsoft through PMML Oracle (10g), supports Java APIs Angoss (KnowledgeSTUDIO), result visualisation, works with SQL Server KXEN , supports OLAP and Excel
  • 36.  
  • 37. Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications Includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining; Let’s mine for valuable gems of knowledge in our databases!