SlideShare a Scribd company logo
2
Most read
3
Most read
Data Mining Lecture 1
Why Data Mining?
Negative Points
 All sounds complicated. Why should I learn
about Data Mining?
 What's wrong with just relational
databases? Why would I want to go through
these extra complicated steps?
 Isn't it expensive? It sounds like it takes a
lot of skills, programming, computational
time and storage space. Where's the
benefit?
Positive Points
 Data Mining is not just a cute academic
exercise
 It has very profitable real world uses
 Practically all large companies and
manygovernments perform data mining as
part of their planning and analysis
Data Explosion
Data explosion problem
 Automated data collection tools and
mature database technology lead to
tremendous amounts of data stored in
databases, data warehouses and other
information repositories We are drowning
in data, but starving for knowledge!
Solution: Data Mining
 Extraction of interesting knowledge (rules,
regularities, patterns, constraints) from
large datasets
What is Data Mining?
Data mining (knowledge discovery in databases):
 Extraction of interesting (non-trivial,
implicit, previously unknown
and potentially useful) information or patterns from
data in large
databases
Alternative names and their “inside stories”:
 Data mining: a misnomer?
 Knowledge discovery (mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archaeology, data dredging,
information harvesting, business
intelligence, etc.
What is not data mining?
 (Deductive) query processing. ● Expert
systems or small ML/statistical programs
Potential Applications
Database analysis and decision support
 Market analysis and management
o target marketing, customer relation
management, market basket
analysis, cross selling, market
segmentation
 Risk analysis and management
o Forecasting, customer retention,
improved underwriting, quality
control, competitive analysis
 Fraud detection and management
Other Applications
 Text mining (news group, email,
documents) and Web analysis
 Intelligent query answering
Market Analysis
Where are the data sources for analysis?
 Credit card transactions, loyalty cards,
discount coupons, customer complaint
calls, plus (public) lifestyle studies
Target marketing
 Find clusters or “model” customers who
share the same characteristics: interest,
income level, spending habits, etc.
Determine customer purchasing patterns over time
 Conversion of single to a joint bank
account: marriage, etc.
Cross-market analysis
 Associations/correlations between product
sales
 Prediction based on the association
information
Corporations
Finance planning and asset evaluation
 cash flow analysis and prediction
 contingent claim analysis to evaluate assets
 cross-sectional and time series analysis
(financialratio, trend analysis, etc.)
Resource planning
 summarize and compare the resources and
spending
Competition
 monitor competitors and market directions
 group customers into classes and a class-
based pricing procedure
 set pricing strategy in a highly competitive
market
Market Management
Customer profiling
 data mining can tell you what types of
customers buy what products (clustering or
classification)
Identifying customer requirements
 identifying the best products for different
customers
 use prediction to find what factors will
attract new customers
Provides summary information
 various multidimensional summary reports
 statistical summary information (data
central tendency and variation)
Fraud Detection
Applications
 widely used in health care, retail, credit
card services, telecommunications (phone
card fraud), etc.
Approach
 use historical data to build models of
fraudulent behaviour and use data mining
to help identify similar instances
Examples
 auto insurance: detect a group of people
who stage accidents to collect on insurance
 money laundering: detect suspicious
money transactions (US Treasury's Financial
Crimes Enforcement Network)
 medical insurance: detect professional
patients and ring of doctors and ring of
references
Detecting inappropriate medical treatment
 Australian Health Insurance Commission
identifies that in many cases blanket
screening tests were requested (save
Australian €1m/yr)
Detecting telephone fraud
 Telephone call model: destination of the
call, duration, time of day or week. Analyse
patterns that deviate from an expected
norm
 British Telecom identified discrete groups
of callers with frequent intra-group calls,
especially mobile phones, and broke a
multimillion dollar fraud
Retail
 Analysts estimate that 38% of retail shrink is
due to dishonest employees
Other Applications
Sports
 IBM Advanced Scout analysed NBA game
statistics (shots blocked, assists, and fouls)
to gain competitive advantage for New York
Knicks and Miami Heat
Astronomy
 JPL and the Palomar Observatory
discovered 22 quasars with the help of data
mining
Internet Web Surf-Aid
 IBM Surf-Aid applies data mining algorithms
to Web access logs for market-related
pages to discover customer preference and
behaviour pages, analysing effectiveness of
Web marketing, improving Web site
organisation, etc.
KDD Process
Learning the application domain
 relevant prior knowledge and goals of
application
Creating a target data set: data selection
Data cleaning and pre-processing (may take 60% of
effort!)
Data reduction and transformation
 Find useful features,
dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
 summarisation, classification, regression,
association, clustering
Choosing the mining algorithms
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
 visualisation, transformation, removing
redundant patterns, etc.
Use of discovered knowledge
Business Intelligence
Architecture of DM System
DM Software Tools
DM Software Tools
 We can find several Data Mining tools
 Commercial and free (open source)
software tools
 NeuroOnline, ARMiner, Tminer, Bayda,
Cluto, YALE,Rainbow, MDR, C5, Clementine,
etc.
RapidMiner
 Free version
 Commercial version

More Related Content

PDF
Data mining 2 - Data warehouse (cheat sheet - printable)
PPTX
Data lake ppt
PDF
Optimizing MariaDB for maximum performance
PDF
Découverte de Elastic search
PDF
Security Automation and Orchestration
PDF
The Google File System (GFS)
PDF
OSMC 2021 | Introduction into OpenSearch
PDF
MPP vs Hadoop
Data mining 2 - Data warehouse (cheat sheet - printable)
Data lake ppt
Optimizing MariaDB for maximum performance
Découverte de Elastic search
Security Automation and Orchestration
The Google File System (GFS)
OSMC 2021 | Introduction into OpenSearch
MPP vs Hadoop

What's hot (20)

PDF
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
PPTX
CockroachDB
PPTX
Hadoop 2.0, MRv2 and YARN - Module 9
PDF
EDB Postgres DBA Best Practices
 
PPTX
Developing Active-Active Geo-Distributed Apps with Redis
PPTX
Introduction to Apache Spark
PPTX
Building a Big Data Pipeline
PPTX
Elastic stack Presentation
PDF
LinkedIn Data Infrastructure (QCon London 2012)
PPTX
HADOOP TECHNOLOGY ppt
PPTX
PPTX
Mongo db intro.pptx
PPTX
Magnetic storage devices
PDF
Gartner 2021 Magic Quadrant for Cloud Database Management Systems.pdf
PDF
Introduction to column oriented databases
PDF
The Complete MariaDB Server tutorial
PDF
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
PPTX
How we solved Real-time User Segmentation using HBase
PDF
[Pgday.Seoul 2018] Greenplum의 노드 분산 설계
PDF
Fast Data Analytics with Spark and Python
Clickhouse MeetUp@ContentSquare - ContentSquare's Experience Sharing
CockroachDB
Hadoop 2.0, MRv2 and YARN - Module 9
EDB Postgres DBA Best Practices
 
Developing Active-Active Geo-Distributed Apps with Redis
Introduction to Apache Spark
Building a Big Data Pipeline
Elastic stack Presentation
LinkedIn Data Infrastructure (QCon London 2012)
HADOOP TECHNOLOGY ppt
Mongo db intro.pptx
Magnetic storage devices
Gartner 2021 Magic Quadrant for Cloud Database Management Systems.pdf
Introduction to column oriented databases
The Complete MariaDB Server tutorial
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
How we solved Real-time User Segmentation using HBase
[Pgday.Seoul 2018] Greenplum의 노드 분산 설계
Fast Data Analytics with Spark and Python
Ad

Similar to Data mining 1 - Introduction (cheat sheet - printable) (20)

PPT
6 weeks summer training in data mining,ludhiana
PPT
6 weeks summer training in data mining,jalandhar
PPT
6months industrial training in data mining,ludhiana
PPT
6months industrial training in data mining, jalandhar
PPT
Data mining final year project in ludhiana
PPT
Data mining final year project in jalandhar
PPT
PPT
PPTX
Data Mining Intro
PPTX
Data mining
PPTX
Information Technology Data Mining
PPTX
Data mining
PPT
Data mining by_ashok
PPT
introduction to data mining applications
PPT
Introduction To Data Mining
PPT
Introduction To Data Mining
PPT
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
PPTX
Business Intelligence and Analytics Unit-2 part-A .pptx
PPT
datamining.ppt
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,jalandhar
6months industrial training in data mining,ludhiana
6months industrial training in data mining, jalandhar
Data mining final year project in ludhiana
Data mining final year project in jalandhar
Data Mining Intro
Data mining
Information Technology Data Mining
Data mining
Data mining by_ashok
introduction to data mining applications
Introduction To Data Mining
Introduction To Data Mining
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
Business Intelligence and Analytics Unit-2 part-A .pptx
datamining.ppt
Ad

Recently uploaded (20)

PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
master seminar digital applications in india
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Cell Structure & Organelles in detailed.
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Computing-Curriculum for Schools in Ghana
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Classroom Observation Tools for Teachers
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Supply Chain Operations Speaking Notes -ICLT Program
Pharmacology of Heart Failure /Pharmacotherapy of CHF
O5-L3 Freight Transport Ops (International) V1.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
master seminar digital applications in india
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Pre independence Education in Inndia.pdf
RMMM.pdf make it easy to upload and study
human mycosis Human fungal infections are called human mycosis..pptx
Cell Structure & Organelles in detailed.
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Final Presentation General Medicine 03-08-2024.pptx
Computing-Curriculum for Schools in Ghana
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Classroom Observation Tools for Teachers
O7-L3 Supply Chain Operations - ICLT Program
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx

Data mining 1 - Introduction (cheat sheet - printable)

  • 1. Data Mining Lecture 1 Why Data Mining? Negative Points  All sounds complicated. Why should I learn about Data Mining?  What's wrong with just relational databases? Why would I want to go through these extra complicated steps?  Isn't it expensive? It sounds like it takes a lot of skills, programming, computational time and storage space. Where's the benefit? Positive Points  Data Mining is not just a cute academic exercise  It has very profitable real world uses  Practically all large companies and manygovernments perform data mining as part of their planning and analysis Data Explosion Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories We are drowning in data, but starving for knowledge! Solution: Data Mining  Extraction of interesting knowledge (rules, regularities, patterns, constraints) from large datasets What is Data Mining? Data mining (knowledge discovery in databases):  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their “inside stories”:  Data mining: a misnomer?  Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archaeology, data dredging, information harvesting, business intelligence, etc. What is not data mining?  (Deductive) query processing. ● Expert systems or small ML/statistical programs Potential Applications Database analysis and decision support  Market analysis and management o target marketing, customer relation management, market basket analysis, cross selling, market segmentation  Risk analysis and management o Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and management Other Applications  Text mining (news group, email, documents) and Web analysis  Intelligent query answering Market Analysis Where are the data sources for analysis?  Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing  Find clusters or “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time  Conversion of single to a joint bank account: marriage, etc. Cross-market analysis  Associations/correlations between product sales  Prediction based on the association information
  • 2. Corporations Finance planning and asset evaluation  cash flow analysis and prediction  contingent claim analysis to evaluate assets  cross-sectional and time series analysis (financialratio, trend analysis, etc.) Resource planning  summarize and compare the resources and spending Competition  monitor competitors and market directions  group customers into classes and a class- based pricing procedure  set pricing strategy in a highly competitive market Market Management Customer profiling  data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements  identifying the best products for different customers  use prediction to find what factors will attract new customers Provides summary information  various multidimensional summary reports  statistical summary information (data central tendency and variation) Fraud Detection Applications  widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach  use historical data to build models of fraudulent behaviour and use data mining to help identify similar instances Examples  auto insurance: detect a group of people who stage accidents to collect on insurance  money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)  medical insurance: detect professional patients and ring of doctors and ring of references Detecting inappropriate medical treatment  Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian €1m/yr) Detecting telephone fraud  Telephone call model: destination of the call, duration, time of day or week. Analyse patterns that deviate from an expected norm  British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud Retail  Analysts estimate that 38% of retail shrink is due to dishonest employees Other Applications Sports  IBM Advanced Scout analysed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy  JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid  IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behaviour pages, analysing effectiveness of Web marketing, improving Web site organisation, etc.
  • 3. KDD Process Learning the application domain  relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and pre-processing (may take 60% of effort!) Data reduction and transformation  Find useful features, dimensionality/variable reduction, invariant representation Choosing functions of data mining  summarisation, classification, regression, association, clustering Choosing the mining algorithms Data mining: search for patterns of interest Pattern evaluation and knowledge presentation  visualisation, transformation, removing redundant patterns, etc. Use of discovered knowledge Business Intelligence Architecture of DM System DM Software Tools DM Software Tools  We can find several Data Mining tools  Commercial and free (open source) software tools  NeuroOnline, ARMiner, Tminer, Bayda, Cluto, YALE,Rainbow, MDR, C5, Clementine, etc. RapidMiner  Free version  Commercial version