SlideShare a Scribd company logo
University of Mannheim – Prof. Bizer: Data Mining Slide 1
Data Mining
Introduction to
Data Mining
University of Mannheim – Prof. Bizer: Data Mining Slide 2
Hallo
 Prof. Dr. Christian Bizer
 Professor for Information Systems V
 Research Interests:
• Data and Web Mining
• Web Data Integration
• Data Web Technologies
 Room: B6 - B1.15
 eMail: chris@informatik.uni-mannheim.de
University of Mannheim – Prof. Bizer: Data Mining Slide 3
Hallo
 M. Sc. Wi-Inf. Anna Primpeli
 Graduate Research Associate
 Research Interests:
• Semantic Annotations in Web Pages
• Active Learning for Identity Resolution
• Product Data Integration
 Room: B6, 26, C 1.04
 eMail: anna@informatik.uni-mannheim.de
 Will teach the RapidMiner exercises and
will supervise student projects
University of Mannheim – Prof. Bizer: Data Mining Slide 4
Hallo
 M. Sc. Wi-Inf. Ralph Peeters
 Graduate Research Associate
 Research Interests:
• Entity Matching using Deep Learning
• Product Data Integration
 Room: B6, 26, C 1.04
 eMail: ralph@informatik.uni-mannheim.de
 Will teach the Python exercises and will
supervise student projects.
University of Mannheim – Prof. Bizer: Data Mining Slide 5
Course Organisation
 Lecture
• introduces the principle methods of data mining
• discusses how to evaluate generated models
• presents practical examples of data mining applications
from the corporate and Web context
 Exercise Groups
• students experiment with the methods using RapidMiner or Python
 Project Work
• teams of six students realize a data mining project
• teams may choose their own data sets and tasks
(in addition, I will propose some suitable data sets and tasks)
• teams write a 10 page summary about their project and present the results
 Grading
• 75% written exam, 20% project report, 5% presentation of project results
University of Mannheim – Prof. Bizer: Data Mining Slide 6
Course Organisation
 Course Webpage
• provides up-to-date information, lecture slides, and exercise material
• https://www.uni-mannheim.de/dws/teaching/course-details/courses-for-
master-candidates/ie-500-data-mining/
 Solutions to the Exercises
• ILIAS eLearning System, https://ilias.uni-mannheim.de/
 Time and Location
• Lecture:
• Wednesday, 10.15 - 11.45, A5, B144
• Exercise:
• Thursday, 10.15 - 11.45
Room B6, A104 (RapidMiner, Anna)
• Thursday, 12.00 - 13.30,
Room B6, A104 (Python, Ralph)
• Thursday, 13.45 - 15.15,
Room B6, A104 (Python, Ralph)
University of Mannheim – Prof. Bizer: Data Mining Slide 7
Lecture Contents
1. Introduction to Data Mining What is Data Mining?
Tasks and Applications
The Data Mining Process
2. Cluster Analysis K-means Clustering, Density-based Clustering,
Hierarchical Clustering, Proximity Measures
3. Classification Nearest Neighbor, Decision Trees,
Model Evaluation, Rule Learning, Naïve Bayes,
Neural Networks, Support Vector Machines
4. Regression Linear Regression, Nearest Neighbor Regression,
Regression Trees, Time Series
5. Association Analysis Frequent Item Set Generation, Rule Generation,
Interestingness Measures
6. Text Mining Preprocessing Text, Feature Generation, Feature
Selection, RapidMiner Text Extension
University of Mannheim – Prof. Bizer: Data Mining Slide 8
Schedule
Week Wednesday Thursday
12.02.2020 Introduction to Data Mining
Intro to Python (15:30, A5, C 013)
Exercise Preprocessing/Visualization
19.02.2020 Lecture Cluster Analysis Exercise Cluster Analysis
26.02.2020 Lecture Classification 1 Exercise Classification
04.03.2020 Lecture Classification 2 Exercise Classification
11.03.2020 Lecture Classification 3 Exercise Classification
18.03.2020 Lecture Regression Exercise Regression
25.03.2020 Lecture Association Analysis Exercise Association Analysis
01.04.2020 Lecture Text Mining Exercise Text Mining
22.04.2020 Group Formation for Student
Projects (Attendance obligatory)
Preparation of Project Proposal
29.04.2020 Feedback on Project Proposals Project Work
06.05.2020 Feedback on demand Project Work
13.05.2020 Feedback on demand Project Work
20.05.2020 Feedback on demand Project Work
27.05.2020 Presentation of project results Presentation of project results
University of Mannheim – Prof. Bizer: Data Mining Slide 9
Deadlines
 Submission of project proposal
• Sunday, April 26th, 23:59
 Submission of final project report
• Wednesday, May 20th, 23:59
 Project presentations
• Wednesday May 27th, Thursday, May 28th
• everyone has to attend the presentations
University of Mannheim – Prof. Bizer: Data Mining Slide 10
Final Exam
 Date and Time: 8th June
 Room: tba
 Duration: 60 minutes
 Structure: 6 open questions that
• Goal is to check whether you have understood the lecture content
• we try to cover all major chapters of the lecture: clustering,
classification, regression, association analysis, text mining
• Require you to describe the ideas behind algorithms and methods
• often: How do methods react to special pattern in the data?
• Might require you to do some simple calculations for which
• you need to know the most relevant formulas
• you do not need a calculator
University of Mannheim – Prof. Bizer: Data Mining Slide 11
Text Book for the Course
Pang-Ning Tan, Michael Steinbach, Vipin Kumar:
Introduction to Data Mining. 2nd Edition.
Pearson / Addison Wesley.
University of Mannheim – Prof. Bizer: Data Mining Slide 12
Lecture Videos and Screencasts
1. Video recordings of
all lectures from FSS 2015
2. Step-by-step introduction
to relevant RapidMiner
features
3. Step-by-step solutions
of the exercises
http://dws.informatik.uni-mannheim.de/
en/teaching/lecture-videos/
University of Mannheim – Prof. Bizer: Data Mining Slide 13
Questions?
University of Mannheim – Prof. Bizer: Data Mining Slide 14
Outline: Introduction to Data Mining
1. What is Data Mining?
2. Tasks and Applications
3. The Data Mining Process
4. Data Mining Software
University of Mannheim – Prof. Bizer: Data Mining Slide 15
1. What is Data Mining?
 Large quantities of data
are collected about all
aspects of our lives
 This data contains interesting
patterns
 Data Mining helps us to
1. discover these patterns and
2. use them for decision making
across all areas of society,
including
 Business and industry
 Science and engineering
 Medicine and biotech
 Government
 Individuals
University of Mannheim – Prof. Bizer: Data Mining Slide 16
Sloan Digital Sky Survey
≈ 200 GB/day
≈ 73 TB/year
Predict
• Type of sky object:
Star or galaxy?
“We are Drowning in Data...”
University of Mannheim – Prof. Bizer: Data Mining Slide 17
US Library of Congress
≈ 235 TB archived
≈ 40 Wikipedias
Discover
• Topic distributions
• Historic trends*
• Citation networks
“We are Drowning in Data...”
* Lansdall-Welfare, et al.: Content analysis of 150 years of British periodicals. PNSA, 2017.
University of Mannheim – Prof. Bizer: Data Mining Slide 18
Facebook
• 4 Petabyte of new data
generated every day
• over 300 Petabyte in
Facebook‘s data
warehouse
Predict
• Interests and behavior
of over one billion
people
“We are Drowning in Data...”
https://www.brandwatch.com/blog/facebook-statistics/
http://www.technologyreview.com/featuredstory/428150/what-facebook-knows/
University of Mannheim – Prof. Bizer: Data Mining Slide 19
“We are Drowning in Data...”
Predict
• Interests and
behavior of mankind
University of Mannheim – Prof. Bizer: Data Mining Slide 20
“We are Drowning in Data...”
Law enforcement agencies
collect unknown amounts of
data from various sources
• Cell phone calls
• Location data
• Web browsing behavior
• Credit card transactions
• Online profiles (Facebook)
• …
Predict
• Terrorist or not?
• Trustworthiness
University of Mannheim – Prof. Bizer: Data Mining Slide 21
“...but starving for knowledge!”
We are interested in the patterns, not the data itself!
Data Mining methods help us to
• discover interesting patterns in large quantities of data
• take decisions based on the patterns
← Amount of data that is collected
← Amount of data that can be looked
at by humans
University of Mannheim – Prof. Bizer: Data Mining Slide 22
Definitions of Data Mining
 Definitions
 Data Mining methods
1. detect interesting patterns in large quantities of data
2. support human decision making by providing such patterns
3. predict the outcome of a future observation based on the
patterns
Non-trivial extraction of
 implicit,
 previously unknown, and
 potentially useful
information from data.
Exploration & analysis,
of large quantities of data
in order to discover
meaningful patterns.
University of Mannheim – Prof. Bizer: Data Mining Slide 23
 Data Mining combines ideas from statistics, machine learning, artificial
intelligence, and database systems
 Tries to overcome short-
comings of traditional
techniques concerning
• large amount of data
• high dimensionality
of data
• heterogeneous and
complex nature
of data
• explorative analysis beyond
hypothesize-and-test paradigm
Origins of Data Mining
Machine
Learning,
AI
Statistics
Data Mining
Database
Systems
University of Mannheim – Prof. Bizer: Data Mining Slide 24
Survey on Data Mining Application Fields
Source: KDnuggets online poll, 435 and 446 participants
https://www.kdnuggets.com/2019/03/poll-analytics-data-science-ml-applied-2018.html
University of Mannheim – Prof. Bizer: Data Mining Slide 25
2. Tasks and Applications
 Descriptive Tasks
• Goal: Find patterns in the data.
• Example: Which products are often bought together?
 Predictive Tasks
– Goal: Predict unknown values of a variable
• given observations (e.g., from the past)
– Example: Will a person click a online advertisement?
• given her browsing history
 Machine Learning Terminology
• descriptive = unsupervised
• predictive = supervised
University of Mannheim – Prof. Bizer: Data Mining Slide 26
Data Mining Tasks
1. Cluster Analysis [Descriptive]
2. Classification [Predictive]
3. Regression [Predictive]
4. Association Analysis [Descriptive]
University of Mannheim – Prof. Bizer: Data Mining Slide 27
 Given a set of data points, each having a set of attributes, and a
similarity measure among them, find groups such that
• data points in one group are more similar to one another
• data points in separate groups are less similar to one another
 Similarity Measures
• Euclidean distance if attributes are continuous
• other task-specific similarity measures
 Goals
1. intra-cluster distances
are minimized
2. inter-cluster distances
are maximized
 Result
• A descriptive grouping of data points
2.1 Cluster Analysis: Definition
University of Mannheim – Prof. Bizer: Data Mining Slide 28
Cluster Analysis: Application 1
 Application area: Market segmentation
 Goal: Find groups of similar customers
• where a group may be conceived
as a marketing target to be reached
with a distinct marketing mix
 Approach:
1. collect information about customers
2. find clusters of similar customers
3. measure the clustering quality by observing buying patterns
after targeting customers with distinct marketing mixes
University of Mannheim – Prof. Bizer: Data Mining Slide 29
Cluster Analysis: Application 2
 Application area: Document Clustering
 Goal: Find groups of documents that are similar to each other
based on terms appearing in them
 Approach
1. identify frequently occurring terms in each document
2. form a similarity measure based on the frequencies of
different terms
 Application Example:
Grouping of articles
in Google News
University of Mannheim – Prof. Bizer: Data Mining Slide 30
 Goal: Previously unseen records should be
assigned a class from a given set of classes
as accurately as possible.
 Approach:
 Given a collection of records (training set)
• each record contains a set of attributes
• one attribute is the class attribute (label) that should be predicted
 Find a model for predicting the class attribute as a
function of the values of other attributes
2.2 Classification: Definition
?
University of Mannheim – Prof. Bizer: Data Mining Slide 31
Classification: Example
 Training set:
 Learned model: "Trees are big, green plants without wheels."
"tree" "tree" "tree"
"not a tree" "not a tree" "not a tree"
University of Mannheim – Prof. Bizer: Data Mining Slide 32
Classification: Workflow
Class/Label Attribute
University of Mannheim – Prof. Bizer: Data Mining Slide 33
Classification: Application 1
 Application area: Fraud Detection
 Goal: Predict fraudulent cases in
credit card transactions.
 Approach:
1. Use credit card transactions and information
about account-holders as attributes
• When and where does a customer buy? What does he buy?
• How often he pays on time? etc.
2. Label past transactions as fraud or fair transactions
This forms the class attribute
3. Learn a model for the class attribute from the transactions
4. Use this model to detect fraud by observing credit card
transactions on an account
University of Mannheim – Prof. Bizer: Data Mining Slide 34
Classification: Application 2
 Application area: Direct Marketing
 Goal: Reduce cost of a mailing campaign
by targeting only the set of consumers
that likely to buy a new product
 Approach:
1. Use data from a campaign introducing a similar product in the past
• we know which customers decided to buy and which decided otherwise
• this {buy, don’t buy} decision forms the class attribute
2. Collect various demographic, lifestyle, and company-interaction related
information about the customers
• age, profession, location, income, marriage status, visits, logins, etc.
3. Use this information to learn a classification model
4. Apply model to decide which consumers to target
University of Mannheim – Prof. Bizer: Data Mining Slide 35
2.3 Regression
 Predict a value of a continuous variable
based on the values of other variables,
assuming a linear or nonlinear model of
dependency
 Examples:
• Predicting sales amounts of new product based
on advertising expenditure
• Predicting the price of a house or car
• Predicting miles per gallon (MPG) of a car
as a function of its weight and horsepower
• Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
 Difference to classification: The predicted attribute is continuous,
while classification is used to predict nominal attributes (e.g. yes/no)
University of Mannheim – Prof. Bizer: Data Mining Slide 36
2.4 Association Analysis: Definition
 Given a set of records each of which contain some number
of items from a given collection
 discover frequent itemsets and produce association rules
which will predict occurrence of an item based on
occurrences of other items
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Association Rules
{Diaper, Milk} --> {Beer}
{Milk} --> {Coke}
Association Rules
{Diaper, Milk} --> {Beer}
{Milk} --> {Coke}
Frequent Itemsets
{Diaper, Milk, Beer}
{Milk, Coke}
Frequent Itemsets
{Diaper, Milk, Beer}
{Milk, Coke}
University of Mannheim – Prof. Bizer: Data Mining Slide 37
Association Rule Discovery: Applications 1
 Application area: Supermarket shelf management.
 Goal: To identify items that are bought
together by sufficiently many customers
 Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies among items
 A classic rule and its implications:
• if a customer buys diapers and milk, then he is likely to buy beer as well
• so, don’t be surprised if you find six-packs stacked next to diapers!
• promote diapers to boost beer sales
• if selling diapers is discontinued, this will affect beer sales as well
 Application area: Sales Promotion
University of Mannheim – Prof. Bizer: Data Mining Slide 38
Association Rule Discovery: Application 2
 Application area:
Inventory Management
 Goal: A consumer appliance repair company wants to anticipate
the nature of repairs on its consumer products and keep the
service vehicles equipped with right parts to reduce on number of
visits to consumer households
 Approach: Process the data on tools and parts required in
previous repairs at different consumer locations and discover the
co-occurrence patterns
University of Mannheim – Prof. Bizer: Data Mining Slide 39
Which Methods are Used in Practice?
Source: KDnuggets online poll, 833 votes, question: methods used last year for real-world app?
https://www.kdnuggets.com/2019/04/top-data-science-machine-learning-methods-2018-2019.html
University of Mannheim – Prof. Bizer: Data Mining Slide 40
3. The Data Mining Process
Source: Fayyad et al. (1996)
a
University of Mannheim – Prof. Bizer: Data Mining Slide 41
3.1 Selection and Exploration
 Selection
• What data is potentially useful for the
task at hand?
• What data is available?
• What do I know about the
quality of the data?
 Exploration / Profiling
• Get an initial understanding of the data
• Calculate basic summarization statistics
• Visualize the data
• Identify data problems such as
outliers, missing values,
duplicate records
University of Mannheim – Prof. Bizer: Data Mining Slide 42
3.2 Preprocessing and Transformation
 Transform data into a representation that is suitable for the chosen data
mining methods
• scales of attributes (nominal, ordinal, numeric)
• number of dimensions (represent relevant information using less attributes)
• amount of data (determines hardware requirements)
 Methods
• discretization and binarization
• feature subset selection / dimensionality reduction
• attribute transformation / text to term vector / embeddings
• aggregation, sampling
• integrate data from multiple sources
 Good data preparation is key to producing valid and reliable models
 Data integration and preparation is estimated to take 70-80% of the time
and effort of a data mining project
University of Mannheim – Prof. Bizer: Data Mining Slide 43
3.3 Data Mining
 Input: Preprocessed Data
 Output: Model / Patterns
1. Apply data mining method
2. Evaluate resulting model / patterns
3. Iterate
• experiment with different parameter settings
• experiment with multiple alternative methods
• improve preprocessing and feature generation
• increase amount or quality of training data
University of Mannheim – Prof. Bizer: Data Mining Slide 44
3.4 Deployment
 Use model in the business context
 Keep iterating in order to maintain and improve model
CRISP-DM Process Model
University of Mannheim – Prof. Bizer: Data Mining Slide 45
How Do Data Scientists Spend Their Days?
Source: CrowdFlower Data Science Report 2016: http://visit.crowdflower.com/data-science-report.html
University of Mannheim – Prof. Bizer: Data Mining Slide 46
4. Data Mining Software
Source: KDnuggets online poll, 1800 votes
https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html
University of Mannheim – Prof. Bizer: Data Mining Slide 47
RapidMiner
 Powerful data mining suite
 Visual modelling of data mining pipelines
 Commercial tool, offering educational licenses
University of Mannheim – Prof. Bizer: Data Mining Slide 48
Gartner 2018 Magic Quadrant for Advanced Analytics Platforms
University of Mannheim – Prof. Bizer: Data Mining Slide 49
Literature – Rapidminer
1. Rapidminer – Documentation
• http://docs.rapidminer.com
• https://academy.rapidminer.com/catalog
2. Vijay Kotu, Bala Deshpande: Predictive
Analytics and Data Mining: Concepts and
Practice with RapidMiner. Morgan Kaufmann,
2014.
• covers theory and practical aspects using RapidMiner
3. Markus Hofmann, Ralf Klinkenberg:
RapidMiner: Data Mining Use Cases and
Business Analytics Applications. Chapman &
Hall, 2013.
• explains along case studies how to use simple and
advanced Rapidminer features
University of Mannheim – Prof. Bizer: Data Mining Slide 50
Python
We use the Anaconda Python distribution
 includes relevant packages, e.g.
• scikit-learn, pandas
• NumPy, Matplotlib
 includes Jupyter as
development environment
University of Mannheim – Prof. Bizer: Data Mining Slide 51
Literature – Python
1. Scikit-learn Documentation:
https://scikit-learn.org/stable/
user_guide.html
2. Aurélien Géron: Hands-on Machine
Learning with Scikit-Learn, Keras
& TensorFlow.
2nd Edition, O’Reilly, 2019
University of Mannheim – Prof. Bizer: Data Mining Slide 52
Literature for this Chapter
Pang-Ning Tan, Michael Steinbach, Vipin Kumar:
Introduction to Data Mining. 2nd Edition.
Pearson / Addison Wesley.
Chapter 1: Introduction
Chapter 2: Data

More Related Content

PDF
2 introductory slides
PDF
01datamining.pdf
PDF
DM course outlines.pdf
PPTX
BAS 250 Lecture 1
PPTX
Data mining
PPT
Data Mining- Unit-I PPT (1).ppt
PDF
UNIT2-Data Mining.pdf
PDF
lec1.pdf
2 introductory slides
01datamining.pdf
DM course outlines.pdf
BAS 250 Lecture 1
Data mining
Data Mining- Unit-I PPT (1).ppt
UNIT2-Data Mining.pdf
lec1.pdf

Similar to Introduction_Data_Mining_BasicConcepts.pdf (20)

PPT
Data Mining Xuequn Shang NorthWestern Polytechnical University
PPT
Data mininng trends
PPT
introduction to data minining and unit iii
PPT
Unit 1 (Chapter-1) on data mining concepts.ppt
PDF
Study of Data Mining Methods and its Applications
PPT
Data mining for business intelligence ch04 sharda
PPTX
Data Mining Application and Trends
PPT
DWDMUNIhjkuijhgfdswertyuuyhtgrertyuujhytrertyT1A.ppt
PPT
Data Mining: Concepts and techniques: Chapter 13 trend
DOC
MIS 542 Syllabus 08.doc
PDF
datamining-introduction.pdf
PPT
Chapter1_IntroductionIntroductionIntroduction.ppt
PPTX
Data mining introduction
PPTX
Introduction to-data-mining chapter 1
PDF
Overview of Data Mining
PDF
A Review On Data Mining From Past To The Future
PPT
Introduction of Data Mining - Concept and techniques
PPT
Chapter 1. Introduction.ppt
PPT
Chapter 1. Introduction
PDF
Data Mining and its detail processes with steps
Data Mining Xuequn Shang NorthWestern Polytechnical University
Data mininng trends
introduction to data minining and unit iii
Unit 1 (Chapter-1) on data mining concepts.ppt
Study of Data Mining Methods and its Applications
Data mining for business intelligence ch04 sharda
Data Mining Application and Trends
DWDMUNIhjkuijhgfdswertyuuyhtgrertyuujhytrertyT1A.ppt
Data Mining: Concepts and techniques: Chapter 13 trend
MIS 542 Syllabus 08.doc
datamining-introduction.pdf
Chapter1_IntroductionIntroductionIntroduction.ppt
Data mining introduction
Introduction to-data-mining chapter 1
Overview of Data Mining
A Review On Data Mining From Past To The Future
Introduction of Data Mining - Concept and techniques
Chapter 1. Introduction.ppt
Chapter 1. Introduction
Data Mining and its detail processes with steps
Ad

More from ssuser012286 (9)

PDF
introduction_Machine_Learning_Slides.pdf
PDF
Introduction_to_Machine_Learning_KUMAR.pdf
PDF
MachineLearning_Road to deep learning.pdf
PDF
IntrodProg_FLUXOGRAMAS_IntrodProgramcao.pdf
PDF
Diario daRepublicaPortuguesa_00260037.pdf
PDF
Diario daRepublicaPortuguesa_07170721.pdf
PDF
Diario daRepublicaPortuguesa_09190920.pdf
PDF
Diario daRepublicaPortuguesa_29152915.pdf
PDF
Diario daRepublicaPortuguesa_12321232.pdf
introduction_Machine_Learning_Slides.pdf
Introduction_to_Machine_Learning_KUMAR.pdf
MachineLearning_Road to deep learning.pdf
IntrodProg_FLUXOGRAMAS_IntrodProgramcao.pdf
Diario daRepublicaPortuguesa_00260037.pdf
Diario daRepublicaPortuguesa_07170721.pdf
Diario daRepublicaPortuguesa_09190920.pdf
Diario daRepublicaPortuguesa_29152915.pdf
Diario daRepublicaPortuguesa_12321232.pdf
Ad

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
ISS -ESG Data flows What is ESG and HowHow
PPT
Quality review (1)_presentation of this 21
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Introduction to the R Programming Language
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to machine learning and Linear Models
PDF
Mega Projects Data Mega Projects Data
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Reliability_Chapter_ presentation 1221.5784
ISS -ESG Data flows What is ESG and HowHow
Quality review (1)_presentation of this 21
STUDY DESIGN details- Lt Col Maksud (21).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
1_Introduction to advance data techniques.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Data_Analytics_and_PowerBI_Presentation.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to the R Programming Language
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Analytics and business intelligence.pdf
Introduction to machine learning and Linear Models
Mega Projects Data Mega Projects Data
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf

Introduction_Data_Mining_BasicConcepts.pdf

  • 1. University of Mannheim – Prof. Bizer: Data Mining Slide 1 Data Mining Introduction to Data Mining
  • 2. University of Mannheim – Prof. Bizer: Data Mining Slide 2 Hallo  Prof. Dr. Christian Bizer  Professor for Information Systems V  Research Interests: • Data and Web Mining • Web Data Integration • Data Web Technologies  Room: B6 - B1.15  eMail: chris@informatik.uni-mannheim.de
  • 3. University of Mannheim – Prof. Bizer: Data Mining Slide 3 Hallo  M. Sc. Wi-Inf. Anna Primpeli  Graduate Research Associate  Research Interests: • Semantic Annotations in Web Pages • Active Learning for Identity Resolution • Product Data Integration  Room: B6, 26, C 1.04  eMail: anna@informatik.uni-mannheim.de  Will teach the RapidMiner exercises and will supervise student projects
  • 4. University of Mannheim – Prof. Bizer: Data Mining Slide 4 Hallo  M. Sc. Wi-Inf. Ralph Peeters  Graduate Research Associate  Research Interests: • Entity Matching using Deep Learning • Product Data Integration  Room: B6, 26, C 1.04  eMail: ralph@informatik.uni-mannheim.de  Will teach the Python exercises and will supervise student projects.
  • 5. University of Mannheim – Prof. Bizer: Data Mining Slide 5 Course Organisation  Lecture • introduces the principle methods of data mining • discusses how to evaluate generated models • presents practical examples of data mining applications from the corporate and Web context  Exercise Groups • students experiment with the methods using RapidMiner or Python  Project Work • teams of six students realize a data mining project • teams may choose their own data sets and tasks (in addition, I will propose some suitable data sets and tasks) • teams write a 10 page summary about their project and present the results  Grading • 75% written exam, 20% project report, 5% presentation of project results
  • 6. University of Mannheim – Prof. Bizer: Data Mining Slide 6 Course Organisation  Course Webpage • provides up-to-date information, lecture slides, and exercise material • https://www.uni-mannheim.de/dws/teaching/course-details/courses-for- master-candidates/ie-500-data-mining/  Solutions to the Exercises • ILIAS eLearning System, https://ilias.uni-mannheim.de/  Time and Location • Lecture: • Wednesday, 10.15 - 11.45, A5, B144 • Exercise: • Thursday, 10.15 - 11.45 Room B6, A104 (RapidMiner, Anna) • Thursday, 12.00 - 13.30, Room B6, A104 (Python, Ralph) • Thursday, 13.45 - 15.15, Room B6, A104 (Python, Ralph)
  • 7. University of Mannheim – Prof. Bizer: Data Mining Slide 7 Lecture Contents 1. Introduction to Data Mining What is Data Mining? Tasks and Applications The Data Mining Process 2. Cluster Analysis K-means Clustering, Density-based Clustering, Hierarchical Clustering, Proximity Measures 3. Classification Nearest Neighbor, Decision Trees, Model Evaluation, Rule Learning, Naïve Bayes, Neural Networks, Support Vector Machines 4. Regression Linear Regression, Nearest Neighbor Regression, Regression Trees, Time Series 5. Association Analysis Frequent Item Set Generation, Rule Generation, Interestingness Measures 6. Text Mining Preprocessing Text, Feature Generation, Feature Selection, RapidMiner Text Extension
  • 8. University of Mannheim – Prof. Bizer: Data Mining Slide 8 Schedule Week Wednesday Thursday 12.02.2020 Introduction to Data Mining Intro to Python (15:30, A5, C 013) Exercise Preprocessing/Visualization 19.02.2020 Lecture Cluster Analysis Exercise Cluster Analysis 26.02.2020 Lecture Classification 1 Exercise Classification 04.03.2020 Lecture Classification 2 Exercise Classification 11.03.2020 Lecture Classification 3 Exercise Classification 18.03.2020 Lecture Regression Exercise Regression 25.03.2020 Lecture Association Analysis Exercise Association Analysis 01.04.2020 Lecture Text Mining Exercise Text Mining 22.04.2020 Group Formation for Student Projects (Attendance obligatory) Preparation of Project Proposal 29.04.2020 Feedback on Project Proposals Project Work 06.05.2020 Feedback on demand Project Work 13.05.2020 Feedback on demand Project Work 20.05.2020 Feedback on demand Project Work 27.05.2020 Presentation of project results Presentation of project results
  • 9. University of Mannheim – Prof. Bizer: Data Mining Slide 9 Deadlines  Submission of project proposal • Sunday, April 26th, 23:59  Submission of final project report • Wednesday, May 20th, 23:59  Project presentations • Wednesday May 27th, Thursday, May 28th • everyone has to attend the presentations
  • 10. University of Mannheim – Prof. Bizer: Data Mining Slide 10 Final Exam  Date and Time: 8th June  Room: tba  Duration: 60 minutes  Structure: 6 open questions that • Goal is to check whether you have understood the lecture content • we try to cover all major chapters of the lecture: clustering, classification, regression, association analysis, text mining • Require you to describe the ideas behind algorithms and methods • often: How do methods react to special pattern in the data? • Might require you to do some simple calculations for which • you need to know the most relevant formulas • you do not need a calculator
  • 11. University of Mannheim – Prof. Bizer: Data Mining Slide 11 Text Book for the Course Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining. 2nd Edition. Pearson / Addison Wesley.
  • 12. University of Mannheim – Prof. Bizer: Data Mining Slide 12 Lecture Videos and Screencasts 1. Video recordings of all lectures from FSS 2015 2. Step-by-step introduction to relevant RapidMiner features 3. Step-by-step solutions of the exercises http://dws.informatik.uni-mannheim.de/ en/teaching/lecture-videos/
  • 13. University of Mannheim – Prof. Bizer: Data Mining Slide 13 Questions?
  • 14. University of Mannheim – Prof. Bizer: Data Mining Slide 14 Outline: Introduction to Data Mining 1. What is Data Mining? 2. Tasks and Applications 3. The Data Mining Process 4. Data Mining Software
  • 15. University of Mannheim – Prof. Bizer: Data Mining Slide 15 1. What is Data Mining?  Large quantities of data are collected about all aspects of our lives  This data contains interesting patterns  Data Mining helps us to 1. discover these patterns and 2. use them for decision making across all areas of society, including  Business and industry  Science and engineering  Medicine and biotech  Government  Individuals
  • 16. University of Mannheim – Prof. Bizer: Data Mining Slide 16 Sloan Digital Sky Survey ≈ 200 GB/day ≈ 73 TB/year Predict • Type of sky object: Star or galaxy? “We are Drowning in Data...”
  • 17. University of Mannheim – Prof. Bizer: Data Mining Slide 17 US Library of Congress ≈ 235 TB archived ≈ 40 Wikipedias Discover • Topic distributions • Historic trends* • Citation networks “We are Drowning in Data...” * Lansdall-Welfare, et al.: Content analysis of 150 years of British periodicals. PNSA, 2017.
  • 18. University of Mannheim – Prof. Bizer: Data Mining Slide 18 Facebook • 4 Petabyte of new data generated every day • over 300 Petabyte in Facebook‘s data warehouse Predict • Interests and behavior of over one billion people “We are Drowning in Data...” https://www.brandwatch.com/blog/facebook-statistics/ http://www.technologyreview.com/featuredstory/428150/what-facebook-knows/
  • 19. University of Mannheim – Prof. Bizer: Data Mining Slide 19 “We are Drowning in Data...” Predict • Interests and behavior of mankind
  • 20. University of Mannheim – Prof. Bizer: Data Mining Slide 20 “We are Drowning in Data...” Law enforcement agencies collect unknown amounts of data from various sources • Cell phone calls • Location data • Web browsing behavior • Credit card transactions • Online profiles (Facebook) • … Predict • Terrorist or not? • Trustworthiness
  • 21. University of Mannheim – Prof. Bizer: Data Mining Slide 21 “...but starving for knowledge!” We are interested in the patterns, not the data itself! Data Mining methods help us to • discover interesting patterns in large quantities of data • take decisions based on the patterns ← Amount of data that is collected ← Amount of data that can be looked at by humans
  • 22. University of Mannheim – Prof. Bizer: Data Mining Slide 22 Definitions of Data Mining  Definitions  Data Mining methods 1. detect interesting patterns in large quantities of data 2. support human decision making by providing such patterns 3. predict the outcome of a future observation based on the patterns Non-trivial extraction of  implicit,  previously unknown, and  potentially useful information from data. Exploration & analysis, of large quantities of data in order to discover meaningful patterns.
  • 23. University of Mannheim – Prof. Bizer: Data Mining Slide 23  Data Mining combines ideas from statistics, machine learning, artificial intelligence, and database systems  Tries to overcome short- comings of traditional techniques concerning • large amount of data • high dimensionality of data • heterogeneous and complex nature of data • explorative analysis beyond hypothesize-and-test paradigm Origins of Data Mining Machine Learning, AI Statistics Data Mining Database Systems
  • 24. University of Mannheim – Prof. Bizer: Data Mining Slide 24 Survey on Data Mining Application Fields Source: KDnuggets online poll, 435 and 446 participants https://www.kdnuggets.com/2019/03/poll-analytics-data-science-ml-applied-2018.html
  • 25. University of Mannheim – Prof. Bizer: Data Mining Slide 25 2. Tasks and Applications  Descriptive Tasks • Goal: Find patterns in the data. • Example: Which products are often bought together?  Predictive Tasks – Goal: Predict unknown values of a variable • given observations (e.g., from the past) – Example: Will a person click a online advertisement? • given her browsing history  Machine Learning Terminology • descriptive = unsupervised • predictive = supervised
  • 26. University of Mannheim – Prof. Bizer: Data Mining Slide 26 Data Mining Tasks 1. Cluster Analysis [Descriptive] 2. Classification [Predictive] 3. Regression [Predictive] 4. Association Analysis [Descriptive]
  • 27. University of Mannheim – Prof. Bizer: Data Mining Slide 27  Given a set of data points, each having a set of attributes, and a similarity measure among them, find groups such that • data points in one group are more similar to one another • data points in separate groups are less similar to one another  Similarity Measures • Euclidean distance if attributes are continuous • other task-specific similarity measures  Goals 1. intra-cluster distances are minimized 2. inter-cluster distances are maximized  Result • A descriptive grouping of data points 2.1 Cluster Analysis: Definition
  • 28. University of Mannheim – Prof. Bizer: Data Mining Slide 28 Cluster Analysis: Application 1  Application area: Market segmentation  Goal: Find groups of similar customers • where a group may be conceived as a marketing target to be reached with a distinct marketing mix  Approach: 1. collect information about customers 2. find clusters of similar customers 3. measure the clustering quality by observing buying patterns after targeting customers with distinct marketing mixes
  • 29. University of Mannheim – Prof. Bizer: Data Mining Slide 29 Cluster Analysis: Application 2  Application area: Document Clustering  Goal: Find groups of documents that are similar to each other based on terms appearing in them  Approach 1. identify frequently occurring terms in each document 2. form a similarity measure based on the frequencies of different terms  Application Example: Grouping of articles in Google News
  • 30. University of Mannheim – Prof. Bizer: Data Mining Slide 30  Goal: Previously unseen records should be assigned a class from a given set of classes as accurately as possible.  Approach:  Given a collection of records (training set) • each record contains a set of attributes • one attribute is the class attribute (label) that should be predicted  Find a model for predicting the class attribute as a function of the values of other attributes 2.2 Classification: Definition ?
  • 31. University of Mannheim – Prof. Bizer: Data Mining Slide 31 Classification: Example  Training set:  Learned model: "Trees are big, green plants without wheels." "tree" "tree" "tree" "not a tree" "not a tree" "not a tree"
  • 32. University of Mannheim – Prof. Bizer: Data Mining Slide 32 Classification: Workflow Class/Label Attribute
  • 33. University of Mannheim – Prof. Bizer: Data Mining Slide 33 Classification: Application 1  Application area: Fraud Detection  Goal: Predict fraudulent cases in credit card transactions.  Approach: 1. Use credit card transactions and information about account-holders as attributes • When and where does a customer buy? What does he buy? • How often he pays on time? etc. 2. Label past transactions as fraud or fair transactions This forms the class attribute 3. Learn a model for the class attribute from the transactions 4. Use this model to detect fraud by observing credit card transactions on an account
  • 34. University of Mannheim – Prof. Bizer: Data Mining Slide 34 Classification: Application 2  Application area: Direct Marketing  Goal: Reduce cost of a mailing campaign by targeting only the set of consumers that likely to buy a new product  Approach: 1. Use data from a campaign introducing a similar product in the past • we know which customers decided to buy and which decided otherwise • this {buy, don’t buy} decision forms the class attribute 2. Collect various demographic, lifestyle, and company-interaction related information about the customers • age, profession, location, income, marriage status, visits, logins, etc. 3. Use this information to learn a classification model 4. Apply model to decide which consumers to target
  • 35. University of Mannheim – Prof. Bizer: Data Mining Slide 35 2.3 Regression  Predict a value of a continuous variable based on the values of other variables, assuming a linear or nonlinear model of dependency  Examples: • Predicting sales amounts of new product based on advertising expenditure • Predicting the price of a house or car • Predicting miles per gallon (MPG) of a car as a function of its weight and horsepower • Predicting wind velocities as a function of temperature, humidity, air pressure, etc.  Difference to classification: The predicted attribute is continuous, while classification is used to predict nominal attributes (e.g. yes/no)
  • 36. University of Mannheim – Prof. Bizer: Data Mining Slide 36 2.4 Association Analysis: Definition  Given a set of records each of which contain some number of items from a given collection  discover frequent itemsets and produce association rules which will predict occurrence of an item based on occurrences of other items TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Association Rules {Diaper, Milk} --> {Beer} {Milk} --> {Coke} Association Rules {Diaper, Milk} --> {Beer} {Milk} --> {Coke} Frequent Itemsets {Diaper, Milk, Beer} {Milk, Coke} Frequent Itemsets {Diaper, Milk, Beer} {Milk, Coke}
  • 37. University of Mannheim – Prof. Bizer: Data Mining Slide 37 Association Rule Discovery: Applications 1  Application area: Supermarket shelf management.  Goal: To identify items that are bought together by sufficiently many customers  Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items  A classic rule and its implications: • if a customer buys diapers and milk, then he is likely to buy beer as well • so, don’t be surprised if you find six-packs stacked next to diapers! • promote diapers to boost beer sales • if selling diapers is discontinued, this will affect beer sales as well  Application area: Sales Promotion
  • 38. University of Mannheim – Prof. Bizer: Data Mining Slide 38 Association Rule Discovery: Application 2  Application area: Inventory Management  Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households  Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co-occurrence patterns
  • 39. University of Mannheim – Prof. Bizer: Data Mining Slide 39 Which Methods are Used in Practice? Source: KDnuggets online poll, 833 votes, question: methods used last year for real-world app? https://www.kdnuggets.com/2019/04/top-data-science-machine-learning-methods-2018-2019.html
  • 40. University of Mannheim – Prof. Bizer: Data Mining Slide 40 3. The Data Mining Process Source: Fayyad et al. (1996) a
  • 41. University of Mannheim – Prof. Bizer: Data Mining Slide 41 3.1 Selection and Exploration  Selection • What data is potentially useful for the task at hand? • What data is available? • What do I know about the quality of the data?  Exploration / Profiling • Get an initial understanding of the data • Calculate basic summarization statistics • Visualize the data • Identify data problems such as outliers, missing values, duplicate records
  • 42. University of Mannheim – Prof. Bizer: Data Mining Slide 42 3.2 Preprocessing and Transformation  Transform data into a representation that is suitable for the chosen data mining methods • scales of attributes (nominal, ordinal, numeric) • number of dimensions (represent relevant information using less attributes) • amount of data (determines hardware requirements)  Methods • discretization and binarization • feature subset selection / dimensionality reduction • attribute transformation / text to term vector / embeddings • aggregation, sampling • integrate data from multiple sources  Good data preparation is key to producing valid and reliable models  Data integration and preparation is estimated to take 70-80% of the time and effort of a data mining project
  • 43. University of Mannheim – Prof. Bizer: Data Mining Slide 43 3.3 Data Mining  Input: Preprocessed Data  Output: Model / Patterns 1. Apply data mining method 2. Evaluate resulting model / patterns 3. Iterate • experiment with different parameter settings • experiment with multiple alternative methods • improve preprocessing and feature generation • increase amount or quality of training data
  • 44. University of Mannheim – Prof. Bizer: Data Mining Slide 44 3.4 Deployment  Use model in the business context  Keep iterating in order to maintain and improve model CRISP-DM Process Model
  • 45. University of Mannheim – Prof. Bizer: Data Mining Slide 45 How Do Data Scientists Spend Their Days? Source: CrowdFlower Data Science Report 2016: http://visit.crowdflower.com/data-science-report.html
  • 46. University of Mannheim – Prof. Bizer: Data Mining Slide 46 4. Data Mining Software Source: KDnuggets online poll, 1800 votes https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html
  • 47. University of Mannheim – Prof. Bizer: Data Mining Slide 47 RapidMiner  Powerful data mining suite  Visual modelling of data mining pipelines  Commercial tool, offering educational licenses
  • 48. University of Mannheim – Prof. Bizer: Data Mining Slide 48 Gartner 2018 Magic Quadrant for Advanced Analytics Platforms
  • 49. University of Mannheim – Prof. Bizer: Data Mining Slide 49 Literature – Rapidminer 1. Rapidminer – Documentation • http://docs.rapidminer.com • https://academy.rapidminer.com/catalog 2. Vijay Kotu, Bala Deshpande: Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner. Morgan Kaufmann, 2014. • covers theory and practical aspects using RapidMiner 3. Markus Hofmann, Ralf Klinkenberg: RapidMiner: Data Mining Use Cases and Business Analytics Applications. Chapman & Hall, 2013. • explains along case studies how to use simple and advanced Rapidminer features
  • 50. University of Mannheim – Prof. Bizer: Data Mining Slide 50 Python We use the Anaconda Python distribution  includes relevant packages, e.g. • scikit-learn, pandas • NumPy, Matplotlib  includes Jupyter as development environment
  • 51. University of Mannheim – Prof. Bizer: Data Mining Slide 51 Literature – Python 1. Scikit-learn Documentation: https://scikit-learn.org/stable/ user_guide.html 2. Aurélien Géron: Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow. 2nd Edition, O’Reilly, 2019
  • 52. University of Mannheim – Prof. Bizer: Data Mining Slide 52 Literature for this Chapter Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining. 2nd Edition. Pearson / Addison Wesley. Chapter 1: Introduction Chapter 2: Data