SlideShare a Scribd company logo
Data mining
Motivation : Data Flood
Data explosion problem
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information repositories.
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Data Mining(knowledge mining from data) is an
area of research and practice that is focused on
discovering novel patterns in data using
algorithms and computer , it is good at finding
the hidden patterns of a dataset by analyzing
correlations among attribute values.
Today we have software that
can search through massive
data haystacks looking for lots
of interesting and usable
needles.
Data Mining Tasks
• Classification
• Regression
• Segmentation
• Association
Analysis
• Anomaly
detection
• Sequence
Analysis
• Time-series
Analysis
• Text
categorization
• Advanced insights
discovery
• Others
Data Mining Problems
• What other products are purchased together with a digital
camera?
– Based on previous purchases (shopping cart)
– E.g., If a digital camera is purchased, flash memory, battery, printer
are also purchased.
 Association Analysis
• Similar questions:
– What products to recommend in on-line stores such as
Amazon.com, movie rental, wireless themes, etc.
– What items should be displayed together in merchant.
– What genes appear together in toxic mushrooms.
Data Mining Problems (cont.)
• Is this student going to go to a college?
– Based on Gender, ParentIncome, ParentEncouragement, IQ, etc.
– E.g., if ParentEncouragement=Yes and IQ>100, College=Yes
 Classification (prediction)
• Similar questions:
– Is this a spam email? (spam filtering)
– How good/bad is your credit? (credit scoring)
– Recognition of hand-written letters (pen recognition)
– What is this gene like? (bioinformatics)
– Does this person behave like a terrorist?
Data Mining Problems (cont.)
• What is the age of a person?
– Based on Hobby, MaritalStatus, NumberOfChildren, Income,
HouseOwnership, NumberOfCars, …
– E.g., If MaritalStatus=Yes, Age =
20+4*NumberOfChildren+0.0001*Income+…
 Regression (prediction)
• Similar questions:
– What’s the sales amount of ice cream next month? (sales prediction)
– What’s the stock price of A next week? (stock prediction)
– What’s the income of a customer? (marketing)
– What’s the life-time of a software bug? (bug tracking)
Data Mining Problems (cont.)
• Who are my Web visitor?
– Identify similar groups based on demographics, visiting patterns
– E.g., Daily news readers, email users, shoppers, short-stayers, etc
 Segmentation (clustering)
• Similar questions:
– Identify groups of genes (bioinformatics)
– Identify groups of locations of Cholera incidents in London (spatial
data mining)
– Identify group of customers in merchants (Amazon, E-Bay, MSN,
WalMart, etc) (target marketing)
– Identify groups of documents. (text categorization)
Data Mining Problems (cont.)
• Could this network packet be from a virus
attack?
– Predict likelihood of the network packet pattern
 Anomaly detection (outlier detection)
• Similar questions:
– Are the hospital lab results normal (Adverse drug effect
detection)
– Is this credit transaction fraudulent? (fraud detection)
– Does this person behave unusual, maybe worth high-level
of security clearance?
Data mining and machine learning
• Machine learning focuses on creating computer algorithms
that can use pre-existing inputs to refine and improve their
own capabilities for dealing with future inputs.
• Machine learning is not exactly the same thing as data mining
and vice versa. Not all data mining techniques rely on what
researchers would consider machine learning.
• machine learning is used in areas like robotics that we don’t
commonly think of when we are thinking of data mining as
such.
• Data mining is an area that has taken much of its inspiration
and techniques from machine learning (and some, also, from
statistics), but is put to different ends.
Data mining as a step in the process of knowledge discovery.
• 1. Data cleaning (to remove noise and inconsistent data).
• 2. Data integration (where multiple data sources may be
combined).
• 3. Data selection (where data relevant to the analysis task
are retrieved from the database).
• 4. Data transformation (where data are transformed or
consolidated into forms appropriate for mining by
performing summary or aggregation operations, for
instance).
• 5. Data mining (an essential process where intelligent
methods are applied in order to extract data patterns)
• 6. Pattern evaluation (to identify the truly interesting
patterns representing knowledge based on some
interestingness measures)
• 7. Knowledge presentation (where visualization and
knowledge representation techniques are used to present
the mined knowledge to the user)
according to this view, data mining is only one step in the entire
process .
We agree that data mining is a step in the knowledge discovery
process. However, in industry, in media, and in the database
research milieu, the term data mining is becoming more popular
than the longer term of knowledge discovery from data.
Data mining
 Database, data warehouse ,WorldWideWeb, or other information repository: This
is one or a set of databases, data warehouses, spreadsheets, or other kinds of information
repositories. Data cleaning and data integration techniques may be performed
on the data.
 Database or data warehouse server: The database or data warehouse server is responsible
for fetching the relevant data, based on the user’s data mining request.
 Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns. Such knowledge can include concept
Hierarchies.
 Data mining engine: This is essential to the data mining system and ideally consists of
a set of functional modules for tasks such as characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution
analysis.
 Pattern evaluation module: This component typically employs interestingness measures
and interacts with the data mining modules so as to focus the search toward interesting
patterns . It may use interestingness thresholds to filter
out discovered patterns.
 User interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or
Task.
Data mining typically consists of four processes:
1) data preparation.
2) exploratory data
analysis.
3) model
development.
4) Interpretation of
results.
 Step 1
involves making sure that the data are organized in the right way , that
missing data fields are filled in, that inaccurate data are located and repaired
or deleted, and that data are "recoded" as necessary to make them amenable
to the kind of analysis we have in mind.
 step2
getting to know the data using histograms and other visualization tools, and
looking for preliminary hints that will guide our model choice. The exploration
process also involves figuring out the right values for key parameters.
 Step 3
choosing and developing a model - is by far the most complex and most
interesting of the activities of a data miner. It is here where you test out a
selection of the most appropriate data mining techniques. Depending upon
the structure of a dataset, there may be dozens of options, and choosing the
most promising one has as much art in it as science.
 Step 4
the interpretation of results - focuses on making sense out of what the data
mining algorithm has produced. This is the most important step from the
perspective of the data user, because this is where an actionable conclusion is
formed.
"association rules mining"
Confidence: how frequently a particular pair occurs among all the
times when the first item is present.
Support: Support is the proportion of times that a particular
pairing occurs across all shopping carts.
to evaluate a long list of these rules for a value called:
Lift : takes into account the support for a rule, but also gives more
weight to rules where the LHS and/or the RHS occur less
frequently. In other words, lift favors situations where LHS and RHS
are not abundant but where the relatively few occurrences always
happen together. The larger the value of lift, the more
"interesting" the rule may be.
We can get started with association rules mining very easily using
the R package known as "arules" using the following commands
by using the Groceries data set, which is ready to be analyzed. So
we are skipping right to Step 2 in our four step proces exploratory:
> install.packages("arules")
 library("arules")
You can make the Groceries data set ready with this command:
 data(Groceries)
run the summary() function on Groceries so that we can see what
is in there:
> summary(Groceries)
Data mining
Notes
 Groceries is an item Matrix object in sparse format ,
has rectangular data structure with 9835 rows and 169
columns , is called "sparse" is that very few of these
items exist in any given grocery basket.
when an item appears in a basket, its cell contains a
one, while if an item is not in a basket, its cell contains a
zero.
 every cart has at least one item. output also shows us
which items occur in grocery baskets most frequently.
 any non-zero amount of whole milk is represented by
a one. Other data mining techniques could take
advantage of knowing the exact amount of a product,
but association rules does not need to know that
amount .
 the item "yogurt" appeared in 1372 out of
9835 rows or about 14% of cases. So we can
set the support parameter to somewhere
around 10%-15% in order to get a
manageable number of it.
 item that occurs only very rarely in the
grocery baskets is unlikely to be of much use
to us in terms of creating meaningful Rules.
we want to focus our attention on items
that occur with some meaningful frequency in
the dataset.
itemFrequencyPlot(Groceries,support=0.1)
Bar graph
The term "apriori" refers to the specific algorithm that R will use to scan
the data set for appropriate rules. Apriori alrgorithm used at finding
rules in transaction data.
• Rules are in the form of "if LHS then RHS." ,each rule states that when
the thing or things on the left hand side of the equation occur(s) the
thing on the right hand side occurs a certain percentage of the time.
• For example
if Milk and Butter occur together in 10% of the grocery carts (that is
"support"), and Milk (by itself, ignoring Butter) occurs in 25% of the
carts, then the confidence of the Milk/Butter rule is 0.10/0.25 = 0.40.
> apriori(Groceries,parameter=list(support=0.005,+
confidence=0.5))
Apriori
Data mining
 The "minlen" and "maxlen" parameters also have
sensible defaults: these refer to the minimum and
maximum.
 Obviously you can’t generate a rule unless you have
at least one item in an item set.
Now we will examine ways of making sense out of a
large number of rules, but for now let’s agree that 15 is
too many rules to examine.
we will store the resulting rules in a
data structure called ruleset:
> ruleset <- apriori(Groceries,+
parameter=list(support=0.01,confidence=0.5))
Data mining
The inspect() command
Notes
 Rules 7 and 8 have the highest level of lift: the fruits
and vegetables involved in these two rules have a
relatively low frequency of occurrence, but their
support and confidence are both relatively high.
 Contrast these two rules with Rule 1, which also has
high confidence , but which has low support. The
reason for this is that milk is a frequently occurring
item, so there is not much novelty to that rule. On
the other hand, the combination of fruits, root
vegetables, and other vegetables suggest a need to
find out more about customers whose carts may
contain only vegetarian or vegan items.
 to better insights we can use a data visualization
package to help explore this possibility.
 The R package called arulesViz has methods of
visualizing the rule sets generated by apriori() that
can help us examine a larger set of rules. First, install
and library the arulesViz package:
> install.packages("arulesViz")
> library(arulesViz)
> ruleset <-
apriori(Groceries,parameter=list(support=0.005,confidence=0.35))
generate 357 rules.
> plot(ruleset)
Notes
 the lift is shown by the darkness of a dot that appears
on the plot. The darker the dot, the close the lift of
that rule is to 4.0.
 the support of rules ranges from somewhere below
1% all the way up above 7%, all of the rules with high
lift seem to have support below 1%.On the other
hand, there are rules with high lift and high
confidence , which sounds quite positive.
focus on a smaller set of rules that only
have the very highest levels of lift.
goodrules <-
ruleset[quality(ruleset)$lift > 3.5]
Note that the use of the square braces
with our data structure ruleset allows
us to index only those elements
> inspect(goodrules)
Data mining
Notes
 it seems evidence that shoppers are purchasing
particular combinations of items that go together in
recipes. The first three rules really seem like soup! Rules
four and five seem like a fruit platter with dip.
 we might recommend that recipes could be published
along with coupons and popular recipes, such as for
homemade soup, might want to have all of the ingredients
group together in the store along with signs saying,
"Mmmm, homemade soup!"
R Functions Used in This Chapter
• apriori() - Uses the algorithm of the same name to analyze a
transaction data set and generate rules.
• itemFrequencyPlot() - Shows the relative frequency of commonly
occurring items in the spare occurrence matrix.
• inspect() - Shows the contents of the data object generated by
apriori() that generates the association rules.
• install.packages() - Loads package from the CRAN respository.
• summary() - Provides an overview of the contents of a data
structure.
REFRENCES
• Book :INTRODUCTION TO DATA SCIENCE
• Book : Data mining concepts and techniques
Second Edition
SLIDES :DR:BASSEL Alkteeb
THANK YOU

More Related Content

PPTX
Introduction of data science
PPTX
Data mining presentation.ppt
PPTX
Web mining
PPT
Data mining
PDF
Machine Learning in Healthcare
PPTX
Data Mining: Mining ,associations, and correlations
PDF
Data science presentation
PPTX
Credit card fraud detection using machine learning Algorithms
Introduction of data science
Data mining presentation.ppt
Web mining
Data mining
Machine Learning in Healthcare
Data Mining: Mining ,associations, and correlations
Data science presentation
Credit card fraud detection using machine learning Algorithms

What's hot (20)

PPTX
Data science.chapter-1,2,3
PPT
Chapter 12 outlier
PPTX
Data science applications and usecases
PDF
Introduction on Data Science
PDF
Logistic regression
PDF
Linear Regression vs Logistic Regression | Edureka
PPTX
data science chapter-4,5,6
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
PPT
Machine learning
PDF
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
PPTX
Data mining introduction
PPTX
Federated Learning
PPTX
1.Introduction to deep learning
PDF
Module 1 introduction to machine learning
PPTX
Data preprocessing in Machine learning
PPTX
Exploratory data analysis with Python
PPT
3. mining frequent patterns
PPTX
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
PPTX
Feature Selection in Machine Learning
PPTX
Introduction to data science.pptx
Data science.chapter-1,2,3
Chapter 12 outlier
Data science applications and usecases
Introduction on Data Science
Logistic regression
Linear Regression vs Logistic Regression | Edureka
data science chapter-4,5,6
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Machine learning
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Data mining introduction
Federated Learning
1.Introduction to deep learning
Module 1 introduction to machine learning
Data preprocessing in Machine learning
Exploratory data analysis with Python
3. mining frequent patterns
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Feature Selection in Machine Learning
Introduction to data science.pptx
Ad

Similar to Data mining (20)

PPT
Introduction.ppt
PPT
Data Mining-2023 (2).ppt
PPT
Sanjeev Kumar Dash D ata Mining-2023.ppt
PPT
Introduction
PPTX
Data warehousing and mining furc
PPTX
Data mining an introduction
PDF
G045033841
PPTX
Unit-V-Introduction to Data Mining.pptx
PDF
Lect 1 introduction
PDF
Data mining chapter for students of university
PPT
6 weeks summer training in data mining,ludhiana
PPT
6 weeks summer training in data mining,jalandhar
PPT
6months industrial training in data mining,ludhiana
PPT
6months industrial training in data mining, jalandhar
PPT
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
PPT
Chapter 01Intro.ppt full explanation used
PPTX
Explorartory Data Analytics and Knowledge Discovery techniques.pptx
PDF
Introduction to Data Mining and Knowledge DiscoveryChapter 01
DOCX
Seminar Report Vaibhav
PDF
Overview of Data Mining
Introduction.ppt
Data Mining-2023 (2).ppt
Sanjeev Kumar Dash D ata Mining-2023.ppt
Introduction
Data warehousing and mining furc
Data mining an introduction
G045033841
Unit-V-Introduction to Data Mining.pptx
Lect 1 introduction
Data mining chapter for students of university
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,jalandhar
6months industrial training in data mining,ludhiana
6months industrial training in data mining, jalandhar
1328cvkdlgkdgjfdkjgjdfgdfkgdflgkgdfglkjgld8679 - Copy.ppt
Chapter 01Intro.ppt full explanation used
Explorartory Data Analytics and Knowledge Discovery techniques.pptx
Introduction to Data Mining and Knowledge DiscoveryChapter 01
Seminar Report Vaibhav
Overview of Data Mining
Ad

More from heba_ahmad (13)

PDF
heba alsayed ahmad_Recomm_#
PDF
heba alsayed ahmad_Recomm_#2
DOCX
bassel alkhatib recommendation
PDF
recommendation dr jose
PDF
recommendation dr.miguel
PPT
metaheuristic tabu pso
PDF
Line uo,please
PDF
Introduction to data science intro,ch(1,2,3)
PDF
Shiny in R
PPTX
&Final presentation
PDF
Chapter 18,19
PDF
Ggplot2 ch2
PPTX
Final presentation
heba alsayed ahmad_Recomm_#
heba alsayed ahmad_Recomm_#2
bassel alkhatib recommendation
recommendation dr jose
recommendation dr.miguel
metaheuristic tabu pso
Line uo,please
Introduction to data science intro,ch(1,2,3)
Shiny in R
&Final presentation
Chapter 18,19
Ggplot2 ch2
Final presentation

Recently uploaded (20)

PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Basic Mud Logging Guide for educational purpose
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Cell Types and Its function , kingdom of life
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Insiders guide to clinical Medicine.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Complications of Minimal Access Surgery at WLH
O7-L3 Supply Chain Operations - ICLT Program
PPH.pptx obstetrics and gynecology in nursing
VCE English Exam - Section C Student Revision Booklet
Computing-Curriculum for Schools in Ghana
Pharma ospi slides which help in ospi learning
Microbial diseases, their pathogenesis and prophylaxis
Basic Mud Logging Guide for educational purpose
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
GDM (1) (1).pptx small presentation for students
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Cell Types and Its function , kingdom of life
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Renaissance Architecture: A Journey from Faith to Humanism
Insiders guide to clinical Medicine.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Complications of Minimal Access Surgery at WLH

Data mining

  • 2. Motivation : Data Flood Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
  • 3. Data Mining(knowledge mining from data) is an area of research and practice that is focused on discovering novel patterns in data using algorithms and computer , it is good at finding the hidden patterns of a dataset by analyzing correlations among attribute values.
  • 4. Today we have software that can search through massive data haystacks looking for lots of interesting and usable needles.
  • 5. Data Mining Tasks • Classification • Regression • Segmentation • Association Analysis • Anomaly detection • Sequence Analysis • Time-series Analysis • Text categorization • Advanced insights discovery • Others
  • 6. Data Mining Problems • What other products are purchased together with a digital camera? – Based on previous purchases (shopping cart) – E.g., If a digital camera is purchased, flash memory, battery, printer are also purchased.  Association Analysis • Similar questions: – What products to recommend in on-line stores such as Amazon.com, movie rental, wireless themes, etc. – What items should be displayed together in merchant. – What genes appear together in toxic mushrooms.
  • 7. Data Mining Problems (cont.) • Is this student going to go to a college? – Based on Gender, ParentIncome, ParentEncouragement, IQ, etc. – E.g., if ParentEncouragement=Yes and IQ>100, College=Yes  Classification (prediction) • Similar questions: – Is this a spam email? (spam filtering) – How good/bad is your credit? (credit scoring) – Recognition of hand-written letters (pen recognition) – What is this gene like? (bioinformatics) – Does this person behave like a terrorist?
  • 8. Data Mining Problems (cont.) • What is the age of a person? – Based on Hobby, MaritalStatus, NumberOfChildren, Income, HouseOwnership, NumberOfCars, … – E.g., If MaritalStatus=Yes, Age = 20+4*NumberOfChildren+0.0001*Income+…  Regression (prediction) • Similar questions: – What’s the sales amount of ice cream next month? (sales prediction) – What’s the stock price of A next week? (stock prediction) – What’s the income of a customer? (marketing) – What’s the life-time of a software bug? (bug tracking)
  • 9. Data Mining Problems (cont.) • Who are my Web visitor? – Identify similar groups based on demographics, visiting patterns – E.g., Daily news readers, email users, shoppers, short-stayers, etc  Segmentation (clustering) • Similar questions: – Identify groups of genes (bioinformatics) – Identify groups of locations of Cholera incidents in London (spatial data mining) – Identify group of customers in merchants (Amazon, E-Bay, MSN, WalMart, etc) (target marketing) – Identify groups of documents. (text categorization)
  • 10. Data Mining Problems (cont.) • Could this network packet be from a virus attack? – Predict likelihood of the network packet pattern  Anomaly detection (outlier detection) • Similar questions: – Are the hospital lab results normal (Adverse drug effect detection) – Is this credit transaction fraudulent? (fraud detection) – Does this person behave unusual, maybe worth high-level of security clearance?
  • 11. Data mining and machine learning • Machine learning focuses on creating computer algorithms that can use pre-existing inputs to refine and improve their own capabilities for dealing with future inputs. • Machine learning is not exactly the same thing as data mining and vice versa. Not all data mining techniques rely on what researchers would consider machine learning. • machine learning is used in areas like robotics that we don’t commonly think of when we are thinking of data mining as such. • Data mining is an area that has taken much of its inspiration and techniques from machine learning (and some, also, from statistics), but is put to different ends.
  • 12. Data mining as a step in the process of knowledge discovery.
  • 13. • 1. Data cleaning (to remove noise and inconsistent data). • 2. Data integration (where multiple data sources may be combined). • 3. Data selection (where data relevant to the analysis task are retrieved from the database). • 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance). • 5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) • 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) • 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
  • 14. according to this view, data mining is only one step in the entire process . We agree that data mining is a step in the knowledge discovery process. However, in industry, in media, and in the database research milieu, the term data mining is becoming more popular than the longer term of knowledge discovery from data.
  • 16.  Database, data warehouse ,WorldWideWeb, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data.  Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request.  Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept Hierarchies.  Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.  Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns . It may use interestingness thresholds to filter out discovered patterns.  User interface: This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or Task.
  • 17. Data mining typically consists of four processes: 1) data preparation. 2) exploratory data analysis. 3) model development. 4) Interpretation of results.
  • 18.  Step 1 involves making sure that the data are organized in the right way , that missing data fields are filled in, that inaccurate data are located and repaired or deleted, and that data are "recoded" as necessary to make them amenable to the kind of analysis we have in mind.  step2 getting to know the data using histograms and other visualization tools, and looking for preliminary hints that will guide our model choice. The exploration process also involves figuring out the right values for key parameters.  Step 3 choosing and developing a model - is by far the most complex and most interesting of the activities of a data miner. It is here where you test out a selection of the most appropriate data mining techniques. Depending upon the structure of a dataset, there may be dozens of options, and choosing the most promising one has as much art in it as science.  Step 4 the interpretation of results - focuses on making sense out of what the data mining algorithm has produced. This is the most important step from the perspective of the data user, because this is where an actionable conclusion is formed.
  • 20. Confidence: how frequently a particular pair occurs among all the times when the first item is present. Support: Support is the proportion of times that a particular pairing occurs across all shopping carts. to evaluate a long list of these rules for a value called: Lift : takes into account the support for a rule, but also gives more weight to rules where the LHS and/or the RHS occur less frequently. In other words, lift favors situations where LHS and RHS are not abundant but where the relatively few occurrences always happen together. The larger the value of lift, the more "interesting" the rule may be.
  • 21. We can get started with association rules mining very easily using the R package known as "arules" using the following commands by using the Groceries data set, which is ready to be analyzed. So we are skipping right to Step 2 in our four step proces exploratory: > install.packages("arules")  library("arules") You can make the Groceries data set ready with this command:  data(Groceries) run the summary() function on Groceries so that we can see what is in there: > summary(Groceries)
  • 23. Notes  Groceries is an item Matrix object in sparse format , has rectangular data structure with 9835 rows and 169 columns , is called "sparse" is that very few of these items exist in any given grocery basket. when an item appears in a basket, its cell contains a one, while if an item is not in a basket, its cell contains a zero.  every cart has at least one item. output also shows us which items occur in grocery baskets most frequently.  any non-zero amount of whole milk is represented by a one. Other data mining techniques could take advantage of knowing the exact amount of a product, but association rules does not need to know that amount .
  • 24.  the item "yogurt" appeared in 1372 out of 9835 rows or about 14% of cases. So we can set the support parameter to somewhere around 10%-15% in order to get a manageable number of it.  item that occurs only very rarely in the grocery baskets is unlikely to be of much use to us in terms of creating meaningful Rules. we want to focus our attention on items that occur with some meaningful frequency in the dataset. itemFrequencyPlot(Groceries,support=0.1) Bar graph
  • 25. The term "apriori" refers to the specific algorithm that R will use to scan the data set for appropriate rules. Apriori alrgorithm used at finding rules in transaction data. • Rules are in the form of "if LHS then RHS." ,each rule states that when the thing or things on the left hand side of the equation occur(s) the thing on the right hand side occurs a certain percentage of the time. • For example if Milk and Butter occur together in 10% of the grocery carts (that is "support"), and Milk (by itself, ignoring Butter) occurs in 25% of the carts, then the confidence of the Milk/Butter rule is 0.10/0.25 = 0.40. > apriori(Groceries,parameter=list(support=0.005,+ confidence=0.5)) Apriori
  • 27.  The "minlen" and "maxlen" parameters also have sensible defaults: these refer to the minimum and maximum.  Obviously you can’t generate a rule unless you have at least one item in an item set.
  • 28. Now we will examine ways of making sense out of a large number of rules, but for now let’s agree that 15 is too many rules to examine. we will store the resulting rules in a data structure called ruleset: > ruleset <- apriori(Groceries,+ parameter=list(support=0.01,confidence=0.5))
  • 31. Notes  Rules 7 and 8 have the highest level of lift: the fruits and vegetables involved in these two rules have a relatively low frequency of occurrence, but their support and confidence are both relatively high.  Contrast these two rules with Rule 1, which also has high confidence , but which has low support. The reason for this is that milk is a frequently occurring item, so there is not much novelty to that rule. On the other hand, the combination of fruits, root vegetables, and other vegetables suggest a need to find out more about customers whose carts may contain only vegetarian or vegan items.
  • 32.  to better insights we can use a data visualization package to help explore this possibility.  The R package called arulesViz has methods of visualizing the rule sets generated by apriori() that can help us examine a larger set of rules. First, install and library the arulesViz package: > install.packages("arulesViz") > library(arulesViz)
  • 34. Notes  the lift is shown by the darkness of a dot that appears on the plot. The darker the dot, the close the lift of that rule is to 4.0.  the support of rules ranges from somewhere below 1% all the way up above 7%, all of the rules with high lift seem to have support below 1%.On the other hand, there are rules with high lift and high confidence , which sounds quite positive.
  • 35. focus on a smaller set of rules that only have the very highest levels of lift. goodrules <- ruleset[quality(ruleset)$lift > 3.5] Note that the use of the square braces with our data structure ruleset allows us to index only those elements > inspect(goodrules)
  • 37. Notes  it seems evidence that shoppers are purchasing particular combinations of items that go together in recipes. The first three rules really seem like soup! Rules four and five seem like a fruit platter with dip.  we might recommend that recipes could be published along with coupons and popular recipes, such as for homemade soup, might want to have all of the ingredients group together in the store along with signs saying, "Mmmm, homemade soup!"
  • 38. R Functions Used in This Chapter • apriori() - Uses the algorithm of the same name to analyze a transaction data set and generate rules. • itemFrequencyPlot() - Shows the relative frequency of commonly occurring items in the spare occurrence matrix. • inspect() - Shows the contents of the data object generated by apriori() that generates the association rules. • install.packages() - Loads package from the CRAN respository. • summary() - Provides an overview of the contents of a data structure.
  • 39. REFRENCES • Book :INTRODUCTION TO DATA SCIENCE • Book : Data mining concepts and techniques Second Edition SLIDES :DR:BASSEL Alkteeb