SlideShare a Scribd company logo
Data Mining : Lecture One Dr. Ahmed Alnasheri 1
DATA MINING
INTRODUCTION
What is data mining?
Applications and techniques
Data Mining : Lecture One Dr. Ahmed Alnasheri 2
• This course has been designed to give students an introduction to
data mining and hands on experience with all phases of the data
mining process using real data and modern tools. It covers many
topics such as data formats, and cleaning; make prediction using
supervised and unsupervised learning using Python and other
tools, and sound evaluation methods; and data/knowledge
visualization.
Course Description
Data Mining : Lecture One Dr. Ahmed Alnasheri 3
• Providing the fundamental understanding of data mining in order to extract
hidden knowledge.
• Exploring the different data mining tasks to extract knowledge:
• Classification,
• Clustering,
• Association Rules extraction, and
• Outlier detection.
• Practicing the data mining project phases
• Presenting the data in the early stage of data mining projects as well as the
extracted knowledge.
• Provide the students the latest hot topics in data mining field.
• Strengthen the team work
Course Objectives
Data Mining : Lecture One Dr. Ahmed Alnasheri 4
• Why Data Mining?
• What is Data Mining ?
• Knowledge Data Discovery (KDD) Process
• Data Mining Task
• Data Mining Function.
• Data Creates Values
Outline
Data Mining : Lecture One Dr. Ahmed Alnasheri 5
Data Mining : Lecture One Dr. Ahmed Alnasheri 6
“Data is the new oil. It’s valuable, but
if unrefined it cannot really be used.
It has to be changed into gas, plastic,
chemicals, etc to create a valuable
entity that drives profitable activity;
so must data be broken down,
analyzed for it to have value.”
Data Mining : Lecture One Dr. Ahmed Alnasheri 7
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society.
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”
• Data mining → Automated analysis of massive data sets
Why Data mining?
Data Mining : Lecture One Dr. Ahmed Alnasheri 8
Why data mining? “Benefits of Data Mining”
• Scientific point of view
• Scientists are at an unprecedented position where they can collect TB (tuberculosis ) of information
• Examples: Sensor data, astronomy data, social network data, gene data
• We need the tools to analyze such data to get a better understanding of the world and
advance science and help people
• Commercial point of view
• Data has become the key competitive advantage of companies
• Examples: Facebook, Google, Amazon
• Being able to extract useful information out of the data is key for exploiting them
commercially.
• Scale (in data size and feature dimension)
• Why not use traditional analytic methods?
• Enormity of data, curse of dimensionality
• The amount and the complexity of data does not allow for manual processing of the
data. We need automated techniques.
Data Mining : Lecture One Dr. Ahmed Alnasheri 9
Why Data mining?
• Every human, physical, or machine activity
generates data.
• Transaction data in stores, credit cards
• Scientific measurements
• DNA sequences, gene co-expression
• Health records, brain images, daily measurements
• The Web, Wikipedia, Facebook posts, Tweets, Online
Reviews
• Queries to Google, Clicks, Browsing behavior, Ads
• Facebook likes and comments, Twitter retweets
• The Web graph, Facebook friends, Twitter followers
• Movement data, Trajectories,
• Mobile use, telephone calls
• Wearable devices
• Machine and workflow monitoring
• Everybody collects data!
Data Mining : Lecture One Dr. Ahmed Alnasheri 10
• Data mining is an interdisciplinary field
Why Data mining?
Data mining is an
interdisciplinary subfield
of computer science and
statistics with an overall
goal to extract information
(with intelligent methods)
from a data set and
transform the information
into a comprehensible
structure for further use.
What are the
origins of data
mining?
Data Mining : Lecture One Dr. Ahmed Alnasheri 11
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithms
Other
Disciplines
Visualization
Data Mining : Lecture One Dr. Ahmed Alnasheri 12
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithms
Other
Disciplines
Visualization
Data Mining : Lecture One Dr. Ahmed Alnasheri 13
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database
Technology Statistics
Machine
Learning
Pattern
Recognition
Algorithms
Distributed
Computing
Visualization
Data Mining : Lecture One Dr. Ahmed Alnasheri 14
• After years of data mining there is still no unique answer to this question
• Data Mining Is a knowledge discovery from data
• Extraction of interest (Non-trivial, Implicit, Previously unknown, and Potentially useful) pattern
or knowledge from huge amount of data.
• A tentative definition:
• Data mining is the use of efficient techniques for the analysis of very large collections of data and the extraction
of useful and possibly unexpected patterns in data.
What is data mining ?
Data Mining : Lecture One Dr. Ahmed Alnasheri 15
• Data Mining is:
• (1) The efficient discovery of previously unknown, valid, potentially useful,
understandable patterns in large datasets.
• (2) The analysis of (often large) observational data sets to find unsuspected relationships
and to summarize the data in novel ways that are both understandable and useful to the
data owner
What is data mining ?
Data Mining : Lecture One Dr. Ahmed Alnasheri 16
What is data mining ?
• What is the difference between data mining and Database Query?
We know what exactly we want We vaguely know what we are looking for
Data Mining : Lecture One Dr. Ahmed Alnasheri 17
Data Mining
• In simple terms:
Data Data Mining Value
Data Mining : Lecture One Dr. Ahmed Alnasheri 18
Knowledge Data Discovery (KDD) Process
• Data mining plays an essential role in the knowledge discovery process and
highly dependent on data
Data Mining : Lecture One Dr. Ahmed Alnasheri 19
The Data Mining Process
1. Understand the domain
2. Create a dataset:
• Select the interesting attributes
• Data cleaning and preprocessing
3. Choose the data mining task and the specific algorithm
4. Interpret the results, and possibly return to 2
Data Mining : Lecture One Dr. Ahmed Alnasheri 20
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of interest .
• noisy: containing errors or outliers.
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
• Quality decisions must be based on quality data.
• Data warehouse needs consistent integration of quality data.
• Required for both OLAP and Data Mining!
Data Mining : Lecture One Dr. Ahmed Alnasheri 21
Why can Data be Incomplete?
• Attributes of interest are not available (e.g., customer information
for sales transaction data)
• Data were not considered important at the time of transactions, so
they were not recorded!
• Data not recorder because of misunderstanding or malfunctions
• Data may have been recorded and later deleted!
• Missing/unknown values for some data
Data Mining : Lecture One Dr. Ahmed Alnasheri 22
Data Cleaning
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
Data Mining : Lecture One Dr. Ahmed Alnasheri 23
Types of data
• Numeric data: Each object is a point in a multidimensional space
• Categorical data: Each object is a vector of categorical values
• Set data: Each object is a set of values (with or without counts)
• Sets can also be represented as binary vectors, or vectors of counts
• Ordered sequences: Each object is an ordered sequence of values.
• Graph data
Data Mining : Lecture One Dr. Ahmed Alnasheri 24
What can we do with data mining?
• Some examples:
• Frequent item sets and Association Rules extraction
• Clustering
• Classification
• Ranking
• Exploratory analysis
Data Mining : Lecture One Dr. Ahmed Alnasheri 25
What can we do with data mining?
• You are the owner of a social network, and you have full access to
the social graph, what kind of information do you want to get out of
your graph?
Data Mining : Lecture One Dr. Ahmed Alnasheri 26
What can we do with data mining?
• Suppose that you are the owner of a supermarket and you have
collected billions of market basket data. What information would
you extract from it and how would you use it?
• What if this was an online store?
Data Mining : Lecture One Dr. Ahmed Alnasheri 27
What can we do with data mining?
• Suppose you are a search engine and you have a toolbar log consisting of
• Pages browsed,
• Queries,
• Pages clicked,
• Ads clicked
• Each with a user id and a timestamp. What information would you like to
get our of the data?
Data Mining : Lecture One Dr. Ahmed Alnasheri 28
Data Set : High School Students’ Grade
Data Mining : Lecture One Dr. Ahmed Alnasheri 29
Data Mining Task
• Two main types of data mining task
• Descriptive: Characterize properties of the data in a target data set.
• Predictive : Perform induction on the current data in order to predict values
of a new data.
Data Mining : Lecture One Dr. Ahmed Alnasheri 30
Data Mining Task
1. Classification: learning a function that maps an item into one of a set of
predefined classes
2. Regression: learning a function that maps an item to a real value
3. Clustering: identify a set of groups of similar items
4. Dependencies and associations: identify significant dependencies between
data attributes
5. Summarization: find a compact description of the dataset or a subset of the
dataset
Data Mining : Lecture One Dr. Ahmed Alnasheri 31
Data Mining Function.
• Data mining functionalities are used to specify the kind of patterns
to be found in data mining tasks.
Data Mining : Lecture One Dr. Ahmed Alnasheri 32
The data is complex and interconnected
• Multiple types of data: database tables, text, time series, images,
videos, graphs, etc
• Spatial and temporal aspect
• Interconnected data of different types:
• From the mobile phone we can collect, location of the user, friendship
information, check-ins to venues, opinions through twitter, status updates in
FB(FaceBook), images though cameras, queries to search engines
Data Mining : Lecture One Dr. Ahmed Alnasheri 33
Data creates value
Natural language
understanding is
driven by data
Data Mining : Lecture One Dr. Ahmed Alnasheri 34
Data creates value
Precision/Personalized medicine:
Find the best treatment for patients
using their genotype and all data
that are related to them
Also: understanding drug side-
effects through google queries
Data Mining : Lecture One Dr. Ahmed Alnasheri 35
Data creates value
Self-Driving Cars:
Car is the next computer. A
future of smart cars that can
drive themselves and learn
from data
Also: smart cities – urban
computing
Data Mining : Lecture One Dr. Ahmed Alnasheri 36
Data creates value
Computers learn to play
games by observing data
Data Mining : Lecture One Dr. Ahmed Alnasheri 37
Data creates value
Use of data for crisis management
Data Mining : Lecture One Dr. Ahmed Alnasheri 38
Data creates value
• All major soccer and basketball teams use data mining to make
decisions.
The national team of Germany
had a special software for the
analysis of video.
They concluded that the
possession time per player
should be reduced.
Germany won the 2014 word cup
Data Mining : Lecture One Dr. Ahmed Alnasheri 39
Data creates value
James Harden defense
Data Mining : Lecture One Dr. Ahmed Alnasheri 40
Putting it all together: The Data Mining Pipeline (LinkedIn)
feature extraction
feature transformation
user modeling
Data Pipeline
data tracking & logging candidates generation
multi-pass rankers
Online Serving System
real-time feedback
Model Fitting Pipeline (Hadoop)
offline modeling fitting
(cold-start model)
nearline modeling fitting
(warm-start model)
daily/weekly minutes/hourly
online A/B test
model evaluation
Data Mining : Lecture One Dr. Ahmed Alnasheri 41
Data Mining Example
• Suppose that you were creating the Yemeni Facebook.
• What kind of data would you collect and store?
Social network contacts
Interaction with contacts: messages, likes, replies, shares
Posts, content of posts
Interactions with feed: Clicks, Likes, Comments, Shares
Photos
Demographics: Age, City, etc
Ads seen, ads clicked
Products bought
Videos uploaded videos consumed
and many more!
What would you do with this data?
Data Mining : Lecture One Dr. Ahmed Alnasheri 42
Exploratory Analysis
• Make measurements to understand what the
data looks like
• Example: Posts
• How often do users posts, how many posts per
user, when do they post, is there a correlation
between number of posts and number of friends,
etc
• This is one of the first steps when collecting
data.
• Metrics: Deciding what to measure is important
• The example of the Web graph
Data Mining : Lecture One Dr. Ahmed Alnasheri 43
Exploiting similarities
• Consider the following data for six users:
• Number of times they have clicked on posts from these pages
• What conclusion can we draw?
NBA ESPN Sports.com MSNBC NY Times Wall Street Politico
A 100 50 73 10 1 1 4
B 500 200 400 20 10 4 1
C 80 100 60 1 3 1 1
D 4 2 1 12 90 100 80
E 9 3 4 9 100 80 70
F 3 4 5 30 300 200 500
Data Mining : Lecture One Dr. Ahmed Alnasheri 44
Exploiting similarities
• Two types of users and two types of pages
• Sports and politics
• Questions:
• How do we compute similarity?
• How do we group similar users? Clustering
NBA ESPN Sports.com MSNBC NY Times Wall Street Politico
A 100 50 73 10 1 1 4
B 500 200 400 20 10 4 1
C 80 100 60 1 3 1 1
D 4 2 1 12 90 100 80
E 9 3 4 9 100 80 70
F 3 4 5 30 300 200 500
Data Mining : Lecture One Dr. Ahmed Alnasheri 45
Exploiting similarities
• What if we were missing this entry?
• Can we fill this value?
• Similar users like items similarly: Recommendation systems
NBA ESPN Sports.com MSNBC NY Times Wall Street Politico
A 100 50 73 10 1 1 4
B 500 200 400 20 10 4 1
C 80 100 ??? 1 3 1 1
D 4 2 1 12 90 100 80
E 9 3 4 9 100 80 70
F 3 4 5 30 300 200 500
Data Mining : Lecture One Dr. Ahmed Alnasheri 46
Amazon Recommendations
• “People who have bought this also bought…”
• A huge breakthrough for amazon
• Took advantage of the long tail
• A big breakthrough for data mining in general
Data Mining : Lecture One Dr. Ahmed Alnasheri 47
Making predictions
• Filling the missing value can also be viewed as a prediction task
• Types of prediction tasks:
• Predicting a real value (e.g. number of clicks): Regression
• Predicting a YES/NO value (e.g., will the user click?): Binary classification
• Predicting over multiple classes (e.g., what is the topic of a post): Classification
• Can you think of prediction/classification tasks for your social
network?
Ad click prediction
Ad clickthrough prediction
Like prediction
Predict if a user will like a post over another:
Learning to rank
Predict if a post is offensive
Predict if a photo contains nudity
Data Mining : Lecture One Dr. Ahmed Alnasheri 48
Classification
• Classification process:
• Find features that describe an entity.
• Use examples of the classes you want to predict.
• Learn a model (function) that predicts
• Classification is the engine behind the AI revolution
• Used in all systems that make decisions
• Became very powerful with Deep Learning
• Huge applications in vision
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Data Mining : Lecture One Dr. Ahmed Alnasheri 49
Deep learning
• Machine learning systems that use neural networks with multiple
layers and are trained on very large quantities of data
• Able to learn complex representations and powerful models.
• Applications in recommendations, network analysis, text analysis, image
recognition, car driving, playing games…
• Require less feature engineering
Data Mining : Lecture One Dr. Ahmed Alnasheri 50
The social graph
• Your Yemeni Facebook also has a social graph. What can you do
with this data?
What is the shortest path between two nodes?
Who is important and influential in the graph?
How does information spread in the network?
What becomes viral?
Will two users become friends in the future?
Data Mining : Lecture One Dr. Ahmed Alnasheri 51
What is the most important node in
this graph? Degree Centrality and
Closeness Centrality
• The PageRank algorithm: A node is
important if it is pointed to by other
important nodes.
• The idea of the basic PageRank
algorithm is that the importance of a
node depends on the number and
importance of the neighbor nodes
pointing to it.
Node importance
Data Mining : Lecture One Dr. Ahmed Alnasheri 52
The Web as a graph
• When ranking pages, the
authoritativeness is
factored in the ranking.
• This is the idea that made
Google a success around
2000
• Today a lot more
information is used, like
clicks, browsing behavior,
etc
• Ranking of the pages is a
very complex task that
requires sophisticated
techniques
Data Mining : Lecture One Dr. Ahmed Alnasheri 53
Data Mining : Lecture One Dr. Ahmed Alnasheri 54
Friendship suggestions
• LinkedIn, Twitter, Facebook friendship suggestions
• Useful for the users to discover their friends, but also useful for the network in order
to grow, and increase engagement
• LinkedIn success story
• Triadic closure principle: Links are created in a way that usually closes a
triangle
• If both Bob and Charlie know Alice, then they are likely to meet at some point.
Data Mining : Lecture One Dr. Ahmed Alnasheri 55
What is Data Mining again?
• “Data mining is the analysis of (often large) observational data sets to
find unsuspected relationships and to summarize the data in novel ways
that are both understandable and useful to the data analyst” (Hand,
Mannila, Smyth)
• “Data mining is the discovery of models for data” (Rajaraman, Ullman)
• We can have the following types of models
• Models that explain the data (e.g., a single function)
• Models that predict the future data instances.
• Models that summarize the data
• Models the extract the most prominent features of the data.
• “Data Mining is the study of collecting, processing, analyzing, and
gaining useful insights from data” – Charu Aggarwal
Data Mining : Lecture One Dr. Ahmed Alnasheri 56
The buzz around data
• Data Science: Data is useful to understand a process and improve it. All
organizations should have a data science team that analyses their data and
proposes improvements
• Focuses on more immediate applications and insights
• Big Data: Data appear everywhere. We should process it collectively and
interconnect them. We need infrastructure (cloud computing, cloud storage) to
do this
• More systems oriented
• AI/Machine Learning/Deep Learning: These have been around for a while but
now we have the data to learn more complex models that are significantly
more powerful
• More emphasis on scientific breakthroughs (Penetration)
Data Mining : Lecture One Dr. Ahmed Alnasheri 57
New era of data mining
• Boundaries are becoming less clear
• Today data mining, machine learning, and AI are synonymous. It is assumed
that the algorithms should scale. It is clear that statistical inference is used
for building the models.
• Data is the engine for AI
• Data Mining touches everything related to data.
Data Mining : Lecture One Dr. Ahmed Alnasheri 58
Which also has a dark side
• Are the algorithms making fair
and correct decisions?
• Do algorithms create filter
bubbles, echo chambers, and
promote misinformation? Are
they a threat to democracy?
• Surveillance capitalism
• Is AI a threat?
Data Mining : Lecture One Dr. Ahmed Alnasheri 59
The Skills of a Data Miner – Data Scientist
It is a hard job
Data Mining : Lecture One Dr. Ahmed Alnasheri 60
But also a rewarding one
"The success of companies
like Google, Facebook,
Amazon, and Netflix, not to
mention Wall Street firms and
industries from manufacturing
and retail to healthcare, is
increasingly driven by better
tools for extracting meaning
from very large quantities of
data. 'Data Scientist' is now
the hottest job title in Silicon
Valley." – Tim O'Reilly
Sexiest Job but…
Data Mining : Lecture One Dr. Ahmed Alnasheri 61
Homework 1
• Read chapter one and answer the questions on page 34, and 35. By hand
• Data Scientist: The Sexiest Job of the 21st Century. PowerPoint Presentation.
• http://www.cse.msu.edu/~ptan/dmbook/software/

More Related Content

PPTX
datamining_Lecture_1(introduction).pptx
PPT
Data mining Introduction
PPT
Unit 1 (Chapter-1) on data mining concepts.ppt
PPT
hanjia chapter_1.ppt data mining chapter 1
PPT
Introduction of Data Mining - Concept and techniques
PPT
Chapter 1. Introduction.ppt
PPT
Chapter 01Intro.ppt full explanation used
PPTX
Introduction to-data-mining chapter 1
datamining_Lecture_1(introduction).pptx
Data mining Introduction
Unit 1 (Chapter-1) on data mining concepts.ppt
hanjia chapter_1.ppt data mining chapter 1
Introduction of Data Mining - Concept and techniques
Chapter 1. Introduction.ppt
Chapter 01Intro.ppt full explanation used
Introduction to-data-mining chapter 1

Similar to datamining-introduction.pdf (20)

PDF
UNIT2-Data Mining.pdf
PPT
01Intro.ppt data analytics r language slide 1
PPTX
DWDM 3rd EDITION TEXT BOOK SLIDES24.pptx
PPTX
dataminingintroductionpptpptpptptro.pptx
PPT
Introduction to data warehouse
PPT
3RD B.TECH-DATAMINING-INTRODUCTION-UNIT1 .ppt
PPT
01Intro.ppt data mining dahauuehuwhuwrwhrurhuqhuahura
PPT
DATA MINING: INTRODUCTION TO DATA MINING
PPT
Data Mining and Warehousing Concept and Techniques
PDF
Data mining chapter for students of university
PPTX
Lect 1 introduction
PPTX
Data mining concepts
PPT
Data Mining- Unit-I PPT (1).ppt
PDF
Lect 1 introduction
PPTX
Topic(1)-Intro data mining master ALEX.pptx
PPT
DWDMUNIhjkuijhgfdswertyuuyhtgrertyuujhytrertyT1A.ppt
PPTX
2 Data-mining process
PDF
2 introductory slides
PPT
Data Mining
PPT
01Intro(1).ppt Introduction In computer science
UNIT2-Data Mining.pdf
01Intro.ppt data analytics r language slide 1
DWDM 3rd EDITION TEXT BOOK SLIDES24.pptx
dataminingintroductionpptpptpptptro.pptx
Introduction to data warehouse
3RD B.TECH-DATAMINING-INTRODUCTION-UNIT1 .ppt
01Intro.ppt data mining dahauuehuwhuwrwhrurhuqhuahura
DATA MINING: INTRODUCTION TO DATA MINING
Data Mining and Warehousing Concept and Techniques
Data mining chapter for students of university
Lect 1 introduction
Data mining concepts
Data Mining- Unit-I PPT (1).ppt
Lect 1 introduction
Topic(1)-Intro data mining master ALEX.pptx
DWDMUNIhjkuijhgfdswertyuuyhtgrertyuujhytrertyT1A.ppt
2 Data-mining process
2 introductory slides
Data Mining
01Intro(1).ppt Introduction In computer science
Ad

Recently uploaded (20)

PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Introduction to the R Programming Language
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Mega Projects Data Mega Projects Data
PPTX
1_Introduction to advance data techniques.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Business Analytics and business intelligence.pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Quality review (1)_presentation of this 21
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Introduction to machine learning and Linear Models
[EN] Industrial Machine Downtime Prediction
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Fluorescence-microscope_Botany_detailed content
Introduction to Knowledge Engineering Part 1
Introduction to the R Programming Language
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Mega Projects Data Mega Projects Data
1_Introduction to advance data techniques.pptx
.pdf is not working space design for the following data for the following dat...
Business Analytics and business intelligence.pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Reliability_Chapter_ presentation 1221.5784
Galatica Smart Energy Infrastructure Startup Pitch Deck
Quality review (1)_presentation of this 21
Data_Analytics_and_PowerBI_Presentation.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Introduction to machine learning and Linear Models
Ad

datamining-introduction.pdf

  • 1. Data Mining : Lecture One Dr. Ahmed Alnasheri 1 DATA MINING INTRODUCTION What is data mining? Applications and techniques
  • 2. Data Mining : Lecture One Dr. Ahmed Alnasheri 2 • This course has been designed to give students an introduction to data mining and hands on experience with all phases of the data mining process using real data and modern tools. It covers many topics such as data formats, and cleaning; make prediction using supervised and unsupervised learning using Python and other tools, and sound evaluation methods; and data/knowledge visualization. Course Description
  • 3. Data Mining : Lecture One Dr. Ahmed Alnasheri 3 • Providing the fundamental understanding of data mining in order to extract hidden knowledge. • Exploring the different data mining tasks to extract knowledge: • Classification, • Clustering, • Association Rules extraction, and • Outlier detection. • Practicing the data mining project phases • Presenting the data in the early stage of data mining projects as well as the extracted knowledge. • Provide the students the latest hot topics in data mining field. • Strengthen the team work Course Objectives
  • 4. Data Mining : Lecture One Dr. Ahmed Alnasheri 4 • Why Data Mining? • What is Data Mining ? • Knowledge Data Discovery (KDD) Process • Data Mining Task • Data Mining Function. • Data Creates Values Outline
  • 5. Data Mining : Lecture One Dr. Ahmed Alnasheri 5
  • 6. Data Mining : Lecture One Dr. Ahmed Alnasheri 6 “Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.”
  • 7. Data Mining : Lecture One Dr. Ahmed Alnasheri 7 • The Explosive Growth of Data: from terabytes to petabytes • Data collection and data availability • Automated data collection tools, database systems, Web, computerized society. • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, YouTube • We are drowning in data, but starving for knowledge! • “Necessity is the mother of invention” • Data mining → Automated analysis of massive data sets Why Data mining?
  • 8. Data Mining : Lecture One Dr. Ahmed Alnasheri 8 Why data mining? “Benefits of Data Mining” • Scientific point of view • Scientists are at an unprecedented position where they can collect TB (tuberculosis ) of information • Examples: Sensor data, astronomy data, social network data, gene data • We need the tools to analyze such data to get a better understanding of the world and advance science and help people • Commercial point of view • Data has become the key competitive advantage of companies • Examples: Facebook, Google, Amazon • Being able to extract useful information out of the data is key for exploiting them commercially. • Scale (in data size and feature dimension) • Why not use traditional analytic methods? • Enormity of data, curse of dimensionality • The amount and the complexity of data does not allow for manual processing of the data. We need automated techniques.
  • 9. Data Mining : Lecture One Dr. Ahmed Alnasheri 9 Why Data mining? • Every human, physical, or machine activity generates data. • Transaction data in stores, credit cards • Scientific measurements • DNA sequences, gene co-expression • Health records, brain images, daily measurements • The Web, Wikipedia, Facebook posts, Tweets, Online Reviews • Queries to Google, Clicks, Browsing behavior, Ads • Facebook likes and comments, Twitter retweets • The Web graph, Facebook friends, Twitter followers • Movement data, Trajectories, • Mobile use, telephone calls • Wearable devices • Machine and workflow monitoring • Everybody collects data!
  • 10. Data Mining : Lecture One Dr. Ahmed Alnasheri 10 • Data mining is an interdisciplinary field Why Data mining? Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use. What are the origins of data mining?
  • 11. Data Mining : Lecture One Dr. Ahmed Alnasheri 11 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithms Other Disciplines Visualization
  • 12. Data Mining : Lecture One Dr. Ahmed Alnasheri 12 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithms Other Disciplines Visualization
  • 13. Data Mining : Lecture One Dr. Ahmed Alnasheri 13 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithms Distributed Computing Visualization
  • 14. Data Mining : Lecture One Dr. Ahmed Alnasheri 14 • After years of data mining there is still no unique answer to this question • Data Mining Is a knowledge discovery from data • Extraction of interest (Non-trivial, Implicit, Previously unknown, and Potentially useful) pattern or knowledge from huge amount of data. • A tentative definition: • Data mining is the use of efficient techniques for the analysis of very large collections of data and the extraction of useful and possibly unexpected patterns in data. What is data mining ?
  • 15. Data Mining : Lecture One Dr. Ahmed Alnasheri 15 • Data Mining is: • (1) The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets. • (2) The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner What is data mining ?
  • 16. Data Mining : Lecture One Dr. Ahmed Alnasheri 16 What is data mining ? • What is the difference between data mining and Database Query? We know what exactly we want We vaguely know what we are looking for
  • 17. Data Mining : Lecture One Dr. Ahmed Alnasheri 17 Data Mining • In simple terms: Data Data Mining Value
  • 18. Data Mining : Lecture One Dr. Ahmed Alnasheri 18 Knowledge Data Discovery (KDD) Process • Data mining plays an essential role in the knowledge discovery process and highly dependent on data
  • 19. Data Mining : Lecture One Dr. Ahmed Alnasheri 19 The Data Mining Process 1. Understand the domain 2. Create a dataset: • Select the interesting attributes • Data cleaning and preprocessing 3. Choose the data mining task and the specific algorithm 4. Interpret the results, and possibly return to 2
  • 20. Data Mining : Lecture One Dr. Ahmed Alnasheri 20 Why Data Preprocessing? • Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest . • noisy: containing errors or outliers. • inconsistent: containing discrepancies in codes or names • No quality data, no quality mining results! • Quality decisions must be based on quality data. • Data warehouse needs consistent integration of quality data. • Required for both OLAP and Data Mining!
  • 21. Data Mining : Lecture One Dr. Ahmed Alnasheri 21 Why can Data be Incomplete? • Attributes of interest are not available (e.g., customer information for sales transaction data) • Data were not considered important at the time of transactions, so they were not recorded! • Data not recorder because of misunderstanding or malfunctions • Data may have been recorded and later deleted! • Missing/unknown values for some data
  • 22. Data Mining : Lecture One Dr. Ahmed Alnasheri 22 Data Cleaning • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data
  • 23. Data Mining : Lecture One Dr. Ahmed Alnasheri 23 Types of data • Numeric data: Each object is a point in a multidimensional space • Categorical data: Each object is a vector of categorical values • Set data: Each object is a set of values (with or without counts) • Sets can also be represented as binary vectors, or vectors of counts • Ordered sequences: Each object is an ordered sequence of values. • Graph data
  • 24. Data Mining : Lecture One Dr. Ahmed Alnasheri 24 What can we do with data mining? • Some examples: • Frequent item sets and Association Rules extraction • Clustering • Classification • Ranking • Exploratory analysis
  • 25. Data Mining : Lecture One Dr. Ahmed Alnasheri 25 What can we do with data mining? • You are the owner of a social network, and you have full access to the social graph, what kind of information do you want to get out of your graph?
  • 26. Data Mining : Lecture One Dr. Ahmed Alnasheri 26 What can we do with data mining? • Suppose that you are the owner of a supermarket and you have collected billions of market basket data. What information would you extract from it and how would you use it? • What if this was an online store?
  • 27. Data Mining : Lecture One Dr. Ahmed Alnasheri 27 What can we do with data mining? • Suppose you are a search engine and you have a toolbar log consisting of • Pages browsed, • Queries, • Pages clicked, • Ads clicked • Each with a user id and a timestamp. What information would you like to get our of the data?
  • 28. Data Mining : Lecture One Dr. Ahmed Alnasheri 28 Data Set : High School Students’ Grade
  • 29. Data Mining : Lecture One Dr. Ahmed Alnasheri 29 Data Mining Task • Two main types of data mining task • Descriptive: Characterize properties of the data in a target data set. • Predictive : Perform induction on the current data in order to predict values of a new data.
  • 30. Data Mining : Lecture One Dr. Ahmed Alnasheri 30 Data Mining Task 1. Classification: learning a function that maps an item into one of a set of predefined classes 2. Regression: learning a function that maps an item to a real value 3. Clustering: identify a set of groups of similar items 4. Dependencies and associations: identify significant dependencies between data attributes 5. Summarization: find a compact description of the dataset or a subset of the dataset
  • 31. Data Mining : Lecture One Dr. Ahmed Alnasheri 31 Data Mining Function. • Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks.
  • 32. Data Mining : Lecture One Dr. Ahmed Alnasheri 32 The data is complex and interconnected • Multiple types of data: database tables, text, time series, images, videos, graphs, etc • Spatial and temporal aspect • Interconnected data of different types: • From the mobile phone we can collect, location of the user, friendship information, check-ins to venues, opinions through twitter, status updates in FB(FaceBook), images though cameras, queries to search engines
  • 33. Data Mining : Lecture One Dr. Ahmed Alnasheri 33 Data creates value Natural language understanding is driven by data
  • 34. Data Mining : Lecture One Dr. Ahmed Alnasheri 34 Data creates value Precision/Personalized medicine: Find the best treatment for patients using their genotype and all data that are related to them Also: understanding drug side- effects through google queries
  • 35. Data Mining : Lecture One Dr. Ahmed Alnasheri 35 Data creates value Self-Driving Cars: Car is the next computer. A future of smart cars that can drive themselves and learn from data Also: smart cities – urban computing
  • 36. Data Mining : Lecture One Dr. Ahmed Alnasheri 36 Data creates value Computers learn to play games by observing data
  • 37. Data Mining : Lecture One Dr. Ahmed Alnasheri 37 Data creates value Use of data for crisis management
  • 38. Data Mining : Lecture One Dr. Ahmed Alnasheri 38 Data creates value • All major soccer and basketball teams use data mining to make decisions. The national team of Germany had a special software for the analysis of video. They concluded that the possession time per player should be reduced. Germany won the 2014 word cup
  • 39. Data Mining : Lecture One Dr. Ahmed Alnasheri 39 Data creates value James Harden defense
  • 40. Data Mining : Lecture One Dr. Ahmed Alnasheri 40 Putting it all together: The Data Mining Pipeline (LinkedIn) feature extraction feature transformation user modeling Data Pipeline data tracking & logging candidates generation multi-pass rankers Online Serving System real-time feedback Model Fitting Pipeline (Hadoop) offline modeling fitting (cold-start model) nearline modeling fitting (warm-start model) daily/weekly minutes/hourly online A/B test model evaluation
  • 41. Data Mining : Lecture One Dr. Ahmed Alnasheri 41 Data Mining Example • Suppose that you were creating the Yemeni Facebook. • What kind of data would you collect and store? Social network contacts Interaction with contacts: messages, likes, replies, shares Posts, content of posts Interactions with feed: Clicks, Likes, Comments, Shares Photos Demographics: Age, City, etc Ads seen, ads clicked Products bought Videos uploaded videos consumed and many more! What would you do with this data?
  • 42. Data Mining : Lecture One Dr. Ahmed Alnasheri 42 Exploratory Analysis • Make measurements to understand what the data looks like • Example: Posts • How often do users posts, how many posts per user, when do they post, is there a correlation between number of posts and number of friends, etc • This is one of the first steps when collecting data. • Metrics: Deciding what to measure is important • The example of the Web graph
  • 43. Data Mining : Lecture One Dr. Ahmed Alnasheri 43 Exploiting similarities • Consider the following data for six users: • Number of times they have clicked on posts from these pages • What conclusion can we draw? NBA ESPN Sports.com MSNBC NY Times Wall Street Politico A 100 50 73 10 1 1 4 B 500 200 400 20 10 4 1 C 80 100 60 1 3 1 1 D 4 2 1 12 90 100 80 E 9 3 4 9 100 80 70 F 3 4 5 30 300 200 500
  • 44. Data Mining : Lecture One Dr. Ahmed Alnasheri 44 Exploiting similarities • Two types of users and two types of pages • Sports and politics • Questions: • How do we compute similarity? • How do we group similar users? Clustering NBA ESPN Sports.com MSNBC NY Times Wall Street Politico A 100 50 73 10 1 1 4 B 500 200 400 20 10 4 1 C 80 100 60 1 3 1 1 D 4 2 1 12 90 100 80 E 9 3 4 9 100 80 70 F 3 4 5 30 300 200 500
  • 45. Data Mining : Lecture One Dr. Ahmed Alnasheri 45 Exploiting similarities • What if we were missing this entry? • Can we fill this value? • Similar users like items similarly: Recommendation systems NBA ESPN Sports.com MSNBC NY Times Wall Street Politico A 100 50 73 10 1 1 4 B 500 200 400 20 10 4 1 C 80 100 ??? 1 3 1 1 D 4 2 1 12 90 100 80 E 9 3 4 9 100 80 70 F 3 4 5 30 300 200 500
  • 46. Data Mining : Lecture One Dr. Ahmed Alnasheri 46 Amazon Recommendations • “People who have bought this also bought…” • A huge breakthrough for amazon • Took advantage of the long tail • A big breakthrough for data mining in general
  • 47. Data Mining : Lecture One Dr. Ahmed Alnasheri 47 Making predictions • Filling the missing value can also be viewed as a prediction task • Types of prediction tasks: • Predicting a real value (e.g. number of clicks): Regression • Predicting a YES/NO value (e.g., will the user click?): Binary classification • Predicting over multiple classes (e.g., what is the topic of a post): Classification • Can you think of prediction/classification tasks for your social network? Ad click prediction Ad clickthrough prediction Like prediction Predict if a user will like a post over another: Learning to rank Predict if a post is offensive Predict if a photo contains nudity
  • 48. Data Mining : Lecture One Dr. Ahmed Alnasheri 48 Classification • Classification process: • Find features that describe an entity. • Use examples of the classes you want to predict. • Learn a model (function) that predicts • Classification is the engine behind the AI revolution • Used in all systems that make decisions • Became very powerful with Deep Learning • Huge applications in vision Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
  • 49. Data Mining : Lecture One Dr. Ahmed Alnasheri 49 Deep learning • Machine learning systems that use neural networks with multiple layers and are trained on very large quantities of data • Able to learn complex representations and powerful models. • Applications in recommendations, network analysis, text analysis, image recognition, car driving, playing games… • Require less feature engineering
  • 50. Data Mining : Lecture One Dr. Ahmed Alnasheri 50 The social graph • Your Yemeni Facebook also has a social graph. What can you do with this data? What is the shortest path between two nodes? Who is important and influential in the graph? How does information spread in the network? What becomes viral? Will two users become friends in the future?
  • 51. Data Mining : Lecture One Dr. Ahmed Alnasheri 51 What is the most important node in this graph? Degree Centrality and Closeness Centrality • The PageRank algorithm: A node is important if it is pointed to by other important nodes. • The idea of the basic PageRank algorithm is that the importance of a node depends on the number and importance of the neighbor nodes pointing to it. Node importance
  • 52. Data Mining : Lecture One Dr. Ahmed Alnasheri 52 The Web as a graph • When ranking pages, the authoritativeness is factored in the ranking. • This is the idea that made Google a success around 2000 • Today a lot more information is used, like clicks, browsing behavior, etc • Ranking of the pages is a very complex task that requires sophisticated techniques
  • 53. Data Mining : Lecture One Dr. Ahmed Alnasheri 53
  • 54. Data Mining : Lecture One Dr. Ahmed Alnasheri 54 Friendship suggestions • LinkedIn, Twitter, Facebook friendship suggestions • Useful for the users to discover their friends, but also useful for the network in order to grow, and increase engagement • LinkedIn success story • Triadic closure principle: Links are created in a way that usually closes a triangle • If both Bob and Charlie know Alice, then they are likely to meet at some point.
  • 55. Data Mining : Lecture One Dr. Ahmed Alnasheri 55 What is Data Mining again? • “Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data analyst” (Hand, Mannila, Smyth) • “Data mining is the discovery of models for data” (Rajaraman, Ullman) • We can have the following types of models • Models that explain the data (e.g., a single function) • Models that predict the future data instances. • Models that summarize the data • Models the extract the most prominent features of the data. • “Data Mining is the study of collecting, processing, analyzing, and gaining useful insights from data” – Charu Aggarwal
  • 56. Data Mining : Lecture One Dr. Ahmed Alnasheri 56 The buzz around data • Data Science: Data is useful to understand a process and improve it. All organizations should have a data science team that analyses their data and proposes improvements • Focuses on more immediate applications and insights • Big Data: Data appear everywhere. We should process it collectively and interconnect them. We need infrastructure (cloud computing, cloud storage) to do this • More systems oriented • AI/Machine Learning/Deep Learning: These have been around for a while but now we have the data to learn more complex models that are significantly more powerful • More emphasis on scientific breakthroughs (Penetration)
  • 57. Data Mining : Lecture One Dr. Ahmed Alnasheri 57 New era of data mining • Boundaries are becoming less clear • Today data mining, machine learning, and AI are synonymous. It is assumed that the algorithms should scale. It is clear that statistical inference is used for building the models. • Data is the engine for AI • Data Mining touches everything related to data.
  • 58. Data Mining : Lecture One Dr. Ahmed Alnasheri 58 Which also has a dark side • Are the algorithms making fair and correct decisions? • Do algorithms create filter bubbles, echo chambers, and promote misinformation? Are they a threat to democracy? • Surveillance capitalism • Is AI a threat?
  • 59. Data Mining : Lecture One Dr. Ahmed Alnasheri 59 The Skills of a Data Miner – Data Scientist It is a hard job
  • 60. Data Mining : Lecture One Dr. Ahmed Alnasheri 60 But also a rewarding one "The success of companies like Google, Facebook, Amazon, and Netflix, not to mention Wall Street firms and industries from manufacturing and retail to healthcare, is increasingly driven by better tools for extracting meaning from very large quantities of data. 'Data Scientist' is now the hottest job title in Silicon Valley." – Tim O'Reilly Sexiest Job but…
  • 61. Data Mining : Lecture One Dr. Ahmed Alnasheri 61 Homework 1 • Read chapter one and answer the questions on page 34, and 35. By hand • Data Scientist: The Sexiest Job of the 21st Century. PowerPoint Presentation. • http://www.cse.msu.edu/~ptan/dmbook/software/