SlideShare a Scribd company logo
Advanced IS Design
Lecture 1
Web Mining
Overview
 Challenges in Web Mining
 Basics of Web Mining
 Classification of Web Mining
Web Mining
It is the application of data mining techniques to
automatically discover and extract information from
Web data, including web documents, hyperlinks
between documents, usage logs of websites, etc.
 Web mining is a multidisciplinary field:
 Data mining,
 Machine learning,
 Natural language processing,
 Statistics,
 Databases,
 Information retrieval, multimedia, etc.
Web mining challenges
 The Web has many unique characteristics, which make
mining useful information and knowledge a fascinating and
challenging task.
 The amount of information on the Web is huge, and easily
accessible.
 Information/data of almost all types exist on the Web, e.g.,
structured tables, texts, multimedia data, etc.
 Much of the Web information is redundant. The same piece of
information or its variants may appear in many pages.
 The Web is noisy. A Web page typically contains a mixture of many
kinds of information, e.g., main contents, advertisements,
navigation panels, copyright notices, etc.
Web mining challenges
 The Web is dynamic. Information on the Web changes
constantly. Keeping up with the changes and monitoring the
changes are important issues.
 Above all, the Web is a virtual society. It is not only about data,
information and services, but also about interactions among
people, organizations and automatic systems, i.e., communities.
Classification of Web Mining Techniques
 Web Structure Mining
 Web Usage Mining
 Web Content Mining
Web-Structure Mining
 Discovering useful knowledge from hyperlinks,
which represent the structure of the Web.
 Link mining refers to data mining techniques that
explicitly consider these links when building predictive or
descriptive models of the linked data are used for
beneficial applications i.e.,:
 In search engines: for discovering important Web
pages.
 In social network analysis: for discovering
communities of users who share common interests.
 Citation analysis (co-citation & bibliographic coupling)
Web-Usage Mining
 Discovery of user access patterns from Web
usage logs, which record user clickstreams.
 Clickstream
 It is the recording of what a computer user clicks on
while Web browsing. As the user clicks anywhere in
the webpage, the action is logged on a client or inside
the Web server, as well as other sources.
Web-Usage Mining
 Clickstream Analysis answers the following questions:
 Which web page is the most common point of entry for users?
 Are visitors entering through the gateway constructed by the
website developers, or are they somehow by passing the
gateway and landing in the middle of the Web site?
 In which order have the pages been viewed?
 Is this page sequencing as the developers might have expected,
or is there something the users are trying to tell us about how the
Web site should be structured?
 Which other Web sites referred the users to your Web site?
 Which referrer sites are providing us with the greatest number of
referrals?
 How many web pages have been viewed in the typical visit?
Web-Usage Mining Benefits
 Restructure a website
 Extract user access patterns to target ads
 Number of access to individual files
 Predict user behavior based on previously learned
rules and users’ profile
Web-Usage Mining Techniques
 Data Preprocessing
Conversion of raw data in usage logs in order to produce
the right data for mining. (e.g., data cleaning)
 Pattern Discovery
- using the algorithms and techniques from data mining,
sequential pattern mining, machine learning, statistics and pattern
recognition etc.
- Common data mining techniques are association rules
and sequence pattern mining.
 Pattern Analysis
Validation and interpretation of the mined patterns.
Web Content Mining
 Discovering useful information or knowledge
from Web page contents.
 Web data contents include text, Image, audio, video,
metadata and hyperlinks.
 Technologies that are normally used in web
content mining are NLP (Natural Language
Processing) and IR (Information Retrieval).
Web Content Mining Applications
 Web Information Integration and Schema
Matching.
 (Lecture 2)
 Opinion extraction from online sources.
 (Lecture 3)
 Knowledge synthesis (representation).
 (Lecture 4)
Social Network Analysis
CS583, Bing Liu, UIC 15
Social network analysis
 Social network is the study of social entities (people
in an organization, called actors), and their
interactions and relationships.
 The interactions and relationships can be
represented with a network or graph,
 each vertex (or node) represents an actor and
 each link represents a relationship.
 From the network, we can study the properties of its
structure, and find various kinds of sub-graphs, e.g.,
communities formed by groups of actors.
 We study two types of social network analysis, centrality
and prestige, which are closely related to hyperlink
analysis and search on the Web.
CS583, Bing Liu, UIC 16
Centrality
 Important or prominent actors are those that
are linked or involved with other actors
extensively.
 A person with extensive contacts (links) or
communications with many other people in
the organization is considered more important
than a person with relatively fewer contacts.
 The links can also be called ties. A central
actor is one involved in many ties.
17
Centrality
Based on the varying notions of importance of
vertices or edges, different centrality measures
were developed:
1. Degree centrality
2. Betweenness centrality
3. Closeness centrality
18
Degree Centrality
Central actors are the most active actors that have most links or ties
with other actors. Let the total number of actors in the network be n.
 Undirected Graph: In an undirected graph, the degree centrality of an
actor i (denoted by CD(i)) is simply the node degree (the number of edges)
of the actor node, denoted by d(i), normalized with the maximum degree, n-
1.
 The value of this measure ranges between 0 and 1 as n-1 is the maximum
value of d(i).
 Directed Graph: In this case, we need to distinguish in-links of actor i
(links pointing to i), and out-links (links pointing out from i). The degree
centrality is defined based on only the out-degree (the number of out-links or
edges), do(i).
19
Degree Centrality
degree?
20
Closeness Centrality
This view of centrality is based on the closeness or distance. The basic
idea is that an actor xi is central if it can easily interact with all other
actors. That is, its distance to all other actors is short. Thus, we can use
the shortest distance to compute this measure. Let the shortest
distance from actor i to actor j be d(i, j) (measured as the number of
links in a shortest path).
 Undirected Graph: The closeness centrality CC(i) of actor i is defined as
 The value of this measure also ranges between 0 and 1 as n-1 is the
minimum value of the denominator, which is the sum of the shortest
distances from i to all other actors.
 Directed Graph: The same equation can be used for a directed graph. The
distance computation needs to consider directions of links or edges.
21
Closeness Centrality
 CC(d)=0.75
 d is at distance 1 from 4 nodes
and at distance 2 from 2 nodes.
 Then
∑j≠ddist(d,j)=1+1+1+1+2+2=8
 Since there are 7 nodes in the
network, the numerator of the
equation above is 6, then the
closeness centrality of d is
6/8=0.75
CS583, Bing Liu, UIC 22
Betweenness Centrality
 If two non-adjacent actors j and k want to
interact and actor i is on the path between j
and k, then i may have some control over the
interactions between j and k.
 Betweenness measures this control of i over
other pairs of actors. Thus,
 if i is on the paths of many such interactions, then
i is an important actor.
CS583, Bing Liu, UIC 23
Betweenness Centrality (cont …)
 Undirected graph: Let pjk be the number of
shortest paths between actor j and actor k.
 The betweenness of an actor i is defined as the
number of shortest paths that pass i (pjk(i))
normalized by the total number of shortest paths.

k
j jk
jk
p
i
p )
(
24
Betweenness Centrality
 CB(b)=16
 as all the shortest paths from
any node from the set a,c
 to any node from the set d,e,f,g
 pass through b
THANK YOU
25

More Related Content

PPTX
Social Network Analysis (SNA) 2018
PPTX
Social Network Analysis (SNA) 2018
PDF
The Mathematics of Social Network Analysis: Metrics for Academic Social Networks
PDF
The Mathematics of Social Network Analysis: Metrics for Academic Social Networks
PPTX
Web Mining
PPTX
Web Mining
PDF
Q046049397
PDF
Q046049397
Social Network Analysis (SNA) 2018
Social Network Analysis (SNA) 2018
The Mathematics of Social Network Analysis: Metrics for Academic Social Networks
The Mathematics of Social Network Analysis: Metrics for Academic Social Networks
Web Mining
Web Mining
Q046049397
Q046049397

Similar to Web Mining .ppt (20)

PPT
Data.Mining.C.8(Ii).Web Mining 570802461
PPT
Data.Mining.C.8(Ii).Web Mining 570802461
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PDF
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
PDF
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
PDF
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
PDF
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
PDF
Sampling of User Behavior Using Online Social Network
PDF
Sampling of User Behavior Using Online Social Network
PDF
Nt1310 Unit 1 Literature Review
PDF
Nt1310 Unit 1 Literature Review
PDF
20142014_20142015_20142115
PDF
20142014_20142015_20142115
PDF
Exploring the Current Trends and Future Prospects in Terrorist Network Mining
PDF
Exploring the Current Trends and Future Prospects in Terrorist Network Mining
Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
Avoiding Anonymous Users in Multiple Social Media Networks (SMN)
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
APPLICATION OF CLUSTERING TO ANALYZE ACADEMIC SOCIAL NETWORKS
Sampling of User Behavior Using Online Social Network
Sampling of User Behavior Using Online Social Network
Nt1310 Unit 1 Literature Review
Nt1310 Unit 1 Literature Review
20142014_20142015_20142115
20142014_20142015_20142115
Exploring the Current Trends and Future Prospects in Terrorist Network Mining
Exploring the Current Trends and Future Prospects in Terrorist Network Mining
Ad

More from NaglaaFathy42 (10)

PPT
reverse engineering.ppt
PPTX
introduction to web engineering.pptx
PDF
introduction to web engineering.pdf
PPT
understanding computers.ppt
PPT
semantic integration.ppt
PPT
semantic web tech.ppt
PPT
Bioinformatic_Databases_2.ppt
PPT
Lec2_Information Integration.ppt
PPT
ch5-georeferencing.ppt
PDF
intro to gis
reverse engineering.ppt
introduction to web engineering.pptx
introduction to web engineering.pdf
understanding computers.ppt
semantic integration.ppt
semantic web tech.ppt
Bioinformatic_Databases_2.ppt
Lec2_Information Integration.ppt
ch5-georeferencing.ppt
intro to gis
Ad

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Computer network topology notes for revision
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Lecture1 pattern recognition............
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
annual-report-2024-2025 original latest.
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Mega Projects Data Mega Projects Data
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Qualitative Qantitative and Mixed Methods.pptx
Reliability_Chapter_ presentation 1221.5784
ISS -ESG Data flows What is ESG and HowHow
1_Introduction to advance data techniques.pptx
Computer network topology notes for revision
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
STUDY DESIGN details- Lt Col Maksud (21).pptx
Lecture1 pattern recognition............
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
annual-report-2024-2025 original latest.
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Mega Projects Data Mega Projects Data
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Ppt On Nestle.pptx huunnnhhgfvu
Quality review (1)_presentation of this 21
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

Web Mining .ppt

  • 2. Overview  Challenges in Web Mining  Basics of Web Mining  Classification of Web Mining
  • 3. Web Mining It is the application of data mining techniques to automatically discover and extract information from Web data, including web documents, hyperlinks between documents, usage logs of websites, etc.  Web mining is a multidisciplinary field:  Data mining,  Machine learning,  Natural language processing,  Statistics,  Databases,  Information retrieval, multimedia, etc.
  • 4. Web mining challenges  The Web has many unique characteristics, which make mining useful information and knowledge a fascinating and challenging task.  The amount of information on the Web is huge, and easily accessible.  Information/data of almost all types exist on the Web, e.g., structured tables, texts, multimedia data, etc.  Much of the Web information is redundant. The same piece of information or its variants may appear in many pages.  The Web is noisy. A Web page typically contains a mixture of many kinds of information, e.g., main contents, advertisements, navigation panels, copyright notices, etc.
  • 5. Web mining challenges  The Web is dynamic. Information on the Web changes constantly. Keeping up with the changes and monitoring the changes are important issues.  Above all, the Web is a virtual society. It is not only about data, information and services, but also about interactions among people, organizations and automatic systems, i.e., communities.
  • 6. Classification of Web Mining Techniques  Web Structure Mining  Web Usage Mining  Web Content Mining
  • 7. Web-Structure Mining  Discovering useful knowledge from hyperlinks, which represent the structure of the Web.  Link mining refers to data mining techniques that explicitly consider these links when building predictive or descriptive models of the linked data are used for beneficial applications i.e.,:  In search engines: for discovering important Web pages.  In social network analysis: for discovering communities of users who share common interests.  Citation analysis (co-citation & bibliographic coupling)
  • 8. Web-Usage Mining  Discovery of user access patterns from Web usage logs, which record user clickstreams.  Clickstream  It is the recording of what a computer user clicks on while Web browsing. As the user clicks anywhere in the webpage, the action is logged on a client or inside the Web server, as well as other sources.
  • 9. Web-Usage Mining  Clickstream Analysis answers the following questions:  Which web page is the most common point of entry for users?  Are visitors entering through the gateway constructed by the website developers, or are they somehow by passing the gateway and landing in the middle of the Web site?  In which order have the pages been viewed?  Is this page sequencing as the developers might have expected, or is there something the users are trying to tell us about how the Web site should be structured?  Which other Web sites referred the users to your Web site?  Which referrer sites are providing us with the greatest number of referrals?  How many web pages have been viewed in the typical visit?
  • 10. Web-Usage Mining Benefits  Restructure a website  Extract user access patterns to target ads  Number of access to individual files  Predict user behavior based on previously learned rules and users’ profile
  • 11. Web-Usage Mining Techniques  Data Preprocessing Conversion of raw data in usage logs in order to produce the right data for mining. (e.g., data cleaning)  Pattern Discovery - using the algorithms and techniques from data mining, sequential pattern mining, machine learning, statistics and pattern recognition etc. - Common data mining techniques are association rules and sequence pattern mining.  Pattern Analysis Validation and interpretation of the mined patterns.
  • 12. Web Content Mining  Discovering useful information or knowledge from Web page contents.  Web data contents include text, Image, audio, video, metadata and hyperlinks.  Technologies that are normally used in web content mining are NLP (Natural Language Processing) and IR (Information Retrieval).
  • 13. Web Content Mining Applications  Web Information Integration and Schema Matching.  (Lecture 2)  Opinion extraction from online sources.  (Lecture 3)  Knowledge synthesis (representation).  (Lecture 4)
  • 15. CS583, Bing Liu, UIC 15 Social network analysis  Social network is the study of social entities (people in an organization, called actors), and their interactions and relationships.  The interactions and relationships can be represented with a network or graph,  each vertex (or node) represents an actor and  each link represents a relationship.  From the network, we can study the properties of its structure, and find various kinds of sub-graphs, e.g., communities formed by groups of actors.  We study two types of social network analysis, centrality and prestige, which are closely related to hyperlink analysis and search on the Web.
  • 16. CS583, Bing Liu, UIC 16 Centrality  Important or prominent actors are those that are linked or involved with other actors extensively.  A person with extensive contacts (links) or communications with many other people in the organization is considered more important than a person with relatively fewer contacts.  The links can also be called ties. A central actor is one involved in many ties.
  • 17. 17 Centrality Based on the varying notions of importance of vertices or edges, different centrality measures were developed: 1. Degree centrality 2. Betweenness centrality 3. Closeness centrality
  • 18. 18 Degree Centrality Central actors are the most active actors that have most links or ties with other actors. Let the total number of actors in the network be n.  Undirected Graph: In an undirected graph, the degree centrality of an actor i (denoted by CD(i)) is simply the node degree (the number of edges) of the actor node, denoted by d(i), normalized with the maximum degree, n- 1.  The value of this measure ranges between 0 and 1 as n-1 is the maximum value of d(i).  Directed Graph: In this case, we need to distinguish in-links of actor i (links pointing to i), and out-links (links pointing out from i). The degree centrality is defined based on only the out-degree (the number of out-links or edges), do(i).
  • 20. 20 Closeness Centrality This view of centrality is based on the closeness or distance. The basic idea is that an actor xi is central if it can easily interact with all other actors. That is, its distance to all other actors is short. Thus, we can use the shortest distance to compute this measure. Let the shortest distance from actor i to actor j be d(i, j) (measured as the number of links in a shortest path).  Undirected Graph: The closeness centrality CC(i) of actor i is defined as  The value of this measure also ranges between 0 and 1 as n-1 is the minimum value of the denominator, which is the sum of the shortest distances from i to all other actors.  Directed Graph: The same equation can be used for a directed graph. The distance computation needs to consider directions of links or edges.
  • 21. 21 Closeness Centrality  CC(d)=0.75  d is at distance 1 from 4 nodes and at distance 2 from 2 nodes.  Then ∑j≠ddist(d,j)=1+1+1+1+2+2=8  Since there are 7 nodes in the network, the numerator of the equation above is 6, then the closeness centrality of d is 6/8=0.75
  • 22. CS583, Bing Liu, UIC 22 Betweenness Centrality  If two non-adjacent actors j and k want to interact and actor i is on the path between j and k, then i may have some control over the interactions between j and k.  Betweenness measures this control of i over other pairs of actors. Thus,  if i is on the paths of many such interactions, then i is an important actor.
  • 23. CS583, Bing Liu, UIC 23 Betweenness Centrality (cont …)  Undirected graph: Let pjk be the number of shortest paths between actor j and actor k.  The betweenness of an actor i is defined as the number of shortest paths that pass i (pjk(i)) normalized by the total number of shortest paths.  k j jk jk p i p ) (
  • 24. 24 Betweenness Centrality  CB(b)=16  as all the shortest paths from any node from the set a,c  to any node from the set d,e,f,g  pass through b