Web Mining

MINING CLIENT SIDE PARADATA FOR
ADAPTIVE WEBPAGES
By
Rami Shawkat Hatem Al-Salman

Advisor
Dr.Natheer Khasawneh

Co-Advisor
Dr. Ahmad Al-Hammouri
Page  1

Contents

 Introduction.
 Server logs data.
 Clients data.
 Framework for collecting and mining client side data.
 Three case studies.
 Results and Discussions.

 Conclusions.

 Future Work.

Page  2

Introduction

 In the recent years a large number of websites is published.

 Current web applications aim to interact with users through rich and
dynamic contents.

 In the recent years JavaScript has developed to be more interactive not
only with a client side but also with the server side, Thus, Asynchronous
JavaScript and XML (AJAX) is introduced.

 Web personalization is applied by several websites.

Page  3

Web personalization

 Web personalization concerns to support the user’s specific environment
related to their needs and domain.

 Many websites use recommender system for supporting a web
personalization.

 Webpage's are personalized based on clients preferences (i.e., interests,
country, gender etc…).

Page  4

AMAZON & Web personalization

 AMAZON uses recommender system relay on collaborative filtering
technique for producing personal recommendations.

 Personal (client) recommendations are generated by computing similarity
between client preference and others.

 Collaborative filtering technique consists of three steps:
 Record the preferences of a group of clients.
 Choose group of clients whose preferences are similar to the target client
using a similarity metric .
 Recommend options (i.e., products) to the target client .

Page  5

AMAZON as a real example

Recommendations based
Recommendations based on preferences of people
on browsing history with similar profile

Page  6

AMAZON as a real example

Recommendations based
on most recent viewed
items
Page  7

Server logs data

 server log is a log file that contains Entry name Server Log Info

vectors of data which are recorded by
web server. IP-Address 178.77.146.157

date [03/Jan/2011:15:20:06 -0800]
 The analysis for server logs can help to
understanding client’s behavior (i.e., request "GET/default.ASPX HTTP/1.0"
the most and least traffic).
status 200

bytes 8788

referrer http://www.just.edu.jo

agent "Mozilla/3.0WebTV/1.2 (compatible; MSIE 2.0)"

Page  8

Apache server access.log

Page  9

Clients data

 Clients data is a data which is recorded Entry name Client Info

based on the client navigation to the Element name DIV1
visited Webpage elements.
 Clients data could record the Element value Yes

interactions between clients and the
Spent time 156.77 seconds
elements in the visited Webpage.
IP-Address 178.77.146.157
 For example: record the name,
value and spent time for specific date [03/Jan/2011:15:20:06 -0800]
Webpage element. request "GET/default.ASPX HTTP/1.0"

status 200

bytes 8788

referrer http://www.just.edu.jo

agent "Mozilla/3.0WebTV/1.2 (compatible; MSIE 2.0)"

Page  10

Clients data example

Page  11

Problem statement

 Most previous studies are investigated by working on server logs data.

 The previous studies used Web Usage Mining (WUM) techniques for
extracting the knowledge from this data.

 Some tools and systems are proposed for tracking clients data.

 The previous studies which related to clients data have not shown the
usefulness of clients data.

 Unfortunately , until now there is no complete framework which could
record and mine in the clients logs data.
Page  12

Motivations

 Some entries can be extracted from the client’s mouse movements over
the visited Webpage.

 Extracting useful knowledge from clients data, will help to understanding
clients’ behaviors and attitudes in better way.

 Support clients with appropriate recommendations.

 The understanding of clients behaviors and needs, will improve the
advertisements for products in WWW.

Page  13

Contributions

 Until now there is no complete framework which could record and mine in
the clients data.
 Thus, the main contribution of this thesis is to building a complete
framework that can recode client’s events and apply the WUM techniques
on this data .
 We mainly show the usefulness of the client’s data.
• We customize the client’s data and then we apply WUM techniques on it.
• We build three different web applications and then we integrate our
framework with their.
• We build a recommendation engine which is able to discovering the
client’s patterns .
• We extract the useful information from the client’s data.
 We generate client’s data model based on client’s data statistics.
Page  14

Framework for collecting and mining client side data

 We propose a framework to record and mine client’s side data.
 Our framework consists of five phases respectively:
 Session identification

 Events identification and catching.

 Events storing.

 Merging and exporting events.

 Web mining.

Page  15

Framework for collecting and mining client side data

Page  16

Session identification

 Once a client requests a webpage, the session id is assigned for him.

 The session id presents the number of milliseconds since midnight Jan 1,
1970, by this way the assigned session id for each client is a unique.

 The generated session id is used to identify all recorded events which
belong to the same user.

 The session for the client can be finished by a target button or link.

Page  17

Events identification and recording

 We identify web elements and associated events.

 The clients data is transferred associated with session id via
XmlHttpRequest AJAX call.

 Based on AJAX, the transferring data is a lightweight operation (Clients
never feel while data is transferred to server ).

 Seven values are recorded: name, value, Item time, session id, Date,
Total mouse's clicks and Personalized.

 Personalized, represents the web element that finishes the session.
Page  18

Cont, Events identification and recording

 Our events are classified into two categories:
 Clickstream-based.
 Time based.

 In the clickstream-based category, the name and value of clicked element
will be transferred.

 In the time-based category, the name, the value and the spent time of web
element will be transferred.

Page  19

Snapshot of clickstream-based data (Events storing)

Page  20

Snapshot of time-based data (Events storing)

Page  21

Merging and Exporting data

 The records are grouped per client session (session id).
 Our merging algorithm works as follow:
1. Load a list of session id’s
2. For each session id:
i. If the data is clickstream-based then accumulate the sequence of
clicks.
ii. If the data is time-based then accumulate the spent time over each
element.

 The merged data is exported to another Database table.
 The output this phase will be the input for the web mining phase.

Page  22

Snapshot of merging data in clickstream-based

Page  23

Snapshot of merging data in time-based

Page  24

Web Mining

 As in every data mining task, the process of Web Usage Mining consists
of three steps:
• Data preprocessing.
• Pattern discovery and web mining.
• Information and Pattern analysis.

Page  25

Data preprocessing

 Preprocessing or data cleaning process is aiming to remove irrelevant
data and keeps the consistent data.

 The preprocessing is fulfilled based on thresholds.

 We mainly use two thresholds:
– The total session time.
– The total number of visited elements.

Page  26

Pattern discovery and web mining

Page  27

Information and Pattern analysis

 Most of times, the analysis of the generated patterns and information
allows us to understand clients behavior deeply.

 The output of this step can be formulated in many forms.

 One of the most important forms is a generated model which is usually
extracted from the statistics (i.e., frequencies.).

Page  28

Three case studies

 To validate the proposed framework we have integrated the framework
with three different web applications.
 The three web applications are:
1. Web based editor controls (TinyMCE).
2. E-commerece web application.
3. E-survey web application.
 The three web applications are hosted online.

Page  29

TinyMCE

 TinyMCE is a platform independent web based Javascript HTML editor
control.
 We modified TinyMCE source code to integrate the proposed framework
with it.
 The events of TinyMCE belong to general data (or clickstream-based
data).
 We applied data mining to cluster and discover the client’s sequence
patterns.
 Finally we classify the clustered output.

Page  30

Snapshot of TinyMCE

Page  31

Data Collection

 As a source of data 60 students from JUST in CPE 411 and CPE 311
classes are asked to use our system.

 We asked the students to write an advertisement using TinyMCE about
JUST to encourage students from Europe Union (EU) countries to study in
JUST.
 The click events are recorded.

 The events are merged in a general data mode.

 The merged data will be the input for the data preprocessing step.

Page  32

Snapshot of merged data

Page  33

Data Preprocessing

 The collected data was preprocessed by removing invalid sequences .

 The invalid sequences were determined based on two thresholds:
1. The number of clicked controls.
2. Total session time which is spent in the sequence .
 Heuristically we used 10 clicks as a first threshold and 200 seconds as a
second threshold.

 The data preprocessing step reduces the total number of sequences to
be 36 sequences (24 sequences are removed).

Page  34

Clustering

 We separated student’s sequences into clusters with similar clickstream
sequences.
 We applied K-means clustering technique using heuristics numbers
clusters equal to two, three, and four.
 We used edit distance as distance measure to calculating the similarity or
dissimilarity between any two objects closing to the mean point.
 The main goal of clustering is to label students sequences.

The points represent the student’s
sequences

Page  35

Pattern discovery

 The clustered sequences are used as an input to the pattern discovery
algorithm.
 We applied Generalize Sequence Pattern (GSP) to extract the patterns
from each cluster.
 GSP not only discovers the patterns sequences but also preserve the
order of these patterns.
 The output of GSP is a top ten patterns for a cluster.
 Theses patterns will be assigned later in classification step.

Page  36

Classification

 The output data of clustering step was used as an input to classification
models.

 Total session time, number of controls and the clickstream sequence are
used as three features for our classification models.

 The classification models are trained based on these features and data.

 We use two classifiers, Naive Bayes and Support Vector Machines.

 After training phase, our classifiers were able to classify the new clients to
one of two or three or four classes.
Page  37

E-commerce system

 In the second case study, E-commerce web application is built from
scratch.
 We integrate our framework with it.
 Our E-commerce system offers two categories of products, Camera’s and
Mobiles.
 The main goal of this web application is to proof, that the classification for
similar clients can be easily and directly done.
 Each product has seven features.

Page  38

Snapshot of E-commerce system for Mobile’s

Page  39

Snapshot of E-commerce system for Camera’s

Page  40

Data Collection

 As a source of data we depend on three sources:
• Students from JUST University.
• Students from Heinrich-Heine University of Duesseldorf (Germany).
• Social network websites (Facebook, Myspace, etc.).
 We record the events.
 The events are merged in a time-based mode.
 Based on the time-based mode, the times which are spent over any cell
within specific user session, they are aggregated.
 Based on our database statistics, 58 clients bought cameras and 54
clients bought mobiles.

Page  41

Snapshot of merged data in time-based mode

Page  42

Data Preprocessing

 The total session time and the number of visited features are used as two
thresholds.
 Based on our experiments, we set total session time to be 20 and number
of visited features to be 7.
 Based on these thresholds:
– For Cameras data, 40 clients transactions are pruned, and the remaining
clients transactions were 18.
– For Mobiles data, 35 clients transactions are pruned, and the remaining
clients transactions were 20.

Page  43

Classification

 In the time-based data mode, classification models can be directly
applied on preprocessed data .
 Each client transaction is labeled by a buy product button (i.e., client
who bought a camera #1).
 Aggregated times which are spent over 28 features (4 products * 7
features), are used as main features.
 Our classification models are trained by preprocessed time-based
data.
 We use three classifiers Naive Bayes, Support Vector Machines and
Decision Tree (C4.5 algorithm).

Page  44

E-survey

 In the third case study, E-survey web application is built from scratch.
 We integrate our framework with it.
 E-survey is a simple web application which allows students to assessing
lecturers by both multiple and assay questions.
 The main goal of E-survey is to understand student’s attitude and
behavior.
 E-survey Webpage consists of twelve questions (eleven multiple
questions and one assay question).
 Each multiple choice question, consists of four options (Can not dot it at
all, weak, good and very good).

Page  45

Snapshot of E-Survey

Page  46

Data Collection

 As a source of data we depend on three sources:
• Students from Yarmook-Accouncting class.
• Students from Jadara-Computer skills class.
• Students from Philadelphia-Design class.
 We record the events.
 The events are merged in the time-based mode.
 Based on the time-based mode, the times which are spent over any
question within specific user session, they are aggregated.
 Based on our database statistics, 101 students assessed their lecturers.
– 37 students from Yarmook University, 38 students from Philadelphia
University and 26 students from Jadara University.

Page  47

Data Preprocessing

 The total session time and the number of visited questions are used as
two thresholds.
 Based on our experiments, we set total session time to be 25 and number
of visited questions to be 12.
 Based on these thresholds 11 students transactions are discarded from
student Database.
– The remaining transactions are 90.

Page  48

Snapshot of preprocessed data

Page  49

Classification

 The aggregated times which are spent over 12 questions are used as
main 12 features.
 In E-Survey, the recorded transactions are not labeled directly.
 Labeling is done by a flag question.
 Our classification models are trained by preprocessed time-based data.
 We use three classifiers Naive Bayes, Support Vector Machines and
Decision Tree (C4.5 algorithm).

Page  50

The student’s data model (exponential)

Questions-Freq

450

400

350
Number of Questions

300

250
Questions-Freq
200

150

100

50

0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58
Time in seconds

Page  51

Evaluation

 For evaluation purpose, we use three well known measures which always
used in information retrieval topic, 1. Precision, 2. Recall, 3.F-measure.

 The False Positive (FP) and False Negative (FN) measures are used for
evaluating the errors in classification models.
 For testing purposes, the classifiers are testing in two modes :
– Training dataset method.
– 5 folds cross-validation method.
 Training dataset method uses dataset for both training and testing.
 5 folds cross-validation method divides dataset into subsets, one of them
used for testing and the remaining subsets for training.

Page  52

5 folds cross-validation method

Green color as training
subsets

Red color as testing
subset

Page  53

Results-TinyMCE

1
0.9
0.8
0.7
0.6 Precision
0.5 Recall
0.4 F-Measure
0.3
0.2
0.1
0
NB 2 clusters DT 2 clusters NB 3 clusters DT 3 clusters NB 4 clusters DT 4 clusters

The Precision, Recall and F-Measure values for NB and DT in 2, 3, 4 clusters using
5-folds cross-validation.

Page  54

Results-TinyMCE

0.6

0.5

0.4
FN
0.3
FP
0.2

0.1

0
NB 2 clusters DT 2 clusters NB 3 clusters DT 3 clusters NB 4 clusters DT 4 clusters

False Positive and True Positive values for NB and DT in 2, 3, 4 clusters using 5-
folds cross-validation.

Page  55

Results E-Survey

1
0.9
0.8
0.7
0.6 Precision
0.5 Recall
0.4 F-Measure
0.3
0.2
0.1
0
DT Naïve bayes SVM DT-5-V Naïve bayes-5-V SVM-5-V

Using training dataset Using 5-folds cross-validation

Page  56

Results E-Survey

0.7

0.6

0.5

0.4 FN
0.3 FP

0.2

0.1

0
DT Naïve bayes SVM DT-5-V Naïve bayes-5-V SVM-5-V

Using training dataset Using 5-folds cross-validation

Page  57

Conclusions

 Clients data is very useful.
 Clients data has a flexibility to be mined.
 Clients data could has multiple forms.
 Clustering should be used for labeling unlabeled clients transactions.
 Classification is very practical in clients data.
 Our complete framework will help to improve clients experiences.
 Our classification models show the ability to classify with high accuracy
rate.

Page  58

Future Work

 We are looking forward to deal with more clients data such as: x,y axis’s.

 We are looking for developing new clustering and classification
techniques which can deal efficiently with client’s data.

 We will extract more knowledge of clients data.

Page  59

Results for E-commerce camera’s

1
0.9
0.8
0.7
0.6 Precision
0.5 Recall
0.4 F-Measure
0.3
0.2
0.1
0
DT Naïve bayes SVM

0.45
0.4
0.35
0.3
0.25 FN
0.2 FP
0.15
0.1
0.05
0
DT Naïve bayes SVM

Page  61

Snapshot of the generated tree from decision tree model for
camera’s category

Page  62

Results for E-commerce mobile’s

1
0.9
0.8
0.7
0.6 Precision
0.5 Recall
0.4 F-Measure
0.3
0.2
0.1
0
DT Naïve bayes SVM

0.35
0.3
0.25
0.2 FN
0.15 FP
0.1
0.05
0
DT Naïve bayes SVM

Page  63

Snap shot of the generated tree from decision tree model for
mobiles category

Page  64

Web applications links

 http://web-engineering.orgfree.com/
 http://easyshoping.orgfree.com/
 http://questions.orgfree.com/

Page  65

Machine learning Algorithms

 Naïve Bayes is a probabilistic model based on Bayesian theorem .

p r ( F | C ) p r (C )
Pr (C | F ) 
pr ( F )

Page  66


 C4.5 is a supervised machine learning algorithm which it is developed
originally from ID3 algorithm .
 C4.5 generates decision trees from a set of training data based on an
information entropy concept.

Page  67


SVM is a supervised machine learning
algorithm. The main idea is to find a
separator line which called hyperplane.

Hyperplane separates the n- dimensional
data completely into its two (or more)
classes.

Page  68

Web Mining

More Related Content

What's hot (19)

Viewers also liked (17)

Similar to Web Mining (20)

Recently uploaded (20)

Web Mining