SlideShare a Scribd company logo
1
Preprocessing on Web Log Data for Web Usage Mining
Shahid Rajaee Teacher Training University
Faculty of Computer Engineering
PRESENTED BY:
Amir Masoud Sefidian
2
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
3
Outline:
•Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
4
Introduction
• Web has been growing as a dominant platform for retrieving information and discovering
knowledge from web data.
• Web usage analysis or web usage mining or web log mining or click stream analysis:
• Process of extracting useful knowledge from web server logs, database logs, user queries,
client side cookies and user profiles in order to analyze web users’ behavior.
• Applies data mining techniques in log data to extract the behavior of users which is used in
various applications.
5
Outline:
• Introduction
•Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
6
Logs
TYPES OF WEB SERVER LOG FILES
• Access logs:
• It stores information about which files are requested from web server.
• Referrer logs:
• Stores information of the URLs of web pages on other sites that link to web pages.
• If a user gets to one of the server‘s pages by clicking on a link from another site, the URL of that site will appear in
this log.
• Agent logs:
• It records information about the web clients that sends requests to web server. Contain type of browser and the
platform determines what a user is able to access on a web site.
• Error logs:
• It stores information about errors and failed requests of the web server.
Types of Web log file formats
• Common Log Format (CLF)
• W3C extended log file format
• Microsoft IIS (Internet Information Services) log file format
• NCSA Common log file format
7
Sources of Log Data For Web Usage Mining
Server side:
• All the click streams are recorded into the web server log.
• Contain basic information e.g. name and IP of the remote
host, date and time of the request etc.
• The web server stores data regarding request performed
by the client.
Client side:
• The client itself which sends information to a repository
regarding the users‘ behavior.
• Done either with an ad-hoc browsing application or
through client side application running standard browsers.
Proxy side:
• Proxy level collection is an intermediary between server
level and client level.
• Proxy servers collect data of groups of users accessing
huge groups of web servers.
We consider only the case of a Web Server Log data.
8
Outline:
• Introduction
• Web Logs Files
•Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
9
Phases of Web Usage Mining:
Data Preprocessing:
• Transform the raw click stream data into a set of user
profiles.
• One of the most complex phase of the Web Usage Mining
process.
Pattern Discovery:
• Extracting information from preprocessed data.
• Data mining, statistics, machine learning and pattern
recognition are applied to web usage data to discover user
access patterns of the web.
Pattern Analysis:
• Extract the interesting patterns from the pattern discovery
process by eliminating the irrelative patterns.
• Involves :
• Validation: remove the irrelative patterns
• Interpretation : using visualization techniques to
interpret mathematic results for humans.
Our Focus
10
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
•Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Conclusion
• Main references
11
Steps of Data Preprocessing
12
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Conclusion
• Main references
13
Data Cleaning:
• Irrelevant or redundant log records will be removed.
• Clean accessorial resources embedded in HTML file, robots requests and error requests.
• Almost no researches focus purely on web log cleaning.
Attributes Involved in Web Log Cleaning and Intrusion Detection:
• Multimedia(images, videos and audio) Files:
• Categorized as useless files in web log preprocessing.
• Web log files size can be reduced to less than 50% of its original sizes by eliminating the image request.
• Web Robots Request :
• Dramatically affect the web sites traffic statistics.
• These are not important from the mining perspective and hence must be removed.
• HTTP Status Codes:
• Log files with unsuccessful HTTP status code are usually eliminated during the web log cleaning process.
• The widely acceptable definition for unsuccessful HTTP status codes is a code under 200 and over 299.
• For Intrusion Detection:
• [3] Removes all log files with status code 200.
• [2] Argued that log files with status 200 series should be remained as these log files may include
web attacks like SQL injection and XSS which have been executed successfully.
14
Attributes Involved in Web Log Cleaning
• HTTP Methods:
• A few researches have included HTTP method as an attribute in web log cleaning.
• In the LODAP Data Cleaning Module All log files with HTTP request method other than GET should be
removed as these are non-significant in web usage mining.
• For Intrusion Detection:
• Someone proposed:
• HTTP request with POST method should be kept.
• Another one proposed:
• Keep the log files with HTTP GET and HEAD request to obtain more accurate referrer information.
• Other Files:
• Log files with request to accessorial resources (e.g. CSS file) embedded in HTML file should be removed.
15
Algorithm Design of Newest Methods:
• This method used for Data Cleaning and Intrusion Detection from log files.
• Total of six cleaning conditions is applied:
First:
Logs with HTTP status code 200 will be removed (probability for web logs with such criteria to contain malicious web attacks
is almost zero).
Second:
• Web logs with multimedia file extensions will be removed if
• The HTTP request in the web log is not HTTP POST and
• The HTTP status code is not 400 series and 500 series.
• Web logs with status code 400 series and 500 series should be kept as these may consider as malicious attempt.
• Users who triggered many web logs with HTTP error status code are subject to suspect.
• In common case, to launch web defacement attack, attacker will use HTTP POST method to replace part or all of the web
interface components.
Third:
• Legitimate web robots requests like Googlebot will be removed.
• Specific IP address will be included in the web robot IP whitelist.
• If there are web logs with web robots request from whitelist IP addresses, the web log will be removed.
16
Algorithm Design of Newest Methods:
Fourth:
Remove web log with legitimate file extension(.css, .pdf, .txt and .doc) if :
The web logs contain no HTTP status codes with 400 series and 500 series and the HTTP method is not HTTP POST.
Fifth:
Web log with HTTP HEAD method(used in a web monitoring system) and legitimate IP will be removed.
A large number of HTTP HEAD requests may indicate malicious web robots activities.
Sixth:
Web log with HTTP POST method will be removed if the file posted are legitimate.
For instance, it is legitimate if there is web log with .svc file extension in uri-stem and with HTTP POST method.
.svc file is a special content file which represents the Windows Communication Foundation (WCF) service hosted in IIS.
17
Implementation
Web log format:
Internet Information Services (IIS) Log Format
Simulating attacks carried out by using three web vulnerability assessment tools:
Acunetix(run on Microsoft Windows)
Nikto and w3af(run on BackTrack GNOME)
An e-commerce site web server is configured to send web logs to the log collector
server via User Datagram Protocol (UDP).
Architectural Diagram for Simulation Attack and Web Log Collection
18
Comparison of existing frameworks:
[1]: Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K. 2011
[2]: Patil, P., Patil, U. 2012
[3]: Yew Chuan Ong and Zuraini Ismail 2014
[1], [2] considered only three files extensions(.jpg, .gif and .css).
Algorithm 3 defined a total of sixteen multimedia file extensions + four other files extension.
Comparisons Factor
Algorithms
[1] [2] [3]
Multimedia Files Yes Yes Yes
Web Robots Request No No Yes
HTTP Status Code
200,
400 series,
500 series
200
200,
400series,
500 series
HTTP Method No GET
GET, POST,
HEAD
Others Files Yes Yes Yes
Number of Rules and Conditions 2 1 6
19
Evaluation of existing frameworks:
Evaluate the cleaning capability:
• size of web log file in bytes.
• # of web log entries based on the total number of lines in the web log file.
Percentage of reduction = (total # of web log entries removed / total # of web log entries) × 100%
higher is the percentage of reduction => the better is the cleaning capability
Evaluate the Intrusion Detection Readiness:
False negative rate = Total number of malicious request removed / total number of malicious request
Lower false negative rate => better intrusion detection readiness
Measuring Factors
Algorithms
[1] [2] [3]
File Size Reduced (bytes) 6945603 32423581 18957149
Number of Entries Removed 52916 215616 153372
Percentage of Reduction (%) 13.94 56.81 40.41
False Negative Rate 0.00144 0.15789 0.00531
Algorithm 3 has the second highest percentage of reduction and second lowest false negative rate compared
to the other algorithms.
20
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
21
User Identification:
 Identify each distinct user.
 User identification is one way of introducing a state into web stateless system.
 A very complex task because of proxy servers and caches.
1 User identification by IP address:
“Each different IP address represents different user.”
Problems:
• Several users can be used the same IP address or computer (i.e. college, internet café etc.).
• One user can have different IP addresses, since a user accesses the Web from different machines will have different IP address.
2 User identification using User registration Data:
If users have login of their information, it is easy to identify them.
Username and password are also stored in the web log files.
Problems:
• But these facilities are not available in every website so that it is not appropriated for the general web browsing .
• There are lots of user do not register their information.
3 User identification using Cookies:
Cookies are HTTP headers in string format.
By using Cookies we can extract the details of users and resources which are accessed by the user.
Problems:
• Users can lock the use of cookies.
• Users can delete the cookies.
22
Two heuristics proposed that can be used to help identify unique users:
• (P. Pirolli,J. Pitkow, and R. Rao) and (K.R. Suneetha, Dr. R. Krihnamoorthi(2009)) proposed:
“Even if the IP address is the same, if the agent log shows a change in browser software or operating system so:
Each different agent type for an IP address represents a different user.”
User 1: A→B → E →K →I → O→E →L
User 2: A → C →G →M →H→N
23
Another heuristic(L. Chaofeng (2006) and V. Chitraa (2010) ):
Use the access log in conjunction with the referrer log and site topology to construct browsing paths for each user:
“If a page is requested that is not directly reachable by a hyperlink from any of the pages visited by the user assumes that
there is another user with the same IP address”.
Following the referrer field along user 1’s path through the Web site.
Unexpectedly, there is no referrer shown for the page I.html request.
There is no direct link between K.html and I.html:
It appears highly unlikely that the user who was traversing A→B→E→K then proceeded to I.
It is more likely that this request for page I.html came from a third user, who accessed the page directly, probably by
entering the URL directly into the browser using the same browser version and operating system:
User 1: A→ B →E →K → E →L , User 2: A → C →G → M →H→N , User 3: I → O
24
• P. Yeng, Y. Zheng(2010) dedicated only to user identification through inspired rules:
• Four constraints are used to identify users. These constraints are: IP address, agent
information, site topology and time information.
• Has low efficiency, but accuracy increased significantly
• “Renáta Iváncsy, and Sándor Juhász” analysis of different user identification methods at
“Analysis of Web User Identification Methods”
• Heuristics are not error-proof.
• Different heuristics must be selected depending on different situations and applications.
25
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
26
Session Identification(Sessionization, Session Reconstruction):
Session Definition:
• Group of activities performed by a user from the moment he entered the website to the moment he left
it.
• A set of user clicks usually referred to as a click stream, across Web servers is defined as a user session.
• A sequence of web pages user browse in a single access.
Session Identification :
• Grouping the different activities of a single user.
• The process of segmenting the access log of each user into individual access sessions.
Session identification Goal:
• Group the page access of each user into individual access sessions.
• Identifying which user has spent how much time on the website.
• Each heuristic h scans the user activity logs to which the web server log is partitioned.
Two general approaches:
• Time-oriented heuristic methods
• Navigation-oriented heuristic methods
27
Session Identification:
• Time-oriented heuristic methods:
A set of pages visited by a specific user is considered as a single user session if the pages are requested at a
time interval not larger than a specified time period.
First Heuristic:
Total session duration may not exceed a threshold 𝜃.
𝑡0: the timestamp for the first request in a constructed session S.
“The request with a timestamp t is assigned to S, iff t − 𝑡0 ≤ 𝜃” (Liu, 2007).
𝜃 = 30𝑚𝑖𝑛 has been recommended from empirical findings (Spiliopoulou, Mobasher, Berendt, & Nakagawa,
2003).
Second Heuristic:
For the page-stay-time-based method:
Total time spent on a page may not exceed a threshold 𝛿.
𝑡1: the timestamp for request assigned to constructed session 𝑆
Next request with timestamp 𝑡2 is assigned to S iff 𝑡2 − 𝑡1 ≤ 𝛿 Liu, 2007.
A conservative threshold for page-stay time is 𝛿 = 10𝑚𝑖𝑛 has been proposed to capture the time for loading
and studying the contents of a page (Spiliopoulou et al., 2003).
28
Session Identification:
• Navigation-oriented heuristic methods :
Web users reach pages by following hyperlinks rather than by typing URLs.
Topology-based heuristic:
“If a web page is not connected with previously visited page in a session, then it is considered
as a different session.”
Referrer-basic heuristic(Cooley et al. (1999) ) based on the referrer information :
• The referrer of a requested page P should be a page already in the session(previously
visited pages); otherwise P is assigned to a different session.
• If The page has an empty referrer, then it is likely to be the first page of a new session.
29
Session Identification:
• “Spiliopoulou” evaluates different heuristics in “A Framework for the Evaluation of Session
Reconstruction Heuristics in Web Usage Analysis”:
• Time based methods are not reliable because users may involve in some other activities after
opening the web page.
• Referrer-based heuristics are more restrictive than the topology-based heuristics, because there
are cases where a page request has an empty referrer.
• Different methods are used by different applications.
• Experiments showed that there is no best heuristic for all cases.
• Even for a simple application, two variations in the method of assessing reconstruction quality led
to significantly different precision scores among the heuristics
• G. Shivaprasad, N.V. Subba Reddy, U. Dinesh Acharya and Prakash K. Aithal (2016) proposed:
• A combined technique based on both the heuristics for Session Identification.
• Uses web topology and page stay time.
30
Session Identification(Time-oriented heuristic example):
Session 1 (user 1): A →B→E →K
Session 2 (user 2): A → C →G → M →H →N
Session 3 (user 3): I → O
Session 4 (user 1): E →L
For user 1, there is more than a 30-minute delay between the request for page K.html and
the second request for page E.html,so :
31
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references
32
Path Completion :
• Critical phase in the preprocessing.
• The number of URLs recorded in log maybe less than the real one:
• Some important page requests are not recorded in server log due to proxy servers, browsers back
button is pressed and local caching.
• Definition:
• “The process of reconstructing the user’s navigation path, by appending missed page requests (page
requests that are not recorded in server log) in order to analyze the data in a proper way within the identified
sessions”.
• Used to obtain the complete user access path.
33
Path Completion :
Methods similar to those used for user identification can be used for path
completion..
Heuristic methods based on referrer log and site topology are employed.
Cooley, R., Mobasher, B., & Srivastava, J. (1999):
Missing pages are added as follows:
The page request is checked whether it is directly linked to the last page or not:
If there is no link with last page check the recent history.
If the log record is available in recent history then it is clear that “back” button
is used for caching until the page has been reached.
34
Path Completion :
Considering session 2 :
Session 2 (user 2): A → C → G →M →H→N
There is no direct link between page M.html and page H.html. Therefore, the user is presumed to have hit the
“Back” button on the browser twice.
The path completion process leads us to insert “→G →C ” into the session path for session 2:
Session 2 (user 2): A → C →G →M →G →C →H →N
35
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
Main references
36
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Conclusion
• Main references
37
Main References:
1. Ong, Y. C., & Ismail, Z. (2014). Enhanced Web Log Cleaning Algorithm for Web Intrusion Detection. In Recent
Advances in Information and Communication Technology (pp. 315-324). Springer International Publishing.
2. Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K.: Web Server Logs Preprocessing for Web Intrusion
Detection. Computer and Information Science 4, 123–133 (2011)
3. Patil, P., Patil, U.: Preprocessing of web server log file for web mining. World Journal of Science and Technology 2,
14–18 (2012)
4. Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Journal
of Knowledge and Information Systems 1 (1999).
5. Das, R., Turkoglu, I.: Creating meaningful data from web logs for improving the impressiveness of a website by
using path analysis method. Expert Systems with Applications 36(3), 6635–6644 (2009)
6. P. Yeng, Y. Zheng. (2010). Inspired Rule-Based User Identification, LNCS 6440, pp. 618-624.
7. K.R. Suneetha, Dr. R. Krihnamoorthi. (2009). Identifying User Behavior by Analyzing Web Server Access Log File,
IJCSNS, 2009.
8. …
Preprocessing of Web Log Data for Web Usage Mining
QUESTION??...

More Related Content

PPTX
Database administrator
DOCX
Open source search engine
PDF
Data automation 101
PPTX
Explainable Machine Learning (Explainable ML)
PDF
Introduction to Neo4j - a hands-on crash course
PPTX
PDF
Unit 5
PDF
Cross-lingual Information Retrieval
Database administrator
Open source search engine
Data automation 101
Explainable Machine Learning (Explainable ML)
Introduction to Neo4j - a hands-on crash course
Unit 5
Cross-lingual Information Retrieval

What's hot (20)

PPTX
Natural Language Processing: Parsing
PPT
Web Usage Pattern
PDF
operating system structure
ODP
Data mining
PPT
File organization 1
PDF
Tutorial on Bias in Rec Sys @ UMAP2020
PPTX
Link analysis : Comparative study of HITS and Page Rank Algorithm
PPTX
Database , 8 Query Optimization
PPTX
Introduction to Text Mining
PPT
Introduction to parallel_computing
PPTX
Introduction to Machine Learning
PPT
File organization and indexing
PPT
Indexing and Hashing
PPTX
Web search vs ir
PPT
File organisation
PDF
Advance database system (part 3)
PDF
Lecture 1 introduction to parallel and distributed computing
PDF
The Evolution of Data Science
PDF
Confusion Matrix Explained
Natural Language Processing: Parsing
Web Usage Pattern
operating system structure
Data mining
File organization 1
Tutorial on Bias in Rec Sys @ UMAP2020
Link analysis : Comparative study of HITS and Page Rank Algorithm
Database , 8 Query Optimization
Introduction to Text Mining
Introduction to parallel_computing
Introduction to Machine Learning
File organization and indexing
Indexing and Hashing
Web search vs ir
File organisation
Advance database system (part 3)
Lecture 1 introduction to parallel and distributed computing
The Evolution of Data Science
Confusion Matrix Explained
Ad

Viewers also liked (20)

ODP
Personal Web Usage Mining
PDF
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
PDF
PPT
Applying web mining application for user behavior understanding
PPTX
Web Mining Presentation Final
PPTX
imortance of w3c validation
PPTX
Learning to Classify Users in Online Interaction Networks
PDF
Knowledge discoverylaurahollink
PDF
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
PDF
Dotnet titles 2016 17
PPTX
Webmining ppt
PPT
A survey on web usage mining techniques
ODP
Data mining
PDF
Web Navigation Presentation
PPT
The FOCUS K3D Project
PPT
Knowledge discovery thru data mining
PPTX
Web mining
PPT
Fp growth algorithm
PDF
Customer Clustering For Retail Marketing
PPTX
Web mining (structure mining)
Personal Web Usage Mining
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Applying web mining application for user behavior understanding
Web Mining Presentation Final
imortance of w3c validation
Learning to Classify Users in Online Interaction Networks
Knowledge discoverylaurahollink
Advance Clustering Technique Based on Markov Chain for Predicting Next User M...
Dotnet titles 2016 17
Webmining ppt
A survey on web usage mining techniques
Data mining
Web Navigation Presentation
The FOCUS K3D Project
Knowledge discovery thru data mining
Web mining
Fp growth algorithm
Customer Clustering For Retail Marketing
Web mining (structure mining)
Ad

Similar to Preprocessing of Web Log Data for Web Usage Mining (20)

PDF
a novel technique to pre-process web log data using sql server management studio
PDF
Ijarcet vol-2-issue-7-2341-2343
PDF
Ijarcet vol-2-issue-7-2341-2343
PDF
Web Data mining-A Research area in Web usage mining
PDF
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
PDF
Comparison of decision and random tree algorithms on
PPTX
Avtar's ppt
PDF
COMPARISON ANALYSIS OF WEB USAGE MINING USING PATTERN RECOGNITION TECHNIQUES
PDF
M0947679
PDF
Website Content Analysis Using Clickstream Data and Apriori Algorithm
PDF
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
PDF
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
PDF
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
PDF
Identifying the Number of Visitors to improve Website Usability from Educatio...
PPT
Web Servers
PPT
Web usage-mining
PPTX
Web usage mining
PPTX
Web usage mining
PDF
A Comparative Study of Recommendation System Using Web Usage Mining
PDF
C017231726
a novel technique to pre-process web log data using sql server management studio
Ijarcet vol-2-issue-7-2341-2343
Ijarcet vol-2-issue-7-2341-2343
Web Data mining-A Research area in Web usage mining
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
Comparison of decision and random tree algorithms on
Avtar's ppt
COMPARISON ANALYSIS OF WEB USAGE MINING USING PATTERN RECOGNITION TECHNIQUES
M0947679
Website Content Analysis Using Clickstream Data and Apriori Algorithm
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
IDENTIFY NAVIGATIONAL PATTERNS OF WEB USERS
Identifying the Number of Visitors to improve Website Usability from Educatio...
Web Servers
Web usage-mining
Web usage mining
Web usage mining
A Comparative Study of Recommendation System Using Web Usage Mining
C017231726

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Electronic commerce courselecture one. Pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Empathic Computing: Creating Shared Understanding
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Electronic commerce courselecture one. Pdf
Chapter 3 Spatial Domain Image Processing.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
The AUB Centre for AI in Media Proposal.docx
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
MIND Revenue Release Quarter 2 2025 Press Release
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf

Preprocessing of Web Log Data for Web Usage Mining

  • 1. 1 Preprocessing on Web Log Data for Web Usage Mining Shahid Rajaee Teacher Training University Faculty of Computer Engineering PRESENTED BY: Amir Masoud Sefidian
  • 2. 2 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 3. 3 Outline: •Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 4. 4 Introduction • Web has been growing as a dominant platform for retrieving information and discovering knowledge from web data. • Web usage analysis or web usage mining or web log mining or click stream analysis: • Process of extracting useful knowledge from web server logs, database logs, user queries, client side cookies and user profiles in order to analyze web users’ behavior. • Applies data mining techniques in log data to extract the behavior of users which is used in various applications.
  • 5. 5 Outline: • Introduction •Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 6. 6 Logs TYPES OF WEB SERVER LOG FILES • Access logs: • It stores information about which files are requested from web server. • Referrer logs: • Stores information of the URLs of web pages on other sites that link to web pages. • If a user gets to one of the server‘s pages by clicking on a link from another site, the URL of that site will appear in this log. • Agent logs: • It records information about the web clients that sends requests to web server. Contain type of browser and the platform determines what a user is able to access on a web site. • Error logs: • It stores information about errors and failed requests of the web server. Types of Web log file formats • Common Log Format (CLF) • W3C extended log file format • Microsoft IIS (Internet Information Services) log file format • NCSA Common log file format
  • 7. 7 Sources of Log Data For Web Usage Mining Server side: • All the click streams are recorded into the web server log. • Contain basic information e.g. name and IP of the remote host, date and time of the request etc. • The web server stores data regarding request performed by the client. Client side: • The client itself which sends information to a repository regarding the users‘ behavior. • Done either with an ad-hoc browsing application or through client side application running standard browsers. Proxy side: • Proxy level collection is an intermediary between server level and client level. • Proxy servers collect data of groups of users accessing huge groups of web servers. We consider only the case of a Web Server Log data.
  • 8. 8 Outline: • Introduction • Web Logs Files •Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 9. 9 Phases of Web Usage Mining: Data Preprocessing: • Transform the raw click stream data into a set of user profiles. • One of the most complex phase of the Web Usage Mining process. Pattern Discovery: • Extracting information from preprocessed data. • Data mining, statistics, machine learning and pattern recognition are applied to web usage data to discover user access patterns of the web. Pattern Analysis: • Extract the interesting patterns from the pattern discovery process by eliminating the irrelative patterns. • Involves : • Validation: remove the irrelative patterns • Interpretation : using visualization techniques to interpret mathematic results for humans. Our Focus
  • 10. 10 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining •Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Conclusion • Main references
  • 11. 11 Steps of Data Preprocessing
  • 12. 12 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Conclusion • Main references
  • 13. 13 Data Cleaning: • Irrelevant or redundant log records will be removed. • Clean accessorial resources embedded in HTML file, robots requests and error requests. • Almost no researches focus purely on web log cleaning. Attributes Involved in Web Log Cleaning and Intrusion Detection: • Multimedia(images, videos and audio) Files: • Categorized as useless files in web log preprocessing. • Web log files size can be reduced to less than 50% of its original sizes by eliminating the image request. • Web Robots Request : • Dramatically affect the web sites traffic statistics. • These are not important from the mining perspective and hence must be removed. • HTTP Status Codes: • Log files with unsuccessful HTTP status code are usually eliminated during the web log cleaning process. • The widely acceptable definition for unsuccessful HTTP status codes is a code under 200 and over 299. • For Intrusion Detection: • [3] Removes all log files with status code 200. • [2] Argued that log files with status 200 series should be remained as these log files may include web attacks like SQL injection and XSS which have been executed successfully.
  • 14. 14 Attributes Involved in Web Log Cleaning • HTTP Methods: • A few researches have included HTTP method as an attribute in web log cleaning. • In the LODAP Data Cleaning Module All log files with HTTP request method other than GET should be removed as these are non-significant in web usage mining. • For Intrusion Detection: • Someone proposed: • HTTP request with POST method should be kept. • Another one proposed: • Keep the log files with HTTP GET and HEAD request to obtain more accurate referrer information. • Other Files: • Log files with request to accessorial resources (e.g. CSS file) embedded in HTML file should be removed.
  • 15. 15 Algorithm Design of Newest Methods: • This method used for Data Cleaning and Intrusion Detection from log files. • Total of six cleaning conditions is applied: First: Logs with HTTP status code 200 will be removed (probability for web logs with such criteria to contain malicious web attacks is almost zero). Second: • Web logs with multimedia file extensions will be removed if • The HTTP request in the web log is not HTTP POST and • The HTTP status code is not 400 series and 500 series. • Web logs with status code 400 series and 500 series should be kept as these may consider as malicious attempt. • Users who triggered many web logs with HTTP error status code are subject to suspect. • In common case, to launch web defacement attack, attacker will use HTTP POST method to replace part or all of the web interface components. Third: • Legitimate web robots requests like Googlebot will be removed. • Specific IP address will be included in the web robot IP whitelist. • If there are web logs with web robots request from whitelist IP addresses, the web log will be removed.
  • 16. 16 Algorithm Design of Newest Methods: Fourth: Remove web log with legitimate file extension(.css, .pdf, .txt and .doc) if : The web logs contain no HTTP status codes with 400 series and 500 series and the HTTP method is not HTTP POST. Fifth: Web log with HTTP HEAD method(used in a web monitoring system) and legitimate IP will be removed. A large number of HTTP HEAD requests may indicate malicious web robots activities. Sixth: Web log with HTTP POST method will be removed if the file posted are legitimate. For instance, it is legitimate if there is web log with .svc file extension in uri-stem and with HTTP POST method. .svc file is a special content file which represents the Windows Communication Foundation (WCF) service hosted in IIS.
  • 17. 17 Implementation Web log format: Internet Information Services (IIS) Log Format Simulating attacks carried out by using three web vulnerability assessment tools: Acunetix(run on Microsoft Windows) Nikto and w3af(run on BackTrack GNOME) An e-commerce site web server is configured to send web logs to the log collector server via User Datagram Protocol (UDP). Architectural Diagram for Simulation Attack and Web Log Collection
  • 18. 18 Comparison of existing frameworks: [1]: Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K. 2011 [2]: Patil, P., Patil, U. 2012 [3]: Yew Chuan Ong and Zuraini Ismail 2014 [1], [2] considered only three files extensions(.jpg, .gif and .css). Algorithm 3 defined a total of sixteen multimedia file extensions + four other files extension. Comparisons Factor Algorithms [1] [2] [3] Multimedia Files Yes Yes Yes Web Robots Request No No Yes HTTP Status Code 200, 400 series, 500 series 200 200, 400series, 500 series HTTP Method No GET GET, POST, HEAD Others Files Yes Yes Yes Number of Rules and Conditions 2 1 6
  • 19. 19 Evaluation of existing frameworks: Evaluate the cleaning capability: • size of web log file in bytes. • # of web log entries based on the total number of lines in the web log file. Percentage of reduction = (total # of web log entries removed / total # of web log entries) × 100% higher is the percentage of reduction => the better is the cleaning capability Evaluate the Intrusion Detection Readiness: False negative rate = Total number of malicious request removed / total number of malicious request Lower false negative rate => better intrusion detection readiness Measuring Factors Algorithms [1] [2] [3] File Size Reduced (bytes) 6945603 32423581 18957149 Number of Entries Removed 52916 215616 153372 Percentage of Reduction (%) 13.94 56.81 40.41 False Negative Rate 0.00144 0.15789 0.00531 Algorithm 3 has the second highest percentage of reduction and second lowest false negative rate compared to the other algorithms.
  • 20. 20 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 21. 21 User Identification:  Identify each distinct user.  User identification is one way of introducing a state into web stateless system.  A very complex task because of proxy servers and caches. 1 User identification by IP address: “Each different IP address represents different user.” Problems: • Several users can be used the same IP address or computer (i.e. college, internet café etc.). • One user can have different IP addresses, since a user accesses the Web from different machines will have different IP address. 2 User identification using User registration Data: If users have login of their information, it is easy to identify them. Username and password are also stored in the web log files. Problems: • But these facilities are not available in every website so that it is not appropriated for the general web browsing . • There are lots of user do not register their information. 3 User identification using Cookies: Cookies are HTTP headers in string format. By using Cookies we can extract the details of users and resources which are accessed by the user. Problems: • Users can lock the use of cookies. • Users can delete the cookies.
  • 22. 22 Two heuristics proposed that can be used to help identify unique users: • (P. Pirolli,J. Pitkow, and R. Rao) and (K.R. Suneetha, Dr. R. Krihnamoorthi(2009)) proposed: “Even if the IP address is the same, if the agent log shows a change in browser software or operating system so: Each different agent type for an IP address represents a different user.” User 1: A→B → E →K →I → O→E →L User 2: A → C →G →M →H→N
  • 23. 23 Another heuristic(L. Chaofeng (2006) and V. Chitraa (2010) ): Use the access log in conjunction with the referrer log and site topology to construct browsing paths for each user: “If a page is requested that is not directly reachable by a hyperlink from any of the pages visited by the user assumes that there is another user with the same IP address”. Following the referrer field along user 1’s path through the Web site. Unexpectedly, there is no referrer shown for the page I.html request. There is no direct link between K.html and I.html: It appears highly unlikely that the user who was traversing A→B→E→K then proceeded to I. It is more likely that this request for page I.html came from a third user, who accessed the page directly, probably by entering the URL directly into the browser using the same browser version and operating system: User 1: A→ B →E →K → E →L , User 2: A → C →G → M →H→N , User 3: I → O
  • 24. 24 • P. Yeng, Y. Zheng(2010) dedicated only to user identification through inspired rules: • Four constraints are used to identify users. These constraints are: IP address, agent information, site topology and time information. • Has low efficiency, but accuracy increased significantly • “Renáta Iváncsy, and Sándor Juhász” analysis of different user identification methods at “Analysis of Web User Identification Methods” • Heuristics are not error-proof. • Different heuristics must be selected depending on different situations and applications.
  • 25. 25 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 26. 26 Session Identification(Sessionization, Session Reconstruction): Session Definition: • Group of activities performed by a user from the moment he entered the website to the moment he left it. • A set of user clicks usually referred to as a click stream, across Web servers is defined as a user session. • A sequence of web pages user browse in a single access. Session Identification : • Grouping the different activities of a single user. • The process of segmenting the access log of each user into individual access sessions. Session identification Goal: • Group the page access of each user into individual access sessions. • Identifying which user has spent how much time on the website. • Each heuristic h scans the user activity logs to which the web server log is partitioned. Two general approaches: • Time-oriented heuristic methods • Navigation-oriented heuristic methods
  • 27. 27 Session Identification: • Time-oriented heuristic methods: A set of pages visited by a specific user is considered as a single user session if the pages are requested at a time interval not larger than a specified time period. First Heuristic: Total session duration may not exceed a threshold 𝜃. 𝑡0: the timestamp for the first request in a constructed session S. “The request with a timestamp t is assigned to S, iff t − 𝑡0 ≤ 𝜃” (Liu, 2007). 𝜃 = 30𝑚𝑖𝑛 has been recommended from empirical findings (Spiliopoulou, Mobasher, Berendt, & Nakagawa, 2003). Second Heuristic: For the page-stay-time-based method: Total time spent on a page may not exceed a threshold 𝛿. 𝑡1: the timestamp for request assigned to constructed session 𝑆 Next request with timestamp 𝑡2 is assigned to S iff 𝑡2 − 𝑡1 ≤ 𝛿 Liu, 2007. A conservative threshold for page-stay time is 𝛿 = 10𝑚𝑖𝑛 has been proposed to capture the time for loading and studying the contents of a page (Spiliopoulou et al., 2003).
  • 28. 28 Session Identification: • Navigation-oriented heuristic methods : Web users reach pages by following hyperlinks rather than by typing URLs. Topology-based heuristic: “If a web page is not connected with previously visited page in a session, then it is considered as a different session.” Referrer-basic heuristic(Cooley et al. (1999) ) based on the referrer information : • The referrer of a requested page P should be a page already in the session(previously visited pages); otherwise P is assigned to a different session. • If The page has an empty referrer, then it is likely to be the first page of a new session.
  • 29. 29 Session Identification: • “Spiliopoulou” evaluates different heuristics in “A Framework for the Evaluation of Session Reconstruction Heuristics in Web Usage Analysis”: • Time based methods are not reliable because users may involve in some other activities after opening the web page. • Referrer-based heuristics are more restrictive than the topology-based heuristics, because there are cases where a page request has an empty referrer. • Different methods are used by different applications. • Experiments showed that there is no best heuristic for all cases. • Even for a simple application, two variations in the method of assessing reconstruction quality led to significantly different precision scores among the heuristics • G. Shivaprasad, N.V. Subba Reddy, U. Dinesh Acharya and Prakash K. Aithal (2016) proposed: • A combined technique based on both the heuristics for Session Identification. • Uses web topology and page stay time.
  • 30. 30 Session Identification(Time-oriented heuristic example): Session 1 (user 1): A →B→E →K Session 2 (user 2): A → C →G → M →H →N Session 3 (user 3): I → O Session 4 (user 1): E →L For user 1, there is more than a 30-minute delay between the request for page K.html and the second request for page E.html,so :
  • 31. 31 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Main references
  • 32. 32 Path Completion : • Critical phase in the preprocessing. • The number of URLs recorded in log maybe less than the real one: • Some important page requests are not recorded in server log due to proxy servers, browsers back button is pressed and local caching. • Definition: • “The process of reconstructing the user’s navigation path, by appending missed page requests (page requests that are not recorded in server log) in order to analyze the data in a proper way within the identified sessions”. • Used to obtain the complete user access path.
  • 33. 33 Path Completion : Methods similar to those used for user identification can be used for path completion.. Heuristic methods based on referrer log and site topology are employed. Cooley, R., Mobasher, B., & Srivastava, J. (1999): Missing pages are added as follows: The page request is checked whether it is directly linked to the last page or not: If there is no link with last page check the recent history. If the log record is available in recent history then it is clear that “back” button is used for caching until the page has been reached.
  • 34. 34 Path Completion : Considering session 2 : Session 2 (user 2): A → C → G →M →H→N There is no direct link between page M.html and page H.html. Therefore, the user is presumed to have hit the “Back” button on the browser twice. The path completion process leads us to insert “→G →C ” into the session path for session 2: Session 2 (user 2): A → C →G →M →G →C →H →N
  • 35. 35 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion Main references
  • 36. 36 Outline: • Introduction • Web Logs Files • Phases of Web Usage Mining • Steps of Data Preprocessing • Data Cleaning • User Identification • Session Identification • Path Completion • Conclusion • Main references
  • 37. 37 Main References: 1. Ong, Y. C., & Ismail, Z. (2014). Enhanced Web Log Cleaning Algorithm for Web Intrusion Detection. In Recent Advances in Information and Communication Technology (pp. 315-324). Springer International Publishing. 2. Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K.: Web Server Logs Preprocessing for Web Intrusion Detection. Computer and Information Science 4, 123–133 (2011) 3. Patil, P., Patil, U.: Preprocessing of web server log file for web mining. World Journal of Science and Technology 2, 14–18 (2012) 4. Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Journal of Knowledge and Information Systems 1 (1999). 5. Das, R., Turkoglu, I.: Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method. Expert Systems with Applications 36(3), 6635–6644 (2009) 6. P. Yeng, Y. Zheng. (2010). Inspired Rule-Based User Identification, LNCS 6440, pp. 618-624. 7. K.R. Suneetha, Dr. R. Krihnamoorthi. (2009). Identifying User Behavior by Analyzing Web Server Access Log File, IJCSNS, 2009. 8. …