Preprocessing of Web Log Data for Web Usage Mining

1
Preprocessing on Web Log Data for Web Usage Mining
Shahid Rajaee Teacher Training University
Faculty of Computer Engineering
PRESENTED BY:
Amir Masoud Sefidian

2
Outline:
• Introduction
• Web Logs Files
• Phases of Web Usage Mining
• Steps of Data Preprocessing
• Data Cleaning
• User Identification
• Session Identification
• Path Completion
• Main references

3
Outline:
•Introduction
• Web Logs Files
• Data Cleaning
• Path Completion
• Main references

4
Introduction
• Web has been growing as a dominant platform for retrieving information and discovering
knowledge from web data.
• Web usage analysis or web usage mining or web log mining or click stream analysis:
• Process of extracting useful knowledge from web server logs, database logs, user queries,
client side cookies and user profiles in order to analyze web users’ behavior.
• Applies data mining techniques in log data to extract the behavior of users which is used in
various applications.

5
Outline:
• Introduction
•Web Logs Files
• Data Cleaning
• Path Completion
• Main references

6
Logs
TYPES OF WEB SERVER LOG FILES
• Access logs:
• It stores information about which files are requested from web server.
• Referrer logs:
• Stores information of the URLs of web pages on other sites that link to web pages.
• If a user gets to one of the server‘s pages by clicking on a link from another site, the URL of that site will appear in
this log.
• Agent logs:
• It records information about the web clients that sends requests to web server. Contain type of browser and the
platform determines what a user is able to access on a web site.
• Error logs:
• It stores information about errors and failed requests of the web server.
Types of Web log file formats
• Common Log Format (CLF)
• W3C extended log file format
• Microsoft IIS (Internet Information Services) log file format
• NCSA Common log file format

7
Sources of Log Data For Web Usage Mining
Server side:
• All the click streams are recorded into the web server log.
• Contain basic information e.g. name and IP of the remote
host, date and time of the request etc.
• The web server stores data regarding request performed
by the client.
Client side:
• The client itself which sends information to a repository
regarding the users‘ behavior.
• Done either with an ad-hoc browsing application or
through client side application running standard browsers.
Proxy side:
• Proxy level collection is an intermediary between server
level and client level.
• Proxy servers collect data of groups of users accessing
huge groups of web servers.
We consider only the case of a Web Server Log data.

8
Outline:
• Introduction
• Web Logs Files
•Phases of Web Usage Mining
• Data Cleaning
• Path Completion
• Main references

9
Phases of Web Usage Mining:
Data Preprocessing:
• Transform the raw click stream data into a set of user
profiles.
• One of the most complex phase of the Web Usage Mining
process.
Pattern Discovery:
• Extracting information from preprocessed data.
• Data mining, statistics, machine learning and pattern
recognition are applied to web usage data to discover user
access patterns of the web.
Pattern Analysis:
• Extract the interesting patterns from the pattern discovery
process by eliminating the irrelative patterns.
• Involves :
• Validation: remove the irrelative patterns
• Interpretation : using visualization techniques to
interpret mathematic results for humans.
Our Focus

10
Outline:
• Introduction
• Web Logs Files
•Steps of Data Preprocessing
• Data Cleaning
• Path Completion
• Conclusion
• Main references

11
Steps of Data Preprocessing

12
Outline:
• Introduction
• Web Logs Files
• Data Cleaning
• Path Completion
• Conclusion
• Main references

13
Data Cleaning:
• Irrelevant or redundant log records will be removed.
• Clean accessorial resources embedded in HTML file, robots requests and error requests.
• Almost no researches focus purely on web log cleaning.
Attributes Involved in Web Log Cleaning and Intrusion Detection:
• Multimedia(images, videos and audio) Files:
• Categorized as useless files in web log preprocessing.
• Web log files size can be reduced to less than 50% of its original sizes by eliminating the image request.
• Web Robots Request :
• Dramatically affect the web sites traffic statistics.
• These are not important from the mining perspective and hence must be removed.
• HTTP Status Codes:
• Log files with unsuccessful HTTP status code are usually eliminated during the web log cleaning process.
• The widely acceptable definition for unsuccessful HTTP status codes is a code under 200 and over 299.
• For Intrusion Detection:
• [3] Removes all log files with status code 200.
• [2] Argued that log files with status 200 series should be remained as these log files may include
web attacks like SQL injection and XSS which have been executed successfully.

14
Attributes Involved in Web Log Cleaning
• HTTP Methods:
• A few researches have included HTTP method as an attribute in web log cleaning.
• In the LODAP Data Cleaning Module All log files with HTTP request method other than GET should be
removed as these are non-significant in web usage mining.
• For Intrusion Detection:
• Someone proposed:
• HTTP request with POST method should be kept.
• Another one proposed:
• Keep the log files with HTTP GET and HEAD request to obtain more accurate referrer information.
• Other Files:
• Log files with request to accessorial resources (e.g. CSS file) embedded in HTML file should be removed.

15
Algorithm Design of Newest Methods:
• This method used for Data Cleaning and Intrusion Detection from log files.
• Total of six cleaning conditions is applied:
First:
Logs with HTTP status code 200 will be removed (probability for web logs with such criteria to contain malicious web attacks
is almost zero).
Second:
• Web logs with multimedia file extensions will be removed if
• The HTTP request in the web log is not HTTP POST and
• The HTTP status code is not 400 series and 500 series.
• Web logs with status code 400 series and 500 series should be kept as these may consider as malicious attempt.
• Users who triggered many web logs with HTTP error status code are subject to suspect.
• In common case, to launch web defacement attack, attacker will use HTTP POST method to replace part or all of the web
interface components.
Third:
• Legitimate web robots requests like Googlebot will be removed.
• Specific IP address will be included in the web robot IP whitelist.
• If there are web logs with web robots request from whitelist IP addresses, the web log will be removed.

16
Algorithm Design of Newest Methods:
Fourth:
Remove web log with legitimate file extension(.css, .pdf, .txt and .doc) if :
The web logs contain no HTTP status codes with 400 series and 500 series and the HTTP method is not HTTP POST.
Fifth:
Web log with HTTP HEAD method(used in a web monitoring system) and legitimate IP will be removed.
A large number of HTTP HEAD requests may indicate malicious web robots activities.
Sixth:
Web log with HTTP POST method will be removed if the file posted are legitimate.
For instance, it is legitimate if there is web log with .svc file extension in uri-stem and with HTTP POST method.
.svc file is a special content file which represents the Windows Communication Foundation (WCF) service hosted in IIS.

17
Implementation
Web log format:
Internet Information Services (IIS) Log Format
Simulating attacks carried out by using three web vulnerability assessment tools:
Acunetix(run on Microsoft Windows)
Nikto and w3af(run on BackTrack GNOME)
An e-commerce site web server is configured to send web logs to the log collector
server via User Datagram Protocol (UDP).
Architectural Diagram for Simulation Attack and Web Log Collection

18
Comparison of existing frameworks:
[1]: Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K. 2011
[2]: Patil, P., Patil, U. 2012
[3]: Yew Chuan Ong and Zuraini Ismail 2014
[1], [2] considered only three files extensions(.jpg, .gif and .css).
Algorithm 3 defined a total of sixteen multimedia file extensions + four other files extension.
Comparisons Factor
Algorithms
[1] [2] [3]
Multimedia Files Yes Yes Yes
Web Robots Request No No Yes
HTTP Status Code
200,
400 series,
500 series
200
200,
400series,
500 series
HTTP Method No GET
GET, POST,
HEAD
Others Files Yes Yes Yes
Number of Rules and Conditions 2 1 6

19
Evaluation of existing frameworks:
Evaluate the cleaning capability:
• size of web log file in bytes.
• # of web log entries based on the total number of lines in the web log file.
Percentage of reduction = (total # of web log entries removed / total # of web log entries) × 100%
higher is the percentage of reduction => the better is the cleaning capability
Evaluate the Intrusion Detection Readiness:
False negative rate = Total number of malicious request removed / total number of malicious request
Lower false negative rate => better intrusion detection readiness
Measuring Factors
Algorithms
[1] [2] [3]
File Size Reduced (bytes) 6945603 32423581 18957149
Number of Entries Removed 52916 215616 153372
Percentage of Reduction (%) 13.94 56.81 40.41
False Negative Rate 0.00144 0.15789 0.00531
Algorithm 3 has the second highest percentage of reduction and second lowest false negative rate compared
to the other algorithms.

20
Outline:
• Introduction
• Web Logs Files
• Data Cleaning
• Path Completion
• Main references

21
User Identification:
 Identify each distinct user.
 User identification is one way of introducing a state into web stateless system.
 A very complex task because of proxy servers and caches.
1 User identification by IP address:
“Each different IP address represents different user.”
Problems:
• Several users can be used the same IP address or computer (i.e. college, internet café etc.).
• One user can have different IP addresses, since a user accesses the Web from different machines will have different IP address.
2 User identification using User registration Data:
If users have login of their information, it is easy to identify them.
Username and password are also stored in the web log files.
Problems:
• But these facilities are not available in every website so that it is not appropriated for the general web browsing .
• There are lots of user do not register their information.
3 User identification using Cookies:
Cookies are HTTP headers in string format.
By using Cookies we can extract the details of users and resources which are accessed by the user.
Problems:
• Users can lock the use of cookies.
• Users can delete the cookies.

22
Two heuristics proposed that can be used to help identify unique users:
• (P. Pirolli,J. Pitkow, and R. Rao) and (K.R. Suneetha, Dr. R. Krihnamoorthi(2009)) proposed:
“Even if the IP address is the same, if the agent log shows a change in browser software or operating system so:
Each different agent type for an IP address represents a different user.”
User 1: A→B → E →K →I → O→E →L
User 2: A → C →G →M →H→N

23
Another heuristic(L. Chaofeng (2006) and V. Chitraa (2010) ):
Use the access log in conjunction with the referrer log and site topology to construct browsing paths for each user:
“If a page is requested that is not directly reachable by a hyperlink from any of the pages visited by the user assumes that
there is another user with the same IP address”.
Following the referrer field along user 1’s path through the Web site.
Unexpectedly, there is no referrer shown for the page I.html request.
There is no direct link between K.html and I.html:
It appears highly unlikely that the user who was traversing A→B→E→K then proceeded to I.
It is more likely that this request for page I.html came from a third user, who accessed the page directly, probably by
entering the URL directly into the browser using the same browser version and operating system:
User 1: A→ B →E →K → E →L , User 2: A → C →G → M →H→N , User 3: I → O

24
• P. Yeng, Y. Zheng(2010) dedicated only to user identification through inspired rules:
• Four constraints are used to identify users. These constraints are: IP address, agent
information, site topology and time information.
• Has low efficiency, but accuracy increased significantly
• “Renáta Iváncsy, and Sándor Juhász” analysis of different user identification methods at
“Analysis of Web User Identification Methods”
• Heuristics are not error-proof.
• Different heuristics must be selected depending on different situations and applications.

25
Outline:
• Introduction
• Web Logs Files
• Data Cleaning
• Path Completion
• Main references

26
Session Identification(Sessionization, Session Reconstruction):
Session Definition:
• Group of activities performed by a user from the moment he entered the website to the moment he left
it.
• A set of user clicks usually referred to as a click stream, across Web servers is defined as a user session.
• A sequence of web pages user browse in a single access.
Session Identification :
• Grouping the different activities of a single user.
• The process of segmenting the access log of each user into individual access sessions.
Session identification Goal:
• Group the page access of each user into individual access sessions.
• Identifying which user has spent how much time on the website.
• Each heuristic h scans the user activity logs to which the web server log is partitioned.
Two general approaches:
• Time-oriented heuristic methods
• Navigation-oriented heuristic methods

27
Session Identification:
• Time-oriented heuristic methods:
A set of pages visited by a specific user is considered as a single user session if the pages are requested at a
time interval not larger than a specified time period.
First Heuristic:
Total session duration may not exceed a threshold 𝜃.
𝑡0: the timestamp for the first request in a constructed session S.
“The request with a timestamp t is assigned to S, iff t − 𝑡0 ≤ 𝜃” (Liu, 2007).
𝜃 = 30𝑚𝑖𝑛 has been recommended from empirical findings (Spiliopoulou, Mobasher, Berendt, & Nakagawa,
2003).
Second Heuristic:
For the page-stay-time-based method:
Total time spent on a page may not exceed a threshold 𝛿.
𝑡1: the timestamp for request assigned to constructed session 𝑆
Next request with timestamp 𝑡2 is assigned to S iff 𝑡2 − 𝑡1 ≤ 𝛿 Liu, 2007.
A conservative threshold for page-stay time is 𝛿 = 10𝑚𝑖𝑛 has been proposed to capture the time for loading
and studying the contents of a page (Spiliopoulou et al., 2003).

28
• Navigation-oriented heuristic methods :
Web users reach pages by following hyperlinks rather than by typing URLs.
Topology-based heuristic:
“If a web page is not connected with previously visited page in a session, then it is considered
as a different session.”
Referrer-basic heuristic(Cooley et al. (1999) ) based on the referrer information :
• The referrer of a requested page P should be a page already in the session(previously
visited pages); otherwise P is assigned to a different session.
• If The page has an empty referrer, then it is likely to be the first page of a new session.

29
• “Spiliopoulou” evaluates different heuristics in “A Framework for the Evaluation of Session
Reconstruction Heuristics in Web Usage Analysis”:
• Time based methods are not reliable because users may involve in some other activities after
opening the web page.
• Referrer-based heuristics are more restrictive than the topology-based heuristics, because there
are cases where a page request has an empty referrer.
• Different methods are used by different applications.
• Experiments showed that there is no best heuristic for all cases.
• Even for a simple application, two variations in the method of assessing reconstruction quality led
to significantly different precision scores among the heuristics
• G. Shivaprasad, N.V. Subba Reddy, U. Dinesh Acharya and Prakash K. Aithal (2016) proposed:
• A combined technique based on both the heuristics for Session Identification.
• Uses web topology and page stay time.

30
Session Identification(Time-oriented heuristic example):
Session 1 (user 1): A →B→E →K
Session 2 (user 2): A → C →G → M →H →N
Session 3 (user 3): I → O
Session 4 (user 1): E →L
For user 1, there is more than a 30-minute delay between the request for page K.html and
the second request for page E.html,so :

31
Outline:
• Introduction
• Web Logs Files
• Data Cleaning
• Path Completion
• Main references

32
Path Completion :
• Critical phase in the preprocessing.
• The number of URLs recorded in log maybe less than the real one:
• Some important page requests are not recorded in server log due to proxy servers, browsers back
button is pressed and local caching.
• Definition:
• “The process of reconstructing the user’s navigation path, by appending missed page requests (page
requests that are not recorded in server log) in order to analyze the data in a proper way within the identified
sessions”.
• Used to obtain the complete user access path.

33
Path Completion :
Methods similar to those used for user identification can be used for path
completion..
Heuristic methods based on referrer log and site topology are employed.
Cooley, R., Mobasher, B., & Srivastava, J. (1999):
Missing pages are added as follows:
The page request is checked whether it is directly linked to the last page or not:
If there is no link with last page check the recent history.
If the log record is available in recent history then it is clear that “back” button
is used for caching until the page has been reached.

34
Path Completion :
Considering session 2 :
Session 2 (user 2): A → C → G →M →H→N
There is no direct link between page M.html and page H.html. Therefore, the user is presumed to have hit the
“Back” button on the browser twice.
The path completion process leads us to insert “→G →C ” into the session path for session 2:
Session 2 (user 2): A → C →G →M →G →C →H →N

35
Outline:
• Introduction
• Web Logs Files
• Data Cleaning
• Path Completion
Main references

36
Outline:
• Introduction
• Web Logs Files
• Data Cleaning
• Path Completion
• Conclusion
• Main references

37
Main References:
1. Ong, Y. C., & Ismail, Z. (2014). Enhanced Web Log Cleaning Algorithm for Web Intrusion Detection. In Recent
Advances in Information and Communication Technology (pp. 315-324). Springer International Publishing.
2. Salama, S.E., Marie, M.I., El-Fangary, L.M., Helmy, Y.K.: Web Server Logs Preprocessing for Web Intrusion
Detection. Computer and Information Science 4, 123–133 (2011)
3. Patil, P., Patil, U.: Preprocessing of web server log file for web mining. World Journal of Science and Technology 2,
14–18 (2012)
4. Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Journal
of Knowledge and Information Systems 1 (1999).
5. Das, R., Turkoglu, I.: Creating meaningful data from web logs for improving the impressiveness of a website by
using path analysis method. Expert Systems with Applications 36(3), 6635–6644 (2009)
6. P. Yeng, Y. Zheng. (2010). Inspired Rule-Based User Identification, LNCS 6440, pp. 618-624.
7. K.R. Suneetha, Dr. R. Krihnamoorthi. (2009). Identifying User Behavior by Analyzing Web Server Access Log File,
IJCSNS, 2009.
8. …

Preprocessing of Web Log Data for Web Usage Mining

Preprocessing of Web Log Data for Web Usage Mining

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Preprocessing of Web Log Data for Web Usage Mining (20)

Recently uploaded (20)

Preprocessing of Web Log Data for Web Usage Mining