SlideShare a Scribd company logo
Database Optimization for Large
Scale Web Access Log Management
Catalogue
•

Overview

•

Introduction

•

Performance Evaluation
•

Experiment Data

Related Concepts & Terms

•

Worst Case

Web Access Log Data

•

Log File

•

Hash

•

•

Experiment Environment

•

•

•

Bucket

•

•

•
•
•

•
JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

Run Time

•

Transaction Processing

Sorting Frame

•

Log Analysis Algorithm

Creation Of Statistical Information

•

Pre-Processor

Writing Data to File

•

System Analysis & Design

Data Input Time

•

Optimization Of Database Design
•

Experiment Result

Memory Usage

Conclusion
2
Overview
•

We will be talking about an optimal method of commercial DBMS for analysing largescale web access log. Also, we develop the pre-processor in memory structure to
improve performance and the user management tool for user friendliness.

•

We have three stages for this research. The first stage in research is data collection. In
this stage, we study characteristics of large-scale web access log data and investigate
commercial DBMS tuning techniques. The next stage, we design the system model. The
proposed system has three components including pre-processor, DBMS and
management tool. To improve the effectiveness the system, Pre-processor hash and
sort log data in memory. And we perform DBMS tuning to improve performance. The
final stage, we implement system and evaluate the performance. And we develop
system management tool for user friendliness.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

3
Introduction
•

Log generally collects some data for analysis, such as, click rates, type of users, how
many users access each web page, and time-of-use .

•

Commercial DBMS with high performance and high scalability is used by many
companies in the industry, such as Oracle, Microsoft, IBM, and etc.

•

In order to optimize the performance of DBMS with high performance, it needs to
widely and deeply understand OS, hardware system, DMBS, and application etc.
because the inappropriate optimization cause the performance degradation.

•

The main point of this study is about how to optimize DBMS to efficiently store and to
manage web access log data.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

4
Web Access Log Data

The system using only DB

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

The proposal system using Pre-Processor

5
A. Hash
•

Hash is a value that is stored on a specific address using hash function. After a value is
processed by hash function, it is mapped to a hash table. A hash table is composed of a
bucket that store data and a hash address that corresponds to a key value. When a
hash address is formed by hash function, each different key might have the same hash
address, which is called Hash Collision.

•

ADVANTAGE : it’s possible to get specific data with only one access using a key .

B. Bucket
•

A value is mapped to a hash table. A hash table is composed of n buckets. Each bucket
has m slots and each slot can store only one value.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

6
JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

7
Optimization Of Database Design

The System Structure
JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

8
B. Pre-Processor:
•

Pre-Processor remove redundant data by reading the collected web log data and is the
role of preprocessing for storing data in Database. Pre-Processor transfers collected
data through log collection server and read data is stored in a hash table using a key. At
the moment, if redundant data about URL is in a hash table, then increase a count
value and make statistical information about collected URL.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

9
JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

10
C. System Analysis and Design:
•

With key/value pair, data is divided into a key and a value. Time and source IP are
stored into a key, on the other hand, other information and a count value are stored
into a value. Basically, we compare a key using key/value pair and increase a count
value if the key exits in a hash table. Because the hash table in this study uses 32-bit
hash value, collision probability is 1/4.284.967.296.

D. Transaction Processing:
•

Statistical information made by Pre-Processor has mostly a lot of URL that access
frequency is 1. In case of URL that access frequency is high is because it brings
information related to a web page without a direct request, such as, images,
advertisement, etc.

•

Transaction processing is a way to sort out some irrelevant data as one transaction is
considered all data related to a request page. Also, it processes log in n seconds, which
is considered a unit of one transaction. After processing, the processed log is split
among two results, such as, a result list and a filter list. The result list is a list without
filtering, that is, an aggregate of unfiltered URL in the whole log.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

11
Access Log

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

12
Filter list & Result List

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

Transaction Filter

13
Performance Evaluation
A.

Experiment Environment

B. Experiment Data
1. Zipf Distribution:
•

Zipf’s Law is that the frequency of a specific item is proportional to a priority item as
creating the optimized data for an experiment. That is, the frequency of n priority item
is that a proportion of some element is proportional to 1/priority with the law that the
frequency is 1/n.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

14
C. Worst Case:
•

Worst case (WC) is that collected web access log is not duplicated and there is no
constant distribution for web access. We expect the worst performance because all
collected data are stored in DB and data de-duplication is not carried out.

D. Log File:
•

A sample data for creating experiment data uses URL Snooper and is composed of
collected time, source IP, destination IP, and domain, URL. We set up to collect over a
million data in a server.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

15
EXPERIMENT RESULT
A.

•

Data Input Time

Zipf parameter 0.2, which includes the
most redundant data, takes 437.5520
seconds to store 376,934 logs to DB is

and the worst case, includes no redundant
data, takes 1044.4392 seconds.

Input Time

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

16
B. Writing Data to a File

File Write Time

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

C. Creation of Statistical Information:

Data input time to Hash Table

17
CONCLUSION:
•

We understand the architecture and the feature of DBMS through how to store, index
structure, and data storage structure of commercial DMBS. And, using the feature of DBMS,
we make the optimized DBMS and the tool for handling web access log data and providing
fundamental knowledge to develop software.

•

The input time from Pre-Processor to DB depends on how often de-duplication occurs in a
hash table. The memory usage is Pre-Processor uses the memory the twice time more than
DB when creating statistical information. However, run time is Pre-Processor, which operate
in-memory, is from 18 to 20 times better than DB

•

In summary, using Pre-Processor, works in-memory, than only using DB is expected to better
performance, run-time, and the result data than the traditional system. Furthermore, this
study is expected to be used from other area like network security.

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

18
References
•

Database Optimization for Large Scale Web Access Log Management - Minjae Lee,
Jaehun Lee, Dohyung Kim, Taeil Kim, Sooyong Kang - Hanyang University {Dept. of
Electronics & Computer Engineering}

JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI)

19

More Related Content

PPTX
Optimizing distributed queries
PPTX
2013 06-21-computing-for-light-sources
PDF
Physical Database Design & Performance
PDF
Write Optimization of Column-Store Databases in Out-of-Core Environment
PDF
Measures of query cost
PPTX
Timestamped Binary Association Table - IEEE Big Data Congress 2015
PDF
Reviewing basic concepts of relational database
PPTX
Query processing in Distributed Database System
Optimizing distributed queries
2013 06-21-computing-for-light-sources
Physical Database Design & Performance
Write Optimization of Column-Store Databases in Out-of-Core Environment
Measures of query cost
Timestamped Binary Association Table - IEEE Big Data Congress 2015
Reviewing basic concepts of relational database
Query processing in Distributed Database System

What's hot (13)

PDF
Comparison of data recovery techniques on master file table between Aho-Coras...
PPTX
The AMB Data Warehouse: A Case Study
PPTX
Education Data Warehouse System
PDF
Database Migration Tool
PDF
A time efficient and accurate retrieval of range aggregate queries using fuzz...
PDF
NOVEL FUNCTIONAL DEPENDENCY APPROACH FOR STORAGE SPACE OPTIMISATION IN GREEN ...
PDF
A Survey on Different File Handling Mechanisms in HDFS
DOCX
Abstract.DOCX
PDF
Advanced database protocols
PDF
Indexing for Large DNA Database sequences
PDF
L017418893
PPTX
Physical database design(database)
PPT
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
Comparison of data recovery techniques on master file table between Aho-Coras...
The AMB Data Warehouse: A Case Study
Education Data Warehouse System
Database Migration Tool
A time efficient and accurate retrieval of range aggregate queries using fuzz...
NOVEL FUNCTIONAL DEPENDENCY APPROACH FOR STORAGE SPACE OPTIMISATION IN GREEN ...
A Survey on Different File Handling Mechanisms in HDFS
Abstract.DOCX
Advanced database protocols
Indexing for Large DNA Database sequences
L017418893
Physical database design(database)
ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap
Ad

Viewers also liked (20)

PPTX
Multi-Protocol Label Switching
PPT
Weblogs
PPT
Web Log Files
PPT
Web Proxy Log Analysis and Management 2007
PPTX
Weblogs
PPT
New Forms Of Communication: Harnessing Collective Knowledge through Web Logs
DOCX
Clericalismo liderazgo del s.xxi y cambio de paradigmas parte 1
PPT
Weblogs and the Public Sphere
PDF
MPLS Deployment Chapter 1 - Basic
PPT
PPT
Voice over MPLS
PPT
Multi-Protocol Label Switching
PPTX
Web log & clickstream
PPT
Multi-Protocol Label Switching: Basics and Applications
PPSX
PDF
Clickstream Data Warehouse - Turning clicks into customers
PPT
MPLS (Multi-Protocol Label Switching)
PDF
MPLS Presentation
PDF
Blogging ppt
PPS
Blog ppt
Multi-Protocol Label Switching
Weblogs
Web Log Files
Web Proxy Log Analysis and Management 2007
Weblogs
New Forms Of Communication: Harnessing Collective Knowledge through Web Logs
Clericalismo liderazgo del s.xxi y cambio de paradigmas parte 1
Weblogs and the Public Sphere
MPLS Deployment Chapter 1 - Basic
Voice over MPLS
Multi-Protocol Label Switching
Web log & clickstream
Multi-Protocol Label Switching: Basics and Applications
Clickstream Data Warehouse - Turning clicks into customers
MPLS (Multi-Protocol Label Switching)
MPLS Presentation
Blogging ppt
Blog ppt
Ad

Similar to Web Access Log Management (20)

PPT
Unit 9 Database Design using ORACLE and SQL.PPT
PDF
PDF
F1803013034
PPTX
UNIT3 DBMS.pptx operation nd management of data base
PDF
1-introduction to DB.pdf
PDF
01__Introduction _to_Database_konsep dasar.pdf
PPTX
BUILDING A DATA WAREHOUSE
PDF
Advanced Database System
PPT
Tips tricks to speed nw bi 2009
DOC
Database performance management
PDF
Advance database system (part 2)
PDF
IRJET- Big Data Processes and Analysis using Hadoop Framework
PPTX
Database part1-
PDF
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
PDF
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
PPTX
Query optimization
PDF
Unit 1: Introduction to DBMS Unit 1 Complete
PDF
Comparison of In-memory Data Platforms
PPTX
Query processing
PPTX
Chapter 11 Enterprise Resource Planning System
Unit 9 Database Design using ORACLE and SQL.PPT
F1803013034
UNIT3 DBMS.pptx operation nd management of data base
1-introduction to DB.pdf
01__Introduction _to_Database_konsep dasar.pdf
BUILDING A DATA WAREHOUSE
Advanced Database System
Tips tricks to speed nw bi 2009
Database performance management
Advance database system (part 2)
IRJET- Big Data Processes and Analysis using Hadoop Framework
Database part1-
QUERY OPTIMIZATION IN OODBMS: IDENTIFYING SUBQUERY FOR COMPLEX QUERY MANAGEMENT
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Query optimization
Unit 1: Introduction to DBMS Unit 1 Complete
Comparison of In-memory Data Platforms
Query processing
Chapter 11 Enterprise Resource Planning System

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PDF
Machine learning based COVID-19 study performance prediction
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25 Week I
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
Spectroscopy.pptx food analysis technology
Unlocking AI with Model Context Protocol (MCP)
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Per capita expenditure prediction using model stacking based on satellite ima...

Web Access Log Management

  • 1. Database Optimization for Large Scale Web Access Log Management
  • 2. Catalogue • Overview • Introduction • Performance Evaluation • Experiment Data Related Concepts & Terms • Worst Case Web Access Log Data • Log File • Hash • • Experiment Environment • • • Bucket • • • • • • JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) Run Time • Transaction Processing Sorting Frame • Log Analysis Algorithm Creation Of Statistical Information • Pre-Processor Writing Data to File • System Analysis & Design Data Input Time • Optimization Of Database Design • Experiment Result Memory Usage Conclusion 2
  • 3. Overview • We will be talking about an optimal method of commercial DBMS for analysing largescale web access log. Also, we develop the pre-processor in memory structure to improve performance and the user management tool for user friendliness. • We have three stages for this research. The first stage in research is data collection. In this stage, we study characteristics of large-scale web access log data and investigate commercial DBMS tuning techniques. The next stage, we design the system model. The proposed system has three components including pre-processor, DBMS and management tool. To improve the effectiveness the system, Pre-processor hash and sort log data in memory. And we perform DBMS tuning to improve performance. The final stage, we implement system and evaluate the performance. And we develop system management tool for user friendliness. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 3
  • 4. Introduction • Log generally collects some data for analysis, such as, click rates, type of users, how many users access each web page, and time-of-use . • Commercial DBMS with high performance and high scalability is used by many companies in the industry, such as Oracle, Microsoft, IBM, and etc. • In order to optimize the performance of DBMS with high performance, it needs to widely and deeply understand OS, hardware system, DMBS, and application etc. because the inappropriate optimization cause the performance degradation. • The main point of this study is about how to optimize DBMS to efficiently store and to manage web access log data. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 4
  • 5. Web Access Log Data The system using only DB JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) The proposal system using Pre-Processor 5
  • 6. A. Hash • Hash is a value that is stored on a specific address using hash function. After a value is processed by hash function, it is mapped to a hash table. A hash table is composed of a bucket that store data and a hash address that corresponds to a key value. When a hash address is formed by hash function, each different key might have the same hash address, which is called Hash Collision. • ADVANTAGE : it’s possible to get specific data with only one access using a key . B. Bucket • A value is mapped to a hash table. A hash table is composed of n buckets. Each bucket has m slots and each slot can store only one value. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 6
  • 7. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 7
  • 8. Optimization Of Database Design The System Structure JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 8
  • 9. B. Pre-Processor: • Pre-Processor remove redundant data by reading the collected web log data and is the role of preprocessing for storing data in Database. Pre-Processor transfers collected data through log collection server and read data is stored in a hash table using a key. At the moment, if redundant data about URL is in a hash table, then increase a count value and make statistical information about collected URL. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 9
  • 10. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 10
  • 11. C. System Analysis and Design: • With key/value pair, data is divided into a key and a value. Time and source IP are stored into a key, on the other hand, other information and a count value are stored into a value. Basically, we compare a key using key/value pair and increase a count value if the key exits in a hash table. Because the hash table in this study uses 32-bit hash value, collision probability is 1/4.284.967.296. D. Transaction Processing: • Statistical information made by Pre-Processor has mostly a lot of URL that access frequency is 1. In case of URL that access frequency is high is because it brings information related to a web page without a direct request, such as, images, advertisement, etc. • Transaction processing is a way to sort out some irrelevant data as one transaction is considered all data related to a request page. Also, it processes log in n seconds, which is considered a unit of one transaction. After processing, the processed log is split among two results, such as, a result list and a filter list. The result list is a list without filtering, that is, an aggregate of unfiltered URL in the whole log. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 11
  • 12. Access Log JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 12
  • 13. Filter list & Result List JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) Transaction Filter 13
  • 14. Performance Evaluation A. Experiment Environment B. Experiment Data 1. Zipf Distribution: • Zipf’s Law is that the frequency of a specific item is proportional to a priority item as creating the optimized data for an experiment. That is, the frequency of n priority item is that a proportion of some element is proportional to 1/priority with the law that the frequency is 1/n. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 14
  • 15. C. Worst Case: • Worst case (WC) is that collected web access log is not duplicated and there is no constant distribution for web access. We expect the worst performance because all collected data are stored in DB and data de-duplication is not carried out. D. Log File: • A sample data for creating experiment data uses URL Snooper and is composed of collected time, source IP, destination IP, and domain, URL. We set up to collect over a million data in a server. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 15
  • 16. EXPERIMENT RESULT A. • Data Input Time Zipf parameter 0.2, which includes the most redundant data, takes 437.5520 seconds to store 376,934 logs to DB is and the worst case, includes no redundant data, takes 1044.4392 seconds. Input Time JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 16
  • 17. B. Writing Data to a File File Write Time JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) C. Creation of Statistical Information: Data input time to Hash Table 17
  • 18. CONCLUSION: • We understand the architecture and the feature of DBMS through how to store, index structure, and data storage structure of commercial DMBS. And, using the feature of DBMS, we make the optimized DBMS and the tool for handling web access log data and providing fundamental knowledge to develop software. • The input time from Pre-Processor to DB depends on how often de-duplication occurs in a hash table. The memory usage is Pre-Processor uses the memory the twice time more than DB when creating statistical information. However, run time is Pre-Processor, which operate in-memory, is from 18 to 20 times better than DB • In summary, using Pre-Processor, works in-memory, than only using DB is expected to better performance, run-time, and the result data than the traditional system. Furthermore, this study is expected to be used from other area like network security. JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 18
  • 19. References • Database Optimization for Large Scale Web Access Log Management - Minjae Lee, Jaehun Lee, Dohyung Kim, Taeil Kim, Sooyong Kang - Hanyang University {Dept. of Electronics & Computer Engineering} JAY J. PATEL (UDIT, UNIVERSITY OF MUMBAI) 19