SlideShare a Scribd company logo
Major Seminar
                             On
        Knowledge Discovery from Web Logs




Guided By:                                       Presented By:
Saurabh Anand                                    Avtar kishore Gaur
Lecturer                                         (IT/09/53)
Department Of IT                                 VIII Sem, IT

                   Poornima College Of Engineering
                          Sitapura,Jaipur
Introduction
• Vast amount of Web site traversal information in the form
  of Web logs are present.
• By analyzing these logs, it is possible to discover various
  kinds of knowledge, which can be applied to improve the
  performance of Web services.
• It is possible to learn the behavior of the Web users by
  analyzing these logs.
Introduction
• A particularly kind of knowledge which can be immediately
  applied to the operation of the Web site is called
  Actionable knowledge.
• Mining of such knowledge is known as Knowledge
  Discovery from Web Logs.
How big is the Web
• More then 4 billion websites are on Internet.(According to
  alexa.com)

• At least 7.92 billion pages (Thursday, 23
  February, 2012).(according to worldwidewebsize.com).
History
• Previous approaches was only aimed to mine Web-log
  knowledge for human consumption.
• These days mining actionable knowledge from Web logs is
  been used to improve the performance of Web Services.
Fields in Web Log File
• Reference Website www.hdwally.com Web Server: Apache
         1. 66.249.71.6 - - [23/Feb/2012:06:23:46 -0600] "GET
           /robots.txt HTTP/1.1" 500 7370 "-" "Mozilla/5.0
           (compatible; Googlebot/2.1;
           +http://www.google.com/bot.html)“
         2. 180.76.5.92 - - [23/Feb/2012:06:11:04 -0600] "GET /
           HTTP/1.1" 500 7370 "-" "Mozilla/5.0 (compatible;
           Baiduspider/2.0;
           +http://www.baidu.com/search/spider.html)“
• IP Adress:-66.249.71.6 and 180.76.5.92
• UserName:- -- and --
• Timestamp :- [23/Feb/2012:06:23:46 -0600] and -
  [23/Feb/2012:06:11:04 -0600] (time of visit by webserver)
Fields in Web Log File
• Access request : "GET /robots.txt HTTP/1.1“ and "GET /
  HTTP/1.1”
• Result status code : 500 and 500 (Internal Server Error)
• Bytes transferred : 7370 and 7370
• User Agent: Mozilla/5.0
• Referrer URL : (compatible; Googlebot/2.1;
  +http://www.google.com/bot.html) and (compatible;
  Baiduspider/2.0;
  +http://www.baidu.com/search/spider.html)
Example Of a Web Log File
• fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400]
  "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-
  WebCrawler/2.1-pre2 (ashen@looksmart.net)"
  fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400]
  "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-
  WebCrawler/2.1-pre2 (ashen@looksmart.net)“
• 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET
  /pics/wpaper.gif HTTP/1.0" 200 6248
  "http://www.jafsoft.com/asctortf/"   "Mozilla/4.05
  (Macintosh; I; PPC )"
Mining Web Logs for Path Profiles
•   Data Cleaning on Web Log Data
•   Mining Web Logs for Path Profiles
•   Web Object Prediction
•   Learning to Prefetch Web Documents
Data Cleaning on Web Log Data
• Break apart a long sequence of visits by the users into user
  sessions.
• Identify user by an individual IP address.
• Thus, data cleaning means to separate the visiting
  sequence of pages into visiting sessions.
Web Log Mining for Prefetching
• We have separate visiting sessions.
• Now we can develop path profiles from these sessions as
  user visiting a sequence of Web pages often leaves a trail of
  the pages URL’s in a Web log.
• A path profile consists frequent subsequences from the
  frequently occurring paths.
• Path profile helps us to predict the next pages that are
  most likely to occur.
Web Object Prediction
• it is possible to train a path-based model for predicting
  future URL's based on a sequence of current URL accesses.
• This can be done on a per-user basis, or on a per-server
  basis.
• The former requires that the user-session be recognized
  and broken down nicely through a filtering system, and the
  latter takes the simplistic view that the accesses on a server
  is a single long thread.
Learning to Prefetch Web Documents
• Original cache memory is partitioned into two parts: cache-
  buffer and prefetching-buffer.
• A prefetching agent(Script) keeps pre-loading the
  prefetching-buffer with documents predicted to access
  next.
Web Page Clustering for Intelligent
              User Interfaces
• Web Logs can be used to build server-side customization
  and transformation to make website more convenient for
  users to visit and find their objectives.
• They path prediction algorithms that guess where the user
  wants to go next in a browsing session like WebWatcher
  and PageGather algorythm.
Applications
•    Search Engines
•    Similarity Measures
•    Ontology
•   information aggregation
•    Recognition technology
•    Summarization
•    E-commerce
•    Content management
Advantages
• Its easy to implement.
• The companies can establish better customer relationship
  by giving them exactly what they need.
• To create personalized search engines, which can
  understand a person’s search queries in a personal way by
  analyzing and profiling user’s search behavior.
• To improving caching and prefetching of Web objects.
• Use the mined knowledge for building better, adaptive user
  interfaces.
• Applying Web query log knowledge to improving Web
  search for a search engine application.
Reference
• Weblogs from www.hdwally.com and
  www.hdwallpaper4u.com .
• www.jafsoft.com/searchengines/log_sample.html
• Research paper on Knowledge Discovery From Weblogs by
  S Chandra and Dr B Kalpana.
• Researcalpana. paper on Mining Web Logs for Actionable
  Knowledge by Qiang Yang, Charles X. Ling and Jianfeng Gao.
• http://www.galeas.de/webmining.html
Queries ?
Thanks

More Related Content

PDF
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
PPTX
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
PPT
WebCrawler
PPTX
PPT
“Web crawler”
PPTX
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
DOCX
Smart crawler a two stage crawler
PPTX
Web fundamentals - part 1
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart crawlet A two stage crawler for efficiently harvesting deep web interf...
WebCrawler
“Web crawler”
Smart Crawler -A Two Stage Crawler For Efficiently Harvesting Deep Web
Smart crawler a two stage crawler
Web fundamentals - part 1

What's hot (20)

PDF
Smart crawler a two stage crawler
PPT
Web crawler
PPT
Web Crawler
DOC
Web crawler synopsis
DOCX
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
PDF
Faster and resourceful multi core web crawling
PPT
Working with WebSPHINX Web Crawler
PDF
What is a web crawler and how does it work
PPTX
SemaGrow demonstrator: “Web Crawler + AgroTagger”
PPT
Re-usable metadata, re-usable content
PPTX
Web crawler
PPTX
Web crawler and applications
PDF
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
PDF
Web crawler
PDF
Guía SEO 2020: Trucos y recomendaciones para desarrolladores y webmasters
ODP
Scout xss csrf_security_presentation_chicago
PDF
Colloquim Report - Rotto Link Web Crawler
PPT
Working of a Web Crawler
PPTX
Web crawler with seo analysis
PDF
Smart crawler a two stage crawler
Web crawler
Web Crawler
Web crawler synopsis
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
Faster and resourceful multi core web crawling
Working with WebSPHINX Web Crawler
What is a web crawler and how does it work
SemaGrow demonstrator: “Web Crawler + AgroTagger”
Re-usable metadata, re-usable content
Web crawler
Web crawler and applications
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Web crawler
Guía SEO 2020: Trucos y recomendaciones para desarrolladores y webmasters
Scout xss csrf_security_presentation_chicago
Colloquim Report - Rotto Link Web Crawler
Working of a Web Crawler
Web crawler with seo analysis
Ad

Similar to Avtar's ppt (20)

PDF
HIGWGET-A Model for Crawling Secure Hidden WebPages
PDF
Web Data mining-A Research area in Web usage mining
PPSX
SharePoint Development Workshop
PDF
Bb31269380
PDF
Preprocessing of Web Log Data for Web Usage Mining
PDF
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
PDF
Web Crawler For Mining Web Data
PPSX
Web performance
PDF
Search engine and web crawler
PPTX
Door Of Internet
PDF
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
PDF
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
PPTX
Web Performance Optimization
PDF
WebApp / SPA @ AllFacebook Developer Conference
PDF
The Technical SEO Full Course how to do
PPT
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
PPTX
Web mining
PPTX
Web Mining
PDF
Web personalization using clustering of web usage data
HIGWGET-A Model for Crawling Secure Hidden WebPages
Web Data mining-A Research area in Web usage mining
SharePoint Development Workshop
Bb31269380
Preprocessing of Web Log Data for Web Usage Mining
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Web Crawler For Mining Web Data
Web performance
Search engine and web crawler
Door Of Internet
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
Web Performance Optimization
WebApp / SPA @ AllFacebook Developer Conference
The Technical SEO Full Course how to do
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
Web mining
Web Mining
Web personalization using clustering of web usage data
Ad

Avtar's ppt

  • 1. Major Seminar On Knowledge Discovery from Web Logs Guided By: Presented By: Saurabh Anand Avtar kishore Gaur Lecturer (IT/09/53) Department Of IT VIII Sem, IT Poornima College Of Engineering Sitapura,Jaipur
  • 2. Introduction • Vast amount of Web site traversal information in the form of Web logs are present. • By analyzing these logs, it is possible to discover various kinds of knowledge, which can be applied to improve the performance of Web services. • It is possible to learn the behavior of the Web users by analyzing these logs.
  • 3. Introduction • A particularly kind of knowledge which can be immediately applied to the operation of the Web site is called Actionable knowledge. • Mining of such knowledge is known as Knowledge Discovery from Web Logs.
  • 4. How big is the Web • More then 4 billion websites are on Internet.(According to alexa.com) • At least 7.92 billion pages (Thursday, 23 February, 2012).(according to worldwidewebsize.com).
  • 5. History • Previous approaches was only aimed to mine Web-log knowledge for human consumption. • These days mining actionable knowledge from Web logs is been used to improve the performance of Web Services.
  • 6. Fields in Web Log File • Reference Website www.hdwally.com Web Server: Apache 1. 66.249.71.6 - - [23/Feb/2012:06:23:46 -0600] "GET /robots.txt HTTP/1.1" 500 7370 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)“ 2. 180.76.5.92 - - [23/Feb/2012:06:11:04 -0600] "GET / HTTP/1.1" 500 7370 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)“ • IP Adress:-66.249.71.6 and 180.76.5.92 • UserName:- -- and -- • Timestamp :- [23/Feb/2012:06:23:46 -0600] and - [23/Feb/2012:06:11:04 -0600] (time of visit by webserver)
  • 7. Fields in Web Log File • Access request : "GET /robots.txt HTTP/1.1“ and "GET / HTTP/1.1” • Result status code : 500 and 500 (Internal Server Error) • Bytes transferred : 7370 and 7370 • User Agent: Mozilla/5.0 • Referrer URL : (compatible; Googlebot/2.1; +http://www.google.com/bot.html) and (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
  • 8. Example Of a Web Log File • fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST- WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST- WebCrawler/2.1-pre2 (ashen@looksmart.net)“ • 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC )"
  • 9. Mining Web Logs for Path Profiles • Data Cleaning on Web Log Data • Mining Web Logs for Path Profiles • Web Object Prediction • Learning to Prefetch Web Documents
  • 10. Data Cleaning on Web Log Data • Break apart a long sequence of visits by the users into user sessions. • Identify user by an individual IP address. • Thus, data cleaning means to separate the visiting sequence of pages into visiting sessions.
  • 11. Web Log Mining for Prefetching • We have separate visiting sessions. • Now we can develop path profiles from these sessions as user visiting a sequence of Web pages often leaves a trail of the pages URL’s in a Web log. • A path profile consists frequent subsequences from the frequently occurring paths. • Path profile helps us to predict the next pages that are most likely to occur.
  • 12. Web Object Prediction • it is possible to train a path-based model for predicting future URL's based on a sequence of current URL accesses. • This can be done on a per-user basis, or on a per-server basis. • The former requires that the user-session be recognized and broken down nicely through a filtering system, and the latter takes the simplistic view that the accesses on a server is a single long thread.
  • 13. Learning to Prefetch Web Documents • Original cache memory is partitioned into two parts: cache- buffer and prefetching-buffer. • A prefetching agent(Script) keeps pre-loading the prefetching-buffer with documents predicted to access next.
  • 14. Web Page Clustering for Intelligent User Interfaces • Web Logs can be used to build server-side customization and transformation to make website more convenient for users to visit and find their objectives. • They path prediction algorithms that guess where the user wants to go next in a browsing session like WebWatcher and PageGather algorythm.
  • 15. Applications • Search Engines • Similarity Measures • Ontology • information aggregation • Recognition technology • Summarization • E-commerce • Content management
  • 16. Advantages • Its easy to implement. • The companies can establish better customer relationship by giving them exactly what they need. • To create personalized search engines, which can understand a person’s search queries in a personal way by analyzing and profiling user’s search behavior. • To improving caching and prefetching of Web objects. • Use the mined knowledge for building better, adaptive user interfaces. • Applying Web query log knowledge to improving Web search for a search engine application.
  • 17. Reference • Weblogs from www.hdwally.com and www.hdwallpaper4u.com . • www.jafsoft.com/searchengines/log_sample.html • Research paper on Knowledge Discovery From Weblogs by S Chandra and Dr B Kalpana. • Researcalpana. paper on Mining Web Logs for Actionable Knowledge by Qiang Yang, Charles X. Ling and Jianfeng Gao. • http://www.galeas.de/webmining.html