SlideShare a Scribd company logo
SRI GURU GRANTH SAHIB WORLD UNIVERSITY
FATEHGARH SAHIB – 140406
JULY,2014
Guided By : Presented By:
Ms. Shruti Aggarwal Prabhjit Singh Sekhon
Assistant Professor Roll No: 12012238
(CSE Department) M.tech(CSE)
 Introduction
 Literature Review
 Problem Statement
 Results
 Conclusion
 Future works
 References
2
Data mining is computational process of discovering patterns in large data sets. The
overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
3
4
Functions of data mining can be classified as:
 Class Description: It can be useful to describe individual classes and concepts.
 Association Analysis: It is the discovery of association rules that define attribute-
value conditions.
 Classification: It is the process of finding a set of models that describe and
distinguish data classes or concepts by using class label.
 Cluster Analysis: The objects are clustered or grouped based on the principle of
maximizing the intra-class similarity and minimizing the inter-class similarity.
 Outlier Analysis: Outliers are data objects that do not comply with the general
behavior of model of the data.
 Evolution Analysis: It describes and models trends for objects whose behaviors
changes over time.
5
 Pre-processing: In this data is collected, irrelevant data items are removed and
converted into appropriate form.
 Processing: To process the data algorithms are considered to develop and evaluate
the models. Models are implemented on data.
 Post-processing: When models are implemented on data it is presented if still there
is irrelevant data then again steps repeated.
6
7
There are various types of Mining :
 Text Mining: Deriving high quality information from large text.
 Visual Mining: In this patterns are generated which are converted from analog to
digital data that are hidden in data.
 Audio Mining: In this content of an audio signal can be analysed and searched.
 Spatial Mining: Finding patterns in data with respect to geography.
 Web Mining :Discover patterns from web.
 Video Mining: Video mining is process to extract the relevant video content and
structure of video automatically like moving objects, spatial and correlations of
features as well as object activity, video events, video structure patterns .
8
Advantages
 Predict future trends, customer purchase habits.
 Help with decision making.
 Improve company revenue and lower costs.
 Market basket analysis.
 Fraud detection.
Disadvantages
 User privacy/security.
 Great cost at implementation stage.
 Possible misuse of information.
 Possible inaccuracy of data.
9
Web mining is the use of data mining techniques to automatically discover and
extract information from Web documents and services .
There are three general classes of information that can be discovered by web mining:
 Web Activity from server logs and Web browser activity tracking.
 Web Graph from links between pages, people and other data.
 Web Content for the data found on Web pages and inside of documents.
10
11
Web Content Mining is the process of extracting useful information from the
contents of web documents. There are two types of approach in content mining
called agent based approach and database based approach.
 Agent Based Approach: This approach concentrate on searching relevant
information using the characteristics of a particular domain to interpret and organize
the collected information.
 Database Approach: This approach is used for retrieving the semi-structure data
from the web.
12
Web structure mining is the process of using graph theory to analyze the node and
connection structure of a web site. According to the type of web structural data, web
structure mining can be divided into two kinds:
 Extracting patterns from hyperlinks in the web: A hyperlink is a structural
component that connects the web page to a different location.
 Mining the document structure: Analysis of the tree-like structure of page
structures.
13
Web usage mining is the process of extracting useful information from server logs
e.g. use Web usage mining is the process of finding out what users are looking for
on the Internet. Some users might be looking at only textual data, whereas some
others might be interested in multi media data.
14
 A Search Engine Spider is a program that most search engines use to find what’s
new on the Internet.
 The program starts at a website and follows every hyperlink on each page. So we
can say that everything on the web will eventually be found and spidered, as the so
called “spider” crawls from one website to another.
15
Crawler Description
GNU Wget GNU Wget is a command-line-operated
crawler written in C and released under
the GPL. It is typically used to mirror Web
and FTP sites.
DataparkSearch It is a crawler and search engine released
under the General Public License.
GRUB It is an open source distributed search
crawler that Wikia Search used to crawl
the web.
HTTrack It uses a Web crawler to create a mirror of
a web site for off-line viewing. It is written
in C and released under the GPL
16
FAST Crawler is a distributed crawler, used by Fast Search & Transfer.
Googlebot is used by google search engine, which is based on C++
and Python.
Yahoo! Slurp was the name of the Yahoo! Search crawler until Yahoo.
Sekhon final 1_ppt
 Information on the Web changes over time.
 Knowledge of this ‘rate of change’ is crucial, as it allows us to
estimate how often a search engine should visit each portion of
the Web in order to maintain a fresh index.
19
 Recognizing the relevance or importance of sites
 Define a scoring function for relevance
where  is the set of parameters, u is url ξ is relevance
20
:)()(
us 

 Indexing the web is a challenge due to its growing and dynamic nature.
 A single crawling process even if multithreading is used will be insufficient for large
scale engines that need to fetch large amounts of data rapidly.
 Distributing crawling activity is the splitting the load, decreases hardware
requirements and at the same time increases the overall download speed and
reliability.
21
 Focused web crawling is to visit unvisited url to check whether it is relevant to
search topic or not.
 Avoid irrelevant documents.
 Reduce network traffic.
22
Basically three steps that are involved in the web crawling process:-
The search crawler starts by crawling the pages of your site.
Then it continues indexing the words and content of the site.
Finally it visit the links (web page addresses or URLs) that are found in
your site.
The Web crawler is the outcome of a combination of policies:-
 Selection policy: It states that which pages to download. Given the current size of
the Web, even large search engines cover only a portion of the publicly available
part.
 Re-visit policy: This policy states when to check for changes to the pages. The
most-used cost functions are freshness and age.
 Politeness policy: This policy states how to avoid overloading sites. If a single
crawler is performing multiple requests per second and/or downloading large files.
 Parallelization policy: A parallel crawler is a crawler that runs multiple processes
in parallel.
24
 New learning-based approach which uses the Naïve Bayes classifier as the base
prediction model to improve relevance prediction in focused Web crawlers is used.
 New learning-based focused crawling approach that uses four relevance attributes to
predict the relevance and if additionally allowing dynamic update of the training
dataset, the prediction accuracy is further boosted [1].
25
 Intelligent crawling method involves looking for specific features in a page to rank
the candidate links.
 These features include page content, URL names of referred Web page, and the
nature of the parent and sibling pages.
 It is a generic framework in that it allows the user to specify the relevant criteria [2].
26
 The Fish-Search is an early crawler that prioritizes unvisited URLs on a queue for
specific search goal.
 The Fish-Search approach assigns priority values (1 or 0) to candidate pages using
simple keyword matching.
 One of the disadvantages of Fish-Search is that all relevant pages are assigned the
same priority value 1 based on keyword matching [3].
27
 The Shark-Search is a modified version of Fish-Search, in which, Vector Space
Model (VSM) is used.
 The priority values (more than just 1 and 0) are computed based on the priority
values of parent pages, page content, and anchor text[3].
28
Sekhon final 1_ppt
http://en.wikipedia.org/wiki/DNA
http://dna.ancestry.com
http://www.familytreedns.com
http://learn.genetics.utah.edu
http://facebook.com
http://www.technologystudent.com/joints/iron2
http://en.wikipedia.org/wiki/iron_ore
http://www.metalprices.com/metal/iron-ore
http://miningartifacts.homestead.com/IronOres.html
http://facebook.com
http://www.worldmarket.com
http://www.marketbarchicago.com
http://finance.yahoo.com
http://www.nasdaq.com
http://facebook.com
www.freecomputerbooks.com
www.computer-books.us
www.onlinecomputerbook.com
www.freetechbook.com
http://facebook.com
Sekhon final 1_ppt
Sekhon final 1_ppt
Sekhon final 1_ppt
Sekhon final 1_ppt
Sekhon final 1_ppt
39
Sekhon final 1_ppt
Sekhon final 1_ppt
Sekhon final 1_ppt
43
 Decision tree
 Neural Network
 Naïve Bayes
 Decision tree is an important tool for machine learning. It is used as a predictive
model to map conclusions about an item's target value to observations about the
item.
 Tree structures comprise of leaves and branches.
 Leaves represent class labels and branches represent conjunctions of features that
lead to those class labels.
 In decision analysis, a decision tree can be used to explicitly and visually represent
decisions and decision making.
 The resulting classification tree can be an input for decision making.
44
 NN is always learning model.
 It usually receives many inputs, called input vector, (analogy of dendrites) and
the sum of them forms output (analogy of synapse).
 The sums of inputs are weighted and the sum is passed to the activation
function.
 However, the brain makes up for the relatively slow rate of operation of a
neuron by having a truly staggering number of neurons(nerve cells) with
massive interconnections between them.
 It is estimated that there must be on the order of 10 billion neurons in the
human cortex, and 60 trillion connections.
 The net result is that the brain is an enormously efficient structure.
45
 Naïve Bayes algorithm is based on Probabilistic learning and classification.
 It assumes that one feature is independent of another.
 This algorithm proved to be efficient over many other approaches although its
simple assumption is not much applicable in realistic world cases.
 Exploits the fact that relevant pages possibly link to other relevant pages. Therefore,
the relevance of a page a to a topic t, pointed by a page b, is estimated by the
relevance of page b to the topic t.
46
 With the advent of WWW and with the increase in the size of web, it becomes a
challenging task for searching relevant information on the web. The result of any search
is millions of documents ranked by a search engine.
 The ranking depends on keywords count, keywords density, link count and some other
proprietary algorithms that are specific to every search engine.
 A user does not have an easy way to reduce those millions of documents by defining a
context or refining the meaning of the keyword.
 The documents appearing in the search result are ranked according to a specific algorithm
that is characteristic to every search engine. The results may in all probability not sorted
according to a users interest.
 Due to the above mentioned problems and the keyword approach used by most search
engines the amount of time and effort required to find the right information is directly
proportional to the amount of information on the web. Data mining is a valid solution to
the problem.
 Crawlers are important tools of data mining which traverses the internet for retrieving
web pages which are relevant to a topic.
47
 To understand the working of content Block segmentation form previous research.
 To find attributes for seed URLs and their child URLs.
 To classify the URLs according to their weight or score in the weight table.
 To prepare the full training data set and maintain the keyword table.
 To classify the relevancy of the new unseen URLs using Decision Tree Induction,
Neural Network, Naïve Bayes Classifier.
 To find out the Harvest ratio or Precision rate for overall performance evaluation.
48
49
There is basically Precision rate parameters which are considered to form results. On
the basis of precision rate performance of all the algorithms is measured.
Precision rate: It estimates the fraction of crawled pages that are relevant. We
depend on multiple classifiers to make this relevance judgement.
Precision Rate = No. of Relevant pages/Total downloaded pages
50
Hardware and Software Requirements
 Hardware:
Intel Core I3 processor.
Memory 3GB.
64-bit Operating System.
 Software:
MATLAB 2010a.
51
 The name MATLAB stands for MATrix-LABoratory.
 MATLAB is a tool for numerical computation and visualization. The basic data
element is a matrix, so if you need a program that manipulates array-based data it is
generally fast to write .
 MATLAB is a high-performance language for technical computing. It
integrates computation, visualization, and programming environment.
52
 MATLAB is an excellent tool because it has unsophisticated data structures,
contains built-in editing and debugging tools, and supports object-oriented
programming.
 It also has easy to use graphics commands that make the visualization of results
immediately available.
 There are toolboxes (specific applications) for signal processing, symbolic
computation, control theory, simulation, optimization, and several other fields of
applied science and engineering.
53
54
55
 Our work was to compare the accuracy the three prime classifiers. We can end up
this discussion by saying that in terms of classification accuracy Neural network
leads the Decision tree induction and Naive Bayesian by a big margin.
 Moreover in order to have more complex computing we need to improve upon the
memory of a particular classifier by training it, in this context also the NN
dominance prevails over the DTI and NB.
56
It can be applied on large database for relevancy prediction.
Further can be used support vector matching.
[1] S. Chakrabarti, M. Berg, and B. Dom, “Focused Crawling: A New Approach for
Topic Specific Resource Discovery”, In Journal of Computer and Information
Science, vol. 31, no. 11-16, pp. 1623-1640,1999.
[2] Li, J., Furuse, K. & Yamaguchi, K., “ Focused Crawling by Exploiting Anchor
Text Using Decision Tree”,In Special interest tracks and posters of the 14th
international conference on World Wide Web,Chiba, Japan,2005
[3] J. Rennie and A. McCallum, "Using Reinforcement Learning to Spider the Web
Efficiently," In proceedings of the 16th International Conference on Machine
Learning(ICML-99), pp. 335-343, 1999.
58

More Related Content

PDF
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
PDF
Comparable Analysis of Web Mining Categories
PDF
HIGWGET-A Model for Crawling Secure Hidden WebPages
PDF
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
PDF
A detail survey of page re ranking various web features and techniques
PDF
Web crawling
PDF
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...
PDF
IRJET - Re-Ranking of Google Search Results
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMS
Comparable Analysis of Web Mining Categories
HIGWGET-A Model for Crawling Secure Hidden WebPages
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
A detail survey of page re ranking various web features and techniques
Web crawling
Implemenation of Enhancing Information Retrieval Using Integration of Invisib...
IRJET - Re-Ranking of Google Search Results

What's hot (16)

PDF
International Journal of Engineering Research and Development
PDF
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
PDF
A Study on Web Structure Mining
PDF
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
DOC
Odam an optimized distributed association rule mining algorithm (synopsis)
PDF
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
PDF
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
PPTX
Web crawler
PDF
A comprehensive study of mining web data
PDF
A Novel Data Extraction and Alignment Method for Web Databases
PDF
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
PDF
D43062127
PDF
Search engine and web crawler
PDF
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
PDF
Pdd crawler a focused web
PDF
Sree saranya
International Journal of Engineering Research and Development
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
A Study on Web Structure Mining
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
Odam an optimized distributed association rule mining algorithm (synopsis)
A survey on Design and Implementation of Clever Crawler Based On DUST Removal
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Web crawler
A comprehensive study of mining web data
A Novel Data Extraction and Alignment Method for Web Databases
COST-SENSITIVE TOPICAL DATA ACQUISITION FROM THE WEB
D43062127
Search engine and web crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
Pdd crawler a focused web
Sree saranya
Ad

Similar to Sekhon final 1_ppt (20)

PPTX
Web Mining.pptx
PDF
Web mining .pdf module 6 dwm third year ce
PPTX
web mining
PPT
Web Mining
PPT
Web Mining
PPTX
Web mining
PPTX
Web Mining
PPTX
Web mining
PDF
Data mining in web search engine optimization
PPTX
Web content mining
PDF
C03406021027
PPTX
WEB MINING.pptx
PDF
DWM-MODULE 6.pdf
PPTX
PDF
Web mining and social media mining
PPT
Web mining
PDF
International conference On Computer Science And technology
PDF
RESEARCH ISSUES IN WEB MINING
PDF
RESEARCH ISSUES IN WEB MINING
PDF
RESEARCH ISSUES IN WEB MINING
Web Mining.pptx
Web mining .pdf module 6 dwm third year ce
web mining
Web Mining
Web Mining
Web mining
Web Mining
Web mining
Data mining in web search engine optimization
Web content mining
C03406021027
WEB MINING.pptx
DWM-MODULE 6.pdf
Web mining and social media mining
Web mining
International conference On Computer Science And technology
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
Ad

Sekhon final 1_ppt

  • 1. SRI GURU GRANTH SAHIB WORLD UNIVERSITY FATEHGARH SAHIB – 140406 JULY,2014 Guided By : Presented By: Ms. Shruti Aggarwal Prabhjit Singh Sekhon Assistant Professor Roll No: 12012238 (CSE Department) M.tech(CSE)
  • 2.  Introduction  Literature Review  Problem Statement  Results  Conclusion  Future works  References 2
  • 3. Data mining is computational process of discovering patterns in large data sets. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. 3
  • 4. 4
  • 5. Functions of data mining can be classified as:  Class Description: It can be useful to describe individual classes and concepts.  Association Analysis: It is the discovery of association rules that define attribute- value conditions.  Classification: It is the process of finding a set of models that describe and distinguish data classes or concepts by using class label.  Cluster Analysis: The objects are clustered or grouped based on the principle of maximizing the intra-class similarity and minimizing the inter-class similarity.  Outlier Analysis: Outliers are data objects that do not comply with the general behavior of model of the data.  Evolution Analysis: It describes and models trends for objects whose behaviors changes over time. 5
  • 6.  Pre-processing: In this data is collected, irrelevant data items are removed and converted into appropriate form.  Processing: To process the data algorithms are considered to develop and evaluate the models. Models are implemented on data.  Post-processing: When models are implemented on data it is presented if still there is irrelevant data then again steps repeated. 6
  • 7. 7
  • 8. There are various types of Mining :  Text Mining: Deriving high quality information from large text.  Visual Mining: In this patterns are generated which are converted from analog to digital data that are hidden in data.  Audio Mining: In this content of an audio signal can be analysed and searched.  Spatial Mining: Finding patterns in data with respect to geography.  Web Mining :Discover patterns from web.  Video Mining: Video mining is process to extract the relevant video content and structure of video automatically like moving objects, spatial and correlations of features as well as object activity, video events, video structure patterns . 8
  • 9. Advantages  Predict future trends, customer purchase habits.  Help with decision making.  Improve company revenue and lower costs.  Market basket analysis.  Fraud detection. Disadvantages  User privacy/security.  Great cost at implementation stage.  Possible misuse of information.  Possible inaccuracy of data. 9
  • 10. Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services . There are three general classes of information that can be discovered by web mining:  Web Activity from server logs and Web browser activity tracking.  Web Graph from links between pages, people and other data.  Web Content for the data found on Web pages and inside of documents. 10
  • 11. 11
  • 12. Web Content Mining is the process of extracting useful information from the contents of web documents. There are two types of approach in content mining called agent based approach and database based approach.  Agent Based Approach: This approach concentrate on searching relevant information using the characteristics of a particular domain to interpret and organize the collected information.  Database Approach: This approach is used for retrieving the semi-structure data from the web. 12
  • 13. Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds:  Extracting patterns from hyperlinks in the web: A hyperlink is a structural component that connects the web page to a different location.  Mining the document structure: Analysis of the tree-like structure of page structures. 13
  • 14. Web usage mining is the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on the Internet. Some users might be looking at only textual data, whereas some others might be interested in multi media data. 14
  • 15.  A Search Engine Spider is a program that most search engines use to find what’s new on the Internet.  The program starts at a website and follows every hyperlink on each page. So we can say that everything on the web will eventually be found and spidered, as the so called “spider” crawls from one website to another. 15
  • 16. Crawler Description GNU Wget GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites. DataparkSearch It is a crawler and search engine released under the General Public License. GRUB It is an open source distributed search crawler that Wikia Search used to crawl the web. HTTrack It uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL 16
  • 17. FAST Crawler is a distributed crawler, used by Fast Search & Transfer. Googlebot is used by google search engine, which is based on C++ and Python. Yahoo! Slurp was the name of the Yahoo! Search crawler until Yahoo.
  • 19.  Information on the Web changes over time.  Knowledge of this ‘rate of change’ is crucial, as it allows us to estimate how often a search engine should visit each portion of the Web in order to maintain a fresh index. 19
  • 20.  Recognizing the relevance or importance of sites  Define a scoring function for relevance where  is the set of parameters, u is url ξ is relevance 20 :)()( us  
  • 21.  Indexing the web is a challenge due to its growing and dynamic nature.  A single crawling process even if multithreading is used will be insufficient for large scale engines that need to fetch large amounts of data rapidly.  Distributing crawling activity is the splitting the load, decreases hardware requirements and at the same time increases the overall download speed and reliability. 21
  • 22.  Focused web crawling is to visit unvisited url to check whether it is relevant to search topic or not.  Avoid irrelevant documents.  Reduce network traffic. 22
  • 23. Basically three steps that are involved in the web crawling process:- The search crawler starts by crawling the pages of your site. Then it continues indexing the words and content of the site. Finally it visit the links (web page addresses or URLs) that are found in your site.
  • 24. The Web crawler is the outcome of a combination of policies:-  Selection policy: It states that which pages to download. Given the current size of the Web, even large search engines cover only a portion of the publicly available part.  Re-visit policy: This policy states when to check for changes to the pages. The most-used cost functions are freshness and age.  Politeness policy: This policy states how to avoid overloading sites. If a single crawler is performing multiple requests per second and/or downloading large files.  Parallelization policy: A parallel crawler is a crawler that runs multiple processes in parallel. 24
  • 25.  New learning-based approach which uses the Naïve Bayes classifier as the base prediction model to improve relevance prediction in focused Web crawlers is used.  New learning-based focused crawling approach that uses four relevance attributes to predict the relevance and if additionally allowing dynamic update of the training dataset, the prediction accuracy is further boosted [1]. 25
  • 26.  Intelligent crawling method involves looking for specific features in a page to rank the candidate links.  These features include page content, URL names of referred Web page, and the nature of the parent and sibling pages.  It is a generic framework in that it allows the user to specify the relevant criteria [2]. 26
  • 27.  The Fish-Search is an early crawler that prioritizes unvisited URLs on a queue for specific search goal.  The Fish-Search approach assigns priority values (1 or 0) to candidate pages using simple keyword matching.  One of the disadvantages of Fish-Search is that all relevant pages are assigned the same priority value 1 based on keyword matching [3]. 27
  • 28.  The Shark-Search is a modified version of Fish-Search, in which, Vector Space Model (VSM) is used.  The priority values (more than just 1 and 0) are computed based on the priority values of parent pages, page content, and anchor text[3]. 28
  • 39. 39
  • 43. 43  Decision tree  Neural Network  Naïve Bayes
  • 44.  Decision tree is an important tool for machine learning. It is used as a predictive model to map conclusions about an item's target value to observations about the item.  Tree structures comprise of leaves and branches.  Leaves represent class labels and branches represent conjunctions of features that lead to those class labels.  In decision analysis, a decision tree can be used to explicitly and visually represent decisions and decision making.  The resulting classification tree can be an input for decision making. 44
  • 45.  NN is always learning model.  It usually receives many inputs, called input vector, (analogy of dendrites) and the sum of them forms output (analogy of synapse).  The sums of inputs are weighted and the sum is passed to the activation function.  However, the brain makes up for the relatively slow rate of operation of a neuron by having a truly staggering number of neurons(nerve cells) with massive interconnections between them.  It is estimated that there must be on the order of 10 billion neurons in the human cortex, and 60 trillion connections.  The net result is that the brain is an enormously efficient structure. 45
  • 46.  Naïve Bayes algorithm is based on Probabilistic learning and classification.  It assumes that one feature is independent of another.  This algorithm proved to be efficient over many other approaches although its simple assumption is not much applicable in realistic world cases.  Exploits the fact that relevant pages possibly link to other relevant pages. Therefore, the relevance of a page a to a topic t, pointed by a page b, is estimated by the relevance of page b to the topic t. 46
  • 47.  With the advent of WWW and with the increase in the size of web, it becomes a challenging task for searching relevant information on the web. The result of any search is millions of documents ranked by a search engine.  The ranking depends on keywords count, keywords density, link count and some other proprietary algorithms that are specific to every search engine.  A user does not have an easy way to reduce those millions of documents by defining a context or refining the meaning of the keyword.  The documents appearing in the search result are ranked according to a specific algorithm that is characteristic to every search engine. The results may in all probability not sorted according to a users interest.  Due to the above mentioned problems and the keyword approach used by most search engines the amount of time and effort required to find the right information is directly proportional to the amount of information on the web. Data mining is a valid solution to the problem.  Crawlers are important tools of data mining which traverses the internet for retrieving web pages which are relevant to a topic. 47
  • 48.  To understand the working of content Block segmentation form previous research.  To find attributes for seed URLs and their child URLs.  To classify the URLs according to their weight or score in the weight table.  To prepare the full training data set and maintain the keyword table.  To classify the relevancy of the new unseen URLs using Decision Tree Induction, Neural Network, Naïve Bayes Classifier.  To find out the Harvest ratio or Precision rate for overall performance evaluation. 48
  • 49. 49
  • 50. There is basically Precision rate parameters which are considered to form results. On the basis of precision rate performance of all the algorithms is measured. Precision rate: It estimates the fraction of crawled pages that are relevant. We depend on multiple classifiers to make this relevance judgement. Precision Rate = No. of Relevant pages/Total downloaded pages 50
  • 51. Hardware and Software Requirements  Hardware: Intel Core I3 processor. Memory 3GB. 64-bit Operating System.  Software: MATLAB 2010a. 51
  • 52.  The name MATLAB stands for MATrix-LABoratory.  MATLAB is a tool for numerical computation and visualization. The basic data element is a matrix, so if you need a program that manipulates array-based data it is generally fast to write .  MATLAB is a high-performance language for technical computing. It integrates computation, visualization, and programming environment. 52
  • 53.  MATLAB is an excellent tool because it has unsophisticated data structures, contains built-in editing and debugging tools, and supports object-oriented programming.  It also has easy to use graphics commands that make the visualization of results immediately available.  There are toolboxes (specific applications) for signal processing, symbolic computation, control theory, simulation, optimization, and several other fields of applied science and engineering. 53
  • 54. 54
  • 55. 55
  • 56.  Our work was to compare the accuracy the three prime classifiers. We can end up this discussion by saying that in terms of classification accuracy Neural network leads the Decision tree induction and Naive Bayesian by a big margin.  Moreover in order to have more complex computing we need to improve upon the memory of a particular classifier by training it, in this context also the NN dominance prevails over the DTI and NB. 56
  • 57. It can be applied on large database for relevancy prediction. Further can be used support vector matching.
  • 58. [1] S. Chakrabarti, M. Berg, and B. Dom, “Focused Crawling: A New Approach for Topic Specific Resource Discovery”, In Journal of Computer and Information Science, vol. 31, no. 11-16, pp. 1623-1640,1999. [2] Li, J., Furuse, K. & Yamaguchi, K., “ Focused Crawling by Exploiting Anchor Text Using Decision Tree”,In Special interest tracks and posters of the 14th international conference on World Wide Web,Chiba, Japan,2005 [3] J. Rennie and A. McCallum, "Using Reinforcement Learning to Spider the Web Efficiently," In proceedings of the 16th International Conference on Machine Learning(ICML-99), pp. 335-343, 1999. 58