SlideShare a Scribd company logo
web intelligence
INFORMATION RETERIVAL
• Information retrieval (IR) is the process of accessing and retrieving relevant
information from a collection of data, typically in the form of text documents or
multimedia content. It involves techniques and methods for effectively searching,
organizing, and presenting information to meet the needs of users.
• Information retrieval (IR) or simply searching. Searching isn’t new functionality;
nearly every application has some implementation of search, but intelligent
searching goes beyond plain old searching.
• Key components of information retrieval include:
• Indexing: Creating an index of the documents in the collection, which involves
analyzing and extracting important keywords, phrases, or metadata to represent the
content of each document.
• Query Processing: Processing user queries to understand their information needs
and matching them to relevant documents in the collection.
• Ranking: Ranking the retrieved documents based on their relevance to the query,
typically using algorithms that consider factors such as keyword frequency,
document popularity, and semantic similarity.
• User Interfaces: Providing user-friendly interfaces for users to interact with the
retrieval system, submit queries, and browse or navigate through the retrieved
results.
• Evaluation: Assessing the performance of the retrieval system using metrics such
as precision, recall, and relevance to measure how effectively it retrieves relevant
infor
Information retrieval is used in various applications and domains,
• web search engines,
• digital libraries,
• enterprise search systems,
• recommendation systems,
• e-commerce platforms.
• It plays a crucial role in enabling users to access and make sense of large volumes
of information efficiently, thereby facilitating decision-making, research, and
knowledge discovery.
IR LIBRARIES
• Experimentation can convince you that the naïve IR solution is full of
problems.
• For example, as soon as you increase the number of documents, or
their size, its per_x0002_formance will become unacceptable for
most purposes.
• there’s an enormous amount of knowledge about IR and fairly
sophisticated and robust libraries are available that offer scalability
and high performance.
• The most successful IR library in the Java programming language is
Lucene, a project created by Doug Cutting
Searching with Lucene
• Lucene can help you solve the IR problem by indexing all your documents and
letting you search through them at lightning speeds! Lucene in Action by Otis
Gospodnetic´ and Erik Hatcher, published by Manning, is a must-read,
especially if you want to know how to index data and introduces search,
sorting, filtering and highlighting search results.
1.The data that you want to search could be in your database, on the internet,
or on any other network that’s accessible to your application. You can collect
data from the internet by using a crawler.A number of crawlers are freely
available.
2.We’ll use a number of pages that we collected on November 6, 2006, so we
can modify them in a controlled fashion and observe the effect of these
changes in the results of the algorithms.
3.These pages have been cleaned up and changed to form a tiny You can find
these pages under the data/ch02/ directory. It’s important to know the content of
these documents, so that you can appreciate what the algorithms do and
understand how they work.
EXAMPLE
Our 15 documents are (the choice of content was random)
A Seven documents related to business news
B.Three documents related to Lance Armstrong’s attempt to run the marathon in
New York.
C.Four documents related to U.S. politics and, in particular, the congressional
elections (circa 2006).
D Five documents related to world news; four about Ortega winning the elections
in Nicaragua and one about global warming.
4.Lucene can help us analyze, index, and search these and any other
document that can be converted into text, so it’s not limited to web
pages. The class that we’ll use to quickly read the stored web pages is
called FetchAndProcessCrawler;
5.this class can also retrieve data from the internet. Its constructor takes
three arguments:
■ The base directory for storing the retrieved data.
■ The depth of the link structure that should be traversed.
■ The maximum number of total documents that should be retrieved.
Reading, indexing, and searching the default
list of web pages
• The crawling and preprocessing stage should take only a few seconds, and
when it finishes you should have a new directory under the base
directory. In our example, the base directory was C:/iWeb2/data/ch02.
The new directory’s name will start with the string crawl- and be followed
by the numeric value of the crawl’s timestamp in milli_x0002_seconds—
for example, crawl-1200697910111.
• You can change the content of the documents, or add more documents,
and rerun the preprocessing and indexing of the files in order to observe
the differences in your search results. Figure 2.1 is a snapshot of
executing the code from listing 2.1 in the BeanShell, and it includes the
results of the search for the term “armstrong.”
INFORMATION RETRIEVAL IN WEB INTELLIGENCE
Understanding the Lucene code
• LUCENE CODE
• 1.The LuceneIndexBuilder creates a Lucene
index
• The IndexWriter class is what Lucene uses to
create an index. It comes with a large
number of constructors, which you can
peruse in the Javadocs. The specific
construc_x0002_tor that we use in our code
takes three arguments:
• ■ The directory where we want to store the
index.
• ■ The analyzer that we want to use—we’ll
talk about analyzers later in this
• section.
• ■ A Boolean variable that determines
whether we need to override the existing
• directory.
• 2.MySearcher: retrieving search results based on Lucene
indexing
• REVIEW
• 1.We use an instance of the Lucene IndexSearcher class to open our index for searching.
• 2 We create an instance of the Lucene QueryParser class by providing the name of the field
that we query against and the analyzer that must be used for tokenizing the query text.
• 3 We use the parse method of the QueryParser to transform the human-readable query into a
Query instance that Lucene can understand.
• 4 We search the index and obtain the results in the form of a Lucene Hits object.
• 5 We loop over the first n results and collect them in the form of our own SearchResult
objects. Note that Lucene’s Hits object contains only references to the underlying documents.
We use these references to collect the required fields; for example, the call
hits.doc(i).get("url") will return the URL that we stored in the index.
• 6 The relevance score for each retrieved document is recorded. This score is a number
between 0 and 1.
basic stages of search
• ■ Crawling
• ■ Parsing
• ■ Analyzing
• ■ Indexing
• ■ Searching
Improving search results based on link
analysis
• In link analysis algorithm that makes Google special—PageRank. The
PageRank algorithm was introduced in 1998, at the seventh international
World Wide Web conference (WWW98), by Sergey Brin and Larry Page in a
paper titled “The anatomy of a large-scale hypertextual Web search engine.”
Around the same time,
• Jon Kleinberg at IBM Almaden had discovered the Hypertext Induced Topic
Search (HITS) algorithm. Both algorithms are link analysis models, although
HITS didn’t have the degree of commercial success that PageRank did.
• PageRank algorithm and the mechanics of calculating ranking values. We’ll
also examine the so-called tele_x0002_portation mechanism and the inner
workings of the power method, which is at the heart of the PageRank
algorithm.
• we’ll introduce the basic concepts behind the PageRank algorithm and the
mechanics of calculating ranking values. We’ll also examine the so-called
teleportation mechanism and the inner workings of the power method, which is at
the heart of the PageRank algorithm. Lastly, we’ll demonstrate the combination of
index scores and PageRank scores for improving our search results.
• An introduction to PageRank
The key idea of PageRank is to consider hyperlinks from one page to another as
recommendations So, the more endorsements a page has the higher its importance
should be.
If web page A has a link to web page B, there’s an arrow pointing from A to B.
Based on this figure, we’ll introduce the hyperlink matrix H and a row vector p (the
PageRank vector). Think of a matrix as nothing more than a table (a 2D array) and a
vector as a single array in Java. Each row in the matrix H is constructed by counting
the number of all the outlinks from page Pi , say N(i) and assigning to column j the
value 1/N(i) if there’s an outlink from page Pi to page Pj, or assigning the value 0
otherwise. Thus, for the graph
shows the directed graph for all our sample web pages that start with the prefix biz. The titles of
these articles and their file names are given in table
• zero represent the sparse matrix
it show minimum nunber
hyperlink
• All values in the matrix are less
than or equal to 1. This turns out
to be very important
Calculating the PageRank vector
• The PageRank algorithm calculates the vector p using the following iterative formula:
• p (k+1) = p (k) * H
• The values of p are the PageRank values for every page in the graph. You start with a
• set of initial values such as p(0) = 1/n, where n is the number of pages in the graph,
• and use the formula to obtain p(1), then p(2), and so on, until the difference between
• two successive PageRank vectors is small enough; that arbitrary smallness is also
• known as the convergence criterion or threshold. This iterative method is the power
method
• as applied to H. That, in a nutshell, is the PageRank algorithm
Problem in page Rank
• The first problem is that on the internet there are some pages that don’t point to
any other pages; in our example, such a web page is biz-02 in figure 2.5. We call these
pages of the graph dangling nodes. These nodes are a problem because they trap our
surfer; without outlinks, there’s nowhere to go! They correspond to rows that have
value equal to zero for all their cells in the H matrix. To fix this problem, we introduce
a random jump, which means that once our surfer reaches a dangling node, he may go
to the address bar of his browser and type the URL of any one of the graph’s pages. In
terms of the H matrix, this corresponds to setting all the zeros (of a dangling node
row) equal to 1/n, where n is the number of pages in the graph. Technically, this
correction of the H matrix is referred to as the stochasticity adjustment.
• The second problem is that sometimes our surfer may get bored, or interrupted, and
may jump to another page without following the linked structure of the web pages;
the equivalent of Star Trek’s teleportation beam. To account for these arbitrary jumps,
we introduce a new parameter that, in our code, we call alpha. This parameter
determines the amount of time that our surfer will surf by following the links versus
jumping arbitrarily from one page to another page; this parameter is sometimes
referred to as the damping factor. Technically, this correction of the H matrix is
referred to as the primitivity adjustment
• , Google’s PageRank and Beyond: The Science of Search Engine Rankings by Amy
Langville and Carl Meyer is an excellent reference. So, let’s get into action and get
the H matrix by running some code. Listing 2.5 shows how to load just the web pages
that belong to the business news and calculate the PageRank that corresponds to
them..
• Understanding the power method
• Combining the index scores and the PageRank scores
Improving search results based on user clicks
• Using the NaiveBayes classifier
• Classification relies on reference structures that divide the space of all possible
data points into a set of classes (also known as categories or con_x0002_cepts)
that are (usually) non-overlapping.
• probabilistic classifier that implements what’s known as the naïve Bayes
algorithm; our implementation is provided by the NaiveBayes class. Classifiers
are agnostic to UserClicks, they’re only concerned with Concepts, Instances,
and Attributes.
• A classifier’s job is to assign a Concept to an Instance; that’s all a classifier does.
In order to know what Concept should be assigned to a particular Instance, a
classifier reads a TrainingSet—a set of Instances that already have a Concept
assigned to them. Upon loading those Instances, the classifier trains itself, or
learns, how to map a Conceptto an Instance based on the assignments in the
TrainingSet. The way that each classifier trains depends on the classifier.
• The good thing about the NaiveBayes classifier is that it provides something called the
conditional probability of X given Y—a probability that tells us how likely is it to
observe event X provided that
• we’ve already observed event Y. In particular, this classifier uses as input the following:
• ■ The probability of observing concept X, in general, also known as the prior
probability and denoted by p(X).
• ■ The probability of observing instance Y if we randomly select an instance from
concept X, also known as the likelihood and denoted by p(Y|X).
• ■ The probability of observing instance Y in general, also known as the evidence
and denoted by p(Y)
• The calculation is performed based on the following formula
• (known as Bayes theorem):
• p(X|Y) = p(Y|X) p(X) / p(Y)
Ranking Word, PDF, and other documents
without links
to introduce ranking in documents without links, we’ll take the HTML documents
and create Word documents with identical content. This willl allow us to compare
our results and identify any similarities or differences in the two approaches. Parsing
Word documents can be done easily using the open source library TextMining; note
that the name has changed to tm-extractor.http:// code.google.com/p/text
mining/source/checkout. We’ve written a class called MSWordDocumentParser that
encapsulates the parsing of a Word document
An introduction to DocRank
se the same classes to read the Word documents as we did to read
the HTML documents (the FetchAndProcessCrawler class) and we use Lucene to
index the content of these documents.
INFORMATION RETRIEVAL IN WEB INTELLIGENCE

More Related Content

PPTX
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
PDF
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
PPTX
Semantic framework for web scraping.
PPTX
How a search engine works slide
PDF
CS6007 information retrieval - 5 units notes
DOCX
Page 18Goal Implement a complete search engine. Milestones.docx
PPTX
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
PPTX
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
Semantic framework for web scraping.
How a search engine works slide
CS6007 information retrieval - 5 units notes
Page 18Goal Implement a complete search engine. Milestones.docx
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...

Similar to INFORMATION RETRIEVAL IN WEB INTELLIGENCE (20)

PPTX
Data science chapter-7,8,9
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
DOC
How a search engine works report
PPT
Working Of Search Engine
PPTX
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
PPTX
Google history nd architecture
PPT
score based ranking of documents
DOCX
Seminar report(rohitsahu cs 17 vth sem)
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
PPTX
WEB BASED INFORMATION RETRIEVAL SYSTEM
PPTX
PgVector + : Enable Richer Interaction with vector database.pptx
PPTX
Lec 11-12 Search engines for easy use.pptx
PPTX
Case study of Rujhaan.com (A social news app )
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
PageRank algorithm and its variations: A Survey report
PPT
3 Understanding Search
PDF
Smart Crawler for Efficient Deep-Web Harvesting
PDF
Data Mining Module 5 Business Analytics.pdf
PPTX
Apache lucene
PPTX
Longwell final ppt
Data science chapter-7,8,9
Dice.com Bay Area Search - Beyond Learning to Rank Talk
How a search engine works report
Working Of Search Engine
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
Google history nd architecture
score based ranking of documents
Seminar report(rohitsahu cs 17 vth sem)
Scaling Recommendations, Semantic Search, & Data Analytics with solr
WEB BASED INFORMATION RETRIEVAL SYSTEM
PgVector + : Enable Richer Interaction with vector database.pptx
Lec 11-12 Search engines for easy use.pptx
Case study of Rujhaan.com (A social news app )
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PageRank algorithm and its variations: A Survey report
3 Understanding Search
Smart Crawler for Efficient Deep-Web Harvesting
Data Mining Module 5 Business Analytics.pdf
Apache lucene
Longwell final ppt
Ad

Recently uploaded (20)

PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
bas. eng. economics group 4 presentation 1.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Well-logging-methods_new................
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
web development for engineering and engineering
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Sustainable Sites - Green Building Construction
PPTX
Construction Project Organization Group 2.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
UNIT 4 Total Quality Management .pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
bas. eng. economics group 4 presentation 1.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Well-logging-methods_new................
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Lecture Notes Electrical Wiring System Components
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
web development for engineering and engineering
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Mechanical Engineering MATERIALS Selection
Sustainable Sites - Green Building Construction
Construction Project Organization Group 2.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
CH1 Production IntroductoryConcepts.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
UNIT 4 Total Quality Management .pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Ad

INFORMATION RETRIEVAL IN WEB INTELLIGENCE

  • 2. • Information retrieval (IR) is the process of accessing and retrieving relevant information from a collection of data, typically in the form of text documents or multimedia content. It involves techniques and methods for effectively searching, organizing, and presenting information to meet the needs of users. • Information retrieval (IR) or simply searching. Searching isn’t new functionality; nearly every application has some implementation of search, but intelligent searching goes beyond plain old searching. • Key components of information retrieval include: • Indexing: Creating an index of the documents in the collection, which involves analyzing and extracting important keywords, phrases, or metadata to represent the content of each document.
  • 3. • Query Processing: Processing user queries to understand their information needs and matching them to relevant documents in the collection. • Ranking: Ranking the retrieved documents based on their relevance to the query, typically using algorithms that consider factors such as keyword frequency, document popularity, and semantic similarity. • User Interfaces: Providing user-friendly interfaces for users to interact with the retrieval system, submit queries, and browse or navigate through the retrieved results. • Evaluation: Assessing the performance of the retrieval system using metrics such as precision, recall, and relevance to measure how effectively it retrieves relevant infor
  • 4. Information retrieval is used in various applications and domains, • web search engines, • digital libraries, • enterprise search systems, • recommendation systems, • e-commerce platforms. • It plays a crucial role in enabling users to access and make sense of large volumes of information efficiently, thereby facilitating decision-making, research, and knowledge discovery.
  • 5. IR LIBRARIES • Experimentation can convince you that the naïve IR solution is full of problems. • For example, as soon as you increase the number of documents, or their size, its per_x0002_formance will become unacceptable for most purposes. • there’s an enormous amount of knowledge about IR and fairly sophisticated and robust libraries are available that offer scalability and high performance. • The most successful IR library in the Java programming language is Lucene, a project created by Doug Cutting
  • 6. Searching with Lucene • Lucene can help you solve the IR problem by indexing all your documents and letting you search through them at lightning speeds! Lucene in Action by Otis Gospodnetic´ and Erik Hatcher, published by Manning, is a must-read, especially if you want to know how to index data and introduces search, sorting, filtering and highlighting search results. 1.The data that you want to search could be in your database, on the internet, or on any other network that’s accessible to your application. You can collect data from the internet by using a crawler.A number of crawlers are freely available. 2.We’ll use a number of pages that we collected on November 6, 2006, so we can modify them in a controlled fashion and observe the effect of these changes in the results of the algorithms.
  • 7. 3.These pages have been cleaned up and changed to form a tiny You can find these pages under the data/ch02/ directory. It’s important to know the content of these documents, so that you can appreciate what the algorithms do and understand how they work. EXAMPLE Our 15 documents are (the choice of content was random) A Seven documents related to business news B.Three documents related to Lance Armstrong’s attempt to run the marathon in New York. C.Four documents related to U.S. politics and, in particular, the congressional elections (circa 2006). D Five documents related to world news; four about Ortega winning the elections in Nicaragua and one about global warming.
  • 8. 4.Lucene can help us analyze, index, and search these and any other document that can be converted into text, so it’s not limited to web pages. The class that we’ll use to quickly read the stored web pages is called FetchAndProcessCrawler; 5.this class can also retrieve data from the internet. Its constructor takes three arguments: ■ The base directory for storing the retrieved data. ■ The depth of the link structure that should be traversed. ■ The maximum number of total documents that should be retrieved.
  • 9. Reading, indexing, and searching the default list of web pages
  • 10. • The crawling and preprocessing stage should take only a few seconds, and when it finishes you should have a new directory under the base directory. In our example, the base directory was C:/iWeb2/data/ch02. The new directory’s name will start with the string crawl- and be followed by the numeric value of the crawl’s timestamp in milli_x0002_seconds— for example, crawl-1200697910111. • You can change the content of the documents, or add more documents, and rerun the preprocessing and indexing of the files in order to observe the differences in your search results. Figure 2.1 is a snapshot of executing the code from listing 2.1 in the BeanShell, and it includes the results of the search for the term “armstrong.”
  • 13. • LUCENE CODE • 1.The LuceneIndexBuilder creates a Lucene index • The IndexWriter class is what Lucene uses to create an index. It comes with a large number of constructors, which you can peruse in the Javadocs. The specific construc_x0002_tor that we use in our code takes three arguments: • ■ The directory where we want to store the index. • ■ The analyzer that we want to use—we’ll talk about analyzers later in this • section. • ■ A Boolean variable that determines whether we need to override the existing • directory.
  • 14. • 2.MySearcher: retrieving search results based on Lucene indexing
  • 15. • REVIEW • 1.We use an instance of the Lucene IndexSearcher class to open our index for searching. • 2 We create an instance of the Lucene QueryParser class by providing the name of the field that we query against and the analyzer that must be used for tokenizing the query text. • 3 We use the parse method of the QueryParser to transform the human-readable query into a Query instance that Lucene can understand. • 4 We search the index and obtain the results in the form of a Lucene Hits object. • 5 We loop over the first n results and collect them in the form of our own SearchResult objects. Note that Lucene’s Hits object contains only references to the underlying documents. We use these references to collect the required fields; for example, the call hits.doc(i).get("url") will return the URL that we stored in the index. • 6 The relevance score for each retrieved document is recorded. This score is a number between 0 and 1.
  • 16. basic stages of search • ■ Crawling • ■ Parsing • ■ Analyzing • ■ Indexing • ■ Searching
  • 17. Improving search results based on link analysis • In link analysis algorithm that makes Google special—PageRank. The PageRank algorithm was introduced in 1998, at the seventh international World Wide Web conference (WWW98), by Sergey Brin and Larry Page in a paper titled “The anatomy of a large-scale hypertextual Web search engine.” Around the same time, • Jon Kleinberg at IBM Almaden had discovered the Hypertext Induced Topic Search (HITS) algorithm. Both algorithms are link analysis models, although HITS didn’t have the degree of commercial success that PageRank did. • PageRank algorithm and the mechanics of calculating ranking values. We’ll also examine the so-called tele_x0002_portation mechanism and the inner workings of the power method, which is at the heart of the PageRank algorithm.
  • 18. • we’ll introduce the basic concepts behind the PageRank algorithm and the mechanics of calculating ranking values. We’ll also examine the so-called teleportation mechanism and the inner workings of the power method, which is at the heart of the PageRank algorithm. Lastly, we’ll demonstrate the combination of index scores and PageRank scores for improving our search results. • An introduction to PageRank The key idea of PageRank is to consider hyperlinks from one page to another as recommendations So, the more endorsements a page has the higher its importance should be. If web page A has a link to web page B, there’s an arrow pointing from A to B. Based on this figure, we’ll introduce the hyperlink matrix H and a row vector p (the PageRank vector). Think of a matrix as nothing more than a table (a 2D array) and a vector as a single array in Java. Each row in the matrix H is constructed by counting the number of all the outlinks from page Pi , say N(i) and assigning to column j the value 1/N(i) if there’s an outlink from page Pi to page Pj, or assigning the value 0 otherwise. Thus, for the graph
  • 19. shows the directed graph for all our sample web pages that start with the prefix biz. The titles of these articles and their file names are given in table
  • 20. • zero represent the sparse matrix it show minimum nunber hyperlink • All values in the matrix are less than or equal to 1. This turns out to be very important
  • 21. Calculating the PageRank vector • The PageRank algorithm calculates the vector p using the following iterative formula: • p (k+1) = p (k) * H • The values of p are the PageRank values for every page in the graph. You start with a • set of initial values such as p(0) = 1/n, where n is the number of pages in the graph, • and use the formula to obtain p(1), then p(2), and so on, until the difference between • two successive PageRank vectors is small enough; that arbitrary smallness is also • known as the convergence criterion or threshold. This iterative method is the power method • as applied to H. That, in a nutshell, is the PageRank algorithm
  • 22. Problem in page Rank • The first problem is that on the internet there are some pages that don’t point to any other pages; in our example, such a web page is biz-02 in figure 2.5. We call these pages of the graph dangling nodes. These nodes are a problem because they trap our surfer; without outlinks, there’s nowhere to go! They correspond to rows that have value equal to zero for all their cells in the H matrix. To fix this problem, we introduce a random jump, which means that once our surfer reaches a dangling node, he may go to the address bar of his browser and type the URL of any one of the graph’s pages. In terms of the H matrix, this corresponds to setting all the zeros (of a dangling node row) equal to 1/n, where n is the number of pages in the graph. Technically, this correction of the H matrix is referred to as the stochasticity adjustment.
  • 23. • The second problem is that sometimes our surfer may get bored, or interrupted, and may jump to another page without following the linked structure of the web pages; the equivalent of Star Trek’s teleportation beam. To account for these arbitrary jumps, we introduce a new parameter that, in our code, we call alpha. This parameter determines the amount of time that our surfer will surf by following the links versus jumping arbitrarily from one page to another page; this parameter is sometimes referred to as the damping factor. Technically, this correction of the H matrix is referred to as the primitivity adjustment • , Google’s PageRank and Beyond: The Science of Search Engine Rankings by Amy Langville and Carl Meyer is an excellent reference. So, let’s get into action and get the H matrix by running some code. Listing 2.5 shows how to load just the web pages that belong to the business news and calculate the PageRank that corresponds to them..
  • 24. • Understanding the power method • Combining the index scores and the PageRank scores
  • 25. Improving search results based on user clicks • Using the NaiveBayes classifier • Classification relies on reference structures that divide the space of all possible data points into a set of classes (also known as categories or con_x0002_cepts) that are (usually) non-overlapping. • probabilistic classifier that implements what’s known as the naïve Bayes algorithm; our implementation is provided by the NaiveBayes class. Classifiers are agnostic to UserClicks, they’re only concerned with Concepts, Instances, and Attributes. • A classifier’s job is to assign a Concept to an Instance; that’s all a classifier does. In order to know what Concept should be assigned to a particular Instance, a classifier reads a TrainingSet—a set of Instances that already have a Concept assigned to them. Upon loading those Instances, the classifier trains itself, or learns, how to map a Conceptto an Instance based on the assignments in the TrainingSet. The way that each classifier trains depends on the classifier.
  • 26. • The good thing about the NaiveBayes classifier is that it provides something called the conditional probability of X given Y—a probability that tells us how likely is it to observe event X provided that • we’ve already observed event Y. In particular, this classifier uses as input the following: • ■ The probability of observing concept X, in general, also known as the prior probability and denoted by p(X). • ■ The probability of observing instance Y if we randomly select an instance from concept X, also known as the likelihood and denoted by p(Y|X). • ■ The probability of observing instance Y in general, also known as the evidence and denoted by p(Y) • The calculation is performed based on the following formula • (known as Bayes theorem): • p(X|Y) = p(Y|X) p(X) / p(Y)
  • 27. Ranking Word, PDF, and other documents without links to introduce ranking in documents without links, we’ll take the HTML documents and create Word documents with identical content. This willl allow us to compare our results and identify any similarities or differences in the two approaches. Parsing Word documents can be done easily using the open source library TextMining; note that the name has changed to tm-extractor.http:// code.google.com/p/text mining/source/checkout. We’ve written a class called MSWordDocumentParser that encapsulates the parsing of a Word document An introduction to DocRank se the same classes to read the Word documents as we did to read the HTML documents (the FetchAndProcessCrawler class) and we use Lucene to index the content of these documents.