SlideShare a Scribd company logo
Emrullah Delibas
ž The Problem of Ranking
•  Objectives, Challenges
ž Early Assumptions & Approaches
ž Link-Based Ranking Algorithms
•  InDegree Algorithm
•  Hubs and Authorities: HITS
•  PageRank
•  SALSA
•  Hilltop
ž Search Engine Spamming
ž Problems with Non-textual Context
ž “Cornell”
•  Did the searcher want information about the
university?
•  The university’s hockey team?
•  The Lab of Ornithology run by the university?
•  Cornell College in Iowa?
•  The Nobel-Prize-winning physicist Eric Cornell?
The same ranking of search results can’t be
right for everyone.
ž  Objectives:
•  To categorize webpages
•  To find pages related to given pages
•  To find duplicated websites
•  To calculate the ‘quality’ of a web link
•  To get the most ‘relevant’ web links based on a given query
•  To model human judgments indirectly
•  …
ž  Challenges:
•  Searching by itself is a hard problem for computers to solve in any
setting
•  scale and complexity on the Web
•  problems of synonymy and polysemy
•  dynamic and constantly-changing nature of Web content
•  …
ž Back in the 1990’s, web search was purely
based on the number of occurrences of a
word in a document.
ž The search was purely and only based on
relevancy of a document with the query.
Simply getting the relevant documents wasn’t
sufficient as the number of relevant
documents may range in a few millions.
ž  Links are assumed to be endorsements
•  Disagreement
•  Self-citation
•  Link to a popular document
ž  Hyperlinks contain information about the human judgment
of a site
ž  The more incoming links to a site, the more it is judged
ž  The Web is not a random network
-Bray,Tim. "Measuring the web." Computer networks and ISDN systems 28.7 (1996): 993-1005.
-Marchiori, Massimo. "The quest for correct information on the web: Hyper search engines." Computer
Networks and ISDN Systems 29.8 (1997): 1225-1235.
ž Hyperlinks are not at random, they
provide valuable information for:
•  Link-based ranking
•  Structure analysis
•  Detection of communities
•  Spam detection
•  …
Link analysis for web search
ž This approach could be seen as the basis of
each and every link analysis ranking
algorithm.
ž The link recommendation assumption is that
by linking to another page, the author
recommends it.
•  So, a page with many incoming links has been highly
recommended.
ž The ranking is just base on the authority and
no weighting of authority values.
Link analysis for web search
Hypertext Induced Topic Selection
ž The basic idea is that relevant pages
(“authorities”) are linked to by many other
pages (“hubs”).
ž The algorithm is now a part of the Ask
search engine.
Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,
46(5):604–632, 1999. A preliminary version appears in the Proceedings of the 9th ACM-SIAM
Symposium on Discrete Algorithms, Jan. 1998.
ž It is developed by looking at the way how
humans analyze a search process rather
than the machines searching up a query
by looking at a bunch of documents and
return the matches.
ž For example;
•  “top automobile makers in the world”
ž Rules:
•  A good hub points to many good authorities.
•  A good authority is pointed to by many good
hubs.
•  Authorities and hubs have a mutual
reinforcement relationship.
Link analysis for web search
ž Objective: Sq
•  (i) Sq is relatively small
•  (ii) Sq is rich in relevant pages
•  (iii) Sq contains most (or many) of the strongest
authorities
ž Solution
•  Generate a Root Set Qσ from text-based search
engine
•  Expand the root set
Link analysis for web search
ž Let authority score of the page i be x(i),
and the hub score of page i be y(i).
ž mutual reinforcing relationship:
•  I step:
•  O step:
ž 1st iteration
ž 1st iteration
•  I step
ž 1st iteration
•  I step
•  O step
ž 2nd iteration
•  I step
ž 2nd iteration
•  I step
•  O step
ž 2nd iteration
•  I step
•  O step
•  …
•  ...
•  ...
1.  must be built “on the fly”
2.  suffers from topic drift
3.  cannot detect advertisements
4.  can easily be spammed
5.  query time evaluation is slow
Heart of Google
ž Proposed by by Sergey Brin and Lawrence
Page
ž Uses a recursive scheme similar to
Kleinberg’s HITS algorithm
ž But the PageRank algorithm produces a
ranking, independent of a user’s query.
Sergey Brin and Lawrence Page.The anatomy of a large-scale hypertextual Web search
engine. In Proc. 7th International World Wide Web Conference, pages 107–117, 1998.
ž A page is important if it is pointed to by
other important pages.
ž The PageRank of a page pi is given as
follows:
•  Suppose that the page pi has pages M(pi) linking
to it.
•  L(pj) is the number of outbound links on page pj.
Link analysis for web search
Link analysis for web search
ž The algorithm is robust against Spam
•  since its not easy for a webpage owner to add in-
links to his/her page from other important
pages.
ž PageRank is a global measure and is
query independent.
ž It favors the older pages
•  Since new ones will not have many links
ž PageRank can be easily increased by the
concept of “link-farms”
•  However, while indexing, the search actively
tries to find these flaws.
ž Rank Sinks: occurs when in a network
pages get in infinite link cycles
ž Spider Traps: occurs if there are no links
from within the group to outside the group.
ž Dangling Links: occurs when a page
contains a link such that the hypertext
points to a page with no outgoing links.
ž Dead Ends: pages with no outgoing links.
Link analysis for web search
ž Damping Factor
•  random jumps (teleportation)
–  where N is the total number of pages
–  Typically d ≈ 0.85
PAGERANK HITS
ž  Computed for all web-
pages stored prior to
the query
ž  Computes authorities only
ž  Fast to compute
ž  No need for additional
normalization
ž  Performed on the subset
generated by each query.
ž  Computes authorities and
hubs
ž  Easy to compute, real-time
execution is hard.
ž  There is need for
normalization
Criteria HITS PageRank
Complexity Analysis O(kN2) O(n)
Result quality Less than PageRank
algorithm
Medium
Relevancy Less. Since this
algorithm ranks the
pages on the indexing
time
More since this
algorithm uses the
hyperlinks to give good
results and also
consider the content of
the page
Neighborhood applied to the local
neighborhood of pages
surrounding the results
of a query
applied to entire web
Grover, Nidhi, and Ritika Wason. "Comparative analysis of pagerank and hits
algorithms." International Journal of Engineering Research and Technology.Vol. 1.
No. 8 (October-2012). ESRSA Publications, 2012.
ž  Keyword-Stuffing: Overloading the website with
relevant keywords.
ž  Text-Hidding: Placing relevant content on the
website which can only be seen by search engines.
ž  Doorway-Page: A page which is very well optimized
for some keywords and with the only purpose to
redirect to a real website.
ž  Link-farms: Websites which are optimized for some
keywords and contains only a huge number of links
to other websites.
ž Flash: rarely processed by search engines
ž Java Applets: normally not processed.
ž Videos and Images: not directly
processable for search engines.
ž Other Rich-Media Formats: (e.g.
Silverlight) which are typically not
processed by search engines.

More Related Content

PPT
Pagerank Algorithm Explained
PDF
Pagerank and hits
PPTX
Page rank algortihm
PPTX
SERP: All you need to know about #SERP
PPT
Seo ppt - BEGINNERS COURSE - COMPLETE GUIDE - ARISE ROBY
PDF
Introduction to Search Engine Optimization
PDF
SEO, PPC and AI in 2023 and Beyond
Pagerank Algorithm Explained
Pagerank and hits
Page rank algortihm
SERP: All you need to know about #SERP
Seo ppt - BEGINNERS COURSE - COMPLETE GUIDE - ARISE ROBY
Introduction to Search Engine Optimization
SEO, PPC and AI in 2023 and Beyond

What's hot (20)

PPT
Working Of Search Engine
PPTX
Social Network Analysis Using Gephi
PPT
Local SEO Presentation
PDF
Technical SEO
PPTX
PageRank
PDF
Link-Based Ranking
PDF
Deep learning
PPTX
OFF PAGE SEO
PPTX
Search Engine Optimization ppt
PDF
Link Analysis
PPTX
SEO Toronto Presentation
PPTX
Crawling and Indexing
PPTX
page ranking algorithm
PPT
Seo ppt
PPTX
basic Seo ppt
PDF
Google Ads Updates Guide 2020: The best Guide
PPTX
Introduction to SEO Presentation
PPTX
Seo presentation
PPTX
Off page seo
PPT
Search Engine Marketing - Boost your presence. Build your brand.
Working Of Search Engine
Social Network Analysis Using Gephi
Local SEO Presentation
Technical SEO
PageRank
Link-Based Ranking
Deep learning
OFF PAGE SEO
Search Engine Optimization ppt
Link Analysis
SEO Toronto Presentation
Crawling and Indexing
page ranking algorithm
Seo ppt
basic Seo ppt
Google Ads Updates Guide 2020: The best Guide
Introduction to SEO Presentation
Seo presentation
Off page seo
Search Engine Marketing - Boost your presence. Build your brand.
Ad

Similar to Link analysis for web search (20)

PPT
Internet 信息检索中的数学
PPT
Mazhiming
PPTX
Discovering knowledge using web structure mining
PPTX
Web Mining.pptx
PPT
Web mining
PPTX
Web mining
PPTX
Search Engine working, Crawlers working, Search Engine mechanism
PPTX
Search engines
PDF
Searchland: Search quality for Beginners
PDF
Macran
PDF
Charting Searchland, ACM SIG Data Mining
PDF
PDF
PageRank algorithm and its variations: A Survey report
PPTX
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
PPT
Page Rank
PPTX
INFORMATION RETRIEVAL IN WEB INTELLIGENCE
PPTX
DC presentation 1
PPTX
Link Analysis Methods a fds fdsa f fads f.pptx
PPTX
Web mining
PPT
Page rank by university of michagain.ppt
Internet 信息检索中的数学
Mazhiming
Discovering knowledge using web structure mining
Web Mining.pptx
Web mining
Web mining
Search Engine working, Crawlers working, Search Engine mechanism
Search engines
Searchland: Search quality for Beginners
Macran
Charting Searchland, ACM SIG Data Mining
PageRank algorithm and its variations: A Survey report
LINEAR ALGEBRA BEHIND GOOGLE SEARCH
Page Rank
INFORMATION RETRIEVAL IN WEB INTELLIGENCE
DC presentation 1
Link Analysis Methods a fds fdsa f fads f.pptx
Web mining
Page rank by university of michagain.ppt
Ad

Recently uploaded (20)

PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
Microbiology with diagram medical studies .pptx
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
BIOMOLECULES PPT........................
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Biophysics 2.pdffffffffffffffffffffffffff
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Microbiology with diagram medical studies .pptx
Cell Membrane: Structure, Composition & Functions
Derivatives of integument scales, beaks, horns,.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Introduction to Fisheries Biotechnology_Lesson 1.pptx
HPLC-PPT.docx high performance liquid chromatography
Comparative Structure of Integument in Vertebrates.pptx
Placing the Near-Earth Object Impact Probability in Context
BIOMOLECULES PPT........................
POSITIONING IN OPERATION THEATRE ROOM.ppt

Link analysis for web search

  • 2. ž The Problem of Ranking •  Objectives, Challenges ž Early Assumptions & Approaches ž Link-Based Ranking Algorithms •  InDegree Algorithm •  Hubs and Authorities: HITS •  PageRank •  SALSA •  Hilltop ž Search Engine Spamming ž Problems with Non-textual Context
  • 3. ž “Cornell” •  Did the searcher want information about the university? •  The university’s hockey team? •  The Lab of Ornithology run by the university? •  Cornell College in Iowa? •  The Nobel-Prize-winning physicist Eric Cornell? The same ranking of search results can’t be right for everyone.
  • 4. ž  Objectives: •  To categorize webpages •  To find pages related to given pages •  To find duplicated websites •  To calculate the ‘quality’ of a web link •  To get the most ‘relevant’ web links based on a given query •  To model human judgments indirectly •  … ž  Challenges: •  Searching by itself is a hard problem for computers to solve in any setting •  scale and complexity on the Web •  problems of synonymy and polysemy •  dynamic and constantly-changing nature of Web content •  …
  • 5. ž Back in the 1990’s, web search was purely based on the number of occurrences of a word in a document. ž The search was purely and only based on relevancy of a document with the query. Simply getting the relevant documents wasn’t sufficient as the number of relevant documents may range in a few millions.
  • 6. ž  Links are assumed to be endorsements •  Disagreement •  Self-citation •  Link to a popular document ž  Hyperlinks contain information about the human judgment of a site ž  The more incoming links to a site, the more it is judged ž  The Web is not a random network -Bray,Tim. "Measuring the web." Computer networks and ISDN systems 28.7 (1996): 993-1005. -Marchiori, Massimo. "The quest for correct information on the web: Hyper search engines." Computer Networks and ISDN Systems 29.8 (1997): 1225-1235.
  • 7. ž Hyperlinks are not at random, they provide valuable information for: •  Link-based ranking •  Structure analysis •  Detection of communities •  Spam detection •  …
  • 9. ž This approach could be seen as the basis of each and every link analysis ranking algorithm. ž The link recommendation assumption is that by linking to another page, the author recommends it. •  So, a page with many incoming links has been highly recommended. ž The ranking is just base on the authority and no weighting of authority values.
  • 12. ž The basic idea is that relevant pages (“authorities”) are linked to by many other pages (“hubs”). ž The algorithm is now a part of the Ask search engine. Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. A preliminary version appears in the Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, Jan. 1998.
  • 13. ž It is developed by looking at the way how humans analyze a search process rather than the machines searching up a query by looking at a bunch of documents and return the matches. ž For example; •  “top automobile makers in the world”
  • 14. ž Rules: •  A good hub points to many good authorities. •  A good authority is pointed to by many good hubs. •  Authorities and hubs have a mutual reinforcement relationship.
  • 16. ž Objective: Sq •  (i) Sq is relatively small •  (ii) Sq is rich in relevant pages •  (iii) Sq contains most (or many) of the strongest authorities ž Solution •  Generate a Root Set Qσ from text-based search engine •  Expand the root set
  • 18. ž Let authority score of the page i be x(i), and the hub score of page i be y(i). ž mutual reinforcing relationship: •  I step: •  O step:
  • 21. ž 1st iteration •  I step •  O step
  • 23. ž 2nd iteration •  I step •  O step
  • 24. ž 2nd iteration •  I step •  O step •  … •  ... •  ...
  • 25. 1.  must be built “on the fly” 2.  suffers from topic drift 3.  cannot detect advertisements 4.  can easily be spammed 5.  query time evaluation is slow
  • 27. ž Proposed by by Sergey Brin and Lawrence Page ž Uses a recursive scheme similar to Kleinberg’s HITS algorithm ž But the PageRank algorithm produces a ranking, independent of a user’s query. Sergey Brin and Lawrence Page.The anatomy of a large-scale hypertextual Web search engine. In Proc. 7th International World Wide Web Conference, pages 107–117, 1998.
  • 28. ž A page is important if it is pointed to by other important pages.
  • 29. ž The PageRank of a page pi is given as follows: •  Suppose that the page pi has pages M(pi) linking to it. •  L(pj) is the number of outbound links on page pj.
  • 32. ž The algorithm is robust against Spam •  since its not easy for a webpage owner to add in- links to his/her page from other important pages. ž PageRank is a global measure and is query independent.
  • 33. ž It favors the older pages •  Since new ones will not have many links ž PageRank can be easily increased by the concept of “link-farms” •  However, while indexing, the search actively tries to find these flaws.
  • 34. ž Rank Sinks: occurs when in a network pages get in infinite link cycles ž Spider Traps: occurs if there are no links from within the group to outside the group. ž Dangling Links: occurs when a page contains a link such that the hypertext points to a page with no outgoing links. ž Dead Ends: pages with no outgoing links.
  • 36. ž Damping Factor •  random jumps (teleportation) –  where N is the total number of pages –  Typically d ≈ 0.85
  • 37. PAGERANK HITS ž  Computed for all web- pages stored prior to the query ž  Computes authorities only ž  Fast to compute ž  No need for additional normalization ž  Performed on the subset generated by each query. ž  Computes authorities and hubs ž  Easy to compute, real-time execution is hard. ž  There is need for normalization
  • 38. Criteria HITS PageRank Complexity Analysis O(kN2) O(n) Result quality Less than PageRank algorithm Medium Relevancy Less. Since this algorithm ranks the pages on the indexing time More since this algorithm uses the hyperlinks to give good results and also consider the content of the page Neighborhood applied to the local neighborhood of pages surrounding the results of a query applied to entire web Grover, Nidhi, and Ritika Wason. "Comparative analysis of pagerank and hits algorithms." International Journal of Engineering Research and Technology.Vol. 1. No. 8 (October-2012). ESRSA Publications, 2012.
  • 39. ž  Keyword-Stuffing: Overloading the website with relevant keywords. ž  Text-Hidding: Placing relevant content on the website which can only be seen by search engines. ž  Doorway-Page: A page which is very well optimized for some keywords and with the only purpose to redirect to a real website. ž  Link-farms: Websites which are optimized for some keywords and contains only a huge number of links to other websites.
  • 40. ž Flash: rarely processed by search engines ž Java Applets: normally not processed. ž Videos and Images: not directly processable for search engines. ž Other Rich-Media Formats: (e.g. Silverlight) which are typically not processed by search engines.