SlideShare a Scribd company logo
CSE509: Introduction to Web Science and TechnologyLecture 3: The Structure of the Web, Link Analysis and Web SearchMuhammad AtifQureshi and ArjumandYounusWeb Science Research GroupInstitute of Business Administration (IBA)
Last Time…Basic Information RetrievalApproachesBag of Words AssumptionInformation Retrieval ModelsBoolean modelVector-space modelTopic/Language modelsJuly 23, 2011
TodaySearch Engine ArchitectureOverview of Web CrawlingWeb Link StructureRanking ProblemSEO and Web SpamWeb Spam ResearchJuly 23, 2011
IntroductionWorld Wide Web has evolved from a handful of pages to billions of pagesIn January 2008, Google reported indexing 30 billion pages and Yahoo 37 billion.In this huge amount of data, search engines play a significant role in finding the needed informationSearch engines consist of the following basic operationsWeb crawlingRankingKeyword extractionQuery processingJuly 23, 2011
General Architecture of a Web Search EngineWebUserQueryCrawlerIndexingVisual InterfaceIndexRankingQueryOperationsJuly 23, 2011
CRAWLING MODULEJuly 23, 2011
Web CrawlerDefinitionProgram that collects Web pages by recursively fetching links (i.e., URLs) starting from a set of seed pages [HN99]ObjectiveAcquisition of large collections of Web pages to be indexed by the search engine for efficient execution of user queriesJuly 23, 2011Introduction
Basic Crawler OperationPlace known seed URLs in the URL queueRepeat following steps until a threshold number of pages downloadedFetch a URL on the URL queue and download the corresponding Web pageFor each downloaded Web pageExtract URLs from the Web page For each extracted URL, check validity and availability of URL using checking modules Place the URLs that pass the checks on the URL queueJuly 23, 2011New URLsSeed URLsNOTATIONS USED: queue: module:  data flow URLs tocrawlWeb pagesChecking moduleExtracted  URLsURLduplication checkWeb page downloaderURL queueLinkextractorDNSresolverCrawled Web pagesURLs tocrawlRobotscheckWeb
Crawling IssuesLoad at visited Web sitesLoad at crawlerScope of crawlIncremental crawlingJuly 23, 2011
RANKING MODULEJuly 23, 2011
Problems of TFIDF VectorWorks well on small controlled corpus, but not on the WebTop result for “American Airlines” query: accident report of American Airline flightsDo users really care how many times American Airlines mentioned?Easy to spamRanking purely based on page contentAuthors can manipulate page content to get high rankingAny idea?July 23, 2011
Web Page RankingMotivation	User queries return huge amount of relevant web pages, but the users want to browse the most important onesNote: Relevancerepresents that a web page matches the user’s queryConcept	Ordering the relevant web pages according to their importanceNote: Importance represents the interest of a user on the relevant web pagesMethodsLink-based method: exploiting the link structure of web for ordering the search resultsContent-based method: exploiting the contents of web pages for ordering the search resultsJuly 23, 2011
Link Structure of WebConceptWeb can be modeled as a graph G(V, E) where V is a set of vertices representing web nodes, and E is a set of edges representing directed links between the nodes.Note: Web node represents either a web page or a web domain.          Links are classifed into two classes as follows:The link structure is called web graph.ExampleInlink: the incoming link to a web node.
Outlink: the outgoing link from a web node.V = {A, B, C}E = {AB, BC}AB is an outlink of the web node A.BC is an outlink of the web node B.AB is an inlink of the web node B.BC is an inlink of the web node C.BCAFig. 1: An example of a web graph.July 23, 2011
PageRank: Basic IdeaThink of ….People as pagesRecommendations as linksTherefore, “Pages are popular, if popular pages link them”	“PageRank is a global ranking of all Web pages regardless of their content, based solely on their location in the Web’s graph structure” [Page et al 1998] July 23, 2011
PageRankOverviewA web page is more important if it is pointed by many other important web pagesThe importance of a web page (called PageRank value) represents the probability that a user visits the web pageFunctionJuly 23, 2011web pageimportant web pageDlinkCErandom jump from F to BAuserFBjump to a random page< User’s behavior on the web graph >PR[p]:  PageRank value of web page p  Nolink(q):  number of outlinks of web page qd:  damping factor (probability of following a link)v[p]:  probability that a user randomly jumps to web page p (random jump value over web page p)
PageRank ExampleJuly 23, 20111234
PageRank: Problems on the Real WebDangling nodesA page with no links to send importanceAll importance “leak out of” the WebSolution: Random surfer modelCrawler trapA group of one or more pages that have no  links out of the groupAccumulate all the importance of the WebSolution: Damping factorJuly 23, 2011
Link Analysis in Modern Web SearchPageRank like ideas play basic role in the ranking functions of Google, Yahoo! And BingCurrent ranking functions far from pure PageRankFar more complexEvolve all the timeKept in secret!July 23, 2011
Search Engine OptimizationImportant game-theoretic principle: the world reacts and adapts to the rulesWeb page authors create their Web pages with the search engine’s ranking formula in mindJuly 23, 2011
A Huge Challenge for Today’s Search EnginesSEO gives birth to nuisance of Web spamJuly 23, 2011
Web SpamConcept	Any deliberate action in order to boost a web node’s rank, without improving its real merit.Link spam: web spam against link-based methodsAn action that changes the link structure of web in order to boost web node's ranking.ExampleN1N2I want to boost the rank of the web node N3The web nodes N1and N2 are not involved in link spam, so they care called non-spam nodesN4Actor creates the web node N3 to NxN3N5Web nodes N3-Nx are involved in link spam, so they are called spam nodes…NxNode	             Link                    ActorFig. 2: An example of link spam.July 23, 2011
TrustRankOverview [GGP04]Trusted domains(e.g., well-known non-spam domains such as .gov and .edu) usually point to non-spam domains by using outlinks.Trust scores are propagated through the outlinks of trusted domains.Domains having high trust scores(≥threshold) at the end of propagation are declared as non-spam domains.ExampleObservation	Trust scores can propagate to spam domains if trusted domain outlinks to the spam domains.1/2A domain being considered5/1211/235/12t(1)=1A seed non-spam domain1/3t(3)=5/61/3t(i): The trust score of domain i24t(2)=1The domain 3 gets trust scores from the domains 1 and 2.1/3t(4)=1/3Fig. 3: An example for explaining TrustRank.July 23, 2011
Anti-TrustRankOverview [KR06]Anti-trusted domains (e.g., well-known spam domains) are usually pointed by spam domains by using inlinks.Anti-trust scores are propagated by the inlinks of anti-trusted domains.Domains having high anti-trust scores(≥threshold) at the end of propagation are declared as spam domains.ExampleObservation	Anti-trust score can propagate to non-spam domains if a non-spam domain outlinks to spam domain.1/2A domain being considered5/1211/2A seed spam domain35/12at(1)=11/3at(3)=5/6at(i): The anti-trust score of domain i21/34The domain 3 gets anti-trust scores from the domains 1 and 2.at(2)=1at(4)=1/31/3Fig. 4: An example for explaining Anti-TrustRank.July 23, 2011
Spam MassOverview [GBG06]A domain is spam if it has excessively high spam score.Spam score is estimated as subtraction from a PageRank score to a non-spam score.Non-spam score is estimated as a trust score computed by TrustRank.ExampleObservationSince the Spam Mass has use TrustRank, it has inherently the same problem as TrustRank does.1A domain being considered2756A seed non-spam domain34Fig. 5: An example for explaining Spam Mass.The domain 5 receives many inlinks but only one  indirect inlink from a non-spam domain.July 23, 2011
Link Farm SpamOverview[WD05]A domain is spam if it has many bidirectional links with domains.A domain is spam if it has many outlinks pointing to spam domains.ExampleObservationLink Farm Spam does not take any input seed set.A domain can have many bidirectional links with trusted domains as well.213A domain being considered45The domains 1, 3, and 4 have two directional links.Fig. 6: An example for explaining Link Farm Spam.July 23, 2011
RESEARCH SECTIONJuly 23, 2011
Web Spam Filtering AlgorithmOverviewThe web spam filtering algorithms output spam nodes to be filtered out [GBG06].In order to identify spam nodes, a web spam filtering algorithm needs spam or non-spam nodes (called input seed sets) as an input [GGP04, KR06, GBG06, WD05].Spam input seed set: the input seed set containing spam nodes.Non-spam input seed set: the input seed set containing non-spam nodes.The input seed set can be used as the basis for grading the degree of whether web nodes are spam or non-spam nodes [GGP04, KR06, GBG06].ObservationThe output quality of web spam filtering algorithms is dependent on that of the input seed sets.The output of the one web spam filtering algorithm can be used as the input of the other web spam filtering algorithm.      The algorithms may support one another if placed in appropriate succession.July 23, 2011
Motivation and GoalMotivationThere is no well-known study which addresses the refinement of the input seed sets for web spam filtering algorithms.There is no well-known study on successions among web spam filtering algorithms.Goal	Improving the quality of web spam filtering by using seed refinement.Improving the quality of web spam filtering by finding the appropriate succession among web spam filtering algorithms.July 23, 2011
ContributionsWe propose modified algorithms that apply seed refinement techniques using both spam and non-spam input seed sets to well-known web spam filtering algorithms.We propose a strategy that makes the best succession of the modified algorithms.We conduct extensive experiments in order to show quality improvement for our work.We compare the original(i.e., well-known) algorithms with the respective modified algorithms.We evaluate the best succession among our modified algorithms.July 23, 2011
Web Spam Filtering Using Seed RefinementObjectivesDecrease the number of domains incorrectly detected as belonging to the class of non-spam domains (called False Positives).Increase the number of domains correctly detected as belonging to the class of spam domains (called True Positives).Our approachesWe modify the spam filtering algorithms by using both spam and non-spam domains in order to decrease False Positives.We use non-spam domains so that their goodness should not propagate to spam domains.We use spam domains so that their badness should not propagate to non-spam domains.We make the succession of these algorithms in order to increase True Positives.We make the succession of the seed refinement algorithm followed by the spam detection algorithm so that the spam detection algorithm uses the refined input seed sets, which is produced by the seed refinement algorithm.July 23, 2011
Modified TrustRankModification	Trust score should not propagate to spam domains.Example5/121/2A seed spam domain5/125611/2A domain being considered3t(6)=5/12 + …t(5)=5/12 + …t(1)=15/12t(3)=5/61/3A seed non-spam domain1/325/12t(i): The trust score of domain i4t(2)=1The domains 5 and 6 are involved in Web spam.1/3t(4)=1/3Fig. 7: An example explaining Modified TrustRank.July 23, 2011
Modified Anti-TrustRankModification	Anti-Trust score should not propagate to non-spam domains.Example5/12A seed spam domainat(5)=5/121/2315A domain being considered75/121/25/125/12at(1)=1at(3)=5/65/126at(7)=5/12 + …A seed non-spam domain1/34at(6)=5/12 + …21/3  at(i): The anti-trust score of domain iat(2)=1at(4)=1/31/3The domains 5 ,6 and 7 are non- spam domains.Fig. 8: An example explaining Modified Anti-TrustRank.July 23, 2011
Modified Spam MassModification	Use modified TrustRank in place of TrustRank.ExampleA seed spam domain1A domain being considered2576A seed non-spam domain34The domain 5 receives many inlinksbut only one  indirect inlink from a non-spam domain.Fig. 9: An example explaining Modified Spam Mass.July 23, 2011
Modified Link Farm SpamModificationUse two types (i.e., spam and non-spam domain) of input seed sets.A domain having many bidirectional links with only trusted domains is not detected as a spam domain.Example6827A seed non-spam domain13A domain being considered45The domains 1, 3, and 4 have two directional links.Fig. 10: An example explaining Modified Link Farm Spam.July 23, 2011
Modified Link Farm SpamOverviewWe make the succession of the seed refinement algorithms (simply, Seed Refiner) followed by the spam detection algorithms (simply, Spam Detector).We also consider the execution order of algorithms belonging to Seed Refiner and Spam Detector, respectively.Strategy
Consideration of the execution order in Seed Refiner.
Modified TrustRank followed by Modified Anti-TrustRank.
Modified Anti-TrustRank followed by Modified TrustRank.
Consideration of the execution order in Spam Detector.
Modified Spam Mass followed by Modified Link Farm Spam.
Modified Link Farm Spam followed by Modified Spam Mass.Manually labeled spam and non-spam domainsSeed RefinerRefined spam and non-spam domainsSpam DetectorDetected spam domainsClassData flowFig. 11: The strategy of succession.July 23, 2011
Performance EvaluationPurposeShow the effect of seed refinement on the quality of web spam filtering.Show the effect of succession on the quality of web spam filtering.Experiments We conduct two sets of the experiments according to the two purposes as mentioned above.Table. 1: Summary of the experiments.July 23, 2011
Experimental ParametersTable. 2: Parameters used in experiments.July 23, 2011
Experimental data [BCD08] [CDB06] [CDG07]Experimental DataTable. 3: Characteristics of the data set in terms of domains and web pages.Table. 4: Classification of the data set as Seed Set and Test Set.July 23, 2011
Experimental MeasureTable. 5: Description of the measures.1False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam).July 23, 2011
Comparison between Originaland Modified Algorithms (1/3)Experiment 1: Comparison Between TR and MTR

More Related Content

PPT
Pagerank
ODP
Web2.0.2012 - lesson 8 - Google world
PDF
Extracting Resources that Help Tell Events' Stories
PPT
Ranking Web Pages
PDF
Accessing the deep web (2007)
PPTX
Competitive Analysis
Pagerank
Web2.0.2012 - lesson 8 - Google world
Extracting Resources that Help Tell Events' Stories
Ranking Web Pages
Accessing the deep web (2007)
Competitive Analysis

What's hot (11)

PPT
Data.Mining.C.8(Ii).Web Mining 570802461
PDF
Link Analysis (RBY)
PDF
Search Engine Optimization - Aykut Aslantaş
PDF
The Google Pagerank algorithm - How does it work?
PPTX
WEB Data Mining
KEY
Responsive Web Design
PPTX
page ranking algorithm
PPTX
Ranking algorithms
PPTX
Page rank and hyperlink
PPT
Data Mining of Informational Stream in Social Networks
PPT
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Data.Mining.C.8(Ii).Web Mining 570802461
Link Analysis (RBY)
Search Engine Optimization - Aykut Aslantaş
The Google Pagerank algorithm - How does it work?
WEB Data Mining
Responsive Web Design
page ranking algorithm
Ranking algorithms
Page rank and hyperlink
Data Mining of Informational Stream in Social Networks
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using...
Ad
Ad

Similar to CSE509 Lecture 3 (20)

PDF
TrustRank.PDF
PPTX
HITS + Pagerank
PDF
Search Engine Google
PDF
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
DOC
Done rerea dlink spam alliances good
PDF
Google and their stance on Link Evolution
PPT
Googling of GooGle
DOC
Done rerea dlink-farm-spam
DOC
Done rerea dlink-farm-spam(2)
DOC
Done rerea dlink-farm-spam(3)
PPT
Link building Services from TheSeoPortal SEO Company
PPT
Link buildingtheseoportal-130705070946-phpapp02
PPTX
SEO Fundamentals and Off Page Best Practices
PPTX
Web spam
PDF
Random web surfer pagerank algorithm
PDF
PDF
Link Analysis " Page Ranke Tobic " by waleed
PDF
SEO for Developers
PDF
SEO Animals
TrustRank.PDF
HITS + Pagerank
Search Engine Google
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
Done rerea dlink spam alliances good
Google and their stance on Link Evolution
Googling of GooGle
Done rerea dlink-farm-spam
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(3)
Link building Services from TheSeoPortal SEO Company
Link buildingtheseoportal-130705070946-phpapp02
SEO Fundamentals and Off Page Best Practices
Web spam
Random web surfer pagerank algorithm
Link Analysis " Page Ranke Tobic " by waleed
SEO for Developers
SEO Animals

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Spectroscopy.pptx food analysis technology
Network Security Unit 5.pdf for BCA BBA.
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
Per capita expenditure prediction using model stacking based on satellite ima...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
cuic standard and advanced reporting.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The Rise and Fall of 3GPP – Time for a Sabbatical?
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

CSE509 Lecture 3

  • 1. CSE509: Introduction to Web Science and TechnologyLecture 3: The Structure of the Web, Link Analysis and Web SearchMuhammad AtifQureshi and ArjumandYounusWeb Science Research GroupInstitute of Business Administration (IBA)
  • 2. Last Time…Basic Information RetrievalApproachesBag of Words AssumptionInformation Retrieval ModelsBoolean modelVector-space modelTopic/Language modelsJuly 23, 2011
  • 3. TodaySearch Engine ArchitectureOverview of Web CrawlingWeb Link StructureRanking ProblemSEO and Web SpamWeb Spam ResearchJuly 23, 2011
  • 4. IntroductionWorld Wide Web has evolved from a handful of pages to billions of pagesIn January 2008, Google reported indexing 30 billion pages and Yahoo 37 billion.In this huge amount of data, search engines play a significant role in finding the needed informationSearch engines consist of the following basic operationsWeb crawlingRankingKeyword extractionQuery processingJuly 23, 2011
  • 5. General Architecture of a Web Search EngineWebUserQueryCrawlerIndexingVisual InterfaceIndexRankingQueryOperationsJuly 23, 2011
  • 7. Web CrawlerDefinitionProgram that collects Web pages by recursively fetching links (i.e., URLs) starting from a set of seed pages [HN99]ObjectiveAcquisition of large collections of Web pages to be indexed by the search engine for efficient execution of user queriesJuly 23, 2011Introduction
  • 8. Basic Crawler OperationPlace known seed URLs in the URL queueRepeat following steps until a threshold number of pages downloadedFetch a URL on the URL queue and download the corresponding Web pageFor each downloaded Web pageExtract URLs from the Web page For each extracted URL, check validity and availability of URL using checking modules Place the URLs that pass the checks on the URL queueJuly 23, 2011New URLsSeed URLsNOTATIONS USED: queue: module: data flow URLs tocrawlWeb pagesChecking moduleExtracted URLsURLduplication checkWeb page downloaderURL queueLinkextractorDNSresolverCrawled Web pagesURLs tocrawlRobotscheckWeb
  • 9. Crawling IssuesLoad at visited Web sitesLoad at crawlerScope of crawlIncremental crawlingJuly 23, 2011
  • 11. Problems of TFIDF VectorWorks well on small controlled corpus, but not on the WebTop result for “American Airlines” query: accident report of American Airline flightsDo users really care how many times American Airlines mentioned?Easy to spamRanking purely based on page contentAuthors can manipulate page content to get high rankingAny idea?July 23, 2011
  • 12. Web Page RankingMotivation User queries return huge amount of relevant web pages, but the users want to browse the most important onesNote: Relevancerepresents that a web page matches the user’s queryConcept Ordering the relevant web pages according to their importanceNote: Importance represents the interest of a user on the relevant web pagesMethodsLink-based method: exploiting the link structure of web for ordering the search resultsContent-based method: exploiting the contents of web pages for ordering the search resultsJuly 23, 2011
  • 13. Link Structure of WebConceptWeb can be modeled as a graph G(V, E) where V is a set of vertices representing web nodes, and E is a set of edges representing directed links between the nodes.Note: Web node represents either a web page or a web domain. Links are classifed into two classes as follows:The link structure is called web graph.ExampleInlink: the incoming link to a web node.
  • 14. Outlink: the outgoing link from a web node.V = {A, B, C}E = {AB, BC}AB is an outlink of the web node A.BC is an outlink of the web node B.AB is an inlink of the web node B.BC is an inlink of the web node C.BCAFig. 1: An example of a web graph.July 23, 2011
  • 15. PageRank: Basic IdeaThink of ….People as pagesRecommendations as linksTherefore, “Pages are popular, if popular pages link them” “PageRank is a global ranking of all Web pages regardless of their content, based solely on their location in the Web’s graph structure” [Page et al 1998] July 23, 2011
  • 16. PageRankOverviewA web page is more important if it is pointed by many other important web pagesThe importance of a web page (called PageRank value) represents the probability that a user visits the web pageFunctionJuly 23, 2011web pageimportant web pageDlinkCErandom jump from F to BAuserFBjump to a random page< User’s behavior on the web graph >PR[p]: PageRank value of web page p Nolink(q): number of outlinks of web page qd: damping factor (probability of following a link)v[p]: probability that a user randomly jumps to web page p (random jump value over web page p)
  • 18. PageRank: Problems on the Real WebDangling nodesA page with no links to send importanceAll importance “leak out of” the WebSolution: Random surfer modelCrawler trapA group of one or more pages that have no links out of the groupAccumulate all the importance of the WebSolution: Damping factorJuly 23, 2011
  • 19. Link Analysis in Modern Web SearchPageRank like ideas play basic role in the ranking functions of Google, Yahoo! And BingCurrent ranking functions far from pure PageRankFar more complexEvolve all the timeKept in secret!July 23, 2011
  • 20. Search Engine OptimizationImportant game-theoretic principle: the world reacts and adapts to the rulesWeb page authors create their Web pages with the search engine’s ranking formula in mindJuly 23, 2011
  • 21. A Huge Challenge for Today’s Search EnginesSEO gives birth to nuisance of Web spamJuly 23, 2011
  • 22. Web SpamConcept Any deliberate action in order to boost a web node’s rank, without improving its real merit.Link spam: web spam against link-based methodsAn action that changes the link structure of web in order to boost web node's ranking.ExampleN1N2I want to boost the rank of the web node N3The web nodes N1and N2 are not involved in link spam, so they care called non-spam nodesN4Actor creates the web node N3 to NxN3N5Web nodes N3-Nx are involved in link spam, so they are called spam nodes…NxNode Link ActorFig. 2: An example of link spam.July 23, 2011
  • 23. TrustRankOverview [GGP04]Trusted domains(e.g., well-known non-spam domains such as .gov and .edu) usually point to non-spam domains by using outlinks.Trust scores are propagated through the outlinks of trusted domains.Domains having high trust scores(≥threshold) at the end of propagation are declared as non-spam domains.ExampleObservation Trust scores can propagate to spam domains if trusted domain outlinks to the spam domains.1/2A domain being considered5/1211/235/12t(1)=1A seed non-spam domain1/3t(3)=5/61/3t(i): The trust score of domain i24t(2)=1The domain 3 gets trust scores from the domains 1 and 2.1/3t(4)=1/3Fig. 3: An example for explaining TrustRank.July 23, 2011
  • 24. Anti-TrustRankOverview [KR06]Anti-trusted domains (e.g., well-known spam domains) are usually pointed by spam domains by using inlinks.Anti-trust scores are propagated by the inlinks of anti-trusted domains.Domains having high anti-trust scores(≥threshold) at the end of propagation are declared as spam domains.ExampleObservation Anti-trust score can propagate to non-spam domains if a non-spam domain outlinks to spam domain.1/2A domain being considered5/1211/2A seed spam domain35/12at(1)=11/3at(3)=5/6at(i): The anti-trust score of domain i21/34The domain 3 gets anti-trust scores from the domains 1 and 2.at(2)=1at(4)=1/31/3Fig. 4: An example for explaining Anti-TrustRank.July 23, 2011
  • 25. Spam MassOverview [GBG06]A domain is spam if it has excessively high spam score.Spam score is estimated as subtraction from a PageRank score to a non-spam score.Non-spam score is estimated as a trust score computed by TrustRank.ExampleObservationSince the Spam Mass has use TrustRank, it has inherently the same problem as TrustRank does.1A domain being considered2756A seed non-spam domain34Fig. 5: An example for explaining Spam Mass.The domain 5 receives many inlinks but only one indirect inlink from a non-spam domain.July 23, 2011
  • 26. Link Farm SpamOverview[WD05]A domain is spam if it has many bidirectional links with domains.A domain is spam if it has many outlinks pointing to spam domains.ExampleObservationLink Farm Spam does not take any input seed set.A domain can have many bidirectional links with trusted domains as well.213A domain being considered45The domains 1, 3, and 4 have two directional links.Fig. 6: An example for explaining Link Farm Spam.July 23, 2011
  • 28. Web Spam Filtering AlgorithmOverviewThe web spam filtering algorithms output spam nodes to be filtered out [GBG06].In order to identify spam nodes, a web spam filtering algorithm needs spam or non-spam nodes (called input seed sets) as an input [GGP04, KR06, GBG06, WD05].Spam input seed set: the input seed set containing spam nodes.Non-spam input seed set: the input seed set containing non-spam nodes.The input seed set can be used as the basis for grading the degree of whether web nodes are spam or non-spam nodes [GGP04, KR06, GBG06].ObservationThe output quality of web spam filtering algorithms is dependent on that of the input seed sets.The output of the one web spam filtering algorithm can be used as the input of the other web spam filtering algorithm.  The algorithms may support one another if placed in appropriate succession.July 23, 2011
  • 29. Motivation and GoalMotivationThere is no well-known study which addresses the refinement of the input seed sets for web spam filtering algorithms.There is no well-known study on successions among web spam filtering algorithms.Goal Improving the quality of web spam filtering by using seed refinement.Improving the quality of web spam filtering by finding the appropriate succession among web spam filtering algorithms.July 23, 2011
  • 30. ContributionsWe propose modified algorithms that apply seed refinement techniques using both spam and non-spam input seed sets to well-known web spam filtering algorithms.We propose a strategy that makes the best succession of the modified algorithms.We conduct extensive experiments in order to show quality improvement for our work.We compare the original(i.e., well-known) algorithms with the respective modified algorithms.We evaluate the best succession among our modified algorithms.July 23, 2011
  • 31. Web Spam Filtering Using Seed RefinementObjectivesDecrease the number of domains incorrectly detected as belonging to the class of non-spam domains (called False Positives).Increase the number of domains correctly detected as belonging to the class of spam domains (called True Positives).Our approachesWe modify the spam filtering algorithms by using both spam and non-spam domains in order to decrease False Positives.We use non-spam domains so that their goodness should not propagate to spam domains.We use spam domains so that their badness should not propagate to non-spam domains.We make the succession of these algorithms in order to increase True Positives.We make the succession of the seed refinement algorithm followed by the spam detection algorithm so that the spam detection algorithm uses the refined input seed sets, which is produced by the seed refinement algorithm.July 23, 2011
  • 32. Modified TrustRankModification Trust score should not propagate to spam domains.Example5/121/2A seed spam domain5/125611/2A domain being considered3t(6)=5/12 + …t(5)=5/12 + …t(1)=15/12t(3)=5/61/3A seed non-spam domain1/325/12t(i): The trust score of domain i4t(2)=1The domains 5 and 6 are involved in Web spam.1/3t(4)=1/3Fig. 7: An example explaining Modified TrustRank.July 23, 2011
  • 33. Modified Anti-TrustRankModification Anti-Trust score should not propagate to non-spam domains.Example5/12A seed spam domainat(5)=5/121/2315A domain being considered75/121/25/125/12at(1)=1at(3)=5/65/126at(7)=5/12 + …A seed non-spam domain1/34at(6)=5/12 + …21/3 at(i): The anti-trust score of domain iat(2)=1at(4)=1/31/3The domains 5 ,6 and 7 are non- spam domains.Fig. 8: An example explaining Modified Anti-TrustRank.July 23, 2011
  • 34. Modified Spam MassModification Use modified TrustRank in place of TrustRank.ExampleA seed spam domain1A domain being considered2576A seed non-spam domain34The domain 5 receives many inlinksbut only one indirect inlink from a non-spam domain.Fig. 9: An example explaining Modified Spam Mass.July 23, 2011
  • 35. Modified Link Farm SpamModificationUse two types (i.e., spam and non-spam domain) of input seed sets.A domain having many bidirectional links with only trusted domains is not detected as a spam domain.Example6827A seed non-spam domain13A domain being considered45The domains 1, 3, and 4 have two directional links.Fig. 10: An example explaining Modified Link Farm Spam.July 23, 2011
  • 36. Modified Link Farm SpamOverviewWe make the succession of the seed refinement algorithms (simply, Seed Refiner) followed by the spam detection algorithms (simply, Spam Detector).We also consider the execution order of algorithms belonging to Seed Refiner and Spam Detector, respectively.Strategy
  • 37. Consideration of the execution order in Seed Refiner.
  • 38. Modified TrustRank followed by Modified Anti-TrustRank.
  • 39. Modified Anti-TrustRank followed by Modified TrustRank.
  • 40. Consideration of the execution order in Spam Detector.
  • 41. Modified Spam Mass followed by Modified Link Farm Spam.
  • 42. Modified Link Farm Spam followed by Modified Spam Mass.Manually labeled spam and non-spam domainsSeed RefinerRefined spam and non-spam domainsSpam DetectorDetected spam domainsClassData flowFig. 11: The strategy of succession.July 23, 2011
  • 43. Performance EvaluationPurposeShow the effect of seed refinement on the quality of web spam filtering.Show the effect of succession on the quality of web spam filtering.Experiments We conduct two sets of the experiments according to the two purposes as mentioned above.Table. 1: Summary of the experiments.July 23, 2011
  • 44. Experimental ParametersTable. 2: Parameters used in experiments.July 23, 2011
  • 45. Experimental data [BCD08] [CDB06] [CDG07]Experimental DataTable. 3: Characteristics of the data set in terms of domains and web pages.Table. 4: Classification of the data set as Seed Set and Test Set.July 23, 2011
  • 46. Experimental MeasureTable. 5: Description of the measures.1False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam).July 23, 2011
  • 47. Comparison between Originaland Modified Algorithms (1/3)Experiment 1: Comparison Between TR and MTR
  • 48. MTR performs either comparable to or slightly better than TR in terms of both true positives and false positives.
  • 49. We find cutoffTreffective till 100% mark indicating that after 100% detection becomes unstable in terms of false positives.  For later experiments, we fix the cutoffTrrange till 100%.Experiment 2: Comparison Between ATR and MATR
  • 50. MATR generally performs better than ATR in terms of true positives
  • 51. We find cutoffATreffective till 180% mark indicating that after 100% detection becomes unstable in terms of false positives.  For later experiments, we fix the cutoffATr at 100% to ensure high precision.July 23, 2011
  • 52. Comparison between Originaland Modified Algorithms (2/3)Experiment 3: Comparison Between SM and MSMMSM performs slightly better than SM in terms of true positives and comparable in terms of false positivesWe find relativeMasseffective between the range of 0.95 to 0.99 in terms of maximizing true positives and minimizing false positives.  For later experiments, we keep the range from 0.8 to 0.99 of relativeMass as effective range.Experiment 4: Comparison Between LFS and MLFSMLFS performs better than LFS in terms of false positives while at some expense of true positives.We find limitBL and limitOL highly effective at 7 and 7 respectively in terms of minimizing many false positives.  For later experiments, we keep limitBL = 7 and limitOL = 7.July 23, 2011
  • 53. Comparison between Originaland Modified Algorithms (3/3)SummaryWe have found all modified algorithms providing better quality than the respective original algorithms.We found SM as the best original web spam detection algorithms among ATR, SM, and LFS algorithms due to high true positives and relatively less false positives.We also found MSM as the best modified web spam detection algorithms among MATR, MSM, and MLFS algorithms due to high true positives and relatively less false positives.July 23, 2011
  • 54. The Best Succession for the Seed RefinerIdentical performance for both successionsIdentical performance for both successionsIdentical performance for both successionsBetter performance for MATR-MTR compared toMTR-MATRTable. 6: Comparison for the seed refiner.Therefore, MATR-MTR is found to be the winner, and hence we select it as the seed refiner.July 23, 2011
  • 55. The Best Successionfor the Spam DetectorComparisonWe pick 0.99 of relativeMass since false positives are minimum at this value compared to other values of relativeMass while true positives are almost comparable for all values of relativeMass.We observe MLFS fails to detect considerable number of spam domains.We obtain the precisions 0.86, 0.86, 0.93, and 0.87 for MLFS-MSM, MSM-MLFS, MLFS, and MSM respectively.We obtain the recalls 0.80, 0.80, 0.33, and 0.76 for MLFS-MSM, MSM-MLFS, MLFS, and MSM respectively.MLFS-MSM and MSM-MLFS are best and identical in performance, we choose MLFS-MSM as the best spam detector without loss of generality.Fig. 12: Comparison for the spam detector.July 23, 2011
  • 56. ComparisonWe pick 0.99 of relativeMass since false positives are minimum at this value compared to other values of relativeMass while true positives are almost comparable for all values of relativeMass.We observe MATR-MTR-MLFS-MSM finds more true positives and some more false positives.We obtain the precisions 0.85, 0.86, and 0.86 for SM, MSM, and MATR-MTR-LFS-MSM respectively.We obtain the recalls 0.64, 0.70, and 0.80 for SM, MSM, and MATR-MTR-LFS-MSM respectively.Comparison among the Best Succession, theBest Known Algorithm and the Best Modified AlgorithmFig. 13:Comparison among MATR-MTR-MLFS-MSM, SM, and MSM.Therefore, MATR-MTR-MLFS-MSM is more effective.July 23, 2011
  • 57. ConclusionsWe have improved the quality of web spam filtering by using seed refinementWe have proposed modifications in four well-known web spam filtering algorithms.We have proposed a strategy of succession of modified algorithmsSeed Refiner contains order of executions for seed refinement algorithms.Spam Detector contains order of executions for spam detection algorithms.We have conducted extensive experiments in order to show the effect of seed refinement on the quality of web spam filteringWe find that every modified algorithm performs better than the respective original algorithm.We find the best performance among the successions by MATR followed by MTR, MLFS, and MSM (i.e., MATR-MTR-MSM). This succession outperforms the best original algorithm i.e., SM, by up to 1.25 times in recall and is comparable in terms of precision.July 23, 2011

Editor's Notes

  • #7: Existing work classify ranking algorithms into two classes as follows. Content-based method: exploiting the vertex information (that is, contents of web pages).Link-based method: exploiting the edges information (that is, a link structure of web).
  • #8: Now, I explain the original PageRank. The main idea of PageRank is that a web page is more important if it is pointed by many other important web pages.The importance of a web page (called PageRank value) represents the probability that a user visits the web pageI will show how user visit a web page by using this figure. (In this figure, circles represent web pages and arrows represent directed links.) Assume a user is on the web page F. Then, the user can visit the web page C by following the outlink FC of F, and visit other web pages by following the outlinks of C, and so on. (If he gets bored with clicking on the outlinks to visit another web page) The user can also type the address of a random web page and jump into it. Here, user on F randomly jumps to web page B. Obviously, as there are many links to web page C, the user may frequently visit C. Thus, C may be an important web page.Since user has two ways to visit the web page, the probability that user visit a web page or the PageRank value of that web page consists of 2 part. The first part is the probability that user visit web page p by following the outlinks of web pages that have links to p. The second part is the probability that user visit web page p by randomly jump from any web page.
  • #9: Highly successful, all major businesses use an RDB system
  • #10: Spam domainThe domain contributing in web spam.Non-spam domainUniverse of domains – {spam domain}
  • #11: Trusted domain is a subset a non-spam domains set and already known to human as well.
  • #12: Anti-Trusted domains is a subset of spam domainsSome sex websites
  • #13: The rank coming from non-spam domain is estimated by TrustRank and the rank coming from non-spam is estimated by TrustRank value from PageRank valueSpam Mass is spam detection algorithm
  • #14: Detect spam domains by expanding the seed set of spam domains by counting outlinks to the spam domains
  • #17: Here are the contents of my presentationFirst I introduce the background and motivation of my researchThen, I present related work that contains two approaches for improving the ranking qualityAfter that, I present the algorithms that combine these two approachesIn the main part, I present the performance evaluation Finally, I present the conclusions
  • #21: Detect spam domains on basis of many bidirectional linksDetect more spam domains by counting outlinks to the spam domains
  • #22: Detect spam domains on basis of many bidirectional linksDetect more spam domains by counting outlinks to the spam domains