SlideShare a Scribd company logo
CSE509: Introduction to Web Science and TechnologyLecture 2: Basic Information Retrieval ModelsMuhammad AtifQureshi and ArjumandYounusWeb Science Research GroupInstitute of Business Administration (IBA)
Last Time…What is Web Science?Why We Need Web Science?Implications of Web ScienceWeb Science Case-Study: Diff-IEJuly 16, 2011
TodayBasic Information RetrievalApproachesBag of Words AssumptionInformation Retrieval ModelsBoolean modelVector-space modelTopic/Language modelsJuly 16, 2011
Computer-based Information ManagementBasic  problemHow to use computers to help humans store, organize and retrieve information?What approaches have been taken and what has been successful?July 16, 2011
Three Major ApproachesDatabase approachExpert-system approachInformation retrieval approachJuly 16, 2011
Database ApproachInformation is stored in a highly-structured wayData stored in relational tables as tuplesSimple data model and query languageRelational model and SQL query languageClear interpretation of data and queryNo ambition to be “intelligent” like humansMain focus on system efficiencyPros and cons?July 16, 2011
Expert-System ApproachInformation stored as a set of logical predicatesBird(X).Cat(X).Fly(X)Given a query, the system infers the answer through logical inferenceBird(Ostrich)  Fly(Ostrich)?Pros and cons?July 16, 2011
Information Retrieval ApproachUses existing text documents as information sourceNo special structuring or database construction requiredText-based query languageKeyword-based query or natural language queryPros and cons?July 16, 2011
Database vs. Information RetrievalJuly 16, 2011IRDatabasesWhat we’re retrievingMostly unstructured.  Free text with some metadata.Structured data. Clear semantics based on a formal model.Queries we’re posingVague, imprecise information needs (often expressed in natural language).Formally (mathematically) defined queries.  Unambiguous.Results we getSometimes relevant, often not.Exact.  Always correct in a formal sense.Interaction with systemInteraction is important.One-shot queries.Other issuesIssues downplayed.Concurrency, recovery, atomicity are all critical.
Main Challenge of Information Retrieval ApproachInterpretation of query and data is not straightforwardBoth queries and data are fuzzyUnstructured text and natural language queryJuly 16, 2011
Need forInformation Retrieval ModelComputers do not understand the document or the queryFor implementation of information retrieval approach a computerizable model is essentialJuly 16, 2011
What is a Model?A model is a construct designed to help us understand a complex systemA particular way of “looking at things”Models work by adopting simplifying assumptionsDifferent types of modelsConceptual modelPhysical analog modelMathematical models…July 16, 2011
Major Simplification: Bag of WordsConsider each document as a “bag of words”Consider queries as a “bag of words” as wellGreat oversimplification, but works adequately in many casesStill how do we match documents and queries?July 16, 2011
Sample DocumentJuly 16, 201116 × said 14 × McDonalds12 × fat11 × fries8 × new6 × company french nutrition5 × food oil percent reduce taste Tuesday…McDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.…“Bag of Words”
Vector Representation“Bags of words” can be represented as vectors
Computational efficiency
Ease of manipulation
A vector is a set of values recorded in any consistent orderJuly 16, 2011“The quick brown fox jumped over the lazy dog’s back”  [ 1 1 1 1 1 1 1 1 2 ]1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”
Representing DocumentsJuly 16, 2011aid01all01back10brown10come01dog10fox10good01jump10lazy10men01now01over10party01quick10their01time01Document 1  TermDocument 1Document 2The quick brown fox jumped over the lazy dog’s back. Stopword     ListforisofDocument 2thetoNow is the time for all good men to come to the aid of their party.
Boolean ModelReturn all documents that contain the words in the queryWeights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document“1” represents “presence”: term is in the documentNo notion of “ranking”A document is either a match or a non-matchJuly 16, 2011
Evaluation: Precision and RecallQ: Are all matching documents what users want?Basic idea: a model is good if it returns a document if and only if it is relevantR: set of relevant documents	D: set of documents returned by a modelJuly 16, 2011
Ranked RetrievalOrder documents by how likely they are to be relevant to the information needArranging documents by relevance isCloser to how humans think: some documents are “better” than othersCloser to user behavior: users can decide when to stop readingBest (partial) match: documents need not have all query termsAlthough documents with more query terms should be “better”July 16, 2011
Similarity-Based QueriesLet’s replace relevance with “similarity”Rank documents by their similarity with the queryTreat the query as if it were a documentCreate a query bag-of-wordsFind its similarity to each documentRank order the documents by similaritySurprisingly, this works pretty well!July 16, 2011
Vector Space ModelDocuments “close together” in vector space “talk about” the same things	Therefore,  retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)July 16, 2011t3d2d3d1θφt1d5t2d4
Similarity MetricAngle between the vectorsJuly 16, 2011Query VectorDocument Vector
How do we Weight Document Terms?July 16, 2011Here’s the intuition:
Terms that appear often in a document should get high weights
Terms that appear in many documents should get low weights
How do we capture this mathematically?
Term frequency
Inverse document frequencyTF.IDF Term WeightingSimple, yet effective!July 16, 2011weight assigned to term i in document jnumber of occurrence of term i in document jnumber of documents in entire collectionnumber of documents with term i
TF.IDF ExampleJuly 16, 2011Wi,jtfidf12341234521.510.60complicated0.3014130.500.130.38contaminated0.1255430.630.500.38fallout0.1256332information0.00010.60interesting0.602370.902.11nuclear0.3016140.750.130.50retrieval0.12521.20siberia0.602
Normalizing Document VectorsRecall our similarity function:Normalize document vectors in advanceUse the “cosine normalization” method: divide each term weight through by length of vectorJuly 16, 2011
Normalization ExampleJuly 16, 2011tf12345241354363321376142Wi,jW'i,jidf123412341.510.600.570.69complicated0.3010.500.130.380.290.130.14contaminated0.1250.630.500.380.370.190.44fallout0.125information0.0000.600.62interesting0.6020.902.110.530.79nuclear0.3010.750.130.500.770.050.57retrieval0.1251.200.71siberia0.6021.700.972.670.87Length
Retrieval ExampleJuly 16, 2011W'i,j12340.570.690.290.130.140.370.190.440.620.530.790.770.050.570.71Query: contaminated retrievalW'i,jquerycomplicated1contaminatedfalloutRanked list:Doc 2Doc 4Doc 1Doc 3informationinterestingnuclear1retrievalsiberia0.290.90.190.57similarity score

More Related Content

PPT
Web & text mining lecture10
PPTX
Information retrieval basics_v1.0
PDF
Fundamentals of IR models
PPTX
PDF
A Modified Information Retrieval Approach to Produce Candidates for Question ...
PPTX
Cognitive Retrieval Model
PPTX
Introduction to Information Retrieval
PPTX
Information retrieval s
Web & text mining lecture10
Information retrieval basics_v1.0
Fundamentals of IR models
A Modified Information Retrieval Approach to Produce Candidates for Question ...
Cognitive Retrieval Model
Introduction to Information Retrieval
Information retrieval s

Similar to CSE509 Lecture 2 (20)

PPTX
Week14-Multimedia Information Retrieval.pptx
PPT
4-IR Models_new.ppt
PPT
4-IR Models_new.ppt
PPT
chapter 5 Information Retrieval Models.ppt
PDF
An Introduction to Information Retrieval.pdf
PDF
Chapter 4 IR Models.pdf
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
PDF
Information Retrieval and Map-Reduce Implementations
PPT
Lec 4,5
PPT
Ir models
PDF
191CSEH IR UNIT - II for an engineering subject
PPTX
Boolean,vector space retrieval Models
PPTX
Information retrival system and PageRank algorithm
PPT
Information Retrieval and Storage Systems
PPTX
Information Retrieval and Extraction - Module 7
PDF
Ibrahim ramadan paper
PPTX
Search Engines
DOCX
UNIT 3 IRT.docx
PPTX
Introduction to Information Retrieval (concepts and principles)
PPTX
Tdm information retrieval
Week14-Multimedia Information Retrieval.pptx
4-IR Models_new.ppt
4-IR Models_new.ppt
chapter 5 Information Retrieval Models.ppt
An Introduction to Information Retrieval.pdf
Chapter 4 IR Models.pdf
Information_Retrieval_Models_Nfaoui_El_Habib
Information Retrieval and Map-Reduce Implementations
Lec 4,5
Ir models
191CSEH IR UNIT - II for an engineering subject
Boolean,vector space retrieval Models
Information retrival system and PageRank algorithm
Information Retrieval and Storage Systems
Information Retrieval and Extraction - Module 7
Ibrahim ramadan paper
Search Engines
UNIT 3 IRT.docx
Introduction to Information Retrieval (concepts and principles)
Tdm information retrieval
Ad

More from Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

Ad

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Cloud computing and distributed systems.
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
sap open course for s4hana steps from ECC to s4
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Assigned Numbers - 2025 - Bluetooth® Document
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectroscopy.pptx food analysis technology
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
Cloud computing and distributed systems.
NewMind AI Weekly Chronicles - August'25-Week II
Spectral efficient network and resource selection model in 5G networks
Per capita expenditure prediction using model stacking based on satellite ima...
Network Security Unit 5.pdf for BCA BBA.
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Encapsulation_ Review paper, used for researhc scholars
Machine Learning_overview_presentation.pptx
Programs and apps: productivity, graphics, security and other tools
sap open course for s4hana steps from ECC to s4
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
Assigned Numbers - 2025 - Bluetooth® Document

CSE509 Lecture 2

  • 1. CSE509: Introduction to Web Science and TechnologyLecture 2: Basic Information Retrieval ModelsMuhammad AtifQureshi and ArjumandYounusWeb Science Research GroupInstitute of Business Administration (IBA)
  • 2. Last Time…What is Web Science?Why We Need Web Science?Implications of Web ScienceWeb Science Case-Study: Diff-IEJuly 16, 2011
  • 3. TodayBasic Information RetrievalApproachesBag of Words AssumptionInformation Retrieval ModelsBoolean modelVector-space modelTopic/Language modelsJuly 16, 2011
  • 4. Computer-based Information ManagementBasic problemHow to use computers to help humans store, organize and retrieve information?What approaches have been taken and what has been successful?July 16, 2011
  • 5. Three Major ApproachesDatabase approachExpert-system approachInformation retrieval approachJuly 16, 2011
  • 6. Database ApproachInformation is stored in a highly-structured wayData stored in relational tables as tuplesSimple data model and query languageRelational model and SQL query languageClear interpretation of data and queryNo ambition to be “intelligent” like humansMain focus on system efficiencyPros and cons?July 16, 2011
  • 7. Expert-System ApproachInformation stored as a set of logical predicatesBird(X).Cat(X).Fly(X)Given a query, the system infers the answer through logical inferenceBird(Ostrich)  Fly(Ostrich)?Pros and cons?July 16, 2011
  • 8. Information Retrieval ApproachUses existing text documents as information sourceNo special structuring or database construction requiredText-based query languageKeyword-based query or natural language queryPros and cons?July 16, 2011
  • 9. Database vs. Information RetrievalJuly 16, 2011IRDatabasesWhat we’re retrievingMostly unstructured. Free text with some metadata.Structured data. Clear semantics based on a formal model.Queries we’re posingVague, imprecise information needs (often expressed in natural language).Formally (mathematically) defined queries. Unambiguous.Results we getSometimes relevant, often not.Exact. Always correct in a formal sense.Interaction with systemInteraction is important.One-shot queries.Other issuesIssues downplayed.Concurrency, recovery, atomicity are all critical.
  • 10. Main Challenge of Information Retrieval ApproachInterpretation of query and data is not straightforwardBoth queries and data are fuzzyUnstructured text and natural language queryJuly 16, 2011
  • 11. Need forInformation Retrieval ModelComputers do not understand the document or the queryFor implementation of information retrieval approach a computerizable model is essentialJuly 16, 2011
  • 12. What is a Model?A model is a construct designed to help us understand a complex systemA particular way of “looking at things”Models work by adopting simplifying assumptionsDifferent types of modelsConceptual modelPhysical analog modelMathematical models…July 16, 2011
  • 13. Major Simplification: Bag of WordsConsider each document as a “bag of words”Consider queries as a “bag of words” as wellGreat oversimplification, but works adequately in many casesStill how do we match documents and queries?July 16, 2011
  • 14. Sample DocumentJuly 16, 201116 × said 14 × McDonalds12 × fat11 × fries8 × new6 × company french nutrition5 × food oil percent reduce taste Tuesday…McDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.…“Bag of Words”
  • 15. Vector Representation“Bags of words” can be represented as vectors
  • 18. A vector is a set of values recorded in any consistent orderJuly 16, 2011“The quick brown fox jumped over the lazy dog’s back”  [ 1 1 1 1 1 1 1 1 2 ]1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”
  • 19. Representing DocumentsJuly 16, 2011aid01all01back10brown10come01dog10fox10good01jump10lazy10men01now01over10party01quick10their01time01Document 1 TermDocument 1Document 2The quick brown fox jumped over the lazy dog’s back. Stopword ListforisofDocument 2thetoNow is the time for all good men to come to the aid of their party.
  • 20. Boolean ModelReturn all documents that contain the words in the queryWeights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document“1” represents “presence”: term is in the documentNo notion of “ranking”A document is either a match or a non-matchJuly 16, 2011
  • 21. Evaluation: Precision and RecallQ: Are all matching documents what users want?Basic idea: a model is good if it returns a document if and only if it is relevantR: set of relevant documents D: set of documents returned by a modelJuly 16, 2011
  • 22. Ranked RetrievalOrder documents by how likely they are to be relevant to the information needArranging documents by relevance isCloser to how humans think: some documents are “better” than othersCloser to user behavior: users can decide when to stop readingBest (partial) match: documents need not have all query termsAlthough documents with more query terms should be “better”July 16, 2011
  • 23. Similarity-Based QueriesLet’s replace relevance with “similarity”Rank documents by their similarity with the queryTreat the query as if it were a documentCreate a query bag-of-wordsFind its similarity to each documentRank order the documents by similaritySurprisingly, this works pretty well!July 16, 2011
  • 24. Vector Space ModelDocuments “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)July 16, 2011t3d2d3d1θφt1d5t2d4
  • 25. Similarity MetricAngle between the vectorsJuly 16, 2011Query VectorDocument Vector
  • 26. How do we Weight Document Terms?July 16, 2011Here’s the intuition:
  • 27. Terms that appear often in a document should get high weights
  • 28. Terms that appear in many documents should get low weights
  • 29. How do we capture this mathematically?
  • 31. Inverse document frequencyTF.IDF Term WeightingSimple, yet effective!July 16, 2011weight assigned to term i in document jnumber of occurrence of term i in document jnumber of documents in entire collectionnumber of documents with term i
  • 32. TF.IDF ExampleJuly 16, 2011Wi,jtfidf12341234521.510.60complicated0.3014130.500.130.38contaminated0.1255430.630.500.38fallout0.1256332information0.00010.60interesting0.602370.902.11nuclear0.3016140.750.130.50retrieval0.12521.20siberia0.602
  • 33. Normalizing Document VectorsRecall our similarity function:Normalize document vectors in advanceUse the “cosine normalization” method: divide each term weight through by length of vectorJuly 16, 2011
  • 34. Normalization ExampleJuly 16, 2011tf12345241354363321376142Wi,jW'i,jidf123412341.510.600.570.69complicated0.3010.500.130.380.290.130.14contaminated0.1250.630.500.380.370.190.44fallout0.125information0.0000.600.62interesting0.6020.902.110.530.79nuclear0.3010.750.130.500.770.050.57retrieval0.1251.200.71siberia0.6021.700.972.670.87Length
  • 35. Retrieval ExampleJuly 16, 2011W'i,j12340.570.690.290.130.140.370.190.440.620.530.790.770.050.570.71Query: contaminated retrievalW'i,jquerycomplicated1contaminatedfalloutRanked list:Doc 2Doc 4Doc 1Doc 3informationinterestingnuclear1retrievalsiberia0.290.90.190.57similarity score
  • 36. Language ModelsLanguage modelsBased on the notion of probabilities and processes for generating textDocuments are ranked based on the probability that they generated the queryBest/partial matchJuly 16, 2011
  • 37. What is a Language Model?Probability distribution over strings of textHow likely is a string in a given “language”?Probabilities depend on what language we’re modelingJuly 16, 2011p1 = P(“a quick brown dog”)p2 = P(“dog quick a brown”)p3 = P(“быстрая brown dog”)p4 = P(“быстраясобака”)In a language model for English: p1 > p2 > p3 > p4In a language model for Russian: p1 < p2 < p3 < p4
  • 38. How do we Model a Language?July 16, 2011Brute force counts?
  • 39. Think of all the things that have ever been said or will ever be said, of any length
  • 40. Count how often each one occurs
  • 41. Is understanding the path to enlightenment?
  • 42. Figure out how meaning and thoughts are expressed
  • 43. Build a model based on this
  • 44. Throw up our hands and admit defeat?Unigram Language ModelJuly 16, 2011Assume each word is generated independently
  • 45. Obviously this is not true…
  • 46. But it seems to work well in practice!
  • 47. The probability of a string given a modelThe probability of a sequence of words decomposes into a product of the probabilities of individual words
  • 48. Physical MetaphorJuly 16, 2011 P ( ) P ( ) P ( ) P ( )P ( ) =Colored balls are randomly drawn from an urn (with replacement)Mwords= (4/9)  (2/9)  (4/9)  (3/9)
  • 49. An ExampleJuly 16, 2011Model MP(w) w0.2 the0.1 a0.01 man0.01 woman0.03 said0.02 likes…themanlikesthe0.20.010.020.20.01multiplyP(“the man likes the woman”|M)= P(the|M)  P(man|M)  P(likes|M)  P(the|M)  P(man|M)= 0.00000008

Editor's Notes

  • #5: Highly successful, all major businesses use an RDB system
  • #6: Popular approach in 80s but has not been successful for gen. IR
  • #7: System returns best-matching documents given the query Had limited appeal until the Web became popular
  • #8: System returns best-matching documents given the query Had limited appeal until the Web became popular
  • #9: System returns best-matching documents given the query Had limited appeal until the Web became popular
  • #10: What documents are good matches for a query?
  • #11: What documents are good matches for a query?
  • #12: What documents are good matches for a query?How do we represent the complexities of language?Keeping in mind that computers don’t “understand” documents or queriesSimple, yet effective approach: “bag of words”Treat all the words in a document as index terms for that documentAssign a “weight” to each term based on its “importance”Disregard order, structure, meaning, etc. of the words
  • #13: Bag-of-words approach:Information retrieval is all (and only) about matching words in documents with words in queriesObviously, not true…But it works pretty well!
  • #14: What documents are good matches for a query?How do we represent the complexities of language?Keeping in mind that computers don’t “understand” documents or queriesSimple, yet effective approach: “bag of words”Treat all the words in a document as index terms for that documentAssign a “weight” to each term based on its “importance”Disregard order, structure, meaning, etc. of the words
  • #16: Processing the enormous quantities of data necessary for these advances requires largeclusters, making distributed computing paradigms more crucial than ever. MapReduce is a programmingmodel for expressing distributed computations on massive datasets and an execution frameworkfor large-scale data processing on clusters of commodity servers. The programming model providesan easy-to-understand abstraction for designing scalable algorithms, while the execution frameworktransparently handles many system-level details, ranging from scheduling to synchronization to faulttolerance.MapReduce+Cloud Computing Debate