CSE509 Lecture 2

CSE509: Introduction to Web Science and TechnologyLecture 2: Basic Information Retrieval ModelsMuhammad AtifQureshi and ArjumandYounusWeb Science Research GroupInstitute of Business Administration (IBA)

Last Time…What is Web Science?Why We Need Web Science?Implications of Web ScienceWeb Science Case-Study: Diff-IEJuly 16, 2011

TodayBasic Information RetrievalApproachesBag of Words AssumptionInformation Retrieval ModelsBoolean modelVector-space modelTopic/Language modelsJuly 16, 2011

Computer-based Information ManagementBasic problemHow to use computers to help humans store, organize and retrieve information?What approaches have been taken and what has been successful?July 16, 2011

Three Major ApproachesDatabase approachExpert-system approachInformation retrieval approachJuly 16, 2011

Database ApproachInformation is stored in a highly-structured wayData stored in relational tables as tuplesSimple data model and query languageRelational model and SQL query languageClear interpretation of data and queryNo ambition to be “intelligent” like humansMain focus on system efficiencyPros and cons?July 16, 2011

Expert-System ApproachInformation stored as a set of logical predicatesBird(X).Cat(X).Fly(X)Given a query, the system infers the answer through logical inferenceBird(Ostrich)  Fly(Ostrich)?Pros and cons?July 16, 2011

Information Retrieval ApproachUses existing text documents as information sourceNo special structuring or database construction requiredText-based query languageKeyword-based query or natural language queryPros and cons?July 16, 2011

Database vs. Information RetrievalJuly 16, 2011IRDatabasesWhat we’re retrievingMostly unstructured. Free text with some metadata.Structured data. Clear semantics based on a formal model.Queries we’re posingVague, imprecise information needs (often expressed in natural language).Formally (mathematically) defined queries. Unambiguous.Results we getSometimes relevant, often not.Exact. Always correct in a formal sense.Interaction with systemInteraction is important.One-shot queries.Other issuesIssues downplayed.Concurrency, recovery, atomicity are all critical.

Main Challenge of Information Retrieval ApproachInterpretation of query and data is not straightforwardBoth queries and data are fuzzyUnstructured text and natural language queryJuly 16, 2011

Need forInformation Retrieval ModelComputers do not understand the document or the queryFor implementation of information retrieval approach a computerizable model is essentialJuly 16, 2011

What is a Model?A model is a construct designed to help us understand a complex systemA particular way of “looking at things”Models work by adopting simplifying assumptionsDifferent types of modelsConceptual modelPhysical analog modelMathematical models…July 16, 2011

Major Simplification: Bag of WordsConsider each document as a “bag of words”Consider queries as a “bag of words” as wellGreat oversimplification, but works adequately in many casesStill how do we match documents and queries?July 16, 2011

Sample DocumentJuly 16, 201116 × said 14 × McDonalds12 × fat11 × fries8 × new6 × company french nutrition5 × food oil percent reduce taste Tuesday…McDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.…“Bag of Words”

Vector Representation“Bags of words” can be represented as vectors

A vector is a set of values recorded in any consistent orderJuly 16, 2011“The quick brown fox jumped over the lazy dog’s back”  [ 1 1 1 1 1 1 1 1 2 ]1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”

Representing DocumentsJuly 16, 2011aid01all01back10brown10come01dog10fox10good01jump10lazy10men01now01over10party01quick10their01time01Document 1 TermDocument 1Document 2The quick brown fox jumped over the lazy dog’s back. Stopword ListforisofDocument 2thetoNow is the time for all good men to come to the aid of their party.

Boolean ModelReturn all documents that contain the words in the queryWeights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document“1” represents “presence”: term is in the documentNo notion of “ranking”A document is either a match or a non-matchJuly 16, 2011

Evaluation: Precision and RecallQ: Are all matching documents what users want?Basic idea: a model is good if it returns a document if and only if it is relevantR: set of relevant documents D: set of documents returned by a modelJuly 16, 2011

Ranked RetrievalOrder documents by how likely they are to be relevant to the information needArranging documents by relevance isCloser to how humans think: some documents are “better” than othersCloser to user behavior: users can decide when to stop readingBest (partial) match: documents need not have all query termsAlthough documents with more query terms should be “better”July 16, 2011

Similarity-Based QueriesLet’s replace relevance with “similarity”Rank documents by their similarity with the queryTreat the query as if it were a documentCreate a query bag-of-wordsFind its similarity to each documentRank order the documents by similaritySurprisingly, this works pretty well!July 16, 2011

Vector Space ModelDocuments “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)July 16, 2011t3d2d3d1θφt1d5t2d4

Similarity MetricAngle between the vectorsJuly 16, 2011Query VectorDocument Vector

How do we Weight Document Terms?July 16, 2011Here’s the intuition:

Terms that appear often in a document should get high weights

Terms that appear in many documents should get low weights

How do we capture this mathematically?

Inverse document frequencyTF.IDF Term WeightingSimple, yet effective!July 16, 2011weight assigned to term i in document jnumber of occurrence of term i in document jnumber of documents in entire collectionnumber of documents with term i

TF.IDF ExampleJuly 16, 2011Wi,jtfidf12341234521.510.60complicated0.3014130.500.130.38contaminated0.1255430.630.500.38fallout0.1256332information0.00010.60interesting0.602370.902.11nuclear0.3016140.750.130.50retrieval0.12521.20siberia0.602

Normalizing Document VectorsRecall our similarity function:Normalize document vectors in advanceUse the “cosine normalization” method: divide each term weight through by length of vectorJuly 16, 2011

Normalization ExampleJuly 16, 2011tf12345241354363321376142Wi,jW'i,jidf123412341.510.600.570.69complicated0.3010.500.130.380.290.130.14contaminated0.1250.630.500.380.370.190.44fallout0.125information0.0000.600.62interesting0.6020.902.110.530.79nuclear0.3010.750.130.500.770.050.57retrieval0.1251.200.71siberia0.6021.700.972.670.87Length

Retrieval ExampleJuly 16, 2011W'i,j12340.570.690.290.130.140.370.190.440.620.530.790.770.050.570.71Query: contaminated retrievalW'i,jquerycomplicated1contaminatedfalloutRanked list:Doc 2Doc 4Doc 1Doc 3informationinterestingnuclear1retrievalsiberia0.290.90.190.57similarity score

CSE509 Lecture 2

More Related Content

Similar to CSE509 Lecture 2 (20)

More from Web Science Research Group at Institute of Business Administration, Karachi, Pakistan (8)

Recently uploaded (20)

CSE509 Lecture 2

Editor's Notes