Chapter 19: Information Retrieval




            Database System Concepts, 1 st Ed.
    ©VNS InfoSolutions Private Limited, Varanasi(UP), India 221002
            See www.vnsispl.com for conditions on re-use             1
Chapter 19: Information Retrieval


               s Relevance Ranking Using Terms
               s Relevance Using Hyperlinks
               s Synonyms., Homonyms, and Ontologies
               s Indexing of Documents
               s Measuring Retrieval Effectiveness
               s Web Search Engines
               s Information Retrieval and Structured Data
               s Directories




Database System Concepts – 1 st Ed.            19.2   ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Information Retrieval Systems
          s Information retrieval (IR) systems use a simpler data model than
               database systems
                 q   Information organized as a collection of documents
                 q   Documents are unstructured, no schema
          s Information retrieval locates relevant documents, on the basis of user
               input such as keywords or example documents
                 q   e.g., find documents containing the words “database systems”
          s Can be used even on textual descriptions provided with non-textual
               data such as images
          s Web search engines are the most familiar example of IR systems




Database System Concepts – 1 st Ed.               19.3   ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Information Retrieval Systems
                                   (Cont.)
             s Differences from database systems
                    q   IR systems don’t deal with transactional updates (including
                        concurrency control and recovery)
                    q   Database systems deal with structured data, with schemas that
                        define the data organization
                    q   IR systems deal with some querying issues not generally
                        addressed by database systems
                            Approximate searching by keywords
                            Ranking of retrieved answers by estimated degree of relevance




Database System Concepts – 1 st Ed.                  19.4   ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Keyword Search
             s    In full text retrieval, all the words in each document are considered to be
                  keywords.
                    q   We use the word term to refer to the words in a document
             s    Information-retrieval systems typically allow query expressions formed using
                  keywords and the logical connectives and, or, and not
                    q   Ands are implicit, even if not explicitly specified
             s    Ranking of documents on the basis of estimated relevance to a query is critical
                    q   Relevance ranking is based on factors such as
                            Term frequency
                              – Frequency of occurrence of query keyword in document
                            Inverse document frequency
                               – How many documents the query keyword occurs in
                                      »   Fewer  give more importance to keyword
                            Hyperlinks to documents
                              – More links to a document  document is more important




Database System Concepts – 1 st Ed.                         19.5   ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Relevance Ranking Using Terms
             s TF-IDF (Term frequency/Inverse Document frequency) ranking:
                    q   Let n(d) = number of terms in the document d
                    q   n(d, t) = number of occurrences of term t in the document d.
                    q   Relevance of a document d to a term t

                                                        n(d, t)
                                  TF (d, t) = log    1+
                                                         n(d)
                            The log factor is to avoid excessive weight to frequent terms
                    q   Relevance of document to query Q


                                  r (d, Q) = ∑ TF (d, t)
                                             t∈Q n(t)




Database System Concepts – 1 st Ed.                   19.6   ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Relevance Ranking Using Terms
                                (Cont.)
             s Most systems add to the above model
                    q   Words that occur in title, author list, section headings, etc. are
                        given greater importance
                    q   Words whose first occurrence is late in the document are given
                        lower importance
                    q   Very common words such as “a”, “an”, “the”, “it” etc are eliminated
                            Called stop words
                    q   Proximity: if keywords in query occur close together in the
                        document, the document has higher importance than if they occur
                        far apart
             s Documents are returned in decreasing order of relevance score
                    q   Usually only top few documents are returned, not all




Database System Concepts – 1 st Ed.                   19.7   ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Similarity Based Retrieval
             s Similarity based retrieval - retrieve documents similar to a given
                  document
                    q   Similarity may be defined on the basis of common words
                            E.g. find k terms in A with highest TF (d, t ) / n (t ) and use
                             these terms to find relevance of other documents.
             s Relevance feedback: Similarity can be used to refine answer set to
                  keyword query
                    q   User selects a few relevant documents from those retrieved by
                        keyword query, and system finds other documents similar to these
             s Vector space model: define an n-dimensional space, where n is the
                  number of words in the document set.
                    q   Vector for document d goes from origin to a point whose i th
                        coordinate is TF (d,t ) / n (t )
                    q   The cosine of the angle between the vectors of two documents is
                        used as a measure of their similarity.




Database System Concepts – 1 st Ed.                     19.8   ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Relevance Using Hyperlinks
        s Number of documents relevant to a query can be enormous if only term
             frequencies are taken into account
        s Using term frequencies makes “spamming” easy
                        E.g. a travel agency can add many occurrences of the words
                         “travel” to its page to make its rank very high
        s Most of the time people are looking for pages from popular sites
        s Idea: use popularity of Web site (e.g. how many people visit it) to rank
             site pages that match given keywords
        s Problem: hard to find actual popularity of site
               q   Solution: next slide




Database System Concepts – 1 st Ed.                 19.9   ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Relevance Using Hyperlinks (Cont.)
       s Solution: use number of hyperlinks to a site as a measure of the popularity or
            prestige of the site
              q   Count only one hyperlink from each site (why? - see previous slide)
              q   Popularity measure is for site, not for individual page
                      But, most hyperlinks are to root of site
                      Also, concept of “site” difficult to define since a URL prefix like
                       cs.yale.edu contains many unrelated pages of varying popularity
       s Refinements
              q   When computing prestige based on links to a site, give more weight to
                  links from sites that themselves have higher prestige
                      Definition is circular
                      Set up and solve system of simultaneous linear equations
              q   Above idea is basis of the Google PageRank ranking mechanism




Database System Concepts – 1 st Ed.                  19.10 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Relevance Using Hyperlinks (Cont.)
         s Connections to social networking theories that ranked prestige of people
                q   E.g. the president of the U.S.A has a high prestige since many people
                    know him
                q   Someone known by multiple prestigious people has high prestige
         s Hub and authority based ranking
                q   A hub is a page that stores links to many pages (on a topic)
                q   An authority is a page that contains actual information on a topic
                q   Each page gets a hub prestige based on prestige of authorities that
                    it points to
                q   Each page gets an authority prestige based on prestige of hubs
                    that point to it
                q   Again, prestige definitions are cyclic, and can be got by
                    solving linear equations
                q   Use authority prestige when ranking answers to a query



Database System Concepts – 1 st Ed.                19.11 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Synonyms and Homonyms
         s Synonyms
               q   E.g. document: “motorcycle repair”, query: “motorcycle maintenance”
                        need to realize that “maintenance” and “repair” are synonyms
               q   System can extend query as “motorcycle and (repair or maintenance)”
         s Homonyms
               q   E.g. “object” has different meanings as noun/verb
               q   Can disambiguate meanings (to some extent) from the context
         s Extending queries automatically using synonyms can be problematic
               q   Need to understand intended meaning in order to infer synonyms
                        Or verify synonyms with user
               q   Synonyms may have other meanings as well




Database System Concepts – 1 st Ed.                 19.12 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Concept-Based Querying
             s Approach
                    q   For each word, determine the concept it represents from context
                    q   Use one or more ontologies:
                             Hierarchical structure showing relationship between concepts
                             E.g.: the ISA relationship that we saw in the E-R model
             s This approach can be used to standardize terminology in a specific
                  field
             s Ontologies can link multiple languages
             s Foundation of the Semantic Web (not covered here)




Database System Concepts – 1 st Ed.                    19.13 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Indexing of Documents
         s An inverted index maps each keyword K to a set of documents S that
                                                i                       i
              contain the keyword
                q   Documents identified by identifiers
         s Inverted index may record
                q   Keyword locations within document to allow proximity based ranking
                q
              Counts of number of occurrences of keyword to compute TF
         s and operation: Finds documents that contain all of K1, K2, ..., Kn.
                q   Intersection S1∩ S2 ∩..... ∩ Sn
         s or operation: documents that contain at least one of K1, K2, …, Kn
                q   union, S1∩ S2 ∩..... ∩ Sn,.
         s Each Si is kept sorted to allow efficient intersection/union by merging
                q   “not” can also be efficiently implemented by merging of sorted lists




Database System Concepts – 1 st Ed.                   19.14 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Measuring Retrieval Effectiveness
             s Information-retrieval systems save space by using index structures
                  that support only approximate retrieval. May result in:
                    q   false negative (false drop) - some relevant documents may
                        not be retrieved.
                    q   false positive - some irrelevant documents may be retrieved.
                    q   For many applications a good index should not permit any false
                        drops, but may permit a few false positives.
             s Relevant performance metrics:
                    q   precision - what percentage of the retrieved documents are
                        relevant to the query.
                    q   recall - what percentage of the documents relevant to the query
                         were retrieved.




Database System Concepts – 1 st Ed.                19.15 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Measuring Retrieval Effectiveness (Cont.)
       s Recall vs. precision tradeoff:
                       Can increase recall by retrieving many documents (down to a low
                        level of relevance ranking), but many irrelevant documents would be
                        fetched, reducing precision
       s Measures of retrieval effectiveness:
              q   Recall as a function of number of documents fetched, or
              q   Precision as a function of recall
                       Equivalently, as a function of number of documents fetched
              q   E.g. “precision of 75% at recall of 50%, and 60% at a recall of 75%”
       s Problem: which documents are actually relevant, and which are not




Database System Concepts – 1 st Ed.                   19.16 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Web Search Engines
             s Web crawlers are programs that locate and gather information on
                  the Web
                    q   Recursively follow hyperlinks present in known documents, to find
                        other documents
                            Starting from a seed set of documents
                    q   Fetched documents
                            Handed over to an indexing system
                            Can be discarded after indexing, or store as a cached copy
             s Crawling the entire Web would take a very large amount of time
                    q   Search engines typically cover only a part of the Web, not all of it
                    q   Take months to perform a single crawl




Database System Concepts – 1 st Ed.                  19.17 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Web Crawling (Cont.)
             s Crawling is done by multiple processes on multiple machines, running
                  in parallel
                    q   Set of links to be crawled stored in a database
                    q   New links found in crawled pages added to this set, to be crawled
                        later
             s Indexing process also runs on multiple machines
                    q   Creates a new copy of index instead of modifying old index
                    q   Old index is used to answer queries
                    q   After a crawl is “completed” new index becomes “old” index
             s Multiple machines used to answer queries
                    q   Indices may be kept in memory
                    q   Queries may be routed to different machines for load balancing




Database System Concepts – 1 st Ed.                 19.18 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Information Retrieval and Structured
                             Data
             s Information retrieval systems originally treated documents as a
                  collection of words
             s Information extraction systems infer structure from documents, e.g.:
                    q   Extraction of house attributes (size, address, number of
                        bedrooms, etc.) from a text advertisement
                    q   Extraction of topic and people named from a new article
             s Relations or XML structures used to store extracted data
                    q   System seeks connections among data to answer queries
                    q   Question answering systems




Database System Concepts – 1 st Ed.                 19.19 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Directories

             s Storing related documents together in a library facilitates browsing
                    q   users can see not only requested document but also related ones.
             s Browsing is facilitated by classification system that organizes logically
                  related documents together.
             s Organization is hierarchical: classification hierarchy




Database System Concepts – 1 st Ed.                19.20 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
A Classification Hierarchy For A Library
                              System




Database System Concepts – 1 st Ed.   19.21 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Classification DAG

            s Documents can reside in multiple places in a hierarchy in an
                 information retrieval system, since physical location is not important.
            s Classification hierarchy is thus Directed Acyclic Graph (DAG)




Database System Concepts – 1 st Ed.                19.22 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
A Classification DAG For A Library
               Information Retrieval System




Database System Concepts – 1 st Ed.   19.23 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
Web Directories
             s A Web directory is just a classification directory on Web pages
                    q   E.g. Yahoo! Directory, Open Directory project
                    q   Issues:
                            What should the directory hierarchy be?
                            Given a document, which nodes of the directory are categories
                             relevant to the document
                    q   Often done manually
                            Classification of documents into a hierarchy may be done
                             based on term similarity




Database System Concepts – 1 st Ed.                  19.24 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
End of Chapter




        Database System Concepts, 1 st Ed.
©VNS InfoSolutions Private Limited, Varanasi(UP), India 221002
        See www.vnsispl.com for conditions on re-use             25

More Related Content

PPT
VNSISPL_DBMS_Concepts_ch18
PPT
VNSISPL_DBMS_Concepts_ch13
PPT
VNSISPL_DBMS_Concepts_AppB
PPT
VNSISPL_DBMS_Concepts_ch22
PPT
Vnsispl dbms concepts_ch1
PPT
VNSISPL_DBMS_Concepts_ch10
PPT
VNSISPL_DBMS_Concepts_ch6
PPT
VNSISPL_DBMS_Concepts_appA
VNSISPL_DBMS_Concepts_ch18
VNSISPL_DBMS_Concepts_ch13
VNSISPL_DBMS_Concepts_AppB
VNSISPL_DBMS_Concepts_ch22
Vnsispl dbms concepts_ch1
VNSISPL_DBMS_Concepts_ch10
VNSISPL_DBMS_Concepts_ch6
VNSISPL_DBMS_Concepts_appA

Similar to VNSISPL_DBMS_Concepts_ch19 (20)

PPT
VNSISPL_DBMS_Concepts_ch9
PPT
VNSISPL_DBMS_Concepts_ch12
PPT
VNSISPL_DBMS_Concepts_ch4
PPTX
Information-Retrieval-Database-Advvnced-Systems.pptx
PPT
PPTX
Tdm information retrieval
PPT
VNSISPL_DBMS_Concepts_ch8
PPT
VNSISPL_DBMS_Concepts_ch2
PDF
Usage of word sense disambiguation in concept identification in ontology cons...
PPT
VNSISPL_DBMS_Concepts_ch21
PPT
VNSISPL_DBMS_Concepts_ch24
PPT
Semantic Text Processing Powered by Wikipedia
PPT
Semantics in Financial Services -David Newman
PDF
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
PDF
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
DOCX
JPJ1423 Keyword Query Routing
PPT
PPTX
Text Mining.pptx
VNSISPL_DBMS_Concepts_ch9
VNSISPL_DBMS_Concepts_ch12
VNSISPL_DBMS_Concepts_ch4
Information-Retrieval-Database-Advvnced-Systems.pptx
Tdm information retrieval
VNSISPL_DBMS_Concepts_ch8
VNSISPL_DBMS_Concepts_ch2
Usage of word sense disambiguation in concept identification in ontology cons...
VNSISPL_DBMS_Concepts_ch21
VNSISPL_DBMS_Concepts_ch24
Semantic Text Processing Powered by Wikipedia
Semantics in Financial Services -David Newman
SEMANTIC NETWORK BASED MECHANISMS FOR KNOWLEDGE ACQUISITION
Poster RDAP13: Research Data in eCommons @ Cornell: Present and Future
JPJ1423 Keyword Query Routing
Text Mining.pptx
Ad

More from sriprasoon (13)

PPT
VNSISPL_DBMS_Concepts_AppC
PPT
VNSISPL_DBMS_Concepts_ch25
PPT
VNSISPL_DBMS_Concepts_ch23
PPT
VNSISPL_DBMS_Concepts_ch20
PPT
VNSISPL_DBMS_Concepts_ch17
PPT
VNSISPL_DBMS_Concepts_ch16
PPT
VNSISPL_DBMS_Concepts_ch15
PPT
VNSISPL_DBMS_Concepts_ch14
PPT
VNSISPL_DBMS_Concepts_ch11
PPT
VNSISPL_DBMS_Concepts_ch7
PPT
VNSISPL_DBMS_Concepts_ch5
PPT
Vnsispl dbms concepts_ch3
PPTX
Inventory Management
VNSISPL_DBMS_Concepts_AppC
VNSISPL_DBMS_Concepts_ch25
VNSISPL_DBMS_Concepts_ch23
VNSISPL_DBMS_Concepts_ch20
VNSISPL_DBMS_Concepts_ch17
VNSISPL_DBMS_Concepts_ch16
VNSISPL_DBMS_Concepts_ch15
VNSISPL_DBMS_Concepts_ch14
VNSISPL_DBMS_Concepts_ch11
VNSISPL_DBMS_Concepts_ch7
VNSISPL_DBMS_Concepts_ch5
Vnsispl dbms concepts_ch3
Inventory Management
Ad

Recently uploaded (20)

PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PDF
Complications of Minimal Access-Surgery.pdf
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PDF
International_Financial_Reporting_Standa.pdf
PDF
Mucosal Drug Delivery system_NDDS_BPHARMACY__SEM VII_PCI.pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
Empowerment Technology for Senior High School Guide
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PDF
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PDF
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
PDF
Hazard Identification & Risk Assessment .pdf
Unit 4 Computer Architecture Multicore Processor.pptx
Complications of Minimal Access-Surgery.pdf
Virtual and Augmented Reality in Current Scenario
MICROENCAPSULATION_NDDS_BPHARMACY__SEM VII_PCI .pdf
Cambridge-Practice-Tests-for-IELTS-12.docx
Environmental Education MCQ BD2EE - Share Source.pdf
Paper A Mock Exam 9_ Attempt review.pdf.
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 1).pdf
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
International_Financial_Reporting_Standa.pdf
Mucosal Drug Delivery system_NDDS_BPHARMACY__SEM VII_PCI.pdf
Introduction to pro and eukaryotes and differences.pptx
Empowerment Technology for Senior High School Guide
FORM 1 BIOLOGY MIND MAPS and their schemes
AI-driven educational solutions for real-life interventions in the Philippine...
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
Hazard Identification & Risk Assessment .pdf

VNSISPL_DBMS_Concepts_ch19

  • 1. Chapter 19: Information Retrieval Database System Concepts, 1 st Ed. ©VNS InfoSolutions Private Limited, Varanasi(UP), India 221002 See www.vnsispl.com for conditions on re-use 1
  • 2. Chapter 19: Information Retrieval s Relevance Ranking Using Terms s Relevance Using Hyperlinks s Synonyms., Homonyms, and Ontologies s Indexing of Documents s Measuring Retrieval Effectiveness s Web Search Engines s Information Retrieval and Structured Data s Directories Database System Concepts – 1 st Ed. 19.2 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 3. Information Retrieval Systems s Information retrieval (IR) systems use a simpler data model than database systems q Information organized as a collection of documents q Documents are unstructured, no schema s Information retrieval locates relevant documents, on the basis of user input such as keywords or example documents q e.g., find documents containing the words “database systems” s Can be used even on textual descriptions provided with non-textual data such as images s Web search engines are the most familiar example of IR systems Database System Concepts – 1 st Ed. 19.3 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 4. Information Retrieval Systems (Cont.) s Differences from database systems q IR systems don’t deal with transactional updates (including concurrency control and recovery) q Database systems deal with structured data, with schemas that define the data organization q IR systems deal with some querying issues not generally addressed by database systems  Approximate searching by keywords  Ranking of retrieved answers by estimated degree of relevance Database System Concepts – 1 st Ed. 19.4 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 5. Keyword Search s In full text retrieval, all the words in each document are considered to be keywords. q We use the word term to refer to the words in a document s Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not q Ands are implicit, even if not explicitly specified s Ranking of documents on the basis of estimated relevance to a query is critical q Relevance ranking is based on factors such as  Term frequency – Frequency of occurrence of query keyword in document  Inverse document frequency – How many documents the query keyword occurs in » Fewer  give more importance to keyword  Hyperlinks to documents – More links to a document  document is more important Database System Concepts – 1 st Ed. 19.5 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 6. Relevance Ranking Using Terms s TF-IDF (Term frequency/Inverse Document frequency) ranking: q Let n(d) = number of terms in the document d q n(d, t) = number of occurrences of term t in the document d. q Relevance of a document d to a term t n(d, t) TF (d, t) = log 1+ n(d)  The log factor is to avoid excessive weight to frequent terms q Relevance of document to query Q r (d, Q) = ∑ TF (d, t) t∈Q n(t) Database System Concepts – 1 st Ed. 19.6 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 7. Relevance Ranking Using Terms (Cont.) s Most systems add to the above model q Words that occur in title, author list, section headings, etc. are given greater importance q Words whose first occurrence is late in the document are given lower importance q Very common words such as “a”, “an”, “the”, “it” etc are eliminated  Called stop words q Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart s Documents are returned in decreasing order of relevance score q Usually only top few documents are returned, not all Database System Concepts – 1 st Ed. 19.7 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 8. Similarity Based Retrieval s Similarity based retrieval - retrieve documents similar to a given document q Similarity may be defined on the basis of common words  E.g. find k terms in A with highest TF (d, t ) / n (t ) and use these terms to find relevance of other documents. s Relevance feedback: Similarity can be used to refine answer set to keyword query q User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these s Vector space model: define an n-dimensional space, where n is the number of words in the document set. q Vector for document d goes from origin to a point whose i th coordinate is TF (d,t ) / n (t ) q The cosine of the angle between the vectors of two documents is used as a measure of their similarity. Database System Concepts – 1 st Ed. 19.8 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 9. Relevance Using Hyperlinks s Number of documents relevant to a query can be enormous if only term frequencies are taken into account s Using term frequencies makes “spamming” easy  E.g. a travel agency can add many occurrences of the words “travel” to its page to make its rank very high s Most of the time people are looking for pages from popular sites s Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords s Problem: hard to find actual popularity of site q Solution: next slide Database System Concepts – 1 st Ed. 19.9 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 10. Relevance Using Hyperlinks (Cont.) s Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site q Count only one hyperlink from each site (why? - see previous slide) q Popularity measure is for site, not for individual page  But, most hyperlinks are to root of site  Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu contains many unrelated pages of varying popularity s Refinements q When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige  Definition is circular  Set up and solve system of simultaneous linear equations q Above idea is basis of the Google PageRank ranking mechanism Database System Concepts – 1 st Ed. 19.10 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 11. Relevance Using Hyperlinks (Cont.) s Connections to social networking theories that ranked prestige of people q E.g. the president of the U.S.A has a high prestige since many people know him q Someone known by multiple prestigious people has high prestige s Hub and authority based ranking q A hub is a page that stores links to many pages (on a topic) q An authority is a page that contains actual information on a topic q Each page gets a hub prestige based on prestige of authorities that it points to q Each page gets an authority prestige based on prestige of hubs that point to it q Again, prestige definitions are cyclic, and can be got by solving linear equations q Use authority prestige when ranking answers to a query Database System Concepts – 1 st Ed. 19.11 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 12. Synonyms and Homonyms s Synonyms q E.g. document: “motorcycle repair”, query: “motorcycle maintenance”  need to realize that “maintenance” and “repair” are synonyms q System can extend query as “motorcycle and (repair or maintenance)” s Homonyms q E.g. “object” has different meanings as noun/verb q Can disambiguate meanings (to some extent) from the context s Extending queries automatically using synonyms can be problematic q Need to understand intended meaning in order to infer synonyms  Or verify synonyms with user q Synonyms may have other meanings as well Database System Concepts – 1 st Ed. 19.12 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 13. Concept-Based Querying s Approach q For each word, determine the concept it represents from context q Use one or more ontologies:  Hierarchical structure showing relationship between concepts  E.g.: the ISA relationship that we saw in the E-R model s This approach can be used to standardize terminology in a specific field s Ontologies can link multiple languages s Foundation of the Semantic Web (not covered here) Database System Concepts – 1 st Ed. 19.13 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 14. Indexing of Documents s An inverted index maps each keyword K to a set of documents S that i i contain the keyword q Documents identified by identifiers s Inverted index may record q Keyword locations within document to allow proximity based ranking q Counts of number of occurrences of keyword to compute TF s and operation: Finds documents that contain all of K1, K2, ..., Kn. q Intersection S1∩ S2 ∩..... ∩ Sn s or operation: documents that contain at least one of K1, K2, …, Kn q union, S1∩ S2 ∩..... ∩ Sn,. s Each Si is kept sorted to allow efficient intersection/union by merging q “not” can also be efficiently implemented by merging of sorted lists Database System Concepts – 1 st Ed. 19.14 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 15. Measuring Retrieval Effectiveness s Information-retrieval systems save space by using index structures that support only approximate retrieval. May result in: q false negative (false drop) - some relevant documents may not be retrieved. q false positive - some irrelevant documents may be retrieved. q For many applications a good index should not permit any false drops, but may permit a few false positives. s Relevant performance metrics: q precision - what percentage of the retrieved documents are relevant to the query. q recall - what percentage of the documents relevant to the query were retrieved. Database System Concepts – 1 st Ed. 19.15 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 16. Measuring Retrieval Effectiveness (Cont.) s Recall vs. precision tradeoff:  Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision s Measures of retrieval effectiveness: q Recall as a function of number of documents fetched, or q Precision as a function of recall  Equivalently, as a function of number of documents fetched q E.g. “precision of 75% at recall of 50%, and 60% at a recall of 75%” s Problem: which documents are actually relevant, and which are not Database System Concepts – 1 st Ed. 19.16 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 17. Web Search Engines s Web crawlers are programs that locate and gather information on the Web q Recursively follow hyperlinks present in known documents, to find other documents  Starting from a seed set of documents q Fetched documents  Handed over to an indexing system  Can be discarded after indexing, or store as a cached copy s Crawling the entire Web would take a very large amount of time q Search engines typically cover only a part of the Web, not all of it q Take months to perform a single crawl Database System Concepts – 1 st Ed. 19.17 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 18. Web Crawling (Cont.) s Crawling is done by multiple processes on multiple machines, running in parallel q Set of links to be crawled stored in a database q New links found in crawled pages added to this set, to be crawled later s Indexing process also runs on multiple machines q Creates a new copy of index instead of modifying old index q Old index is used to answer queries q After a crawl is “completed” new index becomes “old” index s Multiple machines used to answer queries q Indices may be kept in memory q Queries may be routed to different machines for load balancing Database System Concepts – 1 st Ed. 19.18 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 19. Information Retrieval and Structured Data s Information retrieval systems originally treated documents as a collection of words s Information extraction systems infer structure from documents, e.g.: q Extraction of house attributes (size, address, number of bedrooms, etc.) from a text advertisement q Extraction of topic and people named from a new article s Relations or XML structures used to store extracted data q System seeks connections among data to answer queries q Question answering systems Database System Concepts – 1 st Ed. 19.19 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 20. Directories s Storing related documents together in a library facilitates browsing q users can see not only requested document but also related ones. s Browsing is facilitated by classification system that organizes logically related documents together. s Organization is hierarchical: classification hierarchy Database System Concepts – 1 st Ed. 19.20 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 21. A Classification Hierarchy For A Library System Database System Concepts – 1 st Ed. 19.21 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 22. Classification DAG s Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important. s Classification hierarchy is thus Directed Acyclic Graph (DAG) Database System Concepts – 1 st Ed. 19.22 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 23. A Classification DAG For A Library Information Retrieval System Database System Concepts – 1 st Ed. 19.23 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 24. Web Directories s A Web directory is just a classification directory on Web pages q E.g. Yahoo! Directory, Open Directory project q Issues:  What should the directory hierarchy be?  Given a document, which nodes of the directory are categories relevant to the document q Often done manually  Classification of documents into a hierarchy may be done based on term similarity Database System Concepts – 1 st Ed. 19.24 ©VNS InfoSolutions Private Limited, Varanasi(UP), India 22100
  • 25. End of Chapter Database System Concepts, 1 st Ed. ©VNS InfoSolutions Private Limited, Varanasi(UP), India 221002 See www.vnsispl.com for conditions on re-use 25