SlideShare a Scribd company logo
Fundamentals of Database Systems
Seventh Edition
Chapter 27
Introduction to
Information Retrieval and
Web Search
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.1 Information Retrieval (IR) Concepts (1 of 4)
• Information retrieval
– Process of retrieving documents from a collection in
response to a query (search request)
– Deals mainly with unstructured data
▪Example: homebuying contract documents
• Unstructured information
– Does not have a well-defined formal model
– Based on an understanding of natural language
– Stored in a wide variety of standard formats
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Information Retrieval (IR) Concepts (2 of 4)
• Information retrieval field predates database field
– Academic programs in Library and Information
Science
• RDBMS vendors providing new capabilities to support
various data types
– Extended RDBMSs or object-relational database
management systems
• User’s information need expressed as free-form search
request
– Keyword search query
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Information Retrieval (IR) Concepts (3 of 4)
• Characterizing an IR system
– Types of users
▪ Expert
▪ Layperson
– Types of data
▪Domain-specific
– Types of information needs
▪Navigational search
▪Informational search
▪Transactional search
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Information Retrieval (IR) Concepts (4 of 4)
• Enterprise search systems
– Limited to an intranet
• Desktop search engines
– Searches an individual computer system
• Databases have fixed schemas
– IR system has no fixed data model
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Comparing Databases and IR Systems
Table 27.1 A comparison of databases and IR systems
Databases IR Systems
• Structured data • Unstructured data
• Schema driven • No fixed schema; various data models
(e.g., vector space model)
• Relational (or object, hierarchical, and
network) model is predominant
• Free-form query models
• Structured query model • Rich data operations
• Rich metadata operations • Search request returns list or pointers to
documents
• Query returns data Blank
• Results are based on exact matching
(always correct)
• Results are based on approximate
matching and measures of effectiveness
(may be imprecise and ranked)
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
A Brief History of IR
• Stone tablets and papyrus scrolls
• Printing press
• Public libraries
• Computers and automated storage systems
– Inverted file organization based on keywords and their
weights as indexing method
• Search engine
• Crawler
• Challenge: provide high quality, pertinent, timely information
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Modes of Interactions in IR Systems
• Primary modes of interaction
– Retrieval
▪Extract relevant information from document
repository
– Browsing
▪Exploratory activity based on user’s assessment of
relevance
• Web search combines both interaction modes
– Rank of a web page measures its relevance to query
that generated the result set
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Generic IR Pipeline (1 of 2)
• Statistical approach
– Documents analyzed and broken down into chunks of
text
– Each word or phrase is counted, weighted, and
measured for relevance or importance
• Types of statistical approaches
– Boolean
– Vector space
– Probabilistic
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Generic IR Pipeline (2 of 2)
• Semantic approaches
– Use knowledge-based retrieval techniques
– Rely on syntactic, lexical, sentential, discourse-based,
and pragmatic levels of knowledge understanding
– Also apply some form of statistical analysis
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Figure 27.1 Generic IR Framework
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Figure 27.2 Simplified IR Process Pipeline
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.2 Retrieval Models (1 of 5)
• Boolean model
– One of earliest and simplest IR models
– Documents represented as a set of terms
– Queries formulated using AND, OR, and NOT
– Retrieved documents are an exact match
▪No notion of ranking of documents
– Easy to associate metadata information and write
queries that match contents of documents
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Retrieval Models (2 of 5)
• Vector space model
– Weighting, ranking, and determining relevance are
possible
– Uses individual terms as dimensions
– Each document represented by an n-dimensional
vector of values
– Features
▪Subset of terms in a document set that are deemed
most relevant to an IR search for the document set
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Retrieval Models (3 of 5)
• Vector space model
– Different similarity assessment functions can be used
• Term frequency-inverse document frequency (TF-IDF)
– Statistical weight measure used to evaluate the
importance of a document word in a collection of
documents
– A discriminating term must occur in only a few
documents in the general population
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Retrieval Models (4 of 5)
• Probabilistic model
– Involves ranking documents by their estimated
probability of relevance with respect to the query and
the document
– IR system must decide whether a document belongs
to the relevant set or nonrelevant set for a query
▪Calculate probability that document belongs to the
relevant set
– BM25: a popular ranking algorithm
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Retrieval Models (5 of 5)
• Semantic model
– Morphological analysis
▪Analyze roots and affixes to determine parts of speech
of search words
– Syntactic analysis
▪Parse and analyze complete phrases in documents
– Semantic analysis
▪Resolve word ambiguities and generate relevant
synonyms based on semantic relationships
– Uses techniques from artificial intelligence and expert
systems
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.3 Types of Queries in IR Systems (1 of 4)
• Keyword queries
– Simplest and most commonly used
– Keyword terms implicitly connected by logical AND
• Boolean queries
– Allow use of AND, OR, NOT, and other operators
– Exact matches returned
▪No ranking possible
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Types of Queries in IR Systems (2 of 4)
• Phrase queries
– Sequence of words that make up a phrase
– Phrase enclosed in double quotes
– Each retrieved document must contain at least one
instance of the exact phrase
• Proximity queries
– How close within a record multiple search terms are
to each other
– Phrase search is most commonly used proximity
query
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Types of Queries in IR Systems (3 of 4)
• Proximity queries
– Specify order of search terms
– NEAR, ADJ (adjacent), or AFTER operators
– Sequence of words with maximum allowed distance
between them
– Computationally expensive
▪Suitable for smaller document collections rather
than the Web
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Types of Queries in IR Systems (4 of 4)
• Wildcard queries
– Supports regular expressions and pattern-based
matching
▪Example ‘data*’ would retrieve data, database,
dataset, etc.
– Not generally implemented by Web search engines
• Natural language queries
– Definitions of textual terms or common facts
– Semantic models can support
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.4 Text Preprocessing (1 of 3)
• Stopword removal must be performed before indexing
• Stopwords
– Words that are expected to occur in 80% or more of
the documents of a collection
▪Examples: the, of, to, a, and, said, for, that
– Do not contribute much to relevance
• Queries preprocessed for stopword removal before
retrieval process
– Many search engines do not remove stopwords
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Text Preprocessing (2 of 3)
• Stemming
– Trims suffix and prefix
– Reduces the different forms of the word to a common
stem
– Martin Porter’s stemming algorithm
• Utilizing a thesaurus
– Important concepts and main words that describe
each concept for a particular knowledge domain
– Collection of synonyms
– UMLS
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Figure 27.3 A Portion of the UMLS Semantic
Network: “Biologic Function” Hierarchy
Source: UMLS Reference Manual, National Library of Medicine
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Text Preprocessing (3 of 3)
• Other preprocessing steps
– Digits
▪May or may not be removed during preprocessing
– Hyphens and punctuation marks
▪Handled in different ways
– Cases
▪Most search engines use case-insensitive search
• Information extraction tasks
– Identifying noun phrases, facts, events, people,
places, and relationships
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.5 Inverted Indexing (1 of 3)
• Inverted index structure
– Vocabulary information
▪Set of distinct query terms in the document set
– Document information
– Data structure that attaches distinct terms with a list
of all documents that contain the term
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Inverted Indexing (2 of 3)
• Construction of an inverted index
– Break documents into vocabulary terms
▪Tokenizing, cleansing, removing stopwords,
stemming, and/or using a thesaurus
– Collect document statistics
▪Store statistics in document lookup table
– Invert the document-term stream into a term-
document stream
▪Add additional information such as term
frequencies, term positions, and term weights
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Figure 27.4 Example of an Inverted Index
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Inverted Indexing (3 of 3)
• Searching for relevant documents from an inverted index
– Vocabulary search
– Document information retrieval
– Manipulation of retrieved information
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Introduction to Lucene
• Lucene: open source indexing/search engine
– Indexing is primary focus
• Document composed of set of fields
– Chunks of untokenized text
– Series of processed lexical units called token streams
▪Created by tokenization and filtering algorithms
• Highly-configurable search API
• Ease of indexing large, unstructured document
collections
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.6 Evaluation Measures of Search
Relevance (1 of 4)
• Topical relevance
– Measures result topic match to query topic
• User relevance
– Describes ‘goodness’ of retrieved result with regard to
user’s information need
• Web information retrieval
– No binary classification made for relevance or
nonrelevance
– Ranking of documents
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Evaluation Measures of Search
Relevance (2 of 4)
• Recall
– Number of relevant documents retrieved by a search
divided by the total number of actually relevant
documents existing in the database
• Precision
– Number of relevant documents retrieved by a search
divided by total number of documents retrieved by
that search
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Retrieved Versus Relevant Search Results
• TP: true positive
• FP: false positive
• TN: true negative
• FN: false negative
Figure 27.5 Retrieved versus relevant search results
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Evaluation Measures of Search
Relevance (3 of 4)
• Recall can be increased by presenting more results to the user
– May decrease the precision
Doc. No. Rank Position i Relevant Precision(i) Recall(i)
10 1 Yes 1/1 = 100% 1/10 = 10%
2 2 Yes 2/2 = 100% 2/10 = 20%
3 3 Yes 3/3 = 100% 3/10 = 30%
5 4 No 3/4 = 75% 3/10 = 30%
17 5 No 3/5 = 60% 3/10 = 30%
34 6 No 3/6 = 50% 3/10 = 30%
215 7 Yes 4/7 = 57.1% 4/10 = 40%
33 8 Yes 5/8 = 62.5% 5/10 = 50%
45 9 No 5/9 = 55.5% 5/10 = 50%
16 10 Yes 6/10 = 60% 6/10 = 60%
Table 27.2 Precision and recall for ranked retrieval
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Evaluation Measures of Search
Relevance (4 of 4)
• Average precision
– Computed based on the precision at each relevant
document in the ranking
• Recall/precision curve
– Based on the recall and precision values at each rank
position
▪x-axis is recall and y-axis is precision
• F-score
– Harmonic mean of the precision (p) and recall (r)
values
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.7 Web Search and Analysis (1 of 8)
• Search engines must crawl and index Web sites and
document collections
– Regularly update indexes
– Link analysis used to identify page importance
• Vertical search engines
– Customized topic-specific search engines that crawl
and index a specific collection of documents on the
Web
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (2 of 8)
• Metasearch engines
– Query different search engines simultaneously and
aggregate information
• Digital libraries
– Collections of electronic resources and services for
the delivery of materials in a variety of formats
• Web analysis
– Applies data analysis techniques to discover and
analyze useful information from the Web
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (3 of 8)
• Goals of Web analysis
– Finding relevant information
– Personalization of the information
– Finding information of social value
• Categories of Web analysis
– Web structure analysis
– Web content analysis
– Web usage analysis
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (4 of 8)
• Web structure analysis
– Hyperlink
– Destination page
– Anchor text
– Hub
– Authority
• PageRank ranking algorithm
– Used by Google
– Analyzes forward links and backlinks
▪Highly linked pages are more important
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (5 of 8)
• Web content analysis tasks
– Structured data extraction
▪Wrapper
– Web information integration
▪Web query interface integration
▪Schema matching
▪Ontology-based information integration
– Building concept hierarchies
– Segmenting web pages and detecting noise
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (6 of 8)
• Approaches to Web content analysis
– Agent-based
▪Intelligent Web agents
▪Personalized Web agents
▪Information filtering/categorization
– Database-based
▪Attempts to organize a Web site as a database
▪Object Exchange Model
▪Multilevel database
▪Web query system
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (7 of 8)
• Web usage analysis attempts to discover usage patterns
from Web data
– Preprocessing
▪Usage, content, structure
– Pattern discovery
▪Statistical analysis, association rules, clustering,
classification, sequential patterns, dependency
modeling
– Pattern analysis
▪Filter out patterns not of interest
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Web Search and Analysis (8 of 8)
• Practical applications of Web analysis
– Web analytics
▪Understand and optimize the performance of Web
usage
– Web spamming
▪Deliberate activity to promote a page by
manipulating search engine results
– Web security
▪Allow design of more robust Web sites
– Web crawlers
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.8 Trends in Information Retrieval (1 of 3)
• Faceted search
– Classifying content
• Social search
– Collaborative social search
• Conversational information access
– Intelligent agents perform intent extraction to provide
information relevant to a conversation
• Probabilistic topic modeling
– Automatically organize large collections of documents into
relevant themes
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Trends in Information Retrieval (2 of 3)
Figure 27.6 A document D and its topic proportions
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Trends in Information Retrieval (3 of 3)
• Question-answering systems
– Factoid questions
– List questions
– Definition questions
– Opinion questions
– Composed of question analysis, query generation,
search, candidate answer generation, and answer
scoring
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
27.9 Summary
• Information retrieval mainly targeted at unstructured data
• Query and browsing modes of interaction
• Retrieval models
– Boolean, vector space, probabilistic, and semantic
• Text preprocessing
• Web search
• Web ranking
• Trends
Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
Copyright

More Related Content

PPTX
Chapter 1 Intro Information Rerieval.pptx
PPTX
PPTX
2017 biological databases_part1_vupload
PPT
Information Retrieval QueryLanguageOperation.ppt
PDF
Text databases and information retrieval
PPTX
Best practices data collection
PPTX
Systematic_Literature_Search_Bramer.pptx
Chapter 1 Intro Information Rerieval.pptx
2017 biological databases_part1_vupload
Information Retrieval QueryLanguageOperation.ppt
Text databases and information retrieval
Best practices data collection
Systematic_Literature_Search_Bramer.pptx

Similar to Chapter27 distributed database syst.pptx (20)

PDF
Jonathan Breeze, Symplectic
PDF
BLC & Digital Science: Jonathan Breeze, Symplectic
PPTX
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
PPTX
Eureka, I found it! - Special Libraries Association 2021 Presentation
PPTX
Query formulation process
PPT
A Framework for Ontology Usage Analysis
PPTX
2020 02 11_biological_databases_part1
PDF
Language Models for Information Retrieval
PPT
INTRODUCTION TO INFORMATION RETRIEVALChapter 1-IR.ppt
PPTX
Introduction to Information Retrieval (concepts and principles)
PPTX
Entrez databases
PDF
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
PDF
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
PPTX
empirical-SLR.pptx
PPTX
Database tia11
PPTX
Introduction to Information retrieval system-.pptx
PDF
Profile-based Dataset Recommendation for RDF Data Linking
PDF
PPT
Dr. N K Swain’s research prescription for LIS novices
PDF
Architecture of an ontology based domain-specific natural language question a...
Jonathan Breeze, Symplectic
BLC & Digital Science: Jonathan Breeze, Symplectic
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
Eureka, I found it! - Special Libraries Association 2021 Presentation
Query formulation process
A Framework for Ontology Usage Analysis
2020 02 11_biological_databases_part1
Language Models for Information Retrieval
INTRODUCTION TO INFORMATION RETRIEVALChapter 1-IR.ppt
Introduction to Information Retrieval (concepts and principles)
Entrez databases
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
empirical-SLR.pptx
Database tia11
Introduction to Information retrieval system-.pptx
Profile-based Dataset Recommendation for RDF Data Linking
Dr. N K Swain’s research prescription for LIS novices
Architecture of an ontology based domain-specific natural language question a...
Ad

More from ubaidullah75790 (20)

PPTX
Chapter20 transaction processing system .pptx
PPTX
Chapter22 database security in dbms.pptx
PPTX
File Organization in database management.pptx
PPTX
transaction processing databse management.pptx
PPT
physical database design distributed .ppt
PPT
module03-ipaddr ipv6 addressing in net.ppt
PPT
PDBD- Part2 physical database design.ppt
PPT
Physical_Design system development life.PPT
PPT
S3 application and network attacks in.ppt
PPT
Chapter 5 cyber security in computer.ppt
PPTX
1606802425-dba-w7 database management.pptx
PPT
ENCh18 database management system ss.ppt
PPT
Chapter07 database system in computer.ppt
PPT
Chapter05 database sytem in computer . ppt
PPT
Chapter04 database system in computer.ppt
PPT
Chapter03 database system in computer.ppt
PPT
Chapter02 database system in computer.ppt
PPT
Chapter01 database system in computer.ppt
PPT
MYCH8 database management system in .ppt
PPT
ch1 database management system in data.ppt
Chapter20 transaction processing system .pptx
Chapter22 database security in dbms.pptx
File Organization in database management.pptx
transaction processing databse management.pptx
physical database design distributed .ppt
module03-ipaddr ipv6 addressing in net.ppt
PDBD- Part2 physical database design.ppt
Physical_Design system development life.PPT
S3 application and network attacks in.ppt
Chapter 5 cyber security in computer.ppt
1606802425-dba-w7 database management.pptx
ENCh18 database management system ss.ppt
Chapter07 database system in computer.ppt
Chapter05 database sytem in computer . ppt
Chapter04 database system in computer.ppt
Chapter03 database system in computer.ppt
Chapter02 database system in computer.ppt
Chapter01 database system in computer.ppt
MYCH8 database management system in .ppt
ch1 database management system in data.ppt
Ad

Recently uploaded (20)

PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
01-Introduction-to-Information-Management.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Insiders guide to clinical Medicine.pdf
PDF
Basic Mud Logging Guide for educational purpose
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Classroom Observation Tools for Teachers
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Pre independence Education in Inndia.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Lesson notes of climatology university.
PPTX
Institutional Correction lecture only . . .
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
01-Introduction-to-Information-Management.pdf
RMMM.pdf make it easy to upload and study
PPH.pptx obstetrics and gynecology in nursing
Insiders guide to clinical Medicine.pdf
Basic Mud Logging Guide for educational purpose
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
TR - Agricultural Crops Production NC III.pdf
Classroom Observation Tools for Teachers
Supply Chain Operations Speaking Notes -ICLT Program
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Pre independence Education in Inndia.pdf
Pharma ospi slides which help in ospi learning
Lesson notes of climatology university.
Institutional Correction lecture only . . .
human mycosis Human fungal infections are called human mycosis..pptx
Final Presentation General Medicine 03-08-2024.pptx

Chapter27 distributed database syst.pptx

  • 1. Fundamentals of Database Systems Seventh Edition Chapter 27 Introduction to Information Retrieval and Web Search Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved
  • 2. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved 27.1 Information Retrieval (IR) Concepts (1 of 4) • Information retrieval – Process of retrieving documents from a collection in response to a query (search request) – Deals mainly with unstructured data ▪Example: homebuying contract documents • Unstructured information – Does not have a well-defined formal model – Based on an understanding of natural language – Stored in a wide variety of standard formats
  • 3. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Information Retrieval (IR) Concepts (2 of 4) • Information retrieval field predates database field – Academic programs in Library and Information Science • RDBMS vendors providing new capabilities to support various data types – Extended RDBMSs or object-relational database management systems • User’s information need expressed as free-form search request – Keyword search query
  • 4. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Information Retrieval (IR) Concepts (3 of 4) • Characterizing an IR system – Types of users ▪ Expert ▪ Layperson – Types of data ▪Domain-specific – Types of information needs ▪Navigational search ▪Informational search ▪Transactional search
  • 5. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Information Retrieval (IR) Concepts (4 of 4) • Enterprise search systems – Limited to an intranet • Desktop search engines – Searches an individual computer system • Databases have fixed schemas – IR system has no fixed data model
  • 6. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Comparing Databases and IR Systems Table 27.1 A comparison of databases and IR systems Databases IR Systems • Structured data • Unstructured data • Schema driven • No fixed schema; various data models (e.g., vector space model) • Relational (or object, hierarchical, and network) model is predominant • Free-form query models • Structured query model • Rich data operations • Rich metadata operations • Search request returns list or pointers to documents • Query returns data Blank • Results are based on exact matching (always correct) • Results are based on approximate matching and measures of effectiveness (may be imprecise and ranked)
  • 7. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved A Brief History of IR • Stone tablets and papyrus scrolls • Printing press • Public libraries • Computers and automated storage systems – Inverted file organization based on keywords and their weights as indexing method • Search engine • Crawler • Challenge: provide high quality, pertinent, timely information
  • 8. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Modes of Interactions in IR Systems • Primary modes of interaction – Retrieval ▪Extract relevant information from document repository – Browsing ▪Exploratory activity based on user’s assessment of relevance • Web search combines both interaction modes – Rank of a web page measures its relevance to query that generated the result set
  • 9. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Generic IR Pipeline (1 of 2) • Statistical approach – Documents analyzed and broken down into chunks of text – Each word or phrase is counted, weighted, and measured for relevance or importance • Types of statistical approaches – Boolean – Vector space – Probabilistic
  • 10. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Generic IR Pipeline (2 of 2) • Semantic approaches – Use knowledge-based retrieval techniques – Rely on syntactic, lexical, sentential, discourse-based, and pragmatic levels of knowledge understanding – Also apply some form of statistical analysis
  • 11. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Figure 27.1 Generic IR Framework
  • 12. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Figure 27.2 Simplified IR Process Pipeline
  • 13. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved 27.2 Retrieval Models (1 of 5) • Boolean model – One of earliest and simplest IR models – Documents represented as a set of terms – Queries formulated using AND, OR, and NOT – Retrieved documents are an exact match ▪No notion of ranking of documents – Easy to associate metadata information and write queries that match contents of documents
  • 14. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Retrieval Models (2 of 5) • Vector space model – Weighting, ranking, and determining relevance are possible – Uses individual terms as dimensions – Each document represented by an n-dimensional vector of values – Features ▪Subset of terms in a document set that are deemed most relevant to an IR search for the document set
  • 15. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Retrieval Models (3 of 5) • Vector space model – Different similarity assessment functions can be used • Term frequency-inverse document frequency (TF-IDF) – Statistical weight measure used to evaluate the importance of a document word in a collection of documents – A discriminating term must occur in only a few documents in the general population
  • 16. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Retrieval Models (4 of 5) • Probabilistic model – Involves ranking documents by their estimated probability of relevance with respect to the query and the document – IR system must decide whether a document belongs to the relevant set or nonrelevant set for a query ▪Calculate probability that document belongs to the relevant set – BM25: a popular ranking algorithm
  • 17. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Retrieval Models (5 of 5) • Semantic model – Morphological analysis ▪Analyze roots and affixes to determine parts of speech of search words – Syntactic analysis ▪Parse and analyze complete phrases in documents – Semantic analysis ▪Resolve word ambiguities and generate relevant synonyms based on semantic relationships – Uses techniques from artificial intelligence and expert systems
  • 18. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved 27.3 Types of Queries in IR Systems (1 of 4) • Keyword queries – Simplest and most commonly used – Keyword terms implicitly connected by logical AND • Boolean queries – Allow use of AND, OR, NOT, and other operators – Exact matches returned ▪No ranking possible
  • 19. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Types of Queries in IR Systems (2 of 4) • Phrase queries – Sequence of words that make up a phrase – Phrase enclosed in double quotes – Each retrieved document must contain at least one instance of the exact phrase • Proximity queries – How close within a record multiple search terms are to each other – Phrase search is most commonly used proximity query
  • 20. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Types of Queries in IR Systems (3 of 4) • Proximity queries – Specify order of search terms – NEAR, ADJ (adjacent), or AFTER operators – Sequence of words with maximum allowed distance between them – Computationally expensive ▪Suitable for smaller document collections rather than the Web
  • 21. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Types of Queries in IR Systems (4 of 4) • Wildcard queries – Supports regular expressions and pattern-based matching ▪Example ‘data*’ would retrieve data, database, dataset, etc. – Not generally implemented by Web search engines • Natural language queries – Definitions of textual terms or common facts – Semantic models can support
  • 22. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved 27.4 Text Preprocessing (1 of 3) • Stopword removal must be performed before indexing • Stopwords – Words that are expected to occur in 80% or more of the documents of a collection ▪Examples: the, of, to, a, and, said, for, that – Do not contribute much to relevance • Queries preprocessed for stopword removal before retrieval process – Many search engines do not remove stopwords
  • 23. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Text Preprocessing (2 of 3) • Stemming – Trims suffix and prefix – Reduces the different forms of the word to a common stem – Martin Porter’s stemming algorithm • Utilizing a thesaurus – Important concepts and main words that describe each concept for a particular knowledge domain – Collection of synonyms – UMLS
  • 24. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Figure 27.3 A Portion of the UMLS Semantic Network: “Biologic Function” Hierarchy Source: UMLS Reference Manual, National Library of Medicine
  • 25. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Text Preprocessing (3 of 3) • Other preprocessing steps – Digits ▪May or may not be removed during preprocessing – Hyphens and punctuation marks ▪Handled in different ways – Cases ▪Most search engines use case-insensitive search • Information extraction tasks – Identifying noun phrases, facts, events, people, places, and relationships
  • 26. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved 27.5 Inverted Indexing (1 of 3) • Inverted index structure – Vocabulary information ▪Set of distinct query terms in the document set – Document information – Data structure that attaches distinct terms with a list of all documents that contain the term
  • 27. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Inverted Indexing (2 of 3) • Construction of an inverted index – Break documents into vocabulary terms ▪Tokenizing, cleansing, removing stopwords, stemming, and/or using a thesaurus – Collect document statistics ▪Store statistics in document lookup table – Invert the document-term stream into a term- document stream ▪Add additional information such as term frequencies, term positions, and term weights
  • 28. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Figure 27.4 Example of an Inverted Index
  • 29. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Inverted Indexing (3 of 3) • Searching for relevant documents from an inverted index – Vocabulary search – Document information retrieval – Manipulation of retrieved information
  • 30. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Introduction to Lucene • Lucene: open source indexing/search engine – Indexing is primary focus • Document composed of set of fields – Chunks of untokenized text – Series of processed lexical units called token streams ▪Created by tokenization and filtering algorithms • Highly-configurable search API • Ease of indexing large, unstructured document collections
  • 31. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved 27.6 Evaluation Measures of Search Relevance (1 of 4) • Topical relevance – Measures result topic match to query topic • User relevance – Describes ‘goodness’ of retrieved result with regard to user’s information need • Web information retrieval – No binary classification made for relevance or nonrelevance – Ranking of documents
  • 32. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Evaluation Measures of Search Relevance (2 of 4) • Recall – Number of relevant documents retrieved by a search divided by the total number of actually relevant documents existing in the database • Precision – Number of relevant documents retrieved by a search divided by total number of documents retrieved by that search
  • 33. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Retrieved Versus Relevant Search Results • TP: true positive • FP: false positive • TN: true negative • FN: false negative Figure 27.5 Retrieved versus relevant search results
  • 34. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Evaluation Measures of Search Relevance (3 of 4) • Recall can be increased by presenting more results to the user – May decrease the precision Doc. No. Rank Position i Relevant Precision(i) Recall(i) 10 1 Yes 1/1 = 100% 1/10 = 10% 2 2 Yes 2/2 = 100% 2/10 = 20% 3 3 Yes 3/3 = 100% 3/10 = 30% 5 4 No 3/4 = 75% 3/10 = 30% 17 5 No 3/5 = 60% 3/10 = 30% 34 6 No 3/6 = 50% 3/10 = 30% 215 7 Yes 4/7 = 57.1% 4/10 = 40% 33 8 Yes 5/8 = 62.5% 5/10 = 50% 45 9 No 5/9 = 55.5% 5/10 = 50% 16 10 Yes 6/10 = 60% 6/10 = 60% Table 27.2 Precision and recall for ranked retrieval
  • 35. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Evaluation Measures of Search Relevance (4 of 4) • Average precision – Computed based on the precision at each relevant document in the ranking • Recall/precision curve – Based on the recall and precision values at each rank position ▪x-axis is recall and y-axis is precision • F-score – Harmonic mean of the precision (p) and recall (r) values
  • 36. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved 27.7 Web Search and Analysis (1 of 8) • Search engines must crawl and index Web sites and document collections – Regularly update indexes – Link analysis used to identify page importance • Vertical search engines – Customized topic-specific search engines that crawl and index a specific collection of documents on the Web
  • 37. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (2 of 8) • Metasearch engines – Query different search engines simultaneously and aggregate information • Digital libraries – Collections of electronic resources and services for the delivery of materials in a variety of formats • Web analysis – Applies data analysis techniques to discover and analyze useful information from the Web
  • 38. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (3 of 8) • Goals of Web analysis – Finding relevant information – Personalization of the information – Finding information of social value • Categories of Web analysis – Web structure analysis – Web content analysis – Web usage analysis
  • 39. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (4 of 8) • Web structure analysis – Hyperlink – Destination page – Anchor text – Hub – Authority • PageRank ranking algorithm – Used by Google – Analyzes forward links and backlinks ▪Highly linked pages are more important
  • 40. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (5 of 8) • Web content analysis tasks – Structured data extraction ▪Wrapper – Web information integration ▪Web query interface integration ▪Schema matching ▪Ontology-based information integration – Building concept hierarchies – Segmenting web pages and detecting noise
  • 41. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (6 of 8) • Approaches to Web content analysis – Agent-based ▪Intelligent Web agents ▪Personalized Web agents ▪Information filtering/categorization – Database-based ▪Attempts to organize a Web site as a database ▪Object Exchange Model ▪Multilevel database ▪Web query system
  • 42. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (7 of 8) • Web usage analysis attempts to discover usage patterns from Web data – Preprocessing ▪Usage, content, structure – Pattern discovery ▪Statistical analysis, association rules, clustering, classification, sequential patterns, dependency modeling – Pattern analysis ▪Filter out patterns not of interest
  • 43. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Web Search and Analysis (8 of 8) • Practical applications of Web analysis – Web analytics ▪Understand and optimize the performance of Web usage – Web spamming ▪Deliberate activity to promote a page by manipulating search engine results – Web security ▪Allow design of more robust Web sites – Web crawlers
  • 44. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved 27.8 Trends in Information Retrieval (1 of 3) • Faceted search – Classifying content • Social search – Collaborative social search • Conversational information access – Intelligent agents perform intent extraction to provide information relevant to a conversation • Probabilistic topic modeling – Automatically organize large collections of documents into relevant themes
  • 45. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Trends in Information Retrieval (2 of 3) Figure 27.6 A document D and its topic proportions
  • 46. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Trends in Information Retrieval (3 of 3) • Question-answering systems – Factoid questions – List questions – Definition questions – Opinion questions – Composed of question analysis, query generation, search, candidate answer generation, and answer scoring
  • 47. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved 27.9 Summary • Information retrieval mainly targeted at unstructured data • Query and browsing modes of interaction • Retrieval models – Boolean, vector space, probabilistic, and semantic • Text preprocessing • Web search • Web ranking • Trends
  • 48. Copyright © 2016, 2011, 2007 Pearson Education, Inc. All Rights Reserved Copyright

Editor's Notes

  • #1: If this PowerPoint presentation contains mathematical equations, you may need to check that your computer has the following installed: 1) MathType Plugin 2) Math Player (free versions available) 3) NVDA Reader (free versions available)