SlideShare a Scribd company logo
3
Most read
4
Most read
9
Most read
3/10/2021
1
Wollo University
Kombolcha Institute of Technology
College of Informatics
Department of Information Technology
1
Course Outline
 Course Title: Information Storage and Retrieval
 Course Code: ITec3081
 ECTS Credits (CP): 5
 Target Group: B.Sc. 3rd year Information
Technology Students (Regular Program)
 Year /Semester: Year: III, Semester: II
 Status of the Course: Core
Contact Hours (per week) Section A Section B
Lecture Mon 7:40-9:30 Wen 2:20-4:00
Lab/Practical Thurs. 2:20-5:00 Fri 2:20-5:00
Instructor: Habtamu Abate (M.Sc.)
Email: habate999@gmail.com
2
Course Outline
Course Description :
This course will cover introductory concepts of Information Storage
and Retrieval; automatic text operation including automatic indexing;
data and file structure for information retrieval; retrieval models;
evaluation of information retrieval systems and techniques for
enhancing retrieval effectiveness; query languages, query operations,
string manipulation and search algorithms; and Current Issues in IR
etc.
Course Objective
At the end of the course students will be able to:
 Understand the various Information Retrieval Systems and
processes
 Know the retrieval model and evaluation of Information Retrieval
Systems
 Understand the processes of information storage and retrieval
 Design ,develop and evaluate information retrieval models
 Understand evaluation issues in IR
 Understand current issues in IR 3
Course Syllabus
Chapter One
Introduction to ISR
 IR and IR systems
 Data versus information retrieval
 IR and the retrieval process
 Basic structure of an IR system
Chapter Two
Text/Document Operations and
Automatic Indexing
 Index term selection (Luhn’s selection and
Zipf’s law in IR)
 Document pre-processing (Lexical
analysis, Stop word Elimination, stemming)
 Term extraction (Term weighting and
similarity measures)
Chapter Three
Indexing Structures
 Inverted files
 Tries, Suffix Trees and Suffix Arrays
 Signature files
4
3/10/2021
2
Course Syllabus
Chapter Four
IR Models
 Introduction of IR Models
 Boolean model
 Vector space model
 Probabilistic model
 Evaluation of IR systems
 Relevance judgment
 Performance measures (Recall, Precision, etc.)
Chapter Five
Retrieval
Evaluation
Chapter Six
Query Languages
 Keyword-based queries
 Pattern matching
 Structural queries
Chapter Seven
Query Operations
 Relevance feedback
 Query expansion
 Research in IR (Multimedia Retrieval, Web
Retrieval, Question answering. etc.)
Chapter Eight
Current Issues in IR
5
Assessment
 Assignments=10% ,Test=10% / Lab Exam=10%, Project work= 20 % ; Mid
Exam=20% ; Final examination= 40%
Text Book
 Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information
Retrieval, ACM Press.,2008.
Other Reference Books:
 Salton, G. and McGill, M. J. Introduction to Modern Information
Retrieval, McGraw-Hill Co., 1983.
 Robert R. Korfhage, Information Storage and Retrieval, John Wiley and
Sons, 1997.
 Information Retrieval: Data Structures and Algorithms by W. B. Frakes
and R. Baeza-Yates (Eds.) (Prentice-Hall) 1992, ISBN 0-13-463837-9.
 Spärck Jones, K. and Willett, P. (eds.). Readings in information retrieval.
San Francisco: Morgan Kaufmann, 1997.
6
Chapter 1: Introduction to Information
Storage and retrieval
Outline
• IR and IR systems
• Data versus information retrieval
• IR and the retrieval process
• Basic structure of an IR
7
Information Retrieval
 Information retrieval (IR) is the process of
finding material (usually documents) of an
unstructured nature (usually text) that satisfies
an information need from within large collections
(usually stored on computers).
 Information is organized into (a large number of)
documents
 Large collections of documents from various
sources: news articles, research papers, books,
digital libraries, Web pages, etc.
 Example: Web Search Engines like Google claim to
index over 1 Trillion pages
8
3/10/2021
3
Information Retrieval
 Document: Any object that can be
stored and can be retrieved
• Information Need: What kind of
information you need to find
– what you have in mind
• Query: Statement of an information
need or representation of an
information need.
• The user has an information need (IN)
in his/her mind.
• The Retrieval System can’t
understand the IN directly.
• IN has to be abstracted into a form
that matches the information system
• This abstraction is a query
Examples
Good textbook for
Information retrieval
9
General Goal of Information Retrieval
 To help users find useful information based
on their information needs (with a minimum
effort) despite
 Increasing complexity of Information
 Changing needs of user
 Provide immediate random access to the
document collection.
10
Information Retrieval Systems?
• Document (Web page)
retrieval in response to a
query
– Quite effective (at
some things)
– Commercially
successful (some of
them)
• But what goes on behind
the scenes?
– How do they work?
– What happens beyond
the Web?
11
Web search systems
• Lycos, Excite, Yahoo, Google, Live,
Northern Light, Teoma, HotBot, Baidu,
…
Examples of IR systems
 Conventional (library catalog): Search by keyword,
title, author, etc.
 Text-based (Lexis-Nexis, Google, FAST): Search
by keywords. Limited search using queries in natural
language.
 Multimedia (QBIC, WebSeek, SaFe): Search by
visual appearance (shapes, colors,… ).
 Question answering systems (AskJeeves,
Answerbus): Search in (restricted) natural language
Other:
Cross language information retrieval,
Music retrieval
12
3/10/2021
4
Information Retrieval vs. Data Retrieval
 Emphasis of IR is on the retrieval of information, rather
than on the retrieval of data
 Data retrieval
 Consists mainly of determining which documents contain a
set of keywords in the user query (which is not enough to
satisfy the user information need)
 Aims at retrieving all objects that satisfy well defined
semantics
 a single erroneous object among a thousand retrieved
objects implies failure
 Information retrieval
 Is concerned with retrieving information about a subject or
topic than retrieving data which satisfies a given query
 semantics is frequently loose: the retrieved objects might
be inaccurate
 small errors are tolerated
13
Information Retrieval vs. Data Retrieval
• Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics
(ID, Name, age,…)
No fields (other
than text)
Query Language Artificial (defined,
SQL)
Free text (“natural
language”), Boolean
Matching Exact (results are
always “correct”)
Partial match, best match
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive
14
Basic Concepts in Information
Retrieval:
The User Task:
two user task – retrieval and browsing
(i) User Task and (ii) Logical View of documents
Retrieval
Browsing
DB
USER
15
The User Task
Retrieval
 It is the process of retrieving information whereby the
main objective is clearly defined from the onset of
searching process.
 The user of a retrieval system has to translate his
information need into a query in the language provided
by the system.
 In this context (i.e. by specifying a set of words), the
user searches for useful information executing a
retrieval task
 English Language Statement :
 I want a book by Hadis Alemayehu titled Fikir Esk
Mekaber
16
3/10/2021
5
Browsing
It is the process of retrieving information, whereby
the main objective is not clearly defined from the
beginning and whose purpose might change during the
interaction with the system.
E.g. User might search for documents about ‘car
racing’. Meanwhile S/he might find interesting
documents about ‘car manufacturers’. While reading
about car manufacturers in Addis, S/he might turn
his/her attention to a document providing ‘direction
to Addis’, and from this to documents which cover
‘Tourism in Ethiopia’.
In this context, user is said to be browsing in the
collection and not searching, since a user may has an
interest glancing around
17
Logical View of Documents
 Documents in a collection are frequently represented by a
set of index terms or keywords
 Such keywords are mostly extracted directly from the
text of the document
 These representative keywords provide a logical view of
the document
 Document representation viewed as a continuum, in which
logical view of documents might shift from full text to
index terms
Tokenization stop words stemming Indexing
Docs
Full
text
Index terms
18
Logical view of documents
• If full text :
– Each word in the text is a keyword
– Most complex form
– Expensive
 If full text is too large, the set of
representative keywords can be reduced
through transformation process called text
operation
 It reduce the complexity of the document
representation and allow moving the logical view
from that of a full text to a set of index terms
19
Structure of an IR System
An Information Retrieval System serves as a bridge between
the world of authors and the world of readers/users,
That is, writers present a set of ideas in a document using a set
of concepts. Then Users seek the IR system for relevant
documents that satisfy their information need.
The black box is the information retrieval system.
To be effective in its attempt to satisfy information need of users,
the IR system must ‘interpret’ the contents of documents in a
collection and rank them according to their degree of relevance to the
user query.
Thus the notion of relevance is at the centre of IR
The primary goal of an IR system is to retrieve all the documents
which are relevant to a user query while retrieving as few non-
relevant documents as possible
Black box
User Documents
20
3/10/2021
6
Typical IR Task
Given:
 A corpus of textual natural-language
documents.
 A user query in the form of a textual
string.
Find:
 A ranked set of documents that are
relevant to the query.
21
Typical IR System Architecture
IR
System
Query
String
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
22
Web Search System
Query
String
IR
System
Ranked
Documents
1. Page1
2. Page2
3. Page3
.
.
Document
corpus
Web Spider
23
What is Information Retrieval ?
• A good formal definition of information retrieval is
given in Baeze-Yates & Riberio-Neto (1990, p1)
“Information retrieval deals with representation,
storage, organization of, and access to information
items. The organization and access of information items
should provide the user with easy access to the
information in which S/he is interested”
• The definition incorporates all important features of a
good information retrieval system
– Representation
– Storage
– Organization
– Access
• The focus is on the user information need
24
3/10/2021
7
Overview of the Retrieval process
25
• It is necessary to define the text database before
any of the retrieval processes are initiated
• This is usually done by the manager of the database
and includes specifying the following
– The documents to be used
– The operations to be performed on the text
– The text model to be used (the text structure
and what elements can be retrieved)
• The text operations transform the original
documents and the information needs and generate
a logical view of them
The Retrieval Process
26
• Once the logical view of the documents is defined, the
database module builds an index of the text
– An index is a critical data structure
– It allows fast searching over large volumes of data
• Different index structures might be used , but the most
popular one is the inverted file.
• Given that the document database is indexed, the
retrieval process can be initiated
The user first specifies a user need which is then parsed
and transformed by the same text operation applied to
the text
Next the query operations is applied before the actual
query, which provides a system representation for the
user need, is generated
The Retrieval Process …
27
• The query is then processed to obtain the retrieved documents
Before the retrieved documents are sent to the user, the retrieved
documents are ranked according to the likelihood of relevance
The user then examines the set of ranked documents in the
search for useful information. Two choices for the user:
(i) reformulate query, run on entire collection or
(ii) reformulate query, run on result set
At this point, s/he might pinpoint a subset of the documents
seen as definitely of interest and initiate a user feedback cycle
In such a cycle, the system uses the documents selected by
the user to change the query formulation.
Hopefully, this modified query is a better representation of
the real user need
The Retrieval Process …
28
3/10/2021
8
User
Interface
Text Operations
Query Language
& Operations
Indexing
Searching
Ranking
Index
Text
Query
User
need
User
feedback
Ranked docs
Retrieved docs
Logical view
logical view
Inverted file
DB
manager
Module
Text
Database
Text
Detail view of the Retrieval
Process
29
Subsystems of an IR system
• The two subsystems of an IR system:
–Searching: is an online process of finding relevant
documents in the index list as per users query
–Indexing: is an offline process of organizing
documents using keywords extracted from the
collection
• Indexing and searching: are unavoidably connected
–you cannot search what was not first indexed in
some manner or other
–indexing of documents or objects is done in
order to be searchable
–there are many ways to do indexing
–to index one needs an indexing language
–there are many indexing languages
–even taking every word in a document is an indexing language
• Knowing searching is knowing indexing 30
Indexing Subsystem
Documents
Tokenize
Stop list
Stemming & Normalize
Term weighting
Index
text
non-stoplist
tokens
tokens
stemmed
terms
terms with
weights
Assign document identifier
documents
document
IDs
31
Searching Subsystem
Index
query parse query
Stemming & Normalize
stemmed terms
Stop list non-stoplist
tokens
query tokens
Similarity
Measure
ranking
Index terms
ranked
document
set
relevant
document set
Term weighting
Query
terms
32
3/10/2021
9
Issues that arise in IR
 Text representation
 what makes a “good” representation?
 how is a representation generated from text?
 what are retrievable objects and how are they
organized?
 information needs representation
 what is an appropriate query language?
 how can interactive query formulation and
refinement be supported?
 Comparing representations (to identify relevant
documents)
 What weighting scheme and similarity measure to
be used?
 what is a “good” model of retrieval?
 Evaluating effectiveness of retrieval
 what are good metrics?
 what constitutes a good experimental test bed?
33
Focus in IR System Design
Our focus during IR system design is:
• In improving performance effectiveness of the system
–Effectiveness of the system is measured in terms of
precision, recall, …
–Stemming, stop words, weighting schemes, matching
algorithms
• In improving performance efficiency
–The concern here is storage space usage, access time,
searching time, data transfer time …
–Concern regarding space – time tradeoffs !!
–Use Compression techniques, data/file structures, etc.
34
35
Thank you

More Related Content

PDF
Information storage and retrieval
PPTX
Lectures 1,2,3
PPTX
Information retrieval introduction
PDF
Information storage and retrieval PPT.pdf
PDF
Introduction to Information Retrieval & Models
PPTX
Functions of information retrival system(1)
PPTX
Information retrieval (introduction)
PDF
CS6007 information retrieval - 5 units notes
Information storage and retrieval
Lectures 1,2,3
Information retrieval introduction
Information storage and retrieval PPT.pdf
Introduction to Information Retrieval & Models
Functions of information retrival system(1)
Information retrieval (introduction)
CS6007 information retrieval - 5 units notes

What's hot (20)

PPTX
Ppt evaluation of information retrieval system
PPSX
INFORMATION RETRIEVAL ‎AND DISSEMINATION
PPTX
Ontology and Ontology Libraries: a Critical Study
PPTX
Metadata
PPTX
Information retrieval s
PPT
Information Retrieval Models
PPTX
Probabilistic information retrieval models & systems
PPTX
Post coordinate indexing .. Library and information science
PDF
Library Automation in Circulation
PPTX
WEB BASED INFORMATION RETRIEVAL SYSTEM
PPTX
Information retrieval 7 boolean model
PPSX
Web scale discovery service
PPTX
POPSI
PPT
Bibliographic coupling
ODP
Dublin core Presentation
PPT
6&7-Query Languages & Operations.ppt
PPT
User education and information literacy - Innovative strategies and practices
PPTX
Precis
PPT
Reference Sources: Origin, Evaluation and Use
PPTX
BIBLIOMETRICS LAWS
Ppt evaluation of information retrieval system
INFORMATION RETRIEVAL ‎AND DISSEMINATION
Ontology and Ontology Libraries: a Critical Study
Metadata
Information retrieval s
Information Retrieval Models
Probabilistic information retrieval models & systems
Post coordinate indexing .. Library and information science
Library Automation in Circulation
WEB BASED INFORMATION RETRIEVAL SYSTEM
Information retrieval 7 boolean model
Web scale discovery service
POPSI
Bibliographic coupling
Dublin core Presentation
6&7-Query Languages & Operations.ppt
User education and information literacy - Innovative strategies and practices
Precis
Reference Sources: Origin, Evaluation and Use
BIBLIOMETRICS LAWS
Ad

Similar to Chapter 1 Introduction to Information Storage and Retrieval.pdf (20)

PPTX
Chapter 1.pptx
PDF
Chapter 1: Introduction to Information Storage and Retrieval
PDF
Chapter 1 Introduction to ISR (1).pdf
PPT
Information retrival system it is part and parcel
PPT
information retirval system,search info insights in unsturtcured data
PPTX
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
DOCX
unit 1 INTRODUCTION
PDF
CS8080_IRT__UNIT_I_NOTES.pdf
PDF
PPT
IR introduction
PPTX
PPTX
Introduction to Information Retrieval (concepts and principles)
PPTX
Information storage and retrieval system and
PPTX
IRT Unit_I.pptx
PPTX
INFORMATION RETRIEVAL Anandraj.L
PPT
IRintroduction.ppt
PPTX
Chap 1 general introduction of information retrieval
PPTX
information retrieval in artificial intelligence
PPT
chapter 1-Overview of Information Retrieval.ppt
PPTX
Introduction.pptx
Chapter 1.pptx
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1 Introduction to ISR (1).pdf
Information retrival system it is part and parcel
information retirval system,search info insights in unsturtcured data
Chapter 1 - Introduction to IR Information retrieval ch1 Information retrieva...
unit 1 INTRODUCTION
CS8080_IRT__UNIT_I_NOTES.pdf
IR introduction
Introduction to Information Retrieval (concepts and principles)
Information storage and retrieval system and
IRT Unit_I.pptx
INFORMATION RETRIEVAL Anandraj.L
IRintroduction.ppt
Chap 1 general introduction of information retrieval
information retrieval in artificial intelligence
chapter 1-Overview of Information Retrieval.ppt
Introduction.pptx
Ad

More from Habtamu100 (7)

PDF
qury.pdf
PDF
Chapter 4 IR Models.pdf
PDF
Chapter 2 Text Operation.pdf
PDF
Chapter 5 Query Evaluation.pdf
PDF
Chapter 3 Indexing.pdf
PDF
Chapter 7.pdf
PDF
Chapter 6 Query Language .pdf
qury.pdf
Chapter 4 IR Models.pdf
Chapter 2 Text Operation.pdf
Chapter 5 Query Evaluation.pdf
Chapter 3 Indexing.pdf
Chapter 7.pdf
Chapter 6 Query Language .pdf

Recently uploaded (20)

PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
master seminar digital applications in india
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Computing-Curriculum for Schools in Ghana
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
01-Introduction-to-Information-Management.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Pre independence Education in Inndia.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Renaissance Architecture: A Journey from Faith to Humanism
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Anesthesia in Laparoscopic Surgery in India
master seminar digital applications in india
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Computing-Curriculum for Schools in Ghana
VCE English Exam - Section C Student Revision Booklet
01-Introduction-to-Information-Management.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
STATICS OF THE RIGID BODIES Hibbelers.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Microbial disease of the cardiovascular and lymphatic systems
Pre independence Education in Inndia.pdf
GDM (1) (1).pptx small presentation for students
O5-L3 Freight Transport Ops (International) V1.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf

Chapter 1 Introduction to Information Storage and Retrieval.pdf

  • 1. 3/10/2021 1 Wollo University Kombolcha Institute of Technology College of Informatics Department of Information Technology 1 Course Outline  Course Title: Information Storage and Retrieval  Course Code: ITec3081  ECTS Credits (CP): 5  Target Group: B.Sc. 3rd year Information Technology Students (Regular Program)  Year /Semester: Year: III, Semester: II  Status of the Course: Core Contact Hours (per week) Section A Section B Lecture Mon 7:40-9:30 Wen 2:20-4:00 Lab/Practical Thurs. 2:20-5:00 Fri 2:20-5:00 Instructor: Habtamu Abate (M.Sc.) Email: habate999@gmail.com 2 Course Outline Course Description : This course will cover introductory concepts of Information Storage and Retrieval; automatic text operation including automatic indexing; data and file structure for information retrieval; retrieval models; evaluation of information retrieval systems and techniques for enhancing retrieval effectiveness; query languages, query operations, string manipulation and search algorithms; and Current Issues in IR etc. Course Objective At the end of the course students will be able to:  Understand the various Information Retrieval Systems and processes  Know the retrieval model and evaluation of Information Retrieval Systems  Understand the processes of information storage and retrieval  Design ,develop and evaluate information retrieval models  Understand evaluation issues in IR  Understand current issues in IR 3 Course Syllabus Chapter One Introduction to ISR  IR and IR systems  Data versus information retrieval  IR and the retrieval process  Basic structure of an IR system Chapter Two Text/Document Operations and Automatic Indexing  Index term selection (Luhn’s selection and Zipf’s law in IR)  Document pre-processing (Lexical analysis, Stop word Elimination, stemming)  Term extraction (Term weighting and similarity measures) Chapter Three Indexing Structures  Inverted files  Tries, Suffix Trees and Suffix Arrays  Signature files 4
  • 2. 3/10/2021 2 Course Syllabus Chapter Four IR Models  Introduction of IR Models  Boolean model  Vector space model  Probabilistic model  Evaluation of IR systems  Relevance judgment  Performance measures (Recall, Precision, etc.) Chapter Five Retrieval Evaluation Chapter Six Query Languages  Keyword-based queries  Pattern matching  Structural queries Chapter Seven Query Operations  Relevance feedback  Query expansion  Research in IR (Multimedia Retrieval, Web Retrieval, Question answering. etc.) Chapter Eight Current Issues in IR 5 Assessment  Assignments=10% ,Test=10% / Lab Exam=10%, Project work= 20 % ; Mid Exam=20% ; Final examination= 40% Text Book  Ricardo A. Baeza-Yates, Berthier Ribeiro-Neto, Modern Information Retrieval, ACM Press.,2008. Other Reference Books:  Salton, G. and McGill, M. J. Introduction to Modern Information Retrieval, McGraw-Hill Co., 1983.  Robert R. Korfhage, Information Storage and Retrieval, John Wiley and Sons, 1997.  Information Retrieval: Data Structures and Algorithms by W. B. Frakes and R. Baeza-Yates (Eds.) (Prentice-Hall) 1992, ISBN 0-13-463837-9.  Spärck Jones, K. and Willett, P. (eds.). Readings in information retrieval. San Francisco: Morgan Kaufmann, 1997. 6 Chapter 1: Introduction to Information Storage and retrieval Outline • IR and IR systems • Data versus information retrieval • IR and the retrieval process • Basic structure of an IR 7 Information Retrieval  Information retrieval (IR) is the process of finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).  Information is organized into (a large number of) documents  Large collections of documents from various sources: news articles, research papers, books, digital libraries, Web pages, etc.  Example: Web Search Engines like Google claim to index over 1 Trillion pages 8
  • 3. 3/10/2021 3 Information Retrieval  Document: Any object that can be stored and can be retrieved • Information Need: What kind of information you need to find – what you have in mind • Query: Statement of an information need or representation of an information need. • The user has an information need (IN) in his/her mind. • The Retrieval System can’t understand the IN directly. • IN has to be abstracted into a form that matches the information system • This abstraction is a query Examples Good textbook for Information retrieval 9 General Goal of Information Retrieval  To help users find useful information based on their information needs (with a minimum effort) despite  Increasing complexity of Information  Changing needs of user  Provide immediate random access to the document collection. 10 Information Retrieval Systems? • Document (Web page) retrieval in response to a query – Quite effective (at some things) – Commercially successful (some of them) • But what goes on behind the scenes? – How do they work? – What happens beyond the Web? 11 Web search systems • Lycos, Excite, Yahoo, Google, Live, Northern Light, Teoma, HotBot, Baidu, … Examples of IR systems  Conventional (library catalog): Search by keyword, title, author, etc.  Text-based (Lexis-Nexis, Google, FAST): Search by keywords. Limited search using queries in natural language.  Multimedia (QBIC, WebSeek, SaFe): Search by visual appearance (shapes, colors,… ).  Question answering systems (AskJeeves, Answerbus): Search in (restricted) natural language Other: Cross language information retrieval, Music retrieval 12
  • 4. 3/10/2021 4 Information Retrieval vs. Data Retrieval  Emphasis of IR is on the retrieval of information, rather than on the retrieval of data  Data retrieval  Consists mainly of determining which documents contain a set of keywords in the user query (which is not enough to satisfy the user information need)  Aims at retrieving all objects that satisfy well defined semantics  a single erroneous object among a thousand retrieved objects implies failure  Information retrieval  Is concerned with retrieving information about a subject or topic than retrieving data which satisfies a given query  semantics is frequently loose: the retrieved objects might be inaccurate  small errors are tolerated 13 Information Retrieval vs. Data Retrieval • Example of data retrieval system is a relational database Data Retrieval Info Retrieval Data organization Structured Unstructured Fields Clear Semantics (ID, Name, age,…) No fields (other than text) Query Language Artificial (defined, SQL) Free text (“natural language”), Boolean Matching Exact (results are always “correct”) Partial match, best match Query specification Complete Incomplete Items wanted Matching Relevant Accuracy 100% < 50% Error response Sensitive Insensitive 14 Basic Concepts in Information Retrieval: The User Task: two user task – retrieval and browsing (i) User Task and (ii) Logical View of documents Retrieval Browsing DB USER 15 The User Task Retrieval  It is the process of retrieving information whereby the main objective is clearly defined from the onset of searching process.  The user of a retrieval system has to translate his information need into a query in the language provided by the system.  In this context (i.e. by specifying a set of words), the user searches for useful information executing a retrieval task  English Language Statement :  I want a book by Hadis Alemayehu titled Fikir Esk Mekaber 16
  • 5. 3/10/2021 5 Browsing It is the process of retrieving information, whereby the main objective is not clearly defined from the beginning and whose purpose might change during the interaction with the system. E.g. User might search for documents about ‘car racing’. Meanwhile S/he might find interesting documents about ‘car manufacturers’. While reading about car manufacturers in Addis, S/he might turn his/her attention to a document providing ‘direction to Addis’, and from this to documents which cover ‘Tourism in Ethiopia’. In this context, user is said to be browsing in the collection and not searching, since a user may has an interest glancing around 17 Logical View of Documents  Documents in a collection are frequently represented by a set of index terms or keywords  Such keywords are mostly extracted directly from the text of the document  These representative keywords provide a logical view of the document  Document representation viewed as a continuum, in which logical view of documents might shift from full text to index terms Tokenization stop words stemming Indexing Docs Full text Index terms 18 Logical view of documents • If full text : – Each word in the text is a keyword – Most complex form – Expensive  If full text is too large, the set of representative keywords can be reduced through transformation process called text operation  It reduce the complexity of the document representation and allow moving the logical view from that of a full text to a set of index terms 19 Structure of an IR System An Information Retrieval System serves as a bridge between the world of authors and the world of readers/users, That is, writers present a set of ideas in a document using a set of concepts. Then Users seek the IR system for relevant documents that satisfy their information need. The black box is the information retrieval system. To be effective in its attempt to satisfy information need of users, the IR system must ‘interpret’ the contents of documents in a collection and rank them according to their degree of relevance to the user query. Thus the notion of relevance is at the centre of IR The primary goal of an IR system is to retrieve all the documents which are relevant to a user query while retrieving as few non- relevant documents as possible Black box User Documents 20
  • 6. 3/10/2021 6 Typical IR Task Given:  A corpus of textual natural-language documents.  A user query in the form of a textual string. Find:  A ranked set of documents that are relevant to the query. 21 Typical IR System Architecture IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . . 22 Web Search System Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3 . . Document corpus Web Spider 23 What is Information Retrieval ? • A good formal definition of information retrieval is given in Baeze-Yates & Riberio-Neto (1990, p1) “Information retrieval deals with representation, storage, organization of, and access to information items. The organization and access of information items should provide the user with easy access to the information in which S/he is interested” • The definition incorporates all important features of a good information retrieval system – Representation – Storage – Organization – Access • The focus is on the user information need 24
  • 7. 3/10/2021 7 Overview of the Retrieval process 25 • It is necessary to define the text database before any of the retrieval processes are initiated • This is usually done by the manager of the database and includes specifying the following – The documents to be used – The operations to be performed on the text – The text model to be used (the text structure and what elements can be retrieved) • The text operations transform the original documents and the information needs and generate a logical view of them The Retrieval Process 26 • Once the logical view of the documents is defined, the database module builds an index of the text – An index is a critical data structure – It allows fast searching over large volumes of data • Different index structures might be used , but the most popular one is the inverted file. • Given that the document database is indexed, the retrieval process can be initiated The user first specifies a user need which is then parsed and transformed by the same text operation applied to the text Next the query operations is applied before the actual query, which provides a system representation for the user need, is generated The Retrieval Process … 27 • The query is then processed to obtain the retrieved documents Before the retrieved documents are sent to the user, the retrieved documents are ranked according to the likelihood of relevance The user then examines the set of ranked documents in the search for useful information. Two choices for the user: (i) reformulate query, run on entire collection or (ii) reformulate query, run on result set At this point, s/he might pinpoint a subset of the documents seen as definitely of interest and initiate a user feedback cycle In such a cycle, the system uses the documents selected by the user to change the query formulation. Hopefully, this modified query is a better representation of the real user need The Retrieval Process … 28
  • 8. 3/10/2021 8 User Interface Text Operations Query Language & Operations Indexing Searching Ranking Index Text Query User need User feedback Ranked docs Retrieved docs Logical view logical view Inverted file DB manager Module Text Database Text Detail view of the Retrieval Process 29 Subsystems of an IR system • The two subsystems of an IR system: –Searching: is an online process of finding relevant documents in the index list as per users query –Indexing: is an offline process of organizing documents using keywords extracted from the collection • Indexing and searching: are unavoidably connected –you cannot search what was not first indexed in some manner or other –indexing of documents or objects is done in order to be searchable –there are many ways to do indexing –to index one needs an indexing language –there are many indexing languages –even taking every word in a document is an indexing language • Knowing searching is knowing indexing 30 Indexing Subsystem Documents Tokenize Stop list Stemming & Normalize Term weighting Index text non-stoplist tokens tokens stemmed terms terms with weights Assign document identifier documents document IDs 31 Searching Subsystem Index query parse query Stemming & Normalize stemmed terms Stop list non-stoplist tokens query tokens Similarity Measure ranking Index terms ranked document set relevant document set Term weighting Query terms 32
  • 9. 3/10/2021 9 Issues that arise in IR  Text representation  what makes a “good” representation?  how is a representation generated from text?  what are retrievable objects and how are they organized?  information needs representation  what is an appropriate query language?  how can interactive query formulation and refinement be supported?  Comparing representations (to identify relevant documents)  What weighting scheme and similarity measure to be used?  what is a “good” model of retrieval?  Evaluating effectiveness of retrieval  what are good metrics?  what constitutes a good experimental test bed? 33 Focus in IR System Design Our focus during IR system design is: • In improving performance effectiveness of the system –Effectiveness of the system is measured in terms of precision, recall, … –Stemming, stop words, weighting schemes, matching algorithms • In improving performance efficiency –The concern here is storage space usage, access time, searching time, data transfer time … –Concern regarding space – time tradeoffs !! –Use Compression techniques, data/file structures, etc. 34 35 Thank you