SlideShare a Scribd company logo
Chapter 2 Modeling
Contents Introduction Taxonomy of IR Models Retrieval : Ad hoc, Filtering Formal Characterization of IR Models Classic IR Models Alternative Set Theoretic Models Alternative Algebraic Models Alternative Probabilistic Models
Contents (Cont.) Structured Text Retrieval Models Models for Browsing Trends and Research Issues
2.1  Introduction Traditional IR System Adopt  index terms  to index and retrieve documents Index Term Restricted sense Keyword  which has some meaning of its own (usually noun) General form Any word  which appears in the text of a document Ranking Algorithm Attempt to establish a simple  ordering  of the documents retrieved Operate according to basic premises regarding the notion of document relevance
2.2  A Taxonomy of IR Models Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network U s e r T a s k Retrieval: Ad hoc Filtering Browsing Classic Models boolean vector probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Browsing Flat Structure Guided Hypertext
A Taxonomy of IR Models (Cont.) Retrieval models Most frequently associated with distinct combinations of a document logical view and a user task Logical View of Documents U S E R T A S K Structure Guided Hypertext Flat Hypertext Flat Browsing Structured Classic Set theoretic Algebraic Probabilistic Classic Set theoretic Algebraic Probabilistic Retrieval Full Text + Structure Full Text Index Terms
2.3  Retrieval Ad hoc The  documents  in the collection remain relatively  static  while  new queries  are submitted to the system The most common form of user task Filtering The  queries  remain relatively  static  while  new documents  come into the system (and leave) User profile Describing the user’s preferences Routing (variation of filtering, rank the filtered document)
2.4  A Formal Characterization of IR Models IR Model
2.5  Classic Information Retrieval Boolean Model Based on set theory and Boolean algebra Queries are specified as Boolean expressions Model considers that index terms are present or absent in a document  Vector Model Partial matching is possible Assign non-binary weights to index terms Term weights are used to compute the degree of similarity Probabilistic Model Given a query, the model assigns each document  d j , as a measure of similarity to the query, p( d j  relevant to  q )/p( d j  non-relevant to  q ) which computes the odds of the document  d j  being relevant to the query  q
2.5.1  Basic Concepts Index Term Word  whose semantics helps in remembering the document’s main themes Mainly  nouns Nouns have meaning by themselves Weights All terms are not equally useful for describing the document Definition
Basic Concepts (Cont.) Mutual Independence Index term weights are usually assumed to be mutually independent Knowing the weight  w ij  associated with the pair ( k i , d j )  tells us nothing  about the weight  w (i+1)j  associated with the pair ( k i+1 , d j ) It does simplify the task of computing index term weights and allows for fast ranking computation
2.5.2  Boolean Model Base Simple retrieval model based on Set theory and Boolean algebra Operation : and, or, not Advantage Clean formalism Boolean query expressions have precise semantics Disadvantage Binary decision (no notion of a partial match) Retrieval of too few or too many document Difficult to express their query requests in terms of Boolean expressions
Boolean Model (Cont.) Definition Example k a k b k c
Boolean Model (Cont.) 병렬 프로그램 시스템 1  1  0  … 0  1  1  … 0  0  1  … 1  0  1  … 병렬  프로그램  시스템  … 색인어 1 0 0 1 유사도 004 003 002 001 문서
2.5.3  Vector model Motivation Binary weights is too limiting Assign  non-binary  weights to index terms A framework in which  partial matching is possible Instead of attempting to predict whether a document is relevant or not Rank the documents  according to their degree of similarity to the query
Vector model (Cont.) Definition
Vector model (Cont.) Clustering Problem Intra-cluster similarity What are the features which better  describe  the objects Inter-cluster similarity What are the features which better  distinguish  the objects IR Problem Intra-cluster similarity ( tf   factor) Raw frequency of a term  k i   inside a document  d j Inter-cluster similarity ( idf  factor) Inverse of the frequency of a term  k i  among the documents
Vector model (Cont.) Weighting Scheme Term Frequency ( tf ) Measure of  how well that term describes the document  contents Inverse Document Frequency ( idf ) Terms which appear in many documents are not very useful for distinguishing  a relevant document from a non-relevant one
Vector model (Cont.) Best known index term weighting scheme Balance  tf  and  idf   ( tf-idf  scheme) Query term weighting scheme
Vector model (Cont.) .176 .176 .477 0 0 .176 .477 .477 .477 .176 0 idf truck shipment silver of in gold fire delivery damaged arrived a Term
Vector model (Cont.) Hence, the ranking would be  D 2 , D 3 , D 1 Document vectors Not normalized .176 0 .477 0 0 .176 0 0 0 0 0 Q .176 .176 0 0 0 .176 0 0 0 .176 0 D 3 .176 0 .954 0 0 0 0 .477 0 .176 0 D 2 0 .176 0 0 0 .176 .477 0 .477 0 0 D 1 t 11 t 10 t 9 t 8 t 7 t 6 t 5 t 4 t 3 t 2 t 1
Vector model (Cont.) Advantage Term-weighting scheme improves retrieval performance Partial matching strategy allows retrieval of documents that approximate the query conditions Cosine ranking formula sorts the documents according to their degree of similarity to the query Disadvantage Index terms are assumed to be mutually independent tf-idf  scheme does not account for index term dependencies However, in practice, consideration of term dependencies might be a disadvantage

More Related Content

PPTX
PPTX
PDF
Blei ngjordan2003
PPTX
Boolean,vector space retrieval Models
PDF
G04124041046
PPTX
PPTX
Term weighting
PDF
Topicmodels
Blei ngjordan2003
Boolean,vector space retrieval Models
G04124041046
Term weighting
Topicmodels

What's hot (19)

PPT
Ir models
PPTX
Topic Extraction on Domain Ontology
PDF
A-Study_TopicModeling
PDF
Bl24409420
PPTX
Tdm probabilistic models (part 2)
PPT
Email Data Cleaning
PDF
P33077080
PPT
Scalable Discovery Of Hidden Emails From Large Folders
PDF
Introduction to Probabilistic Latent Semantic Analysis
PDF
Canini09a
PPTX
The vector space model
PPTX
Probabilistic models (part 1)
PPTX
Text categorization
PDF
Topic models
PPT
Finding Similar Files in Large Document Repositories
PPTX
Algorithm Name Detection & Extraction
PDF
Ju3517011704
PDF
A Text Mining Research Based on LDA Topic Modelling
PDF
TopicModels_BleiPaper_Summary.pptx
Ir models
Topic Extraction on Domain Ontology
A-Study_TopicModeling
Bl24409420
Tdm probabilistic models (part 2)
Email Data Cleaning
P33077080
Scalable Discovery Of Hidden Emails From Large Folders
Introduction to Probabilistic Latent Semantic Analysis
Canini09a
The vector space model
Probabilistic models (part 1)
Text categorization
Topic models
Finding Similar Files in Large Document Repositories
Algorithm Name Detection & Extraction
Ju3517011704
A Text Mining Research Based on LDA Topic Modelling
TopicModels_BleiPaper_Summary.pptx
Ad

Viewers also liked (7)

PPT
확률모델
PPT
Programs & Initiatives
PPT
집합모델 확장불린모델
PPTX
Using Technology To Strengthen Chapter Communication
PPT
Working for Justice in Australia
PPTX
Effective Membership Campaigns
PDF
How to Become a Thought Leader in Your Niche
확률모델
Programs & Initiatives
집합모델 확장불린모델
Using Technology To Strengthen Chapter Communication
Working for Justice in Australia
Effective Membership Campaigns
How to Become a Thought Leader in Your Niche
Ad

Similar to Vsm 벡터공간모델 (20)

PPT
Information Retrieval and Storage Systems
PPT
chapter 5 Information Retrieval Models.ppt
PPTX
Information retrival system and PageRank algorithm
DOCX
UNIT 3 IRT.docx
PPT
4-IR Models_new.ppt
PPT
4-IR Models_new.ppt
PPTX
unit -4MODELING AND RETRIEVAL EVALUATION
PDF
Chapter 4 IR Models.pdf
PPTX
IRT Unit_ 2.pptx
PDF
Information Retrieval
PPT
Information Retrieval Models
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
PPTX
Chapter27 distributed database syst.pptx
PDF
191CSEH IR UNIT - II for an engineering subject
PPTX
JM Information Retrieval Techniques Unit II
PDF
Is this document relevant probably
PPT
Lec 4,5
PDF
An Introduction to Information Retrieval.pdf
PPT
Slides
PPT
Cs583 info-retrieval
Information Retrieval and Storage Systems
chapter 5 Information Retrieval Models.ppt
Information retrival system and PageRank algorithm
UNIT 3 IRT.docx
4-IR Models_new.ppt
4-IR Models_new.ppt
unit -4MODELING AND RETRIEVAL EVALUATION
Chapter 4 IR Models.pdf
IRT Unit_ 2.pptx
Information Retrieval
Information Retrieval Models
Information_Retrieval_Models_Nfaoui_El_Habib
Chapter27 distributed database syst.pptx
191CSEH IR UNIT - II for an engineering subject
JM Information Retrieval Techniques Unit II
Is this document relevant probably
Lec 4,5
An Introduction to Information Retrieval.pdf
Slides
Cs583 info-retrieval

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
A Presentation on Artificial Intelligence
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Modernizing your data center with Dell and AMD
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx
NewMind AI Monthly Chronicles - July 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
A Presentation on Artificial Intelligence
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Unlocking AI with Model Context Protocol (MCP)
Modernizing your data center with Dell and AMD
Chapter 3 Spatial Domain Image Processing.pdf
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx

Vsm 벡터공간모델

  • 2. Contents Introduction Taxonomy of IR Models Retrieval : Ad hoc, Filtering Formal Characterization of IR Models Classic IR Models Alternative Set Theoretic Models Alternative Algebraic Models Alternative Probabilistic Models
  • 3. Contents (Cont.) Structured Text Retrieval Models Models for Browsing Trends and Research Issues
  • 4. 2.1 Introduction Traditional IR System Adopt index terms to index and retrieve documents Index Term Restricted sense Keyword which has some meaning of its own (usually noun) General form Any word which appears in the text of a document Ranking Algorithm Attempt to establish a simple ordering of the documents retrieved Operate according to basic premises regarding the notion of document relevance
  • 5. 2.2 A Taxonomy of IR Models Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Lat. Semantic Index Neural Networks Probabilistic Inference Network Belief Network U s e r T a s k Retrieval: Ad hoc Filtering Browsing Classic Models boolean vector probabilistic Structured Models Non-Overlapping Lists Proximal Nodes Browsing Flat Structure Guided Hypertext
  • 6. A Taxonomy of IR Models (Cont.) Retrieval models Most frequently associated with distinct combinations of a document logical view and a user task Logical View of Documents U S E R T A S K Structure Guided Hypertext Flat Hypertext Flat Browsing Structured Classic Set theoretic Algebraic Probabilistic Classic Set theoretic Algebraic Probabilistic Retrieval Full Text + Structure Full Text Index Terms
  • 7. 2.3 Retrieval Ad hoc The documents in the collection remain relatively static while new queries are submitted to the system The most common form of user task Filtering The queries remain relatively static while new documents come into the system (and leave) User profile Describing the user’s preferences Routing (variation of filtering, rank the filtered document)
  • 8. 2.4 A Formal Characterization of IR Models IR Model
  • 9. 2.5 Classic Information Retrieval Boolean Model Based on set theory and Boolean algebra Queries are specified as Boolean expressions Model considers that index terms are present or absent in a document Vector Model Partial matching is possible Assign non-binary weights to index terms Term weights are used to compute the degree of similarity Probabilistic Model Given a query, the model assigns each document d j , as a measure of similarity to the query, p( d j relevant to q )/p( d j non-relevant to q ) which computes the odds of the document d j being relevant to the query q
  • 10. 2.5.1 Basic Concepts Index Term Word whose semantics helps in remembering the document’s main themes Mainly nouns Nouns have meaning by themselves Weights All terms are not equally useful for describing the document Definition
  • 11. Basic Concepts (Cont.) Mutual Independence Index term weights are usually assumed to be mutually independent Knowing the weight w ij associated with the pair ( k i , d j ) tells us nothing about the weight w (i+1)j associated with the pair ( k i+1 , d j ) It does simplify the task of computing index term weights and allows for fast ranking computation
  • 12. 2.5.2 Boolean Model Base Simple retrieval model based on Set theory and Boolean algebra Operation : and, or, not Advantage Clean formalism Boolean query expressions have precise semantics Disadvantage Binary decision (no notion of a partial match) Retrieval of too few or too many document Difficult to express their query requests in terms of Boolean expressions
  • 13. Boolean Model (Cont.) Definition Example k a k b k c
  • 14. Boolean Model (Cont.) 병렬 프로그램 시스템 1 1 0 … 0 1 1 … 0 0 1 … 1 0 1 … 병렬 프로그램 시스템 … 색인어 1 0 0 1 유사도 004 003 002 001 문서
  • 15. 2.5.3 Vector model Motivation Binary weights is too limiting Assign non-binary weights to index terms A framework in which partial matching is possible Instead of attempting to predict whether a document is relevant or not Rank the documents according to their degree of similarity to the query
  • 16. Vector model (Cont.) Definition
  • 17. Vector model (Cont.) Clustering Problem Intra-cluster similarity What are the features which better describe the objects Inter-cluster similarity What are the features which better distinguish the objects IR Problem Intra-cluster similarity ( tf factor) Raw frequency of a term k i inside a document d j Inter-cluster similarity ( idf factor) Inverse of the frequency of a term k i among the documents
  • 18. Vector model (Cont.) Weighting Scheme Term Frequency ( tf ) Measure of how well that term describes the document contents Inverse Document Frequency ( idf ) Terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one
  • 19. Vector model (Cont.) Best known index term weighting scheme Balance tf and idf ( tf-idf scheme) Query term weighting scheme
  • 20. Vector model (Cont.) .176 .176 .477 0 0 .176 .477 .477 .477 .176 0 idf truck shipment silver of in gold fire delivery damaged arrived a Term
  • 21. Vector model (Cont.) Hence, the ranking would be D 2 , D 3 , D 1 Document vectors Not normalized .176 0 .477 0 0 .176 0 0 0 0 0 Q .176 .176 0 0 0 .176 0 0 0 .176 0 D 3 .176 0 .954 0 0 0 0 .477 0 .176 0 D 2 0 .176 0 0 0 .176 .477 0 .477 0 0 D 1 t 11 t 10 t 9 t 8 t 7 t 6 t 5 t 4 t 3 t 2 t 1
  • 22. Vector model (Cont.) Advantage Term-weighting scheme improves retrieval performance Partial matching strategy allows retrieval of documents that approximate the query conditions Cosine ranking formula sorts the documents according to their degree of similarity to the query Disadvantage Index terms are assumed to be mutually independent tf-idf scheme does not account for index term dependencies However, in practice, consideration of term dependencies might be a disadvantage