SlideShare a Scribd company logo
2
Most read
3
Most read
5
Most read
Vector Space Model
By: Tharuka Vishwajith
Boolean Model
• Based on set theory and Boolean logic
• Exact matching of documents to a user query
• Uses the Boolean AND, OR and NOT operators
D1 D2 D3 D4 D5 D6
Cat 1 1 0 1 0 1
Dog 1 1 1 1 1 0
Rat 0 1 0 1 0 1
Apple 0 0 0 0 1 0
Orange 0 0 1 1 0 1
Computer 0 0 0 1 1 1
• query: Dog AND Cat AND NOT Computer
• computation: 111110 AND 110101 AND 111000 = 110000
• result: document set {D1,D2}
D1 D2 D3 D4 D5 D6
Cat 1 1 0 1 0 1
Dog 1 1 1 1 1 0
Rat 0 1 0 1 0 1
Apple 0 0 0 0 1 0
Orange 0 0 1 1 0 1
Computer 0 0 0 1 1 1
Boolean Model ...
Advantages
• Relatively easy to implement and scalable
• Fast query processing based on parallel scanning of indexes
Disadvantages
• Does not pay attention to synonymy
• Does not pay attention to polysemy
• No ranking of output
• Often the user has to learn a special syntax such as the use of double quotes to
search for phrases
Vector Space Model
• Algebraic model representing text documents and queries as vectors
based on the index terms
• One dimension for each term
• Compute the similarity (angle) between the query vector and the
document vectors
Dog
Computer
D2
D1
5
1
2 8
Query
θ1
θ2
Vector space model in information retrieval
Vector space model in information retrieval
Vector space model in information retrieval
Vector space model in information retrieval
Vector space model in information retrieval
Cosine similarity among 3 documents
Term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
1 + log(tf)
Term frequency (tf) count
Log normalization:
Cosine similarity among 3 documents
Term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
Log Frequency Weightage
Length normalization for SaS = (3.06)2 + (2)2 + (1.3)2 + (0) 2
Term SaS PaP WH
affection 3.06 0.83 0.52
jealous 2.00 0.55 0.46
gossip 1.30 0 0.40
wuthering 0 0 0.58
Length normalization for PaP = (2.76)2 + (1.84)2 + (0)2 + (0) 2
Length normalization for WH = (2.3)2 + (2.04)2 + (1.78)2 + (2.58) 2
= 3.87
= 3.31
= 4.39
Term SaS PaP WH
affection 3.06 2.76 2.30
jealous 2.00 1.84 2.04
gossip 1.30 0 1.78
wuthering 0 0 2.58
Cosine similarity among 3 documents
Term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
After Length Normalization
Length normalization for SaS = (3.06)2 + (2)2 + (1.3)2 + (0) 2
Term SaS PaP WH
affection 3.06 / 3.87 2.78 / 3.31 2.30 / 4.39
jealous 2.00 / 3.87 1.84 / 3.31 2.04 / 4.39
gossip 1.30 / 3.87 0 / 3.31 1.78 / 4.39
wuthering 0 / 3.87 0 / 3.31 2.58 / 4.39
Length normalization for PaP = (2.76)2 + (1.84)2 + (0)2 + (0) 2
Length normalization for WH = (2.3)2 + (2.04)2 + (1.77)2 + (2.57) 2
= 3.87
= 3.31
= 4.39
Cosine similarity among 3 documents
Term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
After Length Normalization
Cos( SaS . PaP ) ∝ (0.79 x 0.84) + (0.51 x 0.56)
Term SaS PaP WH
affection 0.79 0.84 0.52
jealous 0.51 0.56 0.46
gossip 0.33 0 0.40
wuthering 0 0 0.58
Cos ( PaP . WH ) ∝ (0.84 x 0.52) + (0.56 x 0.46)
Cos ( SaS . WH ) ∝ (0.79 x 0.52) + (0.51 x 0.46) + (0.33 x 0.4)
= 0.95
= 0.69
= 0.78
Vector space model in information retrieval
Vector space model in information retrieval

More Related Content

PPTX
The vector space model
PPT
Information Retrieval Models
PPTX
Boolean,vector space retrieval Models
PPTX
Information retrieval 7 boolean model
PPTX
Probabilistic retrieval model
PPTX
Information retrieval s
PPTX
Information retrieval introduction
PPT
Inverted index
The vector space model
Information Retrieval Models
Boolean,vector space retrieval Models
Information retrieval 7 boolean model
Probabilistic retrieval model
Information retrieval s
Information retrieval introduction
Inverted index

What's hot (20)

PDF
CS6007 information retrieval - 5 units notes
PPTX
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
PPTX
Vector space model of information retrieval
PPTX
Information retrieval 9 tf idf weights
PPTX
Lectures 1,2,3
PPTX
Tdm information retrieval
PPTX
Probabilistic information retrieval models & systems
PDF
Information Storage and Retrieval : A Case Study
PPTX
Introduction to Information Retrieval
PPT
Metadata: A concept
PPTX
Object relational database management system
PPT
1.2 steps and functionalities
PPTX
Information retrieval 13 alternative set theoretic models
PPTX
Automatic indexing
PPTX
Information retrieval (introduction)
PDF
Data Mining: Association Rules Basics
PDF
Introduction to the Semantic Web
PPTX
Signature files
PDF
Introduction to Information Retrieval & Models
CS6007 information retrieval - 5 units notes
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
Vector space model of information retrieval
Information retrieval 9 tf idf weights
Lectures 1,2,3
Tdm information retrieval
Probabilistic information retrieval models & systems
Information Storage and Retrieval : A Case Study
Introduction to Information Retrieval
Metadata: A concept
Object relational database management system
1.2 steps and functionalities
Information retrieval 13 alternative set theoretic models
Automatic indexing
Information retrieval (introduction)
Data Mining: Association Rules Basics
Introduction to the Semantic Web
Signature files
Introduction to Information Retrieval & Models
Ad

Similar to Vector space model in information retrieval (20)

PDF
Analyzing and Interpreting AWR
PPT
Kaizen cso002 l1
PPT
01 introduction
PDF
Clustering Methods with R
PDF
Clustering Methods with R
PDF
Text-to-SQL with Data-Driven Templates
PDF
Data Structure - Lecture 1 - Introduction.pdf
PDF
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
PDF
Three steps to untangle data traffic jams
PPTX
BenG Update on automatic labelling
PDF
ppt_pspp.pdf
PPTX
LEC 1.pptx
PPT
Digital Logic Design lec 2 slide.ppt DLD Lec 1.ppt
PPT
Digital Logic & Design
PDF
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
ODP
Beyond PHP - it's not (just) about the code
PPTX
chapter 1.pptx
PDF
Stream-based Data Synchronization
PPTX
Python 3.6 Features 20161207
PDF
5_RNN_LSTM.pdf
 
Analyzing and Interpreting AWR
Kaizen cso002 l1
01 introduction
Clustering Methods with R
Clustering Methods with R
Text-to-SQL with Data-Driven Templates
Data Structure - Lecture 1 - Introduction.pdf
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
Three steps to untangle data traffic jams
BenG Update on automatic labelling
ppt_pspp.pdf
LEC 1.pptx
Digital Logic Design lec 2 slide.ppt DLD Lec 1.ppt
Digital Logic & Design
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Beyond PHP - it's not (just) about the code
chapter 1.pptx
Stream-based Data Synchronization
Python 3.6 Features 20161207
5_RNN_LSTM.pdf
 
Ad

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Lecture1 pattern recognition............
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Foundation of Data Science unit number two notes
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
1_Introduction to advance data techniques.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Computer network topology notes for revision
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Business Analytics and business intelligence.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Lecture1 pattern recognition............
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Foundation of Data Science unit number two notes
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
.pdf is not working space design for the following data for the following dat...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Fluorescence-microscope_Botany_detailed content
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Computer network topology notes for revision
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Analytics and business intelligence.pdf

Vector space model in information retrieval

  • 1. Vector Space Model By: Tharuka Vishwajith
  • 2. Boolean Model • Based on set theory and Boolean logic • Exact matching of documents to a user query • Uses the Boolean AND, OR and NOT operators D1 D2 D3 D4 D5 D6 Cat 1 1 0 1 0 1 Dog 1 1 1 1 1 0 Rat 0 1 0 1 0 1 Apple 0 0 0 0 1 0 Orange 0 0 1 1 0 1 Computer 0 0 0 1 1 1
  • 3. • query: Dog AND Cat AND NOT Computer • computation: 111110 AND 110101 AND 111000 = 110000 • result: document set {D1,D2} D1 D2 D3 D4 D5 D6 Cat 1 1 0 1 0 1 Dog 1 1 1 1 1 0 Rat 0 1 0 1 0 1 Apple 0 0 0 0 1 0 Orange 0 0 1 1 0 1 Computer 0 0 0 1 1 1
  • 4. Boolean Model ... Advantages • Relatively easy to implement and scalable • Fast query processing based on parallel scanning of indexes Disadvantages • Does not pay attention to synonymy • Does not pay attention to polysemy • No ranking of output • Often the user has to learn a special syntax such as the use of double quotes to search for phrases
  • 5. Vector Space Model • Algebraic model representing text documents and queries as vectors based on the index terms • One dimension for each term • Compute the similarity (angle) between the query vector and the document vectors
  • 12. Cosine similarity among 3 documents Term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 1 + log(tf) Term frequency (tf) count Log normalization:
  • 13. Cosine similarity among 3 documents Term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 Log Frequency Weightage Length normalization for SaS = (3.06)2 + (2)2 + (1.3)2 + (0) 2 Term SaS PaP WH affection 3.06 0.83 0.52 jealous 2.00 0.55 0.46 gossip 1.30 0 0.40 wuthering 0 0 0.58 Length normalization for PaP = (2.76)2 + (1.84)2 + (0)2 + (0) 2 Length normalization for WH = (2.3)2 + (2.04)2 + (1.78)2 + (2.58) 2 = 3.87 = 3.31 = 4.39 Term SaS PaP WH affection 3.06 2.76 2.30 jealous 2.00 1.84 2.04 gossip 1.30 0 1.78 wuthering 0 0 2.58
  • 14. Cosine similarity among 3 documents Term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 After Length Normalization Length normalization for SaS = (3.06)2 + (2)2 + (1.3)2 + (0) 2 Term SaS PaP WH affection 3.06 / 3.87 2.78 / 3.31 2.30 / 4.39 jealous 2.00 / 3.87 1.84 / 3.31 2.04 / 4.39 gossip 1.30 / 3.87 0 / 3.31 1.78 / 4.39 wuthering 0 / 3.87 0 / 3.31 2.58 / 4.39 Length normalization for PaP = (2.76)2 + (1.84)2 + (0)2 + (0) 2 Length normalization for WH = (2.3)2 + (2.04)2 + (1.77)2 + (2.57) 2 = 3.87 = 3.31 = 4.39
  • 15. Cosine similarity among 3 documents Term SaS PaP WH affection 115 58 20 jealous 10 7 11 gossip 2 0 6 wuthering 0 0 38 After Length Normalization Cos( SaS . PaP ) ∝ (0.79 x 0.84) + (0.51 x 0.56) Term SaS PaP WH affection 0.79 0.84 0.52 jealous 0.51 0.56 0.46 gossip 0.33 0 0.40 wuthering 0 0 0.58 Cos ( PaP . WH ) ∝ (0.84 x 0.52) + (0.56 x 0.46) Cos ( SaS . WH ) ∝ (0.79 x 0.52) + (0.51 x 0.46) + (0.33 x 0.4) = 0.95 = 0.69 = 0.78