SlideShare a Scribd company logo
Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval
Lecture 7Lecture 7 (book chapter 9)(book chapter 9)::
Parallel and Distributed IRParallel and Distributed IR
Alexander Gelbukh
www.Gelbukh.com
Previous Chapter: ConclusionsPrevious Chapter: Conclusions
 How to accelerate search? Same results as sequential
 Ideas:
 Quick-and-dirty rejection of bad objects, 100% recall
 Fast data structure for search (based on clustering)
 Careful check of all found candidates
 Solution: mapping into fewer-D feature space
 Condition: lower-bounding of the distance
 Assumption: skewed spectrum distribution
 Few coefficients concentrate energy, rest are less important
Previous Chapter: Research topicsPrevious Chapter: Research topics
 Object detection (pattern and image recognition)
 Automatic feature selection
 Spatial indexing data structures (more than 1D)
 New types of data.
 What features to select? How to determine them?
 Mixed-type data (e.g., webpages, or images with
sound and description)
 What clustering/IR methods are better suited for
what features? (What features for what methods?)
 Similar methods in data mining, ...
The problemThe problem
 Very large document collections
 Google: 4,000,000,000 pages
 Slow response?
 Solution: parallel computing
 Google: 10,000 computers
Parallel architecturesParallel architectures
Data stream
Single Multiple
Instructionstream
Single
SISD
classical
SIMD
simple
Multiple
MISD
(rare)
MIMD
many SISD
MIMD architectureMIMD architecture
 The most common
 Can be
 tightly coupled
 loosely coupled
 Distributed
 Many computers interacting via network
 PC Clusters
 Similar to MIMD computers, but greater cost of
communication
 very loosely coupled
 More coarse-grained programs
Performance improvementPerformance improvement
Time: speedup S
 Ideally, N times (number of processors)
 In practice impossible
 The problem does not decompose into N equal parts
 Communication and control overhead
 < 1 / f, where f is the largest separable fraction of the
problem
Cost
 Per processor: S / N
Two approaches to parallelismTwo approaches to parallelism
 Build new algorithms
 E.g., neural nets
 Naturally parallel
 Problem: to define the retrieval task
 Adapt the existing techniques to parallelism
 Allows relying on well-studied approaches
 We will consider this option
Ways to use parallelismWays to use parallelism
 Multitasking
 N search engines
 Good for processing many queries
Problems:
 A single query is not speeded up
 Bottleneck: disk access (index)
 Possible solution: replicating (part of) data. RAIDs
 Parallel algorithms
 IR = data. Main question: how to partition the data
 Document / index term matrix
(terms can be LSI dimensions, signature bits, etc)
Possible partitioningsPossible partitionings
 Horizontal: document partitioning. Union of results
 Vertical: term partitioning. Basically, intersect results
Inverted files: Logical partitioningInverted files: Logical partitioning
 Logical vs. physical document partitioning
 Logical: for each term, use pointers into inverted file data for
each processor, to indicate its portion
Inverted files: Logical partitioningInverted files: Logical partitioning
Construction and updatingConstruction and updating
 Also parallel
Construction
 Assign docs to processors
 Order docs such that each processor has an interval
 Process in parallel
 Merge. Each piece is ordered already
Inverted files:Inverted files:
Physical document partitioningPhysical document partitioning
 Several separate collections, one per processor
 Separate indices
 Then the lists are merged (they are already ordered)
 Priority queue is used
 The result is not sorted; Insertion is quick
 The maximal element can be found quickly
 First k elements can be found rather quickly
 Details in the book
 Consistent scores are needed
 Global statistics is needed. Can be computed at index
time
Logical or physical partitioning?Logical or physical partitioning?
 Logical requires less communication
 Faster
 Physical is more flexible. Simpler implementation
 Simpler conversion of existing systems
Inverted files:Inverted files: Term partitioningTerm partitioning
 Each processor processes a part of the inverted file
 The results are intersected (for AND)
 (or as appropriate for Boolean operations, OR and NOT)
 When term distribution in user queries is skewed,
then document partitioning is better
 When uniform, term partitioning is better.
 Twice for long queries, 5 – 10 times for short (Web-like)
Suffix arraysSuffix arrays
 Array construction can be parallelized
 merges are parallel
 Document partitioning is applied straightforwardly
 Each processor maintains its own suffix array
 Term partitioning can be applied
 Each processor owns a branch of the tree (lexicographic
interval)
 Bottleneck: all processors need access to the entire text
Parallel and Distributed Information Retrieval System
Signature filesSignature files
 Document partitioning: straightforward
 Create query signature, distribute to each processor
 Merge results (using Boolean operations if needed)
 Term partitioning: shorter signatures
 Merging and eliminating false drops is slow
 This method is not recommended
SIMD computersSIMD computers
 Single Instruction, Multiple data
 Uncommon
 Good for simple operations
 Bit operations in signature files
 Details in the book
 Ranking is supported in hardware in some computers
 If signature file does not fit into memory, can be
processed in batches
 I/O overhead
 Use multiple queries with the same batch
 This improves throughput, but not response time
…… SIMD computersSIMD computers
 Inverted files are difficult to adapt to SIMD
 The inverted file is restructured
 Details in the book
Distributed IRDistributed IR
 MIMD with
 Slow communication
 Not all nodes are used for a given query
 Encryption issues
 Document partitioning is usually used
 Term partitioning imposes greater communication
overhead
 Document clustering can be useful (to distribute docs
by processors)
 Index clusters and then search only the best ones
 Another approach: use training queries, then similarity of
the user query to these
Research topicsResearch topics
 How to evaluate the speedup
 New algorithms
 Adaptation of existing algorithms
 Merging the results is a bottleneck
 Meta search engines
 Creating large collections with judgements
 Is recall important?
ConclusionsConclusions
 Parallel computing can improve
 response time for each query and/or
 throughput: number of queries processed with same speed
 Document partitioning is simple
 good for distributed computing
 Term partitioning is good for some data structures
 Distributed computing is MIMD computing with slow
communication
 SIMD machines are good for Signature files
 Both are out of favor now
Thank you!
Till May 17? 18?, 6 pm

More Related Content

PPTX
Information retrieval 13 alternative set theoretic models
PDF
CS9222 Advanced Operating System
PPTX
CPU Scheduling in OS Presentation
DOCX
Distributed system Tanenbaum chapter 1,2,3,4 notes
PPTX
Analytical learning
PPT
Inverted index
PPT
distributed shared memory
PPTX
Expert system
Information retrieval 13 alternative set theoretic models
CS9222 Advanced Operating System
CPU Scheduling in OS Presentation
Distributed system Tanenbaum chapter 1,2,3,4 notes
Analytical learning
Inverted index
distributed shared memory
Expert system

What's hot (20)

PPTX
Distributed dbms architectures
PDF
CS9222 ADVANCED OPERATING SYSTEMS
PPTX
GOOGLE FILE SYSTEM
PPTX
IRS-Cataloging and Indexing-2.1.pptx
PPTX
Forward and Backward chaining in AI
PDF
Parallel and Distributed Computing chapter 1
PPTX
Multivector and multiprocessor
PDF
CS6007 information retrieval - 5 units notes
PPTX
Information retrieval (introduction)
PPT
message passing
PPT
program partitioning and scheduling IN Advanced Computer Architecture
PPT
Information Retrieval Models
PPT
2.4 rule based classification
PPTX
Synchronization in distributed computing
KEY
Testing Hadoop jobs with MRUnit
PPTX
Probabilistic information retrieval models & systems
PPTX
Distributed design alternatives
PPTX
Lectures 1,2,3
PPT
Distributed Transaction
Distributed dbms architectures
CS9222 ADVANCED OPERATING SYSTEMS
GOOGLE FILE SYSTEM
IRS-Cataloging and Indexing-2.1.pptx
Forward and Backward chaining in AI
Parallel and Distributed Computing chapter 1
Multivector and multiprocessor
CS6007 information retrieval - 5 units notes
Information retrieval (introduction)
message passing
program partitioning and scheduling IN Advanced Computer Architecture
Information Retrieval Models
2.4 rule based classification
Synchronization in distributed computing
Testing Hadoop jobs with MRUnit
Probabilistic information retrieval models & systems
Distributed design alternatives
Lectures 1,2,3
Distributed Transaction
Ad

Viewers also liked (8)

PPTX
Presentation parallelsystem
DOC
Centralized vs distrbution system
PPT
Centralised and distributed databases
PPTX
Cab booking system india
ODP
Distributed Computing
PPTX
Cluster analysis
PDF
Parallel and Distributed System IEEE 2014 Projects
PPSX
Parallel Database
Presentation parallelsystem
Centralized vs distrbution system
Centralised and distributed databases
Cab booking system india
Distributed Computing
Cluster analysis
Parallel and Distributed System IEEE 2014 Projects
Parallel Database
Ad

Similar to Parallel and Distributed Information Retrieval System (20)

PPT
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
PPT
kantorNSF-NIJ-ISI-03-06-04.ppt
PPT
Nov 2010 HUG: Fuzzy Table - B.A.H
PPT
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
PPT
eScience: A Transformed Scientific Method
PDF
Implementing sorting in database systems
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
PPT
Chapter 1( intro &amp; overview)
PPTX
Text Analytics for Legal work
PDF
Experimenting With Big Data
ODT
Data Deduplication: Venti and its improvements
PPT
PPTX
Empowering Transformational Science
PDF
Bi4101343346
PPTX
Unit-1 Introduction to Big Data.pptx
DOCX
Introduction
PDF
Distributed Systems: scalability and high availability
PPT
Pnuts Review
PPT
PPT
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
kantorNSF-NIJ-ISI-03-06-04.ppt
Nov 2010 HUG: Fuzzy Table - B.A.H
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
eScience: A Transformed Scientific Method
Implementing sorting in database systems
CS 542 Parallel DBs, NoSQL, MapReduce
Chapter 1( intro &amp; overview)
Text Analytics for Legal work
Experimenting With Big Data
Data Deduplication: Venti and its improvements
Empowering Transformational Science
Bi4101343346
Unit-1 Introduction to Big Data.pptx
Introduction
Distributed Systems: scalability and high availability
Pnuts Review

Recently uploaded (20)

PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Digital Logic Computer Design lecture notes
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
web development for engineering and engineering
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Embodied AI: Ushering in the Next Era of Intelligent Systems
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Digital Logic Computer Design lecture notes
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Lecture Notes Electrical Wiring System Components
web development for engineering and engineering
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
R24 SURVEYING LAB MANUAL for civil enggi
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf

Parallel and Distributed Information Retrieval System

  • 1. Special Topics in Computer ScienceSpecial Topics in Computer Science Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval Lecture 7Lecture 7 (book chapter 9)(book chapter 9):: Parallel and Distributed IRParallel and Distributed IR Alexander Gelbukh www.Gelbukh.com
  • 2. Previous Chapter: ConclusionsPrevious Chapter: Conclusions  How to accelerate search? Same results as sequential  Ideas:  Quick-and-dirty rejection of bad objects, 100% recall  Fast data structure for search (based on clustering)  Careful check of all found candidates  Solution: mapping into fewer-D feature space  Condition: lower-bounding of the distance  Assumption: skewed spectrum distribution  Few coefficients concentrate energy, rest are less important
  • 3. Previous Chapter: Research topicsPrevious Chapter: Research topics  Object detection (pattern and image recognition)  Automatic feature selection  Spatial indexing data structures (more than 1D)  New types of data.  What features to select? How to determine them?  Mixed-type data (e.g., webpages, or images with sound and description)  What clustering/IR methods are better suited for what features? (What features for what methods?)  Similar methods in data mining, ...
  • 4. The problemThe problem  Very large document collections  Google: 4,000,000,000 pages  Slow response?  Solution: parallel computing  Google: 10,000 computers
  • 5. Parallel architecturesParallel architectures Data stream Single Multiple Instructionstream Single SISD classical SIMD simple Multiple MISD (rare) MIMD many SISD
  • 6. MIMD architectureMIMD architecture  The most common  Can be  tightly coupled  loosely coupled  Distributed  Many computers interacting via network  PC Clusters  Similar to MIMD computers, but greater cost of communication  very loosely coupled  More coarse-grained programs
  • 7. Performance improvementPerformance improvement Time: speedup S  Ideally, N times (number of processors)  In practice impossible  The problem does not decompose into N equal parts  Communication and control overhead  < 1 / f, where f is the largest separable fraction of the problem Cost  Per processor: S / N
  • 8. Two approaches to parallelismTwo approaches to parallelism  Build new algorithms  E.g., neural nets  Naturally parallel  Problem: to define the retrieval task  Adapt the existing techniques to parallelism  Allows relying on well-studied approaches  We will consider this option
  • 9. Ways to use parallelismWays to use parallelism  Multitasking  N search engines  Good for processing many queries Problems:  A single query is not speeded up  Bottleneck: disk access (index)  Possible solution: replicating (part of) data. RAIDs  Parallel algorithms  IR = data. Main question: how to partition the data  Document / index term matrix (terms can be LSI dimensions, signature bits, etc)
  • 10. Possible partitioningsPossible partitionings  Horizontal: document partitioning. Union of results  Vertical: term partitioning. Basically, intersect results
  • 11. Inverted files: Logical partitioningInverted files: Logical partitioning  Logical vs. physical document partitioning  Logical: for each term, use pointers into inverted file data for each processor, to indicate its portion
  • 12. Inverted files: Logical partitioningInverted files: Logical partitioning Construction and updatingConstruction and updating  Also parallel Construction  Assign docs to processors  Order docs such that each processor has an interval  Process in parallel  Merge. Each piece is ordered already
  • 13. Inverted files:Inverted files: Physical document partitioningPhysical document partitioning  Several separate collections, one per processor  Separate indices  Then the lists are merged (they are already ordered)  Priority queue is used  The result is not sorted; Insertion is quick  The maximal element can be found quickly  First k elements can be found rather quickly  Details in the book  Consistent scores are needed  Global statistics is needed. Can be computed at index time
  • 14. Logical or physical partitioning?Logical or physical partitioning?  Logical requires less communication  Faster  Physical is more flexible. Simpler implementation  Simpler conversion of existing systems
  • 15. Inverted files:Inverted files: Term partitioningTerm partitioning  Each processor processes a part of the inverted file  The results are intersected (for AND)  (or as appropriate for Boolean operations, OR and NOT)  When term distribution in user queries is skewed, then document partitioning is better  When uniform, term partitioning is better.  Twice for long queries, 5 – 10 times for short (Web-like)
  • 16. Suffix arraysSuffix arrays  Array construction can be parallelized  merges are parallel  Document partitioning is applied straightforwardly  Each processor maintains its own suffix array  Term partitioning can be applied  Each processor owns a branch of the tree (lexicographic interval)  Bottleneck: all processors need access to the entire text
  • 18. Signature filesSignature files  Document partitioning: straightforward  Create query signature, distribute to each processor  Merge results (using Boolean operations if needed)  Term partitioning: shorter signatures  Merging and eliminating false drops is slow  This method is not recommended
  • 19. SIMD computersSIMD computers  Single Instruction, Multiple data  Uncommon  Good for simple operations  Bit operations in signature files  Details in the book  Ranking is supported in hardware in some computers  If signature file does not fit into memory, can be processed in batches  I/O overhead  Use multiple queries with the same batch  This improves throughput, but not response time
  • 20. …… SIMD computersSIMD computers  Inverted files are difficult to adapt to SIMD  The inverted file is restructured  Details in the book
  • 21. Distributed IRDistributed IR  MIMD with  Slow communication  Not all nodes are used for a given query  Encryption issues  Document partitioning is usually used  Term partitioning imposes greater communication overhead  Document clustering can be useful (to distribute docs by processors)  Index clusters and then search only the best ones  Another approach: use training queries, then similarity of the user query to these
  • 22. Research topicsResearch topics  How to evaluate the speedup  New algorithms  Adaptation of existing algorithms  Merging the results is a bottleneck  Meta search engines  Creating large collections with judgements  Is recall important?
  • 23. ConclusionsConclusions  Parallel computing can improve  response time for each query and/or  throughput: number of queries processed with same speed  Document partitioning is simple  good for distributed computing  Term partitioning is good for some data structures  Distributed computing is MIMD computing with slow communication  SIMD machines are good for Signature files  Both are out of favor now
  • 24. Thank you! Till May 17? 18?, 6 pm