SlideShare a Scribd company logo
“ BEYOND PAGES: SUPPORTING EFFICIENT, SCALABLE ENTITY SEARCH WITH DUAL INVERSION INDEX” Tao Chang, Kevin-Chen-Chuan Chang University of Illinois at Urbana-Champaign Presented By:   Mahesh Gupta CSE 6339 Web Search Mining & Integration – Paper Presentation
WHAT THIS PAPER IS ALL ABOUT? Entity based search is next big step forward and significant departure from traditional keyword based search. From computational point of view  also Entity search on the scale of world wide web is going to through unique challenges. This paper indentify these computational challenges and introduce solution using “ Dual-Inversion Index ” technique.
ENTITY SEARCH Suppose we are interested in finding location of Cowboy Stadium. Our System expect query as combination of keywords and Entities.   Task to do here: Context Matching Cowboy Stadium #Location
CONTEXT MATCHING ALONE ENOUGH? … . Cowboy Stadium  located in  Arlington Texas  ….. … .. Cowboy Stadium  located in  United States   cost $1.3 billion in complete construction... ....  Cowboy Stadium  located in  North Texas  is the fourth largest national football stadium in the country by seating capacity….. …  Cowboy Stadium  is 20 Miles drive from Hilton hotel located in  Stemmons Freeway, Dallas Texas  ……. So Clearly we need some scoring mechanism also.
HENCE…. Suppose we are interested in finding location of Cowboy Stadium.   Task to do here: Context Matching Global Aggregation Cowboy Stadium #Location
COMPUTATIONAL CHALLENGES Approach that, first doing keyword search using keywords  in the query and then doing entity search on resulting document of keyword search is much like sequential scan and very slow to scale on world wide web. If some one suggest for top-k pages approach here then what would be the effective value of k for web and it will also affect global aggregation. So we need effective mechanism for this. & That is  :  Index
WHAT IS INDEX HERE Indexing would we a pre-processing task for application here and indexes will be use to answer users query fast.  For example, index list can look something like below: Keyword Document, position Cowboy (D10,12) ;(D12,34)(D46,257)…… Stadium (D10,13) ;(D34,134)(D146,357)…… ------------- ----------------
INTRODUCING DUAL-INVERSION INDEX 2 types of Index mechanism proposed here by the authors for entity search: Document-Inverted Index Entity-Inverted Index Lets Discuss each one by one.
DOCUMENT INVERTED INDEX Process document, identify keywords,Entities in it and then create a list for each keyword having document ID and position in the document. Basically this index is keyword,Entity-to-document mapping. Mathematically for keyword ‘k’ and document ‘doc’   D(k): k -> {(doc,pos) | doc.content(pos)=k; } So list will look something like: D(cowboy):  D(stadium): D2,12 D6,17 D9,34 D9,357 D97,45 D6,18 D9,35 D56,55 D64,5 D97,46
DOCUMENT INVERTED INDEX CONTINUE.. Mapping for Entity to document will be slightly different then keyword to entity in index. It is because Entity can have different instance value in the document.  So Mathematically: D(E): E-> {(doc,pos,e)| d.content(pos)=e; eεE} In List view: D(Location):  D6,23,’Arlington TX’ D9,45,’United State’ D97,50,’North Texas’ …… . …… .
DI-INDEX->  IS IT EFFICIENT NOW ? If we treat list of each keywords and entity in index as a relation then we can write equivalent SQL query as: Select D(l).entity , sum(lscore) as score   From Cowboy c, Stadium s, Location l where c.dId=s.dId and s.dId=l.dId   ------- Group By D(l).entity Having score > threshold  ; Issue here: Cost of Complex join: In fact as number of keywords and Entity in query increases join will become more complex. So we still need improvement.
DI-INDEX  -> DATA PARTITIONING Partition the document space in equal size. For example  100 Doc -> 10 partition-> 10 doc in each partition Each partition have list of keywords and entity it support, find partition support yours and then you hose have to perform join only in between those documents. So Entry for entity in index now should have partition number instead of instance value(Why?) D(Location):  D6,23,P8 D9,45,P86 D97,50,P8 …… .
ENTITY-INVERTED INDEX As opposite to DI-Index here we map each keyword to the entity while building index. Here we not only store each keyword’s position in the doc but also nearby entity’s position also under which context this keyword occurred. Mathematically E(k) : k ->{(o(doc,epos,entity),pos) |o.context[pos]=k; entityεE}  Which translated to k appear with position ‘pos’ in the context of entity occurrence o(doc,epos,entity).
EI-INDEX CONTINUE… Hence layout of index list in this case will look something like: E(cowboy): E(Stadium): (Notice here that there is no entry for location because it is entity) Here P is partition number explained in next slide whose concept is analogous to DI-Index partitioning. ((D6,23,P8),17) ((D9,45,P86),34) ((D97,50,P8),45) ……… . ((D23,23,P8),18) ((D9,45,P86),35) ((D97,50,P8),46) ……… .
EI-INDEX PARTITIONING Here we will do the partitioning on the basis of Entites. Divide Entity space into equal size. So if 10 Entity-> 10 partition node -> 1 entity each partition. Each entity node will have list of keywords found in the context of this entity. Its faster than DI-Index because task of context matching we have performed during index formation itself.
COMPARISON D-Inverted E-Inverted Join Fast (why?) Faster (why?) Aggregation Central (why?) Distributed (why?) Space Minimal Overhead (why?) Large (why?)
BOTH INDEX CO-EXIST? DUAL-INVERSION INDEX) Answer is Yes. Its advisable that  E-Inverted should be created for Entities that are queried more often and take less space because its faster whereas D-Inverted should be created for Entities that are queried less often but take large space. This balance space and time performance of the application.
SUMMARY D-Inverted maps keywords, Entity to document. It gives good performance using partitioning and takes minimal space. E-inverted maps each keywords to document and and context of Entity under which it found. It is faster than D-Inverted but require large space to store. Both can co-exist in a system to balance performance.
Thank You

More Related Content

PDF
Database management system session 6
PPTX
Mdst 3559-03-01-sql-php
PDF
International Journal of Engineering Research and Development
PPTX
Property Alignment on Linked Open Data
PDF
Z04506138145
PDF
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
PDF
Make money fast! department of computer science-copypasteads.com
PPT
Lecture 20
Database management system session 6
Mdst 3559-03-01-sql-php
International Journal of Engineering Research and Development
Property Alignment on Linked Open Data
Z04506138145
Relational Model and Relational Algebra - Lecture 3 - Introduction to Databas...
Make money fast! department of computer science-copypasteads.com
Lecture 20

Viewers also liked (16)

PPTX
Test your taste buds
PDF
A Better Understanding: Solving Business Challenges with Data
PPTX
Warsztaty Active Image | Opinie
PDF
Solving the Really Big Tech Problems with IoT
PDF
See the Whole Story: The Case for a Visualization Platform
PDF
Auto bodies
PDF
Arcadian Landscapes
PPSX
Warsztaty PR-u i komunikacji | Opinie
PDF
Who, What, Where and How: Why You Want to Know
PDF
The Art of Visibility: Enabling Multi-Platform Management
PPTX
My OS
PDF
The Key to Effective Analytics: Fast-Returning Queries
PPTX
Extracción-de-la-muestra-_ Clase Nº 2 Hematología
PDF
The Central Hub: Defining the Data Lake
PDF
Mind Your Business: Why Privacy Matters to the Successful Enterprise
PDF
A Tight Ship: How Containers and SDS Optimize the Enterprise
Test your taste buds
A Better Understanding: Solving Business Challenges with Data
Warsztaty Active Image | Opinie
Solving the Really Big Tech Problems with IoT
See the Whole Story: The Case for a Visualization Platform
Auto bodies
Arcadian Landscapes
Warsztaty PR-u i komunikacji | Opinie
Who, What, Where and How: Why You Want to Know
The Art of Visibility: Enabling Multi-Platform Management
My OS
The Key to Effective Analytics: Fast-Returning Queries
Extracción-de-la-muestra-_ Clase Nº 2 Hematología
The Central Hub: Defining the Data Lake
Mind Your Business: Why Privacy Matters to the Successful Enterprise
A Tight Ship: How Containers and SDS Optimize the Enterprise
Ad

Similar to Presentation dual inversion-index (20)

PPT
Slides
PPT
Intro to Data warehousing lecture 11
PPT
Intro to Data warehousing lecture 14
PPT
Intro to Data warehousing lecture 19
PPT
Inverted Files for Text Search Engin.ppt
PDF
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
PPT
What to do when one size does not fit all?!
PDF
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
PPT
Lecture 27
PPTX
BGOUG 2012 - XML Index Strategies
PPTX
Effective and Efficient Entity Search in RDF data
PPT
Web search engines
PDF
FAST Search for SharePoint
PPT
Mapreduce in Search
PDF
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
PPT
Intro.ppt
PPT
Lucece Indexing
PDF
Research Report on Document Indexing-Nithish Kumar
PDF
Research report nithish
DOC
IEEE 2014 JAVA DATA MINING PROJECTS Fast nearest neighbor search with keywords
Slides
Intro to Data warehousing lecture 11
Intro to Data warehousing lecture 14
Intro to Data warehousing lecture 19
Inverted Files for Text Search Engin.ppt
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
What to do when one size does not fit all?!
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
Lecture 27
BGOUG 2012 - XML Index Strategies
Effective and Efficient Entity Search in RDF data
Web search engines
FAST Search for SharePoint
Mapreduce in Search
"A Study of I/O and Virtualization Performance with a Search Engine based on ...
Intro.ppt
Lucece Indexing
Research Report on Document Indexing-Nithish Kumar
Research report nithish
IEEE 2014 JAVA DATA MINING PROJECTS Fast nearest neighbor search with keywords
Ad

Recently uploaded (20)

PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
Cell Structure & Organelles in detailed.
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Pharma ospi slides which help in ospi learning
PDF
01-Introduction-to-Information-Management.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Presentation on HIE in infants and its manifestations
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Lesson notes of climatology university.
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Cell Structure & Organelles in detailed.
102 student loan defaulters named and shamed – Is someone you know on the list?
Pharma ospi slides which help in ospi learning
01-Introduction-to-Information-Management.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Microbial disease of the cardiovascular and lymphatic systems
VCE English Exam - Section C Student Revision Booklet
2.FourierTransform-ShortQuestionswithAnswers.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
A systematic review of self-coping strategies used by university students to ...
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Presentation on HIE in infants and its manifestations
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Lesson notes of climatology university.
Module 4: Burden of Disease Tutorial Slides S2 2025
human mycosis Human fungal infections are called human mycosis..pptx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf

Presentation dual inversion-index

  • 1. “ BEYOND PAGES: SUPPORTING EFFICIENT, SCALABLE ENTITY SEARCH WITH DUAL INVERSION INDEX” Tao Chang, Kevin-Chen-Chuan Chang University of Illinois at Urbana-Champaign Presented By: Mahesh Gupta CSE 6339 Web Search Mining & Integration – Paper Presentation
  • 2. WHAT THIS PAPER IS ALL ABOUT? Entity based search is next big step forward and significant departure from traditional keyword based search. From computational point of view also Entity search on the scale of world wide web is going to through unique challenges. This paper indentify these computational challenges and introduce solution using “ Dual-Inversion Index ” technique.
  • 3. ENTITY SEARCH Suppose we are interested in finding location of Cowboy Stadium. Our System expect query as combination of keywords and Entities. Task to do here: Context Matching Cowboy Stadium #Location
  • 4. CONTEXT MATCHING ALONE ENOUGH? … . Cowboy Stadium located in Arlington Texas ….. … .. Cowboy Stadium located in United States cost $1.3 billion in complete construction... .... Cowboy Stadium located in North Texas is the fourth largest national football stadium in the country by seating capacity….. … Cowboy Stadium is 20 Miles drive from Hilton hotel located in Stemmons Freeway, Dallas Texas ……. So Clearly we need some scoring mechanism also.
  • 5. HENCE…. Suppose we are interested in finding location of Cowboy Stadium. Task to do here: Context Matching Global Aggregation Cowboy Stadium #Location
  • 6. COMPUTATIONAL CHALLENGES Approach that, first doing keyword search using keywords in the query and then doing entity search on resulting document of keyword search is much like sequential scan and very slow to scale on world wide web. If some one suggest for top-k pages approach here then what would be the effective value of k for web and it will also affect global aggregation. So we need effective mechanism for this. & That is : Index
  • 7. WHAT IS INDEX HERE Indexing would we a pre-processing task for application here and indexes will be use to answer users query fast. For example, index list can look something like below: Keyword Document, position Cowboy (D10,12) ;(D12,34)(D46,257)…… Stadium (D10,13) ;(D34,134)(D146,357)…… ------------- ----------------
  • 8. INTRODUCING DUAL-INVERSION INDEX 2 types of Index mechanism proposed here by the authors for entity search: Document-Inverted Index Entity-Inverted Index Lets Discuss each one by one.
  • 9. DOCUMENT INVERTED INDEX Process document, identify keywords,Entities in it and then create a list for each keyword having document ID and position in the document. Basically this index is keyword,Entity-to-document mapping. Mathematically for keyword ‘k’ and document ‘doc’ D(k): k -> {(doc,pos) | doc.content(pos)=k; } So list will look something like: D(cowboy): D(stadium): D2,12 D6,17 D9,34 D9,357 D97,45 D6,18 D9,35 D56,55 D64,5 D97,46
  • 10. DOCUMENT INVERTED INDEX CONTINUE.. Mapping for Entity to document will be slightly different then keyword to entity in index. It is because Entity can have different instance value in the document. So Mathematically: D(E): E-> {(doc,pos,e)| d.content(pos)=e; eεE} In List view: D(Location): D6,23,’Arlington TX’ D9,45,’United State’ D97,50,’North Texas’ …… . …… .
  • 11. DI-INDEX-> IS IT EFFICIENT NOW ? If we treat list of each keywords and entity in index as a relation then we can write equivalent SQL query as: Select D(l).entity , sum(lscore) as score From Cowboy c, Stadium s, Location l where c.dId=s.dId and s.dId=l.dId ------- Group By D(l).entity Having score > threshold ; Issue here: Cost of Complex join: In fact as number of keywords and Entity in query increases join will become more complex. So we still need improvement.
  • 12. DI-INDEX -> DATA PARTITIONING Partition the document space in equal size. For example 100 Doc -> 10 partition-> 10 doc in each partition Each partition have list of keywords and entity it support, find partition support yours and then you hose have to perform join only in between those documents. So Entry for entity in index now should have partition number instead of instance value(Why?) D(Location): D6,23,P8 D9,45,P86 D97,50,P8 …… .
  • 13. ENTITY-INVERTED INDEX As opposite to DI-Index here we map each keyword to the entity while building index. Here we not only store each keyword’s position in the doc but also nearby entity’s position also under which context this keyword occurred. Mathematically E(k) : k ->{(o(doc,epos,entity),pos) |o.context[pos]=k; entityεE} Which translated to k appear with position ‘pos’ in the context of entity occurrence o(doc,epos,entity).
  • 14. EI-INDEX CONTINUE… Hence layout of index list in this case will look something like: E(cowboy): E(Stadium): (Notice here that there is no entry for location because it is entity) Here P is partition number explained in next slide whose concept is analogous to DI-Index partitioning. ((D6,23,P8),17) ((D9,45,P86),34) ((D97,50,P8),45) ……… . ((D23,23,P8),18) ((D9,45,P86),35) ((D97,50,P8),46) ……… .
  • 15. EI-INDEX PARTITIONING Here we will do the partitioning on the basis of Entites. Divide Entity space into equal size. So if 10 Entity-> 10 partition node -> 1 entity each partition. Each entity node will have list of keywords found in the context of this entity. Its faster than DI-Index because task of context matching we have performed during index formation itself.
  • 16. COMPARISON D-Inverted E-Inverted Join Fast (why?) Faster (why?) Aggregation Central (why?) Distributed (why?) Space Minimal Overhead (why?) Large (why?)
  • 17. BOTH INDEX CO-EXIST? DUAL-INVERSION INDEX) Answer is Yes. Its advisable that E-Inverted should be created for Entities that are queried more often and take less space because its faster whereas D-Inverted should be created for Entities that are queried less often but take large space. This balance space and time performance of the application.
  • 18. SUMMARY D-Inverted maps keywords, Entity to document. It gives good performance using partitioning and takes minimal space. E-inverted maps each keywords to document and and context of Entity under which it found. It is faster than D-Inverted but require large space to store. Both can co-exist in a system to balance performance.