SlideShare a Scribd company logo
Enterprise Database Search Component - EDSC DATAFIELD CONTENT WEB LEXICAL SEMANTICS MEANING RELEVANCE CONTEXT POLYSEMY AMBIGUITY SYNONYM HOMONYM DATABASE FEDERATION HETEROGENEOUS LEGACY E-COMM WORD TERM TEXT SQL QUERY INFERENCE ENGINE COOPERATION SEARCH 1 Mario Flecha - 24 November 2005
The Problem Search is one of the most used functions in Information Systems but “searchability” is not improving in the same pace as data proliferate. # of databases,  data volume, complexity, and time constraints are crescent and users need cooperative software in order to find relevant information from a plethora of databases and “zillions” of records.  Public-face, user-centric e-government solutions will make database search even more challenging because there will be more user’s than only internal staff and some users will be just software! Multilingual search becomes more and more necessary for database search in a global economy, and society. Diversity of database schemas for structured data sources is an unsolved problem. A general-purpose database search component is on high demand (Google, for instance, doesn’t have a solution for such problem). 2
What The Search Component Has To Provide... Understanding users’ vocabulary in different search domains (including the ability to accommodate language or speech preferences of an individual at a particular point in time – idiolect - as much as possible). Independence of database semantics, database schema, data source format and technology, languages (natural and computer ones), and user interface (web, non-web). Ease database searching for all sorts of users (from proficient to unskilled, young to senior). Overcome language barriers during search transactions. Cope with lexical ambiguit such as : polysemy, homonymy, synonymy (disambiguation provided mostly by context identification). 3
What The Search Component Has To Provide... Automatic and manual  knowledge acquisition mechanisms. High performance for simultaneous and multiple users’ consultations (including multiple and simultaneous queries by user) over multiple databases in online or batch processing modes. Low processing cost and disc storage economy. Elimination of  multiple indices and pre-defined queries on databases. Find answers by proximity (inclusion/exclusion of arguments). Preemptive detection of high-cost queries (i.e. selectivity factor) Reuse and componentization. Create a one-stop shopping point concept for all kinds of queries over databases. 4
Handle context to cope with....... AMBIGUITY -> Polysemy, Homonymy, Synonymy  POLYSEMY TERM MEANING SAME SOUND  ONE MEANING T M 1 M 2 M n . . . ONE TERM VARIOUS MEANINGS M T 1 T 2 T n . . . ONE MEANING VARIOUS TERMS HOMONYMY BY PHONETIC CONVERGENGE T 1 T 2 M 1 M 2 HOMONYMY BY SEMANTIC DIVERGENCE T T 1 T 2 M 1 M 2 SAME SOUND  PASSAGE TO POLYSEMY T 1 M 1 T 2 M 2 T M 1 M 2 5 Legend: T = Term  M = Meaning
Handle context to cope with.......  SEMANTIC CONSTELLATION (SEMANTIC FIELD) TEACHING ANALPHABETISM STUDY STUDENT ANALPHABET TEACH KNOWLEDGE EDUCATE EDUCATION ALPHABETIZE ALPHABETIZING LEARN APPRENTICESHIP APPRENTICE 6
CONTEXT TREATMENT Context adds meaning because it brings restriction and closure to an ambiguous, polysemical environment. Contextualization is the underlying weapon of languages to fight ambiguity  and bring precision and semantic relevance in linguistic events. We have found two situations in which context shall be properly processed: Structured data. Non-structured or semi-structured data. 7
CONTEXT TREATMENT Either, non-structured and semi-structured data, require the pursuit of context. An effective way to do so is the semantic constellation approach. In order to properly work with contextual information we have developed a technique and a tool: content semantics analysis-CS- and a Relational Inference Machine-RIM. Structured data is based on previous analysis to identify contextual information, which shall then be concatenated to the term to obtain a contextualized term.  CS consists in a method to identify and represent  lexical  semantics that comes from the content of datafields or terms obtained from unstructured data. 8
CONTEXT TREATMENT Unstructured data requires a more complex approach to undertake context treatment automatically during run time. It’s based on affinity between terms and the highest probability of a context over others.  It implies a specialized lexicon which contains contextual information to support the disambiguation process. Obs.: Stoplists need to be provided in all cases. 9
CONTEXT TREATMENT Alternate terms gain a contour in the same context and are differentiated in other contexts. Users may use their preferred terms with more freedom provided that they are known by the Search Component. High independence from natural language structure, database schema, user interface and environment restrictions. A general purpose search component becomes possible once it is decoupled from user interface and database layers. 10
Contextualizing Terms 11 (a) Example: city*Seattle; state*WS; year*1998. The context database, beyond the prefix, keeps processing information for term treatment, like phonetization, words breaking etc Context Prefix * Term = Contextualized Term Lexical Domain
Overall Search Component Architecture 12 User’s Application RIM’s Auxiliary Objects Facts Databases -  X,Y,Z... And Instances Facts database instance  X Contextualized Term  (Semantic Knowledge Base) Contextualized Term (Ontology) Database X Instance 1 . . . . . . . . . . . . . . . . . . . . . . . . . . Database X Instance 1 Term 3 . . . . . . . . . . . . term 1 . . . . . . . . . . . . RIM User * Knowledge Acquisition Consultation * User could be a human or software Mediator Mediator Mediator Database Y Database Z Instance 2 Instance N Instance 1 . Instance 2 Instance N . Instance N Database X Instance 1 Term 2 Database X Instance 2 Term 90 Database X Instance 1000 Term 10 Database Y Instance 5 Term 100 Database Y Instance 3 Term 100 term 2 term 3 term 10 term 30 term 100 term 1000 term K Database L Instance 2 Term 2000 Database Z Instance Z Term K Relations and composite Views of RIM Downward Upward User’s Application Answer (set of tuple Ids) Question (set of questions Knowledge Acquisition Methods
EXAMPLES CITY NAME SEARCH 13
EXAMPLES: LOCATION SEARCH 14 State City Kind of Street Street’s Name Quarter’s Name Did CepDigital find? Did Medi a tor find?    Aníbal Matos * São Pedro ** N SL   St Aníbal Matos São Pedro N SL   Street Aníbal Matos São Pedro N SL   Street Professor Aníbal Matos São Pedro N Y   Street Professor Aníbal de Matos São Pedro N Y   Avenue Prof.Aníbal Matos São Pedro N Y MG Belo Horizonte Street Professor Aníbal de Matos Santo Antônio N Y MG Belo Horizonte Street Prof Anïbal de Matos Santo Antônio N Y MG Belo Horizonte St Professor Aníbal de Matos Santo Antônio Y Y MG Belo Horizonte Street Professor Aníbal de Matos  or S Antônio N N BL BL MG Belo Horizonte Street Professor  or S Antônio BL BL BL SL MG Belo Horizonte Street Anïbal  or S Antônio N SL
CONCLUSION 15

More Related Content

PPTX
Towards the implementation of a refined data model for a Zulu machine-readabl...
PDF
Some Information Retrieval Models and Our Experiments for TREC KBA
PDF
Information retrieval concept, practice and challenge
PPT
Week12
PPTX
3. introduction to text mining
PPT
Boolean Retrieval
PDF
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
PPTX
Text mining
Towards the implementation of a refined data model for a Zulu machine-readabl...
Some Information Retrieval Models and Our Experiments for TREC KBA
Information retrieval concept, practice and challenge
Week12
3. introduction to text mining
Boolean Retrieval
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Text mining

What's hot (20)

PPT
Role of Text Mining in Search Engine
PPTX
Information Retrieval
PPTX
Text mining
PPT
Tesxt mining
PPTX
Introduction to Text Mining and Semantics
PPTX
Textmining Information Extraction
PPT
Textmining
PPT
2011linked science4mccuskermcguinnessfinal
PPT
Big Data & Text Mining
PPT
Textmining Introduction
PPTX
Text data mining1
PPT
4.4 text mining
PPTX
Text Data Mining
PPTX
Text mining
PPTX
Information Retrieval-1
PDF
Information_Retrieval_Models_Nfaoui_El_Habib
PDF
Information Retrieval
PDF
Tutorial 1 (information retrieval basics)
PPTX
RDF2Rule PRESENTATION
PPTX
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Role of Text Mining in Search Engine
Information Retrieval
Text mining
Tesxt mining
Introduction to Text Mining and Semantics
Textmining Information Extraction
Textmining
2011linked science4mccuskermcguinnessfinal
Big Data & Text Mining
Textmining Introduction
Text data mining1
4.4 text mining
Text Data Mining
Text mining
Information Retrieval-1
Information_Retrieval_Models_Nfaoui_El_Habib
Information Retrieval
Tutorial 1 (information retrieval basics)
RDF2Rule PRESENTATION
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Ad

Viewers also liked (20)

DOC
SYNONYMS, ANTONYMS, POLYSEMY, HOMONYM, AND HOMOGRAPH
PPTX
Externally set assignment 2012.13 anchorage
PPTX
Ppt upload
PPTX
Lexical Relationship
PPT
Metaphors of Containment and Causality
PDF
Seven types of ambiguity - the role of communications in the modern media soc...
PPTX
Ling304 assignment2 12127013
PPTX
Polysemi warohmah hasanah
PPTX
PPTX
Semantics relations among words
PDF
Seven types of ambiguity
PPTX
Sense relations
PPTX
Ad polysemy
PPT
Metaphors we live by
PPTX
Idioms Interface between semantics & pragmatics - letras usp Elizabeth Harkot
PPTX
Cognitive Semantics - Metaphor
PPTX
presentation about hyponyms
PPTX
homophone, homonomy, polysemy
PPT
English Idioms
PPTX
Semantics
SYNONYMS, ANTONYMS, POLYSEMY, HOMONYM, AND HOMOGRAPH
Externally set assignment 2012.13 anchorage
Ppt upload
Lexical Relationship
Metaphors of Containment and Causality
Seven types of ambiguity - the role of communications in the modern media soc...
Ling304 assignment2 12127013
Polysemi warohmah hasanah
Semantics relations among words
Seven types of ambiguity
Sense relations
Ad polysemy
Metaphors we live by
Idioms Interface between semantics & pragmatics - letras usp Elizabeth Harkot
Cognitive Semantics - Metaphor
presentation about hyponyms
homophone, homonomy, polysemy
English Idioms
Semantics
Ad

Similar to Semantic Search Component (20)

PPT
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
PDF
Word Embedding In IR
PPT
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
PDF
INTELLIGENT QUERY PROCESSING IN MALAYALAM
PDF
Schema-agnositc queries over large-schema databases: a distributional semanti...
PPTX
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
PPTX
CSC315_LECTURE on database design and management
PPTX
Using topic modelling frameworks for NLP and semantic search
PPTX
Sem tech2013 tutorial
PPTX
Recent Trends in Semantic Search Technologies
PPT
Semantic Interoperability in Infocosm: Beyond Infrastructural and Data Intero...
PDF
The Revolution Of Cloud Computing
PDF
Synthesys Technical Overview
PDF
PDF
Web_Mining_Overview_Nfaoui_El_Habib
PDF
Leveraging Flat Files from the Canvas LMS Data Portal at K-State
PPT
Copy of 10text (2)
PPT
Chapter 10 Data Mining Techniques
PDF
The impact of domain-specific stop-word lists on ecommerce website search per...
Semantic Web & Information Brokering: Opportunities, Commercialization and Ch...
Word Embedding In IR
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
INTELLIGENT QUERY PROCESSING IN MALAYALAM
Schema-agnositc queries over large-schema databases: a distributional semanti...
The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Ap...
CSC315_LECTURE on database design and management
Using topic modelling frameworks for NLP and semantic search
Sem tech2013 tutorial
Recent Trends in Semantic Search Technologies
Semantic Interoperability in Infocosm: Beyond Infrastructural and Data Intero...
The Revolution Of Cloud Computing
Synthesys Technical Overview
Web_Mining_Overview_Nfaoui_El_Habib
Leveraging Flat Files from the Canvas LMS Data Portal at K-State
Copy of 10text (2)
Chapter 10 Data Mining Techniques
The impact of domain-specific stop-word lists on ecommerce website search per...

Semantic Search Component

  • 1. Enterprise Database Search Component - EDSC DATAFIELD CONTENT WEB LEXICAL SEMANTICS MEANING RELEVANCE CONTEXT POLYSEMY AMBIGUITY SYNONYM HOMONYM DATABASE FEDERATION HETEROGENEOUS LEGACY E-COMM WORD TERM TEXT SQL QUERY INFERENCE ENGINE COOPERATION SEARCH 1 Mario Flecha - 24 November 2005
  • 2. The Problem Search is one of the most used functions in Information Systems but “searchability” is not improving in the same pace as data proliferate. # of databases, data volume, complexity, and time constraints are crescent and users need cooperative software in order to find relevant information from a plethora of databases and “zillions” of records. Public-face, user-centric e-government solutions will make database search even more challenging because there will be more user’s than only internal staff and some users will be just software! Multilingual search becomes more and more necessary for database search in a global economy, and society. Diversity of database schemas for structured data sources is an unsolved problem. A general-purpose database search component is on high demand (Google, for instance, doesn’t have a solution for such problem). 2
  • 3. What The Search Component Has To Provide... Understanding users’ vocabulary in different search domains (including the ability to accommodate language or speech preferences of an individual at a particular point in time – idiolect - as much as possible). Independence of database semantics, database schema, data source format and technology, languages (natural and computer ones), and user interface (web, non-web). Ease database searching for all sorts of users (from proficient to unskilled, young to senior). Overcome language barriers during search transactions. Cope with lexical ambiguit such as : polysemy, homonymy, synonymy (disambiguation provided mostly by context identification). 3
  • 4. What The Search Component Has To Provide... Automatic and manual knowledge acquisition mechanisms. High performance for simultaneous and multiple users’ consultations (including multiple and simultaneous queries by user) over multiple databases in online or batch processing modes. Low processing cost and disc storage economy. Elimination of multiple indices and pre-defined queries on databases. Find answers by proximity (inclusion/exclusion of arguments). Preemptive detection of high-cost queries (i.e. selectivity factor) Reuse and componentization. Create a one-stop shopping point concept for all kinds of queries over databases. 4
  • 5. Handle context to cope with....... AMBIGUITY -> Polysemy, Homonymy, Synonymy POLYSEMY TERM MEANING SAME SOUND ONE MEANING T M 1 M 2 M n . . . ONE TERM VARIOUS MEANINGS M T 1 T 2 T n . . . ONE MEANING VARIOUS TERMS HOMONYMY BY PHONETIC CONVERGENGE T 1 T 2 M 1 M 2 HOMONYMY BY SEMANTIC DIVERGENCE T T 1 T 2 M 1 M 2 SAME SOUND PASSAGE TO POLYSEMY T 1 M 1 T 2 M 2 T M 1 M 2 5 Legend: T = Term M = Meaning
  • 6. Handle context to cope with....... SEMANTIC CONSTELLATION (SEMANTIC FIELD) TEACHING ANALPHABETISM STUDY STUDENT ANALPHABET TEACH KNOWLEDGE EDUCATE EDUCATION ALPHABETIZE ALPHABETIZING LEARN APPRENTICESHIP APPRENTICE 6
  • 7. CONTEXT TREATMENT Context adds meaning because it brings restriction and closure to an ambiguous, polysemical environment. Contextualization is the underlying weapon of languages to fight ambiguity and bring precision and semantic relevance in linguistic events. We have found two situations in which context shall be properly processed: Structured data. Non-structured or semi-structured data. 7
  • 8. CONTEXT TREATMENT Either, non-structured and semi-structured data, require the pursuit of context. An effective way to do so is the semantic constellation approach. In order to properly work with contextual information we have developed a technique and a tool: content semantics analysis-CS- and a Relational Inference Machine-RIM. Structured data is based on previous analysis to identify contextual information, which shall then be concatenated to the term to obtain a contextualized term. CS consists in a method to identify and represent lexical semantics that comes from the content of datafields or terms obtained from unstructured data. 8
  • 9. CONTEXT TREATMENT Unstructured data requires a more complex approach to undertake context treatment automatically during run time. It’s based on affinity between terms and the highest probability of a context over others. It implies a specialized lexicon which contains contextual information to support the disambiguation process. Obs.: Stoplists need to be provided in all cases. 9
  • 10. CONTEXT TREATMENT Alternate terms gain a contour in the same context and are differentiated in other contexts. Users may use their preferred terms with more freedom provided that they are known by the Search Component. High independence from natural language structure, database schema, user interface and environment restrictions. A general purpose search component becomes possible once it is decoupled from user interface and database layers. 10
  • 11. Contextualizing Terms 11 (a) Example: city*Seattle; state*WS; year*1998. The context database, beyond the prefix, keeps processing information for term treatment, like phonetization, words breaking etc Context Prefix * Term = Contextualized Term Lexical Domain
  • 12. Overall Search Component Architecture 12 User’s Application RIM’s Auxiliary Objects Facts Databases - X,Y,Z... And Instances Facts database instance X Contextualized Term (Semantic Knowledge Base) Contextualized Term (Ontology) Database X Instance 1 . . . . . . . . . . . . . . . . . . . . . . . . . . Database X Instance 1 Term 3 . . . . . . . . . . . . term 1 . . . . . . . . . . . . RIM User * Knowledge Acquisition Consultation * User could be a human or software Mediator Mediator Mediator Database Y Database Z Instance 2 Instance N Instance 1 . Instance 2 Instance N . Instance N Database X Instance 1 Term 2 Database X Instance 2 Term 90 Database X Instance 1000 Term 10 Database Y Instance 5 Term 100 Database Y Instance 3 Term 100 term 2 term 3 term 10 term 30 term 100 term 1000 term K Database L Instance 2 Term 2000 Database Z Instance Z Term K Relations and composite Views of RIM Downward Upward User’s Application Answer (set of tuple Ids) Question (set of questions Knowledge Acquisition Methods
  • 13. EXAMPLES CITY NAME SEARCH 13
  • 14. EXAMPLES: LOCATION SEARCH 14 State City Kind of Street Street’s Name Quarter’s Name Did CepDigital find? Did Medi a tor find?    Aníbal Matos * São Pedro ** N SL   St Aníbal Matos São Pedro N SL   Street Aníbal Matos São Pedro N SL   Street Professor Aníbal Matos São Pedro N Y   Street Professor Aníbal de Matos São Pedro N Y   Avenue Prof.Aníbal Matos São Pedro N Y MG Belo Horizonte Street Professor Aníbal de Matos Santo Antônio N Y MG Belo Horizonte Street Prof Anïbal de Matos Santo Antônio N Y MG Belo Horizonte St Professor Aníbal de Matos Santo Antônio Y Y MG Belo Horizonte Street Professor Aníbal de Matos  or S Antônio N N BL BL MG Belo Horizonte Street Professor  or S Antônio BL BL BL SL MG Belo Horizonte Street Anïbal  or S Antônio N SL