Semantic Search Component

Enterprise Database Search Component - EDSC DATAFIELD CONTENT WEB LEXICAL SEMANTICS MEANING RELEVANCE CONTEXT POLYSEMY AMBIGUITY SYNONYM HOMONYM DATABASE FEDERATION HETEROGENEOUS LEGACY E-COMM WORD TERM TEXT SQL QUERY INFERENCE ENGINE COOPERATION SEARCH 1 Mario Flecha - 24 November 2005

The Problem Search is one of the most used functions in Information Systems but “searchability” is not improving in the same pace as data proliferate. # of databases, data volume, complexity, and time constraints are crescent and users need cooperative software in order to find relevant information from a plethora of databases and “zillions” of records. Public-face, user-centric e-government solutions will make database search even more challenging because there will be more user’s than only internal staff and some users will be just software! Multilingual search becomes more and more necessary for database search in a global economy, and society. Diversity of database schemas for structured data sources is an unsolved problem. A general-purpose database search component is on high demand (Google, for instance, doesn’t have a solution for such problem). 2

What The Search Component Has To Provide... Understanding users’ vocabulary in different search domains (including the ability to accommodate language or speech preferences of an individual at a particular point in time – idiolect - as much as possible). Independence of database semantics, database schema, data source format and technology, languages (natural and computer ones), and user interface (web, non-web). Ease database searching for all sorts of users (from proficient to unskilled, young to senior). Overcome language barriers during search transactions. Cope with lexical ambiguit such as : polysemy, homonymy, synonymy (disambiguation provided mostly by context identification). 3

What The Search Component Has To Provide... Automatic and manual knowledge acquisition mechanisms. High performance for simultaneous and multiple users’ consultations (including multiple and simultaneous queries by user) over multiple databases in online or batch processing modes. Low processing cost and disc storage economy. Elimination of multiple indices and pre-defined queries on databases. Find answers by proximity (inclusion/exclusion of arguments). Preemptive detection of high-cost queries (i.e. selectivity factor) Reuse and componentization. Create a one-stop shopping point concept for all kinds of queries over databases. 4

Handle context to cope with....... AMBIGUITY -> Polysemy, Homonymy, Synonymy POLYSEMY TERM MEANING SAME SOUND ONE MEANING T M 1 M 2 M n . . . ONE TERM VARIOUS MEANINGS M T 1 T 2 T n . . . ONE MEANING VARIOUS TERMS HOMONYMY BY PHONETIC CONVERGENGE T 1 T 2 M 1 M 2 HOMONYMY BY SEMANTIC DIVERGENCE T T 1 T 2 M 1 M 2 SAME SOUND PASSAGE TO POLYSEMY T 1 M 1 T 2 M 2 T M 1 M 2 5 Legend: T = Term M = Meaning

Handle context to cope with....... SEMANTIC CONSTELLATION (SEMANTIC FIELD) TEACHING ANALPHABETISM STUDY STUDENT ANALPHABET TEACH KNOWLEDGE EDUCATE EDUCATION ALPHABETIZE ALPHABETIZING LEARN APPRENTICESHIP APPRENTICE 6

CONTEXT TREATMENT Context adds meaning because it brings restriction and closure to an ambiguous, polysemical environment. Contextualization is the underlying weapon of languages to fight ambiguity and bring precision and semantic relevance in linguistic events. We have found two situations in which context shall be properly processed: Structured data. Non-structured or semi-structured data. 7

CONTEXT TREATMENT Either, non-structured and semi-structured data, require the pursuit of context. An effective way to do so is the semantic constellation approach. In order to properly work with contextual information we have developed a technique and a tool: content semantics analysis-CS- and a Relational Inference Machine-RIM. Structured data is based on previous analysis to identify contextual information, which shall then be concatenated to the term to obtain a contextualized term. CS consists in a method to identify and represent lexical semantics that comes from the content of datafields or terms obtained from unstructured data. 8

CONTEXT TREATMENT Unstructured data requires a more complex approach to undertake context treatment automatically during run time. It’s based on affinity between terms and the highest probability of a context over others. It implies a specialized lexicon which contains contextual information to support the disambiguation process. Obs.: Stoplists need to be provided in all cases. 9

CONTEXT TREATMENT Alternate terms gain a contour in the same context and are differentiated in other contexts. Users may use their preferred terms with more freedom provided that they are known by the Search Component. High independence from natural language structure, database schema, user interface and environment restrictions. A general purpose search component becomes possible once it is decoupled from user interface and database layers. 10

Contextualizing Terms 11 (a) Example: city*Seattle; state*WS; year*1998. The context database, beyond the prefix, keeps processing information for term treatment, like phonetization, words breaking etc Context Prefix * Term = Contextualized Term Lexical Domain

Overall Search Component Architecture 12 User’s Application RIM’s Auxiliary Objects Facts Databases - X,Y,Z... And Instances Facts database instance X Contextualized Term (Semantic Knowledge Base) Contextualized Term (Ontology) Database X Instance 1 . . . . . . . . . . . . . . . . . . . . . . . . . . Database X Instance 1 Term 3 . . . . . . . . . . . . term 1 . . . . . . . . . . . . RIM User * Knowledge Acquisition Consultation * User could be a human or software Mediator Mediator Mediator Database Y Database Z Instance 2 Instance N Instance 1 . Instance 2 Instance N . Instance N Database X Instance 1 Term 2 Database X Instance 2 Term 90 Database X Instance 1000 Term 10 Database Y Instance 5 Term 100 Database Y Instance 3 Term 100 term 2 term 3 term 10 term 30 term 100 term 1000 term K Database L Instance 2 Term 2000 Database Z Instance Z Term K Relations and composite Views of RIM Downward Upward User’s Application Answer (set of tuple Ids) Question (set of questions Knowledge Acquisition Methods

EXAMPLES: LOCATION SEARCH 14 State City Kind of Street Street’s Name Quarter’s Name Did CepDigital find? Did Medi a tor find?    Aníbal Matos * São Pedro ** N SL   St Aníbal Matos São Pedro N SL   Street Aníbal Matos São Pedro N SL   Street Professor Aníbal Matos São Pedro N Y   Street Professor Aníbal de Matos São Pedro N Y   Avenue Prof.Aníbal Matos São Pedro N Y MG Belo Horizonte Street Professor Aníbal de Matos Santo Antônio N Y MG Belo Horizonte Street Prof Anïbal de Matos Santo Antônio N Y MG Belo Horizonte St Professor Aníbal de Matos Santo Antônio Y Y MG Belo Horizonte Street Professor Aníbal de Matos  or S Antônio N N BL BL MG Belo Horizonte Street Professor  or S Antônio BL BL BL SL MG Belo Horizonte Street Anïbal  or S Antônio N SL

Semantic Search Component

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Semantic Search Component (20)

Semantic Search Component