Algoritmo di text-similarity per l'annotazione semantica di Web Service

Algoritmo di text-similarity
per l’annotazione semantica di WS
SWAP research group - 27 luglio 2010
Michele Filannino, @bronko85

Outline
Il problema
Scenario di riferimento
Similarità

SAWA
Word-to-word similarity
Text-to-text similarity

Risultati sperimentali
Qualità dei risultati
Tempo di esecuzione

2
Sviluppi futuri
Sessione dimostrativa

Il problema
Come misurare la similarità tra due testi?

4 Scenario di riferimento
Natural language To approve/reject
descriptions suggested annotations

WSDL ﬁle CODEArchitects CODEArchitects SAWSDL ﬁle
Annotation Tool Annotation Tool

5 Similarità semantica
Assegnare una metrica di somiglianza, basata sul signiﬁcato, ad un insieme di
termini e/o documenti;

Similarità ≠ Correlatività;
“Banca” e “denaro” sono correlati sebbene non siano affatto simili;

Similarità Correlatività;

“Ragazza” e “fanciulla” sono simili quindi anche correlati.

6 Similarità semantica in SWOP

Concetti del WS Concetti ontologici
- RequestOrder Order -
- Order OrderNumber -
- BillingInformation OrderID -
- ... BillID -
BillReference -
BusinessFirm -
Product -
Catalog -
... -

7 Peso computazionale

Esempio:
Ontologia con 1200 concetti

WSDL con 15 annotazioni

18.000 esecuzioni di SAWA

:(
1.200 x 15 =

SAWA
Similarity Algorithm Wikipedia-bAsed

9 Word-to-word similarity

Date due parole stabilire quanto esse sono simili;
Tipi di algoritmi per il calcolo della similarità tra parole:
Corpus-based: pointwise mutual information, latent semantic analysis;

Hierarchy-based: Leacock & Chodorow, Lesk, Wu & Palmer, Resnik, Lin, Jiang &
Conrath;

Input: due parole;
Output: score compreso tra 0 e 1.

11 Tool di word-to-word similarity

Libreria utilizzata: LinguaTools DISCO;
Utilizza Wikipedia come gerarchia di concetti
202.578 concetti;

Aggiornato al 1° gennaio 2008

Utilizza l’algoritmo di Lin per il calcolo della similarità.

12 Esempi

Tiger, lion = 90%
Doctor, nurse = 70%
Stock, market = 47%
Love, sex = 46%
FBI, investigation = 35%
Professor, cucumber = 0,006%

Qualità dell’algoritmo
Corpus per la misurazione della qualità: WordSim353;
Coefﬁcienti di correlazione (Pearson):
Wikipedia: 0,574;

BNC: 0,415;

PubMed: 0,105;

90.000

67.500

45.000

22.500

0

14 Text-to-text similarity
Dati due testi stabilire quanto essi sono simili;
Estensione opportuna degli algoritmi di word-to-word similarity;
Rimozione delle parole (stopword)
basso potere discriminatorio;

alta frequenza di occorrenza;

Input: due testi;
Output: score compreso tra 0 e 1.

15 Stopword

“Returns the ﬁrst and last name of each customer who is categorized as an
individual consumer”

STOPWORD

“name customer categorized individual consumer”

Algoritmo di Corley & Mihalcea
16 (2005)

Ottimizzazioni (v1.2)

Caching delle frequenze di ogni termine;
Caching delle similarità tra termini;
Apprendimento incrementale;
Riduzione degli accessi a DISCO;

Performance ridotte di 10 volte;

Risultati sperimentali
Qualità e tempo di esecuzione

DESCRIZIONE DEL DOCUMENTO WSDL SCELTA:
"returns the first and last name of each customer who is categorized as an individual consumer"

RANKING DEI CONCETTI ONTOLOGICI SIMILI (con relativo score):
*---------------------------------------------------------------------------------------------------------------*
| Descrizione | Score |
*---------------------------------------------------------------------------------------------------------------*
| name: name of customer | 62,85% |
| customer: Current customer individual information | 56,91% |
| customeraddress: Customer address | 42,36% |
| customercredicard: Customer credit card information | 35,08% |
| salesreason: Reasons why a customer may purchase a particular product. | 30,35% |
| customerstore:Stores of our Company (customer and resellers). | 17,31% |
| salesorderdetail: Product details associated with a specific sales order. | 2,99% |
| productinventory: Product inventory information. | 2,59% |
| salesrepresentativeperson: Contains current sales information for the sales representative persons. | 2,39% |
| productlocation: Product manufacturing locations | 2,36% |
| salestaxrate: Sales Tax rate. | 2,36% |
| salesterritory: Sales territory. | 2,22% |
| employeeaddress: Employee information such as salary, department, and title. | 2,18% |
| product: Products sold or used in the manfacturing of sold products. | 2,12% |
| enterpricedepartment: Departments of Enterprise | 2,00% |
| salesspecialoffer: Sales Special Offer (discounts). | 1,99% |
| productlistpricehistory: Changes in the list price of a product over time. | 1,80% |
| shipmethod: Shipping methods. | 1,79% |
| salesorder: General sales order information (header). | 1,76% |
| productdocument: Product Document | 1,73% |
| productcosthistory: Changes in the cost of a product over time. | 1,68% |
| productbillofmaterials: Bill Of Materials are items required to make products and product subassembl | 1,61% |
| productmodel: Product model classification. | 1,48% |
| currencyrate: Currency exchange rates. | 1,40% |
| salesshoppingcartitem: Contains shopping cart items until the order is submitted or cancelled. | 1,29% |
| productcategory: High-level product categorization. | 1,27% |
| addresstype: Types of addresses | 0,95% |
| unitmeasure: Unit of measure. | 0,80% |
| currency: Standard ISO currencies. | 0,51% |

19
| countryregion: ISO standard codes for countries and regions. | 0,51% |
| stateprovince: States and provinces | 0,12% |
*---------------------------------------------------------------------------------------------------------------*
Time elapsed: 9.4 seconds.

"lists the names and addresses of all individual customers"

*---------------------------------------------------------------------------------------------------------------*
*---------------------------------------------------------------------------------------------------------------*

20
*---------------------------------------------------------------------------------------------------------------*

"returns the name of each customer that is categorized as a store"

*---------------------------------------------------------------------------------------------------------------*
*---------------------------------------------------------------------------------------------------------------*

21
*---------------------------------------------------------------------------------------------------------------*

22 Tempo di esecuzione
Ottimizzato Non ottimizzato

3 1.0 s 9.4 s

6 1.7 s 9.8 s

5 2.7 s 18.1 s

7 3.6 s 21.8 s

2 3.9 s 15.5 s

8 5.6 s 23.1 s

1 6.2 s 14.3 s

4 9.4 s 39.4 s

0 12.5 25 37.5 50

Sviluppi futuri
Imminenti e futuri

Sviluppi futuri
Imminenti:
Realizzazione dell’interfaccia Web Service

Realizzazione dell’interfaccia Web (gratuita)

Realizzazione dell’interfaccia di rete

Disseminazione scientiﬁca

Altri:
Introduzione di soglie per migliorare le performance

Rilascio con licenza open-source del codice sorgente

Algoritmo di text-similarity per l'annotazione semantica di Web Service

More Related Content

Similar to Algoritmo di text-similarity per l'annotazione semantica di Web Service (20)

More from Michele Filannino (16)

Recently uploaded (20)

Algoritmo di text-similarity per l'annotazione semantica di Web Service

Editor's Notes