Diachronic Analysis of the Italian Language exploiting Google Ngram

Diachronic Analysis of
the Italian Language
exploiting Google
Ngram

Hello!
Pierpaolo Basile
Annalina Caputo
Roberta Luisi
Giovanni Semeraro
Department of Computer Science
University of Bari Aldo Moro - Italy

Background
TRI
P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time.
IJCoL vol. 1: Emerging Topics at the First Italian Conference on Computational Linguistics.
Corpus with Temporal
Information
Dictionary/Random Vectors
Temporal Random Indexing
Word
Space
Word
Space
Word
Space
Word
Space
Word
Space
▪ Several WordSpaces for
several time periods
▪ Word vectors are comparable
across WordSpaces

Motivation 1
Detect meaning shift
Marty, in 2015
people will surf
on the web!!!

Motivation 1
Surf!?!?! On
the
web!?!?!?

Motivation 1
Surf!?!?!
On the
web!?!?!?
surf the Net/Internet to use the Internet
When was this meaning introduced?

Motivation 2
Large corpus
▪ Build a method for computing TRI relying on a very
large corpus
▪ Google Ngram for the Italian language
▫n-grams (up to five) extracted from Google Books
▫over five million books spanning the years from
1500 to 2012
▪ covers several languages including Italian
analysis is often described as 1991 104 5
N-gram occurrences books

Methodology
1. Run TRI on the Italian Google Ngram
▫build a WordSpace for each time period (10
years)
2. Provide for each word a time series
3. Search significant changes in the time
series

cossim
( , )
Time Series
Several time series Γ at the time interval k
log frequency
point-wise
cumulative cossim
( , )
Word frequency in each time
period k
Cosine similarity between word
vectors across two time periods
Considers a cumulative vector
of the previous k-1 time periods

Change point
detection
▪ Mean shift of Γ pivoted at time period j
▪ Search statistical significant mean shifts
▫bootstrapping approach under the null hypothesis
that there is no change in the meaning

Evaluation
Dataset
Build a benchmark for meaning shift detection for
the Italian language
▪ extract a set of words by pooling data by running
several system settings
▪ find correct change points in a dictionary (Sabatino
Coletti/Etimologico Zanichelli)

Evaluation
Results
Method Accuracy
TRIpoint
0.3086
TRIcum
0.2963
TRR1point
0.2716
log freq 0.2346
TRR2point
0.1728
TRR1cum
0.1605
TRR2cum
0.1235
Accuracy: the year predicted by the system should be
equal or greater than the year reported in the gold
standard
TRR1 and TRR2 are variants of TRI
based on Reflective Random Indexing

Conclusions and
Future Work
▪ TRI method with point wise detection provides
good results
▫it overcomes the baseline based on
log-frequency
▪ We provide a benchmark for the evaluation of
meaning shifts for the Italian language
▪ Future work: extend the dataset and provide
an evaluation for the English language

Thanks!!
Any questions?
pierpaolo.basile@gmail.com
https://github.com/pippokill/tri

Diachronic Analysis of the Italian Language exploiting Google Ngram

More Related Content

Viewers also liked (9)

Similar to Diachronic Analysis of the Italian Language exploiting Google Ngram (20)

More from Pierpaolo Basile (18)

Recently uploaded (20)

Diachronic Analysis of the Italian Language exploiting Google Ngram