Diachronic Analysis of
the Italian Language
exploiting Google
Ngram
Hello!
Pierpaolo Basile
Annalina Caputo
Roberta Luisi
Giovanni Semeraro
Department of Computer Science
University of Bari Aldo Moro - Italy
Background
TRI
P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time.
IJCoL vol. 1: Emerging Topics at the First Italian Conference on Computational Linguistics.
Corpus with Temporal
Information
Dictionary/Random Vectors
Temporal Random Indexing
Word
Space
Word
Space
Word
Space
Word
Space
Word
Space
▪ Several WordSpaces for
several time periods
▪ Word vectors are comparable
across WordSpaces
Motivation 1
Detect meaning shift
Marty, in 2015
people will surf
on the web!!!
Motivation 1
Detect meaning shift
Surf!?!?! On
the
web!?!?!?
Motivation 1
Detect meaning shift
Surf!?!?!
On the
web!?!?!?
surf the Net/Internet to use the Internet
When was this meaning introduced?
Motivation 2
Large corpus
▪ Build a method for computing TRI relying on a very
large corpus
▪ Google Ngram for the Italian language
▫n-grams (up to five) extracted from Google Books
▫over five million books spanning the years from
1500 to 2012
▪ covers several languages including Italian
analysis is often described as 1991 104 5
N-gram occurrences books
Methodology
1. Run TRI on the Italian Google Ngram
▫build a WordSpace for each time period (10
years)
2. Provide for each word a time series
3. Search significant changes in the time
series
cossim
( , )
Time Series
Several time series Γ at the time interval k
log frequency
point-wise
cumulative cossim
( , )
Word frequency in each time
period k
Cosine similarity between word
vectors across two time periods
Considers a cumulative vector
of the previous k-1 time periods
Change point
detection
▪ Mean shift of Γ pivoted at time period j
▪ Search statistical significant mean shifts
▫bootstrapping approach under the null hypothesis
that there is no change in the meaning
Evaluation
Dataset
Build a benchmark for meaning shift detection for
the Italian language
▪ extract a set of words by pooling data by running
several system settings
▪ find correct change points in a dictionary (Sabatino
Coletti/Etimologico Zanichelli)
Evaluation
Results
Method Accuracy
TRIpoint
0.3086
TRIcum
0.2963
TRR1point
0.2716
log freq 0.2346
TRR2point
0.1728
TRR1cum
0.1605
TRR2cum
0.1235
Accuracy: the year predicted by the system should be
equal or greater than the year reported in the gold
standard
TRR1 and TRR2 are variants of TRI
based on Reflective Random Indexing
Conclusions and
Future Work
▪ TRI method with point wise detection provides
good results
▫it overcomes the baseline based on
log-frequency
▪ We provide a benchmark for the evaluation of
meaning shifts for the Italian language
▪ Future work: extend the dataset and provide
an evaluation for the English language
Thanks!!
Any questions?
pierpaolo.basile@gmail.com
https://github.com/pippokill/tri

More Related Content

PDF
Diachronic Analysis
PPTX
place of preposition
ODP
Semantic Microblogging
PPTX
Poster Recherche d'Information Sociale
PDF
MICROBLOGGING CONTENT PROPAGATION MODELING USING TOPIC-SPECIFIC BEHAVIORAL FA...
PDF
Master Minds on Data Science - Maarten de Rijke
PDF
Rethinking Microblogging: Open Distributed Semantic
PDF
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...
Diachronic Analysis
place of preposition
Semantic Microblogging
Poster Recherche d'Information Sociale
MICROBLOGGING CONTENT PROPAGATION MODELING USING TOPIC-SPECIFIC BEHAVIORAL FA...
Master Minds on Data Science - Maarten de Rijke
Rethinking Microblogging: Open Distributed Semantic
UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...

Viewers also liked (9)

PDF
Uprising microblogs: A Bayesian network retrieval model for tweet search
PDF
Web-scale semantic search
PPT
(Micro)Blog : un sujet de recherche actuel [08/02/2011]
PDF
Barometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociaux
PPTX
Quels facteurs de pertinence pour la recherche de produits e-commerce ?
PDF
Moederpresentatie Cross Media Cafe - Uit het Lab
PDF
Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...
PPTX
Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...
PDF
Un modèle de Recherche d'Information Sociale pour l'Accès aux Ressources Bib...
Uprising microblogs: A Bayesian network retrieval model for tweet search
Web-scale semantic search
(Micro)Blog : un sujet de recherche actuel [08/02/2011]
Barometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociaux
Quels facteurs de pertinence pour la recherche de produits e-commerce ?
Moederpresentatie Cross Media Cafe - Uit het Lab
Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...
Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...
Un modèle de Recherche d'Information Sociale pour l'Accès aux Ressources Bib...
Ad

Similar to Diachronic Analysis of the Italian Language exploiting Google Ngram (20)

PDF
Diachronic Analysis of Language exploiting Google Ngram
PDF
Detecting semantic shift in large corpora by exploiting temporal random indexing
PPTX
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
PPTX
Añotador: a Temporal Tagger for Spanish
PDF
G2 pil a grapheme to-phoneme conversion tool for the italian language
PDF
Unpacking ERP Responses in Artificial Language Learning
PDF
Temporal Semantic Techniques for Text Analysis and Applications
PDF
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
PPTX
Fantoni Urgo - Cirp Dictionary
ODP
Corpora, Blogs and Linguistic Variation (Paderborn)
PDF
Cross language information retrieval in indian
PDF
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
PPTX
65 - An Empirical Simulation-based Study of Real-Time Speech Translation for ...
PPTX
Sign language to text conversion power point presentation
PDF
Scientific and technical translation in English - week 3 2019
PDF
Hernani Costa - ESR 3 - UMA
PPT
data management_transcriptio_Coding_robi.ppt
PDF
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
PPTX
Temporal Web Dynamics and Implications for Information Retrieval
PDF
Seminar report on a statistical approach to machine
Diachronic Analysis of Language exploiting Google Ngram
Detecting semantic shift in large corpora by exploiting temporal random indexing
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
Añotador: a Temporal Tagger for Spanish
G2 pil a grapheme to-phoneme conversion tool for the italian language
Unpacking ERP Responses in Artificial Language Learning
Temporal Semantic Techniques for Text Analysis and Applications
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
Fantoni Urgo - Cirp Dictionary
Corpora, Blogs and Linguistic Variation (Paderborn)
Cross language information retrieval in indian
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
65 - An Empirical Simulation-based Study of Real-Time Speech Translation for ...
Sign language to text conversion power point presentation
Scientific and technical translation in English - week 3 2019
Hernani Costa - ESR 3 - UMA
data management_transcriptio_Coding_robi.ppt
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Temporal Web Dynamics and Implications for Information Retrieval
Seminar report on a statistical approach to machine
Ad

More from Pierpaolo Basile (18)

PDF
Diachronic analysis of entities by exploiting wikipedia page revisions
PDF
Come l'industria tecnologica ha cancellato le donne dalla storia
PDF
EVALITA 2018 NLP4FUN - Solving language games
PDF
Buon appetito! Analyzing Happiness in Italian Tweets
PDF
Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
PDF
INSERT COIN - Storia dei videogame: da Spacewar a Street Fighter
PDF
QuestionCube DigithON 2017
PDF
(Open) data hacking
PDF
La macchina più geek dell’universo The Turing Machine
PDF
Building WordSpaces via Random Indexing from simple to complex spaces
PDF
Analysing Word Meaning over Time by Exploiting Temporal Random Indexing
PDF
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
PDF
A Study on Compositional Semantics of Words in Distributional Spaces
PDF
Exploiting Distributional Semantic Models in Question Answering
PDF
Sst evalita2011 basile_pierpaolo
PDF
AI*IA 2012 PAI Workshop OTTHO
PDF
Word Sense Disambiguation and Intelligent Information Access
PDF
Encoding syntactic dependencies by vector permutation
Diachronic analysis of entities by exploiting wikipedia page revisions
Come l'industria tecnologica ha cancellato le donne dalla storia
EVALITA 2018 NLP4FUN - Solving language games
Buon appetito! Analyzing Happiness in Italian Tweets
Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling
INSERT COIN - Storia dei videogame: da Spacewar a Street Fighter
QuestionCube DigithON 2017
(Open) data hacking
La macchina più geek dell’universo The Turing Machine
Building WordSpaces via Random Indexing from simple to complex spaces
Analysing Word Meaning over Time by Exploiting Temporal Random Indexing
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
A Study on Compositional Semantics of Words in Distributional Spaces
Exploiting Distributional Semantic Models in Question Answering
Sst evalita2011 basile_pierpaolo
AI*IA 2012 PAI Workshop OTTHO
Word Sense Disambiguation and Intelligent Information Access
Encoding syntactic dependencies by vector permutation

Recently uploaded (20)

PPTX
gene cloning powerpoint for general biology 2
PPTX
Introduction to Immunology (Unit-1).pptx
PDF
Cosmology using numerical relativity - what hapenned before big bang?
PPTX
gene cloning powerpoint for general biology 2
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PPTX
Preformulation.pptx Preformulation studies-Including all parameter
PDF
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
PPTX
HAEMATOLOGICAL DISEASES lack of red blood cells, which carry oxygen throughou...
PPTX
Introcution to Microbes Burton's Biology for the Health
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPTX
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
PPTX
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
PDF
Science Form five needed shit SCIENEce so
PPTX
Understanding the Circulatory System……..
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PPTX
Platelet disorders - thrombocytopenia.pptx
PPT
Cell Structure Description and Functions
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PDF
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
PPTX
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK
gene cloning powerpoint for general biology 2
Introduction to Immunology (Unit-1).pptx
Cosmology using numerical relativity - what hapenned before big bang?
gene cloning powerpoint for general biology 2
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
Preformulation.pptx Preformulation studies-Including all parameter
The Future of Telehealth: Engineering New Platforms for Care (www.kiu.ac.ug)
HAEMATOLOGICAL DISEASES lack of red blood cells, which carry oxygen throughou...
Introcution to Microbes Burton's Biology for the Health
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
LIPID & AMINO ACID METABOLISM UNIT-III, B PHARM II SEMESTER
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
Science Form five needed shit SCIENEce so
Understanding the Circulatory System……..
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
Platelet disorders - thrombocytopenia.pptx
Cell Structure Description and Functions
Presentation1 INTRODUCTION TO ENZYMES.pptx
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
GREEN FIELDS SCHOOL PPT ON HOLIDAY HOMEWORK

Diachronic Analysis of the Italian Language exploiting Google Ngram

  • 1. Diachronic Analysis of the Italian Language exploiting Google Ngram
  • 2. Hello! Pierpaolo Basile Annalina Caputo Roberta Luisi Giovanni Semeraro Department of Computer Science University of Bari Aldo Moro - Italy
  • 3. Background TRI P. Basile, A. Caputo, G. Semeraro. Temporal random indexing: A system for analysing word meaning over time. IJCoL vol. 1: Emerging Topics at the First Italian Conference on Computational Linguistics. Corpus with Temporal Information Dictionary/Random Vectors Temporal Random Indexing Word Space Word Space Word Space Word Space Word Space ▪ Several WordSpaces for several time periods ▪ Word vectors are comparable across WordSpaces
  • 4. Motivation 1 Detect meaning shift Marty, in 2015 people will surf on the web!!!
  • 5. Motivation 1 Detect meaning shift Surf!?!?! On the web!?!?!?
  • 6. Motivation 1 Detect meaning shift Surf!?!?! On the web!?!?!? surf the Net/Internet to use the Internet When was this meaning introduced?
  • 7. Motivation 2 Large corpus ▪ Build a method for computing TRI relying on a very large corpus ▪ Google Ngram for the Italian language ▫n-grams (up to five) extracted from Google Books ▫over five million books spanning the years from 1500 to 2012 ▪ covers several languages including Italian analysis is often described as 1991 104 5 N-gram occurrences books
  • 8. Methodology 1. Run TRI on the Italian Google Ngram ▫build a WordSpace for each time period (10 years) 2. Provide for each word a time series 3. Search significant changes in the time series
  • 9. cossim ( , ) Time Series Several time series Γ at the time interval k log frequency point-wise cumulative cossim ( , ) Word frequency in each time period k Cosine similarity between word vectors across two time periods Considers a cumulative vector of the previous k-1 time periods
  • 10. Change point detection ▪ Mean shift of Γ pivoted at time period j ▪ Search statistical significant mean shifts ▫bootstrapping approach under the null hypothesis that there is no change in the meaning
  • 11. Evaluation Dataset Build a benchmark for meaning shift detection for the Italian language ▪ extract a set of words by pooling data by running several system settings ▪ find correct change points in a dictionary (Sabatino Coletti/Etimologico Zanichelli)
  • 12. Evaluation Results Method Accuracy TRIpoint 0.3086 TRIcum 0.2963 TRR1point 0.2716 log freq 0.2346 TRR2point 0.1728 TRR1cum 0.1605 TRR2cum 0.1235 Accuracy: the year predicted by the system should be equal or greater than the year reported in the gold standard TRR1 and TRR2 are variants of TRI based on Reflective Random Indexing
  • 13. Conclusions and Future Work ▪ TRI method with point wise detection provides good results ▫it overcomes the baseline based on log-frequency ▪ We provide a benchmark for the evaluation of meaning shifts for the Italian language ▪ Future work: extend the dataset and provide an evaluation for the English language