SlideShare a Scribd company logo
Language Variety Identification using
Distributed Representations of Words and Documents
Marc Franco-Salvador, Francisco Rangel, Paolo Rosso,
Mariona Taulé, and M. Antònia Martí
mfranco@prhlt.upv.es, francisco.rangel@autoritas.es, prosso@dsic.upv.es,
{mtaule,amarti}@ub.edu
Introduction
“Author profiling aims to identify the linguistic
profile of an author on the basis of his writing
style.”
“Language variety identification is an author
profiling subtask which aims to detect lexical
and semantic variations in order to classify
different varieties of the same language.”
Example
The same sentence in varieties of Spanish:
“Estaba haciendo el tonto con mi perro y perdí el
móvil” (ES-SP)
“Estaba haciendo boludeces con mi perro y extravié el
celular” (ES-AR)
“Estaba haciendo el pendejo con mi perro y extravié el
celular” (ES-MX)
Translation:
“I was goofing around with my dog and I lost my
mobile” (EN)
Related work
● Zampieri and Gebre (2012) investigated varieties of Portuguese applying different
features such as word and character n-grams.
● Sadat et al. (2014) differentiated between six different varieties of Arabic in blogs
and forums using character n-grams.
● Maier and Gómez-Rodríguez (2014) employed meta-learning to classify tweets from
Argentina, Chile, Colombia, Mexico and Spain.
● Kríž et al. (2015) employed cross-entropy to detect English texts written for non-
native English speakers.
------------------------------------------------------------------------------------------
● Fabra-Boluda et al. (2015) NLEL_UPV_Autoritas participation at Discrimination
between Similar Languages (DSL) 2015 shared task
● Franco-Salvador et al. (2015) applied distributed representations of words and
documents to classify different varieties of European languages.
Related work
Tasks on language variety identification:
– Workshop on Language Technology for Closely Related
Languages and Language Variants at EMNLP2014.
– VarDial Workshop at COLING 20145 - Applying NLP Tools to
Similar Languages, Varieties and Dialects.
– T4VarDial - Joint Workshop on Language Technology for
Closely Related Languages, Varieties and Dialect (DSL)
shared task (Zampieri et al., 2014, 2015) at RANLP.
Proposed approach - motivation
The distributed representations of words capture
many linguistic regularities (Mikolov et al., 2013b):
vector('Paris') - vector('France') + vector('Italy')
is very close to
vector('Rome')
vector('king') - vector('man') + vector('woman')
is very close to
vector('queen')
Le and Mikolov (2014) employed distributed
representations of sentences to classify the polarity of
subjective text.
Distributed representation models
● Continuous bag-of-words (CBOW) model (Mikolov
et al., 2013b, 2013c).
– It maximizes the classification of a word in a text based
on the surrounding context (bag-of-words
representation).
– It is fast and maximizes the syntactic accuracy.
● Continuous skip-gram model (Mikolov et al.,
2013b, 2013c).
– It maximizes the classification of a word in a text based
on a close word. Distant words have less impact on the
prediction.
– It considerably maximizes the semantic accuracy.
Skip-gram model
Skip-gram model
The objective of the model is to maximize the
average of the log probability:
Conditional probability should be estimated
using the softmax function [Barto, 1998]:
Reminder:
Alternatives to softmax function
Negative sampling (Mikolov et al. 2013b)
It simplifies the Noise Contrastive Estimation (NCE)
(Gutmann and Hyvarinen, 2012) keeping the vector̈
quality.
“the task is to distinguish the target word from
a noise distribution using logistic
regression, where there are k negative samples
for each word.” (Mikolov et al. 2013b)
WO
Pn(w)
Generating distributed vectors of
sentences and documents
Two alternatives:
– Average the vectors of the words of a text (“Skip-
gram” in the evaluation)
e.g.: (vector('I') + vector('love') + vector('the') +
vector('capital') + vector('of') + vector('Bulgaria')) / 6
– Use directly the Sentence Vectors variation
(“SenVec” in the evaluation)
Generating distributed vectors of
sentences and documents
Two alternatives:
– Average the vectors of the words of a text (“Skip-
gram” in the evaluation)
e.g.: (vector('I') + vector('love') + vector('the') +
vector('capital') + vector('of') + vector('Bulgaria')) / 6
– Use directly the Sentence Vectors variation
(“SenVec” in the evaluation)
* We classified all the vectors using logistic
regression
Proposed alternatives
Author profiling models:
– Emotion-labeled Graphs (Rangel and Rosso, 2015)
(EmoGraphs)
– Information Gain Word-Patterns (Martí et al., 2015)
(IG-WP)
EmoGraph of “He estado tomando cursos en línea sobre
temas valiosos que disfruto estudiando y que podrían
ayudarme a hablar en público” ( “I have been taking online
courses about valuable subjects that I enjoy studying and might
help me to speak in public”)
Information Gain Word-Patterns
Information Gain Word-Patterns (IG-WP) (Martí
et al., 2015) obtains lexico-syntactic patterns
aiming to represent the content of documents.
The method is based on the pattern-
construction hypothesis:
– “those contexts that are relevant to the
definition of a cluster of semantically related
words tend to be (part of) lexico-syntactic
constructions”.
Information Gain Word-Patterns
Pattern structure:
Examples:
In the experiments we selected as features the set
of 1,000 words the obtained the patterns with the
highest information gain.
Dataset
We introduce the HispaBlogs1
dataset, a new
collection of Spanish blogs from five different
countries: Argentina, Chile, Mexico, Peru and
Spain.
There are 450 training and 200 testing blogs
respectively for each language variety.
Each user blog is represented by a set of user
posts, with 10 posts per user/blog.
1
https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs
Evaluation
We measured the accuracy of classification
comparing our approaches with several models and
baselines.
Author profiling models:
– EmoGraphs
– IG-WP
Baselines:
– Bag-of-words
– Character 4-grams
– TF-IDF 2-grams
– TF-IDF graphs
Experimental results
Test set confusion matrix (in %) of
Skip-gram model
Conclusions
● The use of distributed representations allows to
obtain competitive results in the task of
language variety identification in social media.
● The use of averages of vectors of words (Skip-
gram) or vectors of documents (SenVec)
provided similar results without significant
differences.
Future work
● We will investigate how to apply distributed
representations to other author profiling tasks
such as age and gender identification.
● We will continue working to improve the current
model in order to generate better distributed
representations for discriminating between
similar languages.
Thank you for your time :)
Questions / feedback?
francisco.rangel@autoritas.es
This work has been published at
Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Martí, M. A. (2015).
Language variety identification using distributed representations of words and
documents. In Proceeding of the 6th International Conference of CLEF on
Experimental IR meets Multilinguality, Multimodality, and Interaction (CLEF 2015),
volume LNCS(9283). Springer-Verlag.
References
Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press.
Fabra-Boluda, R., Rangel, F., Rosso, P. (2015). NLEL_UPV_Autoritas participation at Discrimination
between Similar Languages (DSL) 2015 shared task. In: Proc. of the Joint Workshop on Language
Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.
Franco-Salvador, M. Rosso, P., & Rangel, F. (2015). Distributed Representations of Words and
Documents for Discriminating Similar Languages. In: Proc. of the Joint Workshop on Language
Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.
Gutmann, M. U., & Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical
models, with applications to natural image statistics. The Journal of Machine Learning Research, 13(1),
307-361.
Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint
arXiv:1405.4053.
Maier, W., & Gómez-Rodrıguez, C. (2014). Language variety identification in Spanish tweets.
LT4CloseLang 2014, 25.
Martí, M.A., Bertran, M., Taulé, M., Salamó, M. (2015). Distributional approach based on syntactic
dependencies for discovering constructions. In Computational Linguistics (under review)
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in
vector space. In Proceedings of Workshop at ICLR.
References
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013c). Distributed representations of
words and phrases and their compositionality. In Advances in Neural Information Processing Systems
(pp. 3111-3119).
Morin, F., & Bengio, Y. (2005, January). Hierarchical probabilistic neural network language model. In
Proceedings of the international workshop on artificial intelligence and statistics (pp. 246-252).
Rangel, F., & Rosso, P. (2015). On the impact of emotions on author profiling. Information Processing &
Management.
Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and
Dialects in Social Media. SocialNLP 2014, 22.
Zampieri, M., & Gebre, B. G. (2012). Automatic identification of language varieties: The case of
Portuguese. In KONVENS2012-The 11th Conference on Natural Language Processing (pp. 233-237).
Österreichischen Gesellschaft für Artificial Intelligende (ÖGAI).
Zampieri, M., Tan, L., Ljubešic, N., & Tiedemann, J. (2014). A report on the DSL shared task 2014.
COLING 2014, 58.
Zampieri, M., Tan, L., Ljubešic, N., Tiedemann, J., & and Nakov, P. (2015). Overview of the dsl shared task
2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages,
Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.

More Related Content

PPTX
Language models
PDF
A Low Dimensionality Representation for Language Variety Identification (CICL...
PDF
Semantics and Computational Semantics
PDF
AINL 2016: Eyecioglu
PDF
A general method applicable to the search for anglicisms in russian social ne...
PDF
AINL 2016: Malykh
PDF
Language Models for Information Retrieval
PPTX
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
Language models
A Low Dimensionality Representation for Language Variety Identification (CICL...
Semantics and Computational Semantics
AINL 2016: Eyecioglu
A general method applicable to the search for anglicisms in russian social ne...
AINL 2016: Malykh
Language Models for Information Retrieval
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop

What's hot (20)

PDF
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
PPT
"Thinking in English" information structures task array
PPT
Publish perish as an instruction-end learning opportunity
PPTX
Detecting and Describing Historical Periods in a Large Corpora
PPTX
AINL 2016: Yagunova
PDF
Patterns of Value
PDF
Lecture 2: Computational Semantics
PDF
Lecture: Word Senses
PPT
Invisible structures of technical writing
PDF
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
PPT
PPT slides
PPTX
Word representations in vector space
PPTX
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
PDF
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
PDF
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
PDF
Semantic Role Labeling
PDF
Introduction to Ontology Engineering with Fluent Editor 2014
PDF
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
What can typological knowledge bases and language representations tell us abo...
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection
"Thinking in English" information structures task array
Publish perish as an instruction-end learning opportunity
Detecting and Describing Historical Periods in a Large Corpora
AINL 2016: Yagunova
Patterns of Value
Lecture 2: Computational Semantics
Lecture: Word Senses
Invisible structures of technical writing
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
PPT slides
Word representations in vector space
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
Codeco: A Grammar Notation for Controlled Natural Language in Predictive Editors
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Semantic Role Labeling
Introduction to Ontology Engineering with Fluent Editor 2014
Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language ...
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
What can typological knowledge bases and language representations tell us abo...
Ad

Viewers also liked (13)

PDF
Use of language and author profiling.key
PPTX
Language variety #1
PPTX
UNIT # 6 LANGUAGE AND CULTURAL IDENTITY __STANDARD LANGUAGE / TOTEM CULTURE
PPT
Language Change Part 1
PPT
Language change timeline
PPTX
language, dialect, varietes
PPT
Language Change Part 2: Labov Studies
PPTX
Language Change
PPTX
Style Register and Dialect
PPT
secret of words
PPTX
Types of language change
PPTX
Language varieties, dialect, register and style
PPT
Language change
Use of language and author profiling.key
Language variety #1
UNIT # 6 LANGUAGE AND CULTURAL IDENTITY __STANDARD LANGUAGE / TOTEM CULTURE
Language Change Part 1
Language change timeline
language, dialect, varietes
Language Change Part 2: Labov Studies
Language Change
Style Register and Dialect
secret of words
Types of language change
Language varieties, dialect, register and style
Language change
Ad

Similar to Language Variety Identification using Distributed Representations of Words and Documents (20)

PDF
Contemporary Models of Natural Language Processing
PPTX
A Neural Probabilistic Language Model
PPTX
What is word2vec?
PPTX
Tomáš Mikolov - Distributed Representations for NLP
PDF
CS571: Distributional semantics
PDF
Word2vec: From intuition to practice using gensim
PPTX
A Panorama of Natural Language Processing
PDF
Categorical Evaluation for Advanced Distributional Semantic Models
PDF
Deep learning for nlp
PDF
Deep learning for natural language embeddings
PDF
A Review of Distributional models of word meaning (Lenci, 2018)
PDF
A Neural Probabilistic Language Model_v2
PDF
On the Limitations of Unsupervised Bilingual Dictionary Induction
PDF
Representation Learning of Text for NLP
PDF
Anthiil Inside workshop on NLP
PDF
BERT-based models for classifying multi-dialect Arabic texts
PDF
Visual-Semantic Embeddings: some thoughts on Language
PDF
Word2vec and Friends
PDF
Deep Learning for Information Retrieval
PPTX
Deep Learning Bangalore meet up
Contemporary Models of Natural Language Processing
A Neural Probabilistic Language Model
What is word2vec?
Tomáš Mikolov - Distributed Representations for NLP
CS571: Distributional semantics
Word2vec: From intuition to practice using gensim
A Panorama of Natural Language Processing
Categorical Evaluation for Advanced Distributional Semantic Models
Deep learning for nlp
Deep learning for natural language embeddings
A Review of Distributional models of word meaning (Lenci, 2018)
A Neural Probabilistic Language Model_v2
On the Limitations of Unsupervised Bilingual Dictionary Induction
Representation Learning of Text for NLP
Anthiil Inside workshop on NLP
BERT-based models for classifying multi-dialect Arabic texts
Visual-Semantic Embeddings: some thoughts on Language
Word2vec and Friends
Deep Learning for Information Retrieval
Deep Learning Bangalore meet up

More from Francisco Manuel Rangel Pardo (20)

PPTX
Profiling Cryptocurrency Influencers with Few-shot Learning 2023
PDF
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
PDF
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
PDF
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
PDF
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
PDF
AL4Trust - Artificial Intelligence for Building Trust 2019
PDF
Author Profiling en Social Media. En la Academia... y en la Industria.
PDF
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
PDF
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
PDF
RusProfiling Gender Identification in Russian Texts PAN@FIRE
PDF
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
PDF
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
PDF
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
PDF
Redes sociales y preadolescentes
PDF
AL4Trust - Artificial Intelligence for Building Trust
PDF
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PDF
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
PDF
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
PDF
Smart Listening - MUIinf
PDF
IA + Big Data = problema + oportunidad
Profiling Cryptocurrency Influencers with Few-shot Learning 2023
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
AL4Trust - Artificial Intelligence for Building Trust 2019
Author Profiling en Social Media. En la Academia... y en la Industria.
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
RusProfiling Gender Identification in Russian Texts PAN@FIRE
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Redes sociales y preadolescentes
AL4Trust - Artificial Intelligence for Building Trust
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
Smart Listening - MUIinf
IA + Big Data = problema + oportunidad

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
Quality review (1)_presentation of this 21
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Database Infoormation System (DBIS).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Mega Projects Data Mega Projects Data
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Quality review (1)_presentation of this 21
Business Ppt On Nestle.pptx huunnnhhgfvu
Database Infoormation System (DBIS).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
annual-report-2024-2025 original latest.
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Mega Projects Data Mega Projects Data
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
.pdf is not working space design for the following data for the following dat...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
IB Computer Science - Internal Assessment.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Language Variety Identification using Distributed Representations of Words and Documents

  • 1. Language Variety Identification using Distributed Representations of Words and Documents Marc Franco-Salvador, Francisco Rangel, Paolo Rosso, Mariona Taulé, and M. Antònia Martí mfranco@prhlt.upv.es, francisco.rangel@autoritas.es, prosso@dsic.upv.es, {mtaule,amarti}@ub.edu
  • 2. Introduction “Author profiling aims to identify the linguistic profile of an author on the basis of his writing style.” “Language variety identification is an author profiling subtask which aims to detect lexical and semantic variations in order to classify different varieties of the same language.”
  • 3. Example The same sentence in varieties of Spanish: “Estaba haciendo el tonto con mi perro y perdí el móvil” (ES-SP) “Estaba haciendo boludeces con mi perro y extravié el celular” (ES-AR) “Estaba haciendo el pendejo con mi perro y extravié el celular” (ES-MX) Translation: “I was goofing around with my dog and I lost my mobile” (EN)
  • 4. Related work ● Zampieri and Gebre (2012) investigated varieties of Portuguese applying different features such as word and character n-grams. ● Sadat et al. (2014) differentiated between six different varieties of Arabic in blogs and forums using character n-grams. ● Maier and Gómez-Rodríguez (2014) employed meta-learning to classify tweets from Argentina, Chile, Colombia, Mexico and Spain. ● Kríž et al. (2015) employed cross-entropy to detect English texts written for non- native English speakers. ------------------------------------------------------------------------------------------ ● Fabra-Boluda et al. (2015) NLEL_UPV_Autoritas participation at Discrimination between Similar Languages (DSL) 2015 shared task ● Franco-Salvador et al. (2015) applied distributed representations of words and documents to classify different varieties of European languages.
  • 5. Related work Tasks on language variety identification: – Workshop on Language Technology for Closely Related Languages and Language Variants at EMNLP2014. – VarDial Workshop at COLING 20145 - Applying NLP Tools to Similar Languages, Varieties and Dialects. – T4VarDial - Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialect (DSL) shared task (Zampieri et al., 2014, 2015) at RANLP.
  • 6. Proposed approach - motivation The distributed representations of words capture many linguistic regularities (Mikolov et al., 2013b): vector('Paris') - vector('France') + vector('Italy') is very close to vector('Rome') vector('king') - vector('man') + vector('woman') is very close to vector('queen') Le and Mikolov (2014) employed distributed representations of sentences to classify the polarity of subjective text.
  • 7. Distributed representation models ● Continuous bag-of-words (CBOW) model (Mikolov et al., 2013b, 2013c). – It maximizes the classification of a word in a text based on the surrounding context (bag-of-words representation). – It is fast and maximizes the syntactic accuracy. ● Continuous skip-gram model (Mikolov et al., 2013b, 2013c). – It maximizes the classification of a word in a text based on a close word. Distant words have less impact on the prediction. – It considerably maximizes the semantic accuracy.
  • 9. Skip-gram model The objective of the model is to maximize the average of the log probability: Conditional probability should be estimated using the softmax function [Barto, 1998]: Reminder:
  • 10. Alternatives to softmax function Negative sampling (Mikolov et al. 2013b) It simplifies the Noise Contrastive Estimation (NCE) (Gutmann and Hyvarinen, 2012) keeping the vector̈ quality. “the task is to distinguish the target word from a noise distribution using logistic regression, where there are k negative samples for each word.” (Mikolov et al. 2013b) WO Pn(w)
  • 11. Generating distributed vectors of sentences and documents Two alternatives: – Average the vectors of the words of a text (“Skip- gram” in the evaluation) e.g.: (vector('I') + vector('love') + vector('the') + vector('capital') + vector('of') + vector('Bulgaria')) / 6 – Use directly the Sentence Vectors variation (“SenVec” in the evaluation)
  • 12. Generating distributed vectors of sentences and documents Two alternatives: – Average the vectors of the words of a text (“Skip- gram” in the evaluation) e.g.: (vector('I') + vector('love') + vector('the') + vector('capital') + vector('of') + vector('Bulgaria')) / 6 – Use directly the Sentence Vectors variation (“SenVec” in the evaluation) * We classified all the vectors using logistic regression
  • 13. Proposed alternatives Author profiling models: – Emotion-labeled Graphs (Rangel and Rosso, 2015) (EmoGraphs) – Information Gain Word-Patterns (Martí et al., 2015) (IG-WP)
  • 14. EmoGraph of “He estado tomando cursos en línea sobre temas valiosos que disfruto estudiando y que podrían ayudarme a hablar en público” ( “I have been taking online courses about valuable subjects that I enjoy studying and might help me to speak in public”)
  • 15. Information Gain Word-Patterns Information Gain Word-Patterns (IG-WP) (Martí et al., 2015) obtains lexico-syntactic patterns aiming to represent the content of documents. The method is based on the pattern- construction hypothesis: – “those contexts that are relevant to the definition of a cluster of semantically related words tend to be (part of) lexico-syntactic constructions”.
  • 16. Information Gain Word-Patterns Pattern structure: Examples: In the experiments we selected as features the set of 1,000 words the obtained the patterns with the highest information gain.
  • 17. Dataset We introduce the HispaBlogs1 dataset, a new collection of Spanish blogs from five different countries: Argentina, Chile, Mexico, Peru and Spain. There are 450 training and 200 testing blogs respectively for each language variety. Each user blog is represented by a set of user posts, with 10 posts per user/blog. 1 https://github.com/autoritas/RD-Lab/tree/master/data/HispaBlogs
  • 18. Evaluation We measured the accuracy of classification comparing our approaches with several models and baselines. Author profiling models: – EmoGraphs – IG-WP Baselines: – Bag-of-words – Character 4-grams – TF-IDF 2-grams – TF-IDF graphs
  • 20. Test set confusion matrix (in %) of Skip-gram model
  • 21. Conclusions ● The use of distributed representations allows to obtain competitive results in the task of language variety identification in social media. ● The use of averages of vectors of words (Skip- gram) or vectors of documents (SenVec) provided similar results without significant differences.
  • 22. Future work ● We will investigate how to apply distributed representations to other author profiling tasks such as age and gender identification. ● We will continue working to improve the current model in order to generate better distributed representations for discriminating between similar languages.
  • 23. Thank you for your time :) Questions / feedback? francisco.rangel@autoritas.es This work has been published at Franco-Salvador, M., Rangel, F., Rosso, P., Taulé, M., & Martí, M. A. (2015). Language variety identification using distributed representations of words and documents. In Proceeding of the 6th International Conference of CLEF on Experimental IR meets Multilinguality, Multimodality, and Interaction (CLEF 2015), volume LNCS(9283). Springer-Verlag.
  • 24. References Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press. Fabra-Boluda, R., Rangel, F., Rosso, P. (2015). NLEL_UPV_Autoritas participation at Discrimination between Similar Languages (DSL) 2015 shared task. In: Proc. of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria. Franco-Salvador, M. Rosso, P., & Rangel, F. (2015). Distributed Representations of Words and Documents for Discriminating Similar Languages. In: Proc. of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria. Gutmann, M. U., & Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The Journal of Machine Learning Research, 13(1), 307-361. Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053. Maier, W., & Gómez-Rodrıguez, C. (2014). Language variety identification in Spanish tweets. LT4CloseLang 2014, 25. Martí, M.A., Bertran, M., Taulé, M., Salamó, M. (2015). Distributional approach based on syntactic dependencies for discovering constructions. In Computational Linguistics (under review) Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in vector space. In Proceedings of Workshop at ICLR.
  • 25. References Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013c). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (pp. 3111-3119). Morin, F., & Bengio, Y. (2005, January). Hierarchical probabilistic neural network language model. In Proceedings of the international workshop on artificial intelligence and statistics (pp. 246-252). Rangel, F., & Rosso, P. (2015). On the impact of emotions on author profiling. Information Processing & Management. Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and Dialects in Social Media. SocialNLP 2014, 22. Zampieri, M., & Gebre, B. G. (2012). Automatic identification of language varieties: The case of Portuguese. In KONVENS2012-The 11th Conference on Natural Language Processing (pp. 233-237). Österreichischen Gesellschaft für Artificial Intelligende (ÖGAI). Zampieri, M., Tan, L., Ljubešic, N., & Tiedemann, J. (2014). A report on the DSL shared task 2014. COLING 2014, 58. Zampieri, M., Tan, L., Ljubešic, N., Tiedemann, J., & and Nakov, P. (2015). Overview of the dsl shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (LT4VarDial), Hissar, Bulgaria.