SlideShare a Scribd company logo
Native Language 
Identification 
State of the Art
What’s native language identification? 
NLI is the task of identifying the native 
language (L1) of a writer based solely on a 
sample of this author writing (1) 
For example, a Spanish journalist writing news in Arabic, 
a French student writing essays in English or a Brazilian 
user tweeting in Spanish... 
(1) https://sites.google.com/site/nlisharedtask2013/home
Why is NLI important? 
Education: More targeted feedback to 
language learners about their errors 
Marketing: Better market segmentation 
based on native idiosyncrasies 
Forensics: Helping in author profiling 
Security: Profiling possible threats 
...
Some variations / relations 
Native Language Identification: For example, people from 
different nationalities speaking English 
Language Varieties or Dialects Identification: 
For example, Portuguese of Portugal vs. 
Brazil, Spanish of Spain, Argentina, Mexico... 
NLI LVI LI
Outline 
Representative Works 
Common Resources / Corpora 
Some Issues 
Research niches 
References
Representative works 
Determining an Author’s Native Language by Mining a Text for Errors. Koppel, M., 
Schler, J., Zigdon, K. In Proceedings of the 11th ACM SIGKDD International Conference 
on Knowledge Discovery in Data Mining, KDD’05 
A Report on the First Native Language Identification Shared Task. Tetreault, J., 
Blanchard, D., Cahill, A. In the 8th Workshop on Innovative Use of NLP for Building 
Educational Applications BEA-8. NAACL-HTL 2013 
Using Other Learner Corpora in the 2013 NLI Shared Task. Brooke, J., Hirst, G. In the 
8th Workshop on Innovative Use of NLP for Building Educational Applications BEA-8. 
NAACL-HTL 2013 
Exploring Syntactic Features for Native Language Identification: A Variationist 
Perspective on Feature Encoding and Ensemble Optimization. Bykh, S., Meurers, D. The 
25th International Conference on Computational Linguistics COLIN 2014 
Author’s Native Language Identification from Web-Based Texts. Tofight, P., Köse, C., 
Rouka, L. International Journal of Computer and Communication Engineering 2012 
Automatic Identification of Language Varieties: The Case of Portuguese. Zampieri, M., 
Gebrekidan, B. In Proceedings of the Conference on Natural Language Processing 2012 
Automatic Identification of Arabic Language Varieties and Dialects in Social Media. 
Sadat, F., Kazemi, F., Farzindar, A. In Proceeding of the 1st. International Workshop on 
Social Media Retrieval and Analysis SoMeRa 2014 
...
Determining an Author’s Native Language by Moining a Text for Errors. 
Koppel, M., Schler, J., Zigdon, K. 
In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD’05. 
The first published work in this area 
Corpus: International Corpus of Learner English, ICLE 
L1: 258 authors from Russia, Czech Republic, Bulgaria, France, Spain 
L2: English 
Features: 
Function words (400) 
Letter n-grams (200) 
Errors and idiosyncrasies (185 error types + 250 rare POS bigrams) 
Ortography: repeated letter (remmit/remit), double letter appears only once (comit/ 
commit)... 
Syntax: run-on sentence, mismatched singular/plural, mismatched tense, that/which 
confusion... 
Neologism: e.g fantabolous 
Rare Parts-of-Speech bigrams in the Brown corpus [http://www.wikiwand.com/en/ 
Brown_Corpus] 
ML Algorithm: Multi-class linear Support Vector Machines 
Evaluation method: 10-fold cross-validation 
Accuracy ~ 80%
A Report on the First Native Language Identification Shared Task. 
Tetreault, J., Blanchard, D., Cahill, A. 
In the 8th Workshop on Innovative Use of NLP for Building Educational Applications BEA-8. NAACL-HTL 2013 
The first shared task in this area 
Corpus: Test of English as a Foreign Language, TOEFL11 
8 prompts (i.e. topics) 
TOEFL11-train (900), TOEFL11-dev (100), TOEFL11-test (100) per L1 
L1: 1100 essays per language, Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, 
Telugu, Turkish 
L2: English 
Task: 
Closed-training: 11-way classification task using only TOEFL11-train and optionally TOEFL11-dev 
Open-training-1: Participants could use any amount of training data excluding TOEFL11 
Open-training-2: Any kind of training data, even TOEFL11-train and -dev 
29 teams, 24 papers, up to 5 systems per task 
Features: the most common were word, character and POS n-gram feature 
ML Algorithm: The majority used SVM (13). Also Maximum Entropy (3), Ensemble (3), Discriminant 
Function Analysis (1) and K-Nearest Neighbors (1)... 
Evaluation method: TOEFL11-test in the three subtasks 
Accuracy: 
Closed-training: 83.6% 
Open-training-1: 56.5% 
Open-training-2: 83.5%
Using Other Learner Corpora in the 2013 NLI Shared Task. 
Brooke, J., Hirst, G. 
In the 8th Workshop on Innovative Use of NLP for Building Educational Applications BEA-8. NAACL-HTL 2013 
Corpus: 
TOEFL11: 11 L1 
Lang-8: Website where language learners write journal entries in their L2 to be corrected by native 
speakers. 11 L1 overlapping with TOEFL11 
ICLE: 15 L1, 8 overlap with TOEFL11 
FCE: Small sample of the First Certificate in English. 16 L1, 9 overlap with TOEFL11 
ICCI: International Corpus of Crosslinguistic Interlanguage. 4 L1 overlap with TOEFL11 
ICANLE: International Corpus Network of Asian Learners of English. 3 L1 overlap with TOEFL11 
L1: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, Turkish 
L2: English 
Features: 
Function words, Word n-grams (up to bigrams), POS n-grams (up to trigram), character n-grams (up to 
trigram), dependencies, context-free productions, ‘mixed’ POS/function n-grams (up to trigram), i.e. n-grams 
with all lexical words replaced with part of speech. 
The best model: word n-grams + mixed POS/function n-grams 
ML Algorithm: Support Vector Machines 
Evaluation method: TOEFL11-test 
Accuracy: 
Closed-task: 80.2% (12 / 29) 
Open-1-task: 56.5% (1 / 3) 
Open-2-task: 81.6% (2 / 4)
Exploring Syntactic Features for Native Language Identification: A Variatonist Perspective on 
Feature Encoding and Ensemble Optimization 
Bykh, S., Meurers, D. 
The 25th International Conference on Computational Linguistics COLIN 2014 
Corpus: 
TOEFL11: 11 L1 
NT11: 5843 texts from ICLE + FCE + BALC + ICNALE + TÜTEL-NLI. 
L1: Arabic (846), Chinese (1048), French (456), German (500), Hindu (400), Italian 
(467), Japanese (447), Korean (684), Spanish (446), Telugu (200), Turkish (349) 
L2: English 
Features: Context-Free Grammar Rules 
Only phrasal CFG production rules excluding all terminals (S->NP VP,N -> D 
NN, ...) 
Only lexicalized CFG production rules of the type predeterminal -> terminal 
(JJ -> nice, JJ -> quick, NN -> vacation, ...) 
The union (combination) of the above two 
ML Algorithm: Logistic Regression 
Evaluation method: Cross-corpus, NT11 for training, TOEFL11-test for testing 
Accuracy: 84.82%
Author’s Native Language Identification from Web-Based Texts. 
Tofight, P., Köse, C., Rouka, L. 
International Journal of Computer and Communication Engineering 2012 
Corpus: 600 publicly available news agencies texts 
L1: 150 texts from each, English, Persian, Turkish, German 
L2: English 
Features: 
Lexical (64): character n-grams, word-length frequency, 
vocabulary richness... 
Syntactic (308): common punctuation signs, function words 
(e.g. the)... 
Structural (13): paragraph length, use of greetings... 
Content-specific (): n-grams with TF > 10 
ML Algorithm: Support Vector Machines 
Evaluation method: 10-fold cross-validation 
Accuracy 70% ~ 80%
Automatic Identification of Language Varieties: The Case of Portuguese. 
Zampieri, M., Gebrekidan-Gebre, B. 
In Proceedings of the Conference on Natural Language Processing 2012 
Corpus (1000 documents from newsletters): 
Brazilian corpus: Folha de Säo Paulo, newspaper 2004 
Portuguese corpus: Diário de Notícias, newspaper 2007 
Features: word and character n-grams 
Orthography: 
Graphical signs: econômico (BP); económico (EP); economic (EN) 
Mute consonants: ator (BP); actor (EP); actor (EN) 
Syntax: 
Pronouns: eu te amo (BP); eu amo-te (EP); I love you (EN) 
Lexical variation: 
multa (BP); coima (EP); fine, penalty (EN) 
Proper nouns 
ML Algorithm: Language probability distributions with log-likelihood function for probability 
estimation 
Evaluation method: 50/50 split 
Accuracy: 
Word uni-grams: 99.6% 
Word bi-grams: 91.2% 
Character 4-grams: 99.8%
Automatic Identification of Arabic Language Varieties and Dialects in Social Media. 
Sadat, F., Kazemi, F., Farzindar, A. 
In Proceeding of the 1st. International Workshop on Social Media Retrieval and Analysis SoMeRa 2014 
Corpus (blogs and forum documents): 6 regional variations 
Egyptian: Egypt 
Iraqui: Iraq 
Gulf: Bahrein, Emirates, Kuwait, Qatar, Oman, Saudi Arabia 
Maghrebi: Algeria, Tunisia, Morocco, Libya, Mauritania 
Levantine: Jordan, Lebanon, Palestine, Syria 
Others: Sudan 
Features: character n-grams 
ML Algorithm: Markov language model vs. Naïve Bayes 
Evaluation method: 50/50 split 
Accuracy: 98% (78% F-measure)
Common resources / corpora 
ICLE: International Corpus of Learner English (Granger et al., 2009) 
Essays written by college-level English language learners 
Issues: Quite small, topic bias 
L1 (11) except Arabic, Hindi and Telugu; L2 English 
Lang-8: http://www.lang8.com 
Social networking service where users write in the language they are learning and get corrections from 
native speakers 
FCE: First Certificate in English (Yannakoudakis et al., 2011) 
Essays written for an English assessment exam 
L1 (16) except Arabic, Hindi and Telugu; L2 English 
ICCI: International Corpus of Crosslinguistic Interlanguage (Tono et al., 2012) 
Descriptive and argumentative essays written by young learnets, i.e. those in grade school 
L1 (4); L2 English 
TOEFL11: Test of English as a Foreign Language (Blanchard et al., 2013) 
Essays written during high-stakes college-entrance test 
L1 (11) Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, Turkish; L2 
English 
ICANLE: International Corpus Network of Asian Learners of English (Ishikawa, 2011) 
Essays from college students 
L1 (10): Asian background (Chinese, Japanese, Korean); L2 English 
BALC (Randall and Groom, 2009); TÜTEL-NLI (Bykh et al., 2013)
Some Issues 
Most of the corpora were built from formal 
media, based on essays of proficient students 
Different corpora have essays with different 
proficiency level. Even high differences inside 
the same corpus 
Very few research works from social media, 
were people express themselves in other 
languages without taking care about their 
errors 
All the works are focused on English as a 
second language
Research niches 
Spanish as a second language (http://epp.eurostat.ec.europa.eu/cache/ITY_PUBLIC/3-25092014-AP/EN/3-25092014-AP-EN.PDF) 
Spanish variations (Hispablogs) 
Lang Blogs Words Max_W Min_W Avg_W Std_W 
AR 450 1408103 11117 502 3126.90 2183.75 
CL 450 1081068 10336 384 2402.37 2378.07 
ES 450 1376478 11141 336 3058.84 2234.14 
MX 450 1697091 11946 725 3771.31 2514.51 
PA 450 950076 13090 120 2111.28 2264.03 
PE 450 1602195 13205 620 3560.43 2515.71
References 
Blanchard, D, Treteault, J., Higgins, D., Cahill, A., Chodorow, M. TOEFL11: A Corpus 
of Non-Native English. Technical Report, Educational Testing Service. 2013 
Bykh, S., Vajjala, S., Krivanek, J., Meurers, D. Combining Shallow and 
Linguistically Motivated Features in Natural Language Identification. In 
Proceedings of the 8th Workshop on Innovative Use of NLP for Building 
Educational Applications BEA-8 at NAACL-HLT 2013 
Granger, S., Dagneaus, E., Meunier, F. The International Corpus of Learner English: 
Handbook and CD-ROM, version 2. Presses Universitaries de Louvain. 2009 
Ishikawa, S. A New Horizon in Learner Corpus Studies: The Aim of the ICNALE 
Projects. In Corpora and Language Tecnologies in Teaching. Learning and 
Research. University of Strathclyde Publishing. 2011 
Randall, M., Groom, N. The BUiD Arab Learner Corpus: A Resource for Studying 
the Acquisition of L2 English Spelling. In Proceedings of the Corpus Linguistic 
Conference CL 2009 
Yannakoudakis, H., Briscoe, T., Medlock, B. A New Dataset and Method for 
Automatically Grading ESOL Texts. In Proceedings of the 49th Annual Meeting of 
the Association for Computational Linguistics: Human Language Technologies. 
2011

More Related Content

PDF
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
PDF
ELKL 4, Language Technology: learning from endangered languages
PDF
ELKL 5 Language documentation for linguistics and technology
ODT
A tutorial on Machine Translation
PDF
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
DOCX
Division_3_Fianna_O'Brien
PDF
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
PDF
Code Mixing computationally bahut challenging hai
Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
ELKL 4, Language Technology: learning from endangered languages
ELKL 5 Language documentation for linguistics and technology
A tutorial on Machine Translation
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
Division_3_Fianna_O'Brien
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
Code Mixing computationally bahut challenging hai

What's hot (20)

PDF
600Desc
PDF
Script to Sentiment : on future of Language TechnologyMysore latest
PDF
Onward presentation.en
PDF
An Extensible Multilingual Open Source Lemmatizer
PDF
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
PPT
Lec 15,16,17 NLP.machine translation
PDF
1 computational linguistics an introduction
PDF
Research data as an aid in teaching technical competence in subtitling
PPT
Arabic MT Project
PPTX
Cross language alignments - challenges guidelines and gold sets
DOCX
My trans kit checklist gw1 ds1_gw3
DOC
B tech project_report
PPT
Enhancing Intercultural Communicative Competence through cross-cultural inter...
PPTX
Computational linguistics
PDF
Computational linguistics
PDF
Buy foreign certificates
PPT
Sanskrit and Computational Linguistic
PDF
Evaluation of language identification methods
PPTX
Machine translation from English to Hindi
600Desc
Script to Sentiment : on future of Language TechnologyMysore latest
Onward presentation.en
An Extensible Multilingual Open Source Lemmatizer
A SURVEY OF LANGUAGE-DETECTION, FONTDETECTION AND FONT-CONVERSION SYSTEMS FOR...
Lec 15,16,17 NLP.machine translation
1 computational linguistics an introduction
Research data as an aid in teaching technical competence in subtitling
Arabic MT Project
Cross language alignments - challenges guidelines and gold sets
My trans kit checklist gw1 ds1_gw3
B tech project_report
Enhancing Intercultural Communicative Competence through cross-cultural inter...
Computational linguistics
Computational linguistics
Buy foreign certificates
Sanskrit and Computational Linguistic
Evaluation of language identification methods
Machine translation from English to Hindi
Ad

Similar to Native Language Identification - Brief review to the state of the art (20)

PDF
concepts-in-programming-languages-2kuots4121.pdf
PPT
**JUNK** (no subject)
PPTX
Natural language processing: feature extraction
PDF
AI6001_Natural Langauge Processing.pdfads
PPTX
Computational linguistics
PPT
Programing Language
PPTX
Cobbbbbbbnnnnnnnnnnnnnnnnncepts of PL.pptx
DOCX
Generations of programming language
PPT
Programming language design and implemenation
PPT
lect1-introductiontoprogramminglanguages-130130013038-phpapp02.ppt
PPTX
LSDI.pptx
PDF
600Desc
PDF
Portuguese Linguistic Tools: What, Why and How
PPT
ppt
PPTX
Assessment of oral skills roleplay
PDF
Natural language processing for requirements engineering: ICSE 2021 Technical...
DOCX
Preliminary-Examination.docx
PDF
Integration of Phonotactic Features for Language Identification on Code-Switc...
PDF
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
PDF
Advanced_programming_language_design.pdf
concepts-in-programming-languages-2kuots4121.pdf
**JUNK** (no subject)
Natural language processing: feature extraction
AI6001_Natural Langauge Processing.pdfads
Computational linguistics
Programing Language
Cobbbbbbbnnnnnnnnnnnnnnnnncepts of PL.pptx
Generations of programming language
Programming language design and implemenation
lect1-introductiontoprogramminglanguages-130130013038-phpapp02.ppt
LSDI.pptx
600Desc
Portuguese Linguistic Tools: What, Why and How
ppt
Assessment of oral skills roleplay
Natural language processing for requirements engineering: ICSE 2021 Technical...
Preliminary-Examination.docx
Integration of Phonotactic Features for Language Identification on Code-Switc...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
Advanced_programming_language_design.pdf
Ad

More from Francisco Manuel Rangel Pardo (20)

PPTX
Profiling Cryptocurrency Influencers with Few-shot Learning 2023
PDF
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
PDF
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
PDF
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
PDF
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
PDF
AL4Trust - Artificial Intelligence for Building Trust 2019
PDF
Author Profiling en Social Media. En la Academia... y en la Industria.
PDF
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
PDF
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
PDF
RusProfiling Gender Identification in Russian Texts PAN@FIRE
PDF
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
PDF
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
PDF
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
PDF
Redes sociales y preadolescentes
PDF
AL4Trust - Artificial Intelligence for Building Trust
PDF
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PDF
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
PDF
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
PDF
Smart Listening - MUIinf
PDF
IA + Big Data = problema + oportunidad
Profiling Cryptocurrency Influencers with Few-shot Learning 2023
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
AL4Trust - Artificial Intelligence for Building Trust 2019
Author Profiling en Social Media. En la Academia... y en la Industria.
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
RusProfiling Gender Identification in Russian Texts PAN@FIRE
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Redes sociales y preadolescentes
AL4Trust - Artificial Intelligence for Building Trust
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
Smart Listening - MUIinf
IA + Big Data = problema + oportunidad

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
1. Introduction to Computer Programming.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Getting Started with Data Integration: FME Form 101
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Spectroscopy.pptx food analysis technology
PPTX
A Presentation on Artificial Intelligence
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
1. Introduction to Computer Programming.pptx
Approach and Philosophy of On baking technology
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Getting Started with Data Integration: FME Form 101
NewMind AI Weekly Chronicles - August'25-Week II
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectroscopy.pptx food analysis technology
A Presentation on Artificial Intelligence
Group 1 Presentation -Planning and Decision Making .pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
cloud_computing_Infrastucture_as_cloud_p
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release

Native Language Identification - Brief review to the state of the art

  • 2. What’s native language identification? NLI is the task of identifying the native language (L1) of a writer based solely on a sample of this author writing (1) For example, a Spanish journalist writing news in Arabic, a French student writing essays in English or a Brazilian user tweeting in Spanish... (1) https://sites.google.com/site/nlisharedtask2013/home
  • 3. Why is NLI important? Education: More targeted feedback to language learners about their errors Marketing: Better market segmentation based on native idiosyncrasies Forensics: Helping in author profiling Security: Profiling possible threats ...
  • 4. Some variations / relations Native Language Identification: For example, people from different nationalities speaking English Language Varieties or Dialects Identification: For example, Portuguese of Portugal vs. Brazil, Spanish of Spain, Argentina, Mexico... NLI LVI LI
  • 5. Outline Representative Works Common Resources / Corpora Some Issues Research niches References
  • 6. Representative works Determining an Author’s Native Language by Mining a Text for Errors. Koppel, M., Schler, J., Zigdon, K. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD’05 A Report on the First Native Language Identification Shared Task. Tetreault, J., Blanchard, D., Cahill, A. In the 8th Workshop on Innovative Use of NLP for Building Educational Applications BEA-8. NAACL-HTL 2013 Using Other Learner Corpora in the 2013 NLI Shared Task. Brooke, J., Hirst, G. In the 8th Workshop on Innovative Use of NLP for Building Educational Applications BEA-8. NAACL-HTL 2013 Exploring Syntactic Features for Native Language Identification: A Variationist Perspective on Feature Encoding and Ensemble Optimization. Bykh, S., Meurers, D. The 25th International Conference on Computational Linguistics COLIN 2014 Author’s Native Language Identification from Web-Based Texts. Tofight, P., Köse, C., Rouka, L. International Journal of Computer and Communication Engineering 2012 Automatic Identification of Language Varieties: The Case of Portuguese. Zampieri, M., Gebrekidan, B. In Proceedings of the Conference on Natural Language Processing 2012 Automatic Identification of Arabic Language Varieties and Dialects in Social Media. Sadat, F., Kazemi, F., Farzindar, A. In Proceeding of the 1st. International Workshop on Social Media Retrieval and Analysis SoMeRa 2014 ...
  • 7. Determining an Author’s Native Language by Moining a Text for Errors. Koppel, M., Schler, J., Zigdon, K. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD’05. The first published work in this area Corpus: International Corpus of Learner English, ICLE L1: 258 authors from Russia, Czech Republic, Bulgaria, France, Spain L2: English Features: Function words (400) Letter n-grams (200) Errors and idiosyncrasies (185 error types + 250 rare POS bigrams) Ortography: repeated letter (remmit/remit), double letter appears only once (comit/ commit)... Syntax: run-on sentence, mismatched singular/plural, mismatched tense, that/which confusion... Neologism: e.g fantabolous Rare Parts-of-Speech bigrams in the Brown corpus [http://www.wikiwand.com/en/ Brown_Corpus] ML Algorithm: Multi-class linear Support Vector Machines Evaluation method: 10-fold cross-validation Accuracy ~ 80%
  • 8. A Report on the First Native Language Identification Shared Task. Tetreault, J., Blanchard, D., Cahill, A. In the 8th Workshop on Innovative Use of NLP for Building Educational Applications BEA-8. NAACL-HTL 2013 The first shared task in this area Corpus: Test of English as a Foreign Language, TOEFL11 8 prompts (i.e. topics) TOEFL11-train (900), TOEFL11-dev (100), TOEFL11-test (100) per L1 L1: 1100 essays per language, Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, Turkish L2: English Task: Closed-training: 11-way classification task using only TOEFL11-train and optionally TOEFL11-dev Open-training-1: Participants could use any amount of training data excluding TOEFL11 Open-training-2: Any kind of training data, even TOEFL11-train and -dev 29 teams, 24 papers, up to 5 systems per task Features: the most common were word, character and POS n-gram feature ML Algorithm: The majority used SVM (13). Also Maximum Entropy (3), Ensemble (3), Discriminant Function Analysis (1) and K-Nearest Neighbors (1)... Evaluation method: TOEFL11-test in the three subtasks Accuracy: Closed-training: 83.6% Open-training-1: 56.5% Open-training-2: 83.5%
  • 9. Using Other Learner Corpora in the 2013 NLI Shared Task. Brooke, J., Hirst, G. In the 8th Workshop on Innovative Use of NLP for Building Educational Applications BEA-8. NAACL-HTL 2013 Corpus: TOEFL11: 11 L1 Lang-8: Website where language learners write journal entries in their L2 to be corrected by native speakers. 11 L1 overlapping with TOEFL11 ICLE: 15 L1, 8 overlap with TOEFL11 FCE: Small sample of the First Certificate in English. 16 L1, 9 overlap with TOEFL11 ICCI: International Corpus of Crosslinguistic Interlanguage. 4 L1 overlap with TOEFL11 ICANLE: International Corpus Network of Asian Learners of English. 3 L1 overlap with TOEFL11 L1: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, Turkish L2: English Features: Function words, Word n-grams (up to bigrams), POS n-grams (up to trigram), character n-grams (up to trigram), dependencies, context-free productions, ‘mixed’ POS/function n-grams (up to trigram), i.e. n-grams with all lexical words replaced with part of speech. The best model: word n-grams + mixed POS/function n-grams ML Algorithm: Support Vector Machines Evaluation method: TOEFL11-test Accuracy: Closed-task: 80.2% (12 / 29) Open-1-task: 56.5% (1 / 3) Open-2-task: 81.6% (2 / 4)
  • 10. Exploring Syntactic Features for Native Language Identification: A Variatonist Perspective on Feature Encoding and Ensemble Optimization Bykh, S., Meurers, D. The 25th International Conference on Computational Linguistics COLIN 2014 Corpus: TOEFL11: 11 L1 NT11: 5843 texts from ICLE + FCE + BALC + ICNALE + TÜTEL-NLI. L1: Arabic (846), Chinese (1048), French (456), German (500), Hindu (400), Italian (467), Japanese (447), Korean (684), Spanish (446), Telugu (200), Turkish (349) L2: English Features: Context-Free Grammar Rules Only phrasal CFG production rules excluding all terminals (S->NP VP,N -> D NN, ...) Only lexicalized CFG production rules of the type predeterminal -> terminal (JJ -> nice, JJ -> quick, NN -> vacation, ...) The union (combination) of the above two ML Algorithm: Logistic Regression Evaluation method: Cross-corpus, NT11 for training, TOEFL11-test for testing Accuracy: 84.82%
  • 11. Author’s Native Language Identification from Web-Based Texts. Tofight, P., Köse, C., Rouka, L. International Journal of Computer and Communication Engineering 2012 Corpus: 600 publicly available news agencies texts L1: 150 texts from each, English, Persian, Turkish, German L2: English Features: Lexical (64): character n-grams, word-length frequency, vocabulary richness... Syntactic (308): common punctuation signs, function words (e.g. the)... Structural (13): paragraph length, use of greetings... Content-specific (): n-grams with TF > 10 ML Algorithm: Support Vector Machines Evaluation method: 10-fold cross-validation Accuracy 70% ~ 80%
  • 12. Automatic Identification of Language Varieties: The Case of Portuguese. Zampieri, M., Gebrekidan-Gebre, B. In Proceedings of the Conference on Natural Language Processing 2012 Corpus (1000 documents from newsletters): Brazilian corpus: Folha de Säo Paulo, newspaper 2004 Portuguese corpus: Diário de Notícias, newspaper 2007 Features: word and character n-grams Orthography: Graphical signs: econômico (BP); económico (EP); economic (EN) Mute consonants: ator (BP); actor (EP); actor (EN) Syntax: Pronouns: eu te amo (BP); eu amo-te (EP); I love you (EN) Lexical variation: multa (BP); coima (EP); fine, penalty (EN) Proper nouns ML Algorithm: Language probability distributions with log-likelihood function for probability estimation Evaluation method: 50/50 split Accuracy: Word uni-grams: 99.6% Word bi-grams: 91.2% Character 4-grams: 99.8%
  • 13. Automatic Identification of Arabic Language Varieties and Dialects in Social Media. Sadat, F., Kazemi, F., Farzindar, A. In Proceeding of the 1st. International Workshop on Social Media Retrieval and Analysis SoMeRa 2014 Corpus (blogs and forum documents): 6 regional variations Egyptian: Egypt Iraqui: Iraq Gulf: Bahrein, Emirates, Kuwait, Qatar, Oman, Saudi Arabia Maghrebi: Algeria, Tunisia, Morocco, Libya, Mauritania Levantine: Jordan, Lebanon, Palestine, Syria Others: Sudan Features: character n-grams ML Algorithm: Markov language model vs. Naïve Bayes Evaluation method: 50/50 split Accuracy: 98% (78% F-measure)
  • 14. Common resources / corpora ICLE: International Corpus of Learner English (Granger et al., 2009) Essays written by college-level English language learners Issues: Quite small, topic bias L1 (11) except Arabic, Hindi and Telugu; L2 English Lang-8: http://www.lang8.com Social networking service where users write in the language they are learning and get corrections from native speakers FCE: First Certificate in English (Yannakoudakis et al., 2011) Essays written for an English assessment exam L1 (16) except Arabic, Hindi and Telugu; L2 English ICCI: International Corpus of Crosslinguistic Interlanguage (Tono et al., 2012) Descriptive and argumentative essays written by young learnets, i.e. those in grade school L1 (4); L2 English TOEFL11: Test of English as a Foreign Language (Blanchard et al., 2013) Essays written during high-stakes college-entrance test L1 (11) Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, Turkish; L2 English ICANLE: International Corpus Network of Asian Learners of English (Ishikawa, 2011) Essays from college students L1 (10): Asian background (Chinese, Japanese, Korean); L2 English BALC (Randall and Groom, 2009); TÜTEL-NLI (Bykh et al., 2013)
  • 15. Some Issues Most of the corpora were built from formal media, based on essays of proficient students Different corpora have essays with different proficiency level. Even high differences inside the same corpus Very few research works from social media, were people express themselves in other languages without taking care about their errors All the works are focused on English as a second language
  • 16. Research niches Spanish as a second language (http://epp.eurostat.ec.europa.eu/cache/ITY_PUBLIC/3-25092014-AP/EN/3-25092014-AP-EN.PDF) Spanish variations (Hispablogs) Lang Blogs Words Max_W Min_W Avg_W Std_W AR 450 1408103 11117 502 3126.90 2183.75 CL 450 1081068 10336 384 2402.37 2378.07 ES 450 1376478 11141 336 3058.84 2234.14 MX 450 1697091 11946 725 3771.31 2514.51 PA 450 950076 13090 120 2111.28 2264.03 PE 450 1602195 13205 620 3560.43 2515.71
  • 17. References Blanchard, D, Treteault, J., Higgins, D., Cahill, A., Chodorow, M. TOEFL11: A Corpus of Non-Native English. Technical Report, Educational Testing Service. 2013 Bykh, S., Vajjala, S., Krivanek, J., Meurers, D. Combining Shallow and Linguistically Motivated Features in Natural Language Identification. In Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications BEA-8 at NAACL-HLT 2013 Granger, S., Dagneaus, E., Meunier, F. The International Corpus of Learner English: Handbook and CD-ROM, version 2. Presses Universitaries de Louvain. 2009 Ishikawa, S. A New Horizon in Learner Corpus Studies: The Aim of the ICNALE Projects. In Corpora and Language Tecnologies in Teaching. Learning and Research. University of Strathclyde Publishing. 2011 Randall, M., Groom, N. The BUiD Arab Learner Corpus: A Resource for Studying the Acquisition of L2 English Spelling. In Proceedings of the Corpus Linguistic Conference CL 2009 Yannakoudakis, H., Briscoe, T., Medlock, B. A New Dataset and Method for Automatically Grading ESOL Texts. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011