Automatic English text correction

Automatic English Text Correction
@tati_alchueyr
Tatiana Al-Chueyr Martins
@tati_alchueyr
Bratislava, 12 March 2016
PyCon SK 2016

@tati_alchueyr
tati.__doc__
● Brazilian
● Lives in London (United Kingdom)
● Pythonista and Open Source activist
● Computer Engineer by Unicamp (Brazil)
● Develops software programs since 2002
● Works at EF (Education First)
○ Backend & DevOps leader of CTX Team

@tati_alchueyr
help(EF)
● EF: Education First
● International education company
○ Language training
○ Educational travel
○ Academic degree level
● Funded in 1955 in Sweden by Bertil Hult
● ~ 40,000 staff
● ~ 500 offices and schools in more than 50 countries (including Slovakia ;))
● Privately held by the Hult family

@tati_alchueyr
help(EF.CTX)
● Classroom Technology Experience
● Teaching and learning applications (Web & Mobile)
● Authoring platform

@tati_alchueyr
CTX.__team__
● CTX Team
● Team travel
● Malta, November 2015

@tati_alchueyr
CTX.backend
● Rafael Cunha de Almeida
● and I
● trying to master Italian culinary
● London, February 2016
● Although I’m presenting
this project alone, Rafa has
contributed to it as much
as I :)

@tati_alchueyr
objective
● Present a challenge
● Introduce a useful dataset
● Introduce a bunch of Python scripts
● Collect ideas
● Build collaboratively good quality open source tools which can help dealing
with this challenge

@tati_alchueyr
the challenge

@tati_alchueyr
The challenge
To assess (evaluate) students’ activities &
exercises can be:
● Laborious
● Repetitive
● Slow
● In other words... painful!
https://classteaching.files.wordpress.com/2013/10/marking-pile.gif

@tati_alchueyr
English Text Correction
hi,
my nameiscrystal.im nineyearsold.im formchina,imliveinjiang xi xing yu.
there aretow peopleinmyfamily:mymother,myfather.
my motheristhirty-sixyearsold,myfatheristhirty-sevenyearsold
EFCamDat - C219811

@tati_alchueyr
hi,
my nameis crystal.im nineyearsold. im form china,imliveinjiang xi xing yu .
EFCamDat - C219811
capitalization

@tati_alchueyr
hi,
my nameis crystal.im nineyearsold. im formchina,imliveinjiang xi xing yu .
EFCamDat - C219811
capitalization
spelling

@tati_alchueyr
hi,
EFCamDat - C219811
capitalization
spelling
verb tense

@tati_alchueyr
hi,
EFCamDat - C219811
capitalization
spelling
verb tense
There are
“only” 89
writings left to
access this
week...

@tati_alchueyr
The challenge
Implement algorithms and tools which can help (teachers) assessing
English written essays
Example of application available in several applications (including LibreOffice,
Google Apps, MS Word):
● Highlight (potential) mistakes while user types in a text area

@tati_alchueyr
The challenge
● Input:
○ English text
● Output:
○ List of items containing:
■ Position in text
■ Kind of potential mistake (eg. preposition, punctuation, article, spelling, etc)
■ Proposal of correction

@tati_alchueyr
the dataset

@tati_alchueyr
The dataset
EFCamDAT
● 551,036 written essays
○ 2,897,788 sentences
○ 32,980,407 word tokens
● by 85,864 learners
● 16 levels of proficiency
● 172 nationalities
● annotated with corrections by English teachers

@tati_alchueyr
The dataset
Examples of essay topics
● Introducing yourself by email
● Writing an online profile
● Describing your favourite day
● Telling someone what you’re doing
● Replying to a new penpal
● Writing about what you do
● Writing a resume
● Giving instructions to play a game
● Reviewing a song for a website
● Writing an apology email
● Writing a movie review
● Turning down an invitation
● Giving advice about budgeting
● Covering a news story
● Researching a legendary creature

@tati_alchueyr
The dataset
Examples of learners nationalities
● 36.9% Brazilians
● 18.7% Chinese
● 8.5% Russians
● 7.9% Mexicans
● 5.6% Germans
● 4.3% French
● ...

@tati_alchueyr
The dataset
EFCamDAT
● EF-Cambridge Open Language Database
● Partnership between:
○ University of Cambridge (Department of Theoretical and Applied Linguistics)
■ EF-Research Unit
○ EF Education First
● Data collected from Englishtown
○ EF learning environment (online English school)

@tati_alchueyr
The dataset
Types of mistakes annotated
● X >> y: change from x to y
● AG: agreement
● AR: article
● CO: combine sentence
● C: capitalization
● D: delete
● EX: expression of idiom
● HL: highlight
● I(x): insert x
● MW: missing word
● NS: new sentence
● NWS: no such word
● PH: phraseology
● PL: plural
● PO: possessive
● PR: preposition
● PS: part of speech
● PU: punctuation
● SI: singular
● SP: spelling
● VT: verb tense
● WC: word choice
● WO: word order

@tati_alchueyr
● 10 most common
mistakes
The dataset

@tati_alchueyr
The dataset
How to get it?
● https://corpus.mml.cam.ac.uk/efcamdat1/access.php
Licence:
● Use non-commercial research
● Commercial use when agreed upon agreement
● https://corpus.mml.cam.ac.uk/efcamdat1/EFCamDAT-USERAGREEMENT.pdf

@tati_alchueyr
The dataset

@tati_alchueyr
The dataset
Once you’ve registered
● It is possible to filter the dataset
● export the dataset into a XML file

@tati_alchueyr
a bunch of Python scripts

@tati_alchueyr
A bunch of (Python) scripts
Disclaimer
Code developed using the Extreme Go
Horse Methodology during Hackday
moments
They are a POC and lack:
- Proper automated tests
- Proper code design & API
- Documentation
https://gist.github.com/banaslee/4147370

@tati_alchueyr
What do they do?
1. Fix the XML files
2. Convert the XML files into good looking JSON files
3. Implement heuristics to identify some common English mistakes
○ For now: spelling, capitalization and articles
4. Analysis of how efficient the algorithm was

@tati_alchueyr
How to download them?
● https://github.com/ef-ctx/righter
Licence
● Apache version 2.0

@tati_alchueyr
Hands on

@tati_alchueyr
Mistakes identification
We wrote functions that apply heuristics and rules to detect mistakes related to:
1. Spelling
2. Capitalization
3. Article

@tati_alchueyr
Efficiency
In order to check their efficiency, we created:
● A few unit tests
● Before committing any change, we’d evaluate
○ How close to the teacher’s annotations we reached, using:
■ Precision
■ Recall
■ F-score
● We print a side-to-side comparison of what the teacher annotated and what
the algorithm identified

@tati_alchueyr
Efficiency
https://en.wikipedia.org/wiki/Precision_and_recall

@tati_alchueyr
Efficiency
F-Score

@tati_alchueyr
Spelling

@tati_alchueyr
Spelling: heuristics
1. Remove unicode symbols (eg. —)
2. Transform diacritics (eg. é -> e)
○ This is particularly important for names
3. Remove punctuation (eg. !, ?, .)
4. Check if word:
○ Is inside dictionary (case insensitive)
○ Has digits
○ Is inside names file (created with domain specific names; eg. Englishtown)
5. If none of that is true, then word is probably misspelled

@tati_alchueyr
Spelling: results
Summary:
● total essays: 85,629
● mean precision: 0.7128 (std: 0.3580)
● mean recall: 0.6535 (std: 0.4212)

@tati_alchueyr
Spelling: precision and recall per learner level

@tati_alchueyr
Spelling: F-score per nationality

@tati_alchueyr
Capitalization

@tati_alchueyr
Capitalization: heuristics
1. Check if word starts a sentence
○ Split on punctuation (!, ., ?, etc)
2. Check if word is a known capital word
○ First person (I)
○ Day of the week
○ Month
○ Language (eg. English, Spanish, French, etc)
○ Country
○ Names (selected from corpus to match context-specific names)

@tati_alchueyr
Capitalization: results
Summary:
● mean recall: 0.5550 (std: 0.4472)

@tati_alchueyr
Capitalization: precision and recall per learner level

@tati_alchueyr
Capitalization: F-score per nationality

@tati_alchueyr
Articles

@tati_alchueyr
Articles: heuristics
1. Check words using a before vogals
2. Check words using an before consonants

@tati_alchueyr
Articles: results
Summary:
● total items: 47,054
● average precision: 0.9724 (std: 0.1602)
● average recall: 0.0718 (std: 0.2463)

@tati_alchueyr
Articles: results
Summary:
● mean recall: 0.5550 (std: 0.4472)

@tati_alchueyr
Article: precision and recall per learner level

@tati_alchueyr
Articles: F-score per nationality

@tati_alchueyr
Overview

@tati_alchueyr
● efficiency of current
heuristics
Mistakes identification per learner level

@tati_alchueyr
ideas

@tati_alchueyr
Next steps
● Clean up code
● Spelling
○ Use probabilistic models
■ http://norvig.com/spell-correct.html
● Capitalization
○ POS-tagging to identify names of people, organizations, places
● Articles
○ POS-tagging
○ Deal with plurals
○ Define heuristics for dealing with definite articles (the)

@tati_alchueyr
Next steps
● Add to user-interface of EF Class
● Collect feedback from end-users (teachers)
● Algorithm for proposing the correct forms
● Dealing with the other kinds of mistakes
● Implement a classifier using NPL (natural language processing) so we can
have input from the end-users if the suggestions are good or not - and learn
with them

@tati_alchueyr
Ideas
●

@tati_alchueyr
PyCon SK is not over...

@tati_alchueyr
● Sunday (13/03)
● 9:00 - 12:00
● Organizer:
○ Rodolfo Carvalho
Join the Coding Dojo tomorrow! (13/03)
http://codingdojo.org/cgi-bin/index.pl?WhatIsCodingDojo
https://www.youtube.com/watch?v=vqnwQ3oVM1M

@tati_alchueyr
Questions?
Thanks :)
@tati_alchueyr
ctx.tech.backend@ef.com

Automatic English text correction

More Related Content

Viewers also liked (20)

Similar to Automatic English text correction (15)

More from Tatiana Al-Chueyr (20)

Recently uploaded (20)

Automatic English text correction