SlideShare a Scribd company logo
Is that Dothraki or Valyrian?
and other NLP tasks with Python and NLTK
Charlie Redmon | SupStat, Inc.
August 18, 2014
Dothraki
Astapori Valyrian
High Valyrian
Importing raw text
dothraki_f = codecs.open(
"/home/cr/Python/westeros/dothraki.txt",
encoding=’utf -8’)
dothraki_raw = dothraki_f.read ()
print dothraki_raw
Athchomar chomakaan , [zhey] khal vezhven. Azha
anhaan asshilat ... Itte oakah! Jadi , zhey Jora
Andahli. Khal vezhven. Ajjalan anha zalat vitiherat
yer hatif. Kash qoy qoyi thira disse. Hash shafka
zali addrivat mae , zhey Khaleesi? Ishish chare
...
Text processing: Cleaning
punct_re = re.compile(
ur’[. ,;:?! u2014u2019u2026 []] ’,
re.UNICODE)
dothraki_proc = punct_re.sub(’’, dothraki_raw)
dothraki_proc = dothraki_proc.lower ()
print dothraki_proc
athchomar chomakaan zhey khal vezhven azha anhaan
asshilat itte oakah jadi zhey jora andahli khal
vezhven ajjalan anha zalat vitiherat yer hatif kash
qoy qoyi thira disse
...
Text processing: Tokenizing
dothraki_tokens = re.split(ur’s+’, dothraki_proc)
dothraki_types = set(dothraki_tokens )
print dothraki_types
set([u’izzi ’, u’ale’, u’morea ’, u’vesazhao ’,
u’yeri ’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera ’,
u’afisi ’, u’rhae ’, u’mawizzi ’, u’vee’, u’arrisse ’,
u’ti’, u’ven’, u’rizh ’, u’afichak ’, u’gache ’,
u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz ’,
u’zigeree ’, u’ayyeyoon ’, u’maan ’, u’mahrazhi ’,
u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’,
u’meshafka ’, u’qisi ’, u’sani ’, u’ville ’, u’vikeesi ’,
u’ifak ’, u’javrathi ’, u’zisa ’, u’chek ’, u’nem’,
...
])
Inspecting the lexical distribution in a text
dothraki_freqdist = FreqDist( dothraki_tokens)
print dothraki_freqdist
<FreqDist: u’anha ’: 50, u’vos’: 40, u’me’: 39,
u’ma’: 38, u’zhey ’: 29, u’mae’: 27, u’anni ’: 26,
u’hash ’: 23, u’yer’: 23, u’khal ’: 16,
u’khaleesi ’: 16, u’mori ’: 15, u’jin’: 13,
u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10,
u’jini ’: 10, u’she’: 10, ... >
dothraki_freqdist .plot (20, cumulative=True)
CFD of Dothraki words
Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal
Valyrian vocabulary distribution
Astapori Valyrian (Top 10):
ji, me, do, espo, si, mysa, eji, ez, ivetr´a, sa
High Valyrian (Top 10):
daor, se, issa, syt, ziry, hen, jem¯ele, lue, yne, avy
Feature 1: Consonant proportion
def c_prop(word ):
c_num = 0
for letter in u’bcdfgjklmnpqrstvxz u00f1 ’:
c_num += word.count(letter)
return c_num / len(word)
c_prop(u’zu016bgusy ’)
0.5
Word-internal consonant proportions across languages
Feature 2: Obstruent proportion
def obstruent_prop (word ):
obstruent_num = 0
for letter in u’bcdfgjkpqstvxz ’
obstruent_num += word.count(letter)
return obstruent_num / len(word)
obstruent_prop (u’u012blvi ’)
0.25
Word-internal obstruent proportions across languages
Feature 3: Coda presence
def c_coda(word ):
if word [-1] in u’bcdfgjklmnpqrstvxz u00f1 ’:
return 1
else:
return 0
def obstruent_coda (word ):
if word [-1] in u’bcdfgjkpqstvxz ’:
return 1
else:
return 0
c_coda(u’lysoon ’)
1
obstruent_coda (u’lysoon ’)
0
Mean coda consonant presence across languages
Mean coda obstruent presence across languages
Feature 4: Consonant clusters
regex = ur’[ bcdfghjklmnpqrstvxz u00f1]
[ bcdfghjklmnpqrstvxz u00f1 ]+’
def c_cluster(word ):
cc_set = re.findall(regex , word , re.UNICODE)
return len(cc_set)
c_cluster(u’avvirsosh ’)
3
Mean consonant cluster frequency across languages
Feature 5: Obstruent clusters
regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’
def obs_cluster(word ):
oo_set = re.findall(regex1 , word , re.UNICODE)
return len(oo_set)
obs_cluster(u’avvirsosh ’)
2
Mean obstruent cluster frequency across languages
Feature 6: Vowel clusters
regex2 = ur’[ bcdfghjklmnpqrstvxz u00f1 ]+’
def v_cluster(word ):
v_set = re.split(regex2 , word , re.UNICODE)
vv_set = [v for v in v_set if len(v) > 1]
return len(vv_set)
v_cluster(u’haeshi ’)
1
Mean vowel cluster frequency across languages
Data from real languages
TDIL Assamese Corpus
TDIL Assamese Corpus
Assamese corpus files
directory = "/home/cr/Documents/NLPwP_pres/
TDIL_assamese_corpus_data "
os.listdir(directory)
[’subj_art2.txt’, ’subj_politics1 .txt’, ’lit3.txt’,
’drama.txt’, ’religion2.txt’, ’criticism2.txt’,
’criticism1.txt’, ’subj_science3.txt’,
’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’,
’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt
’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’,
’subj_sociology .txt’, ’criticism3.txt’, ’lit8.txt’,
’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion
’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticis
’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science
’subj_science_5 .txt’, ’subj_history2.txt’, ’lit2.txt’,
’subj_science4.txt’, ’letter.txt’]
Assamese sample: ‘lit5.txt’
Frequency of the sound /x/ in ’lit5.txt’
len(re.findall(ur’[ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
1313
len(re.findall(ur’u09b6 ’, assamese_sample_raw ,
re.UNICODE ))
298
len(re.findall(ur’u09b7 ’, assamese_sample_raw ,
re.UNICODE ))
195
len(re.findall(ur’u09b8 ’, assamese_sample_raw ,
re.UNICODE ))
820
Positional restrictions
Beginning a word:
len(re.findall(ur’b[ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
1129
Ending a word:
len(re.findall(ur’[ u09b6u09b7u09b8 ]b’,
assamese_sample_raw , re.UNICODE ))
895
Positional restrictions
Following /a/:
len(re.findall(ur’u09be [ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
57
Following /i/:
len(re.findall(ur’[ u09bfu09c0 ][ u09b6u09b7u09b8]’,
ssamese_sample_raw , re.UNICODE ))
70
Following /u/:
len(re.findall(ur’[ u09c1u09c2 ][ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
10
Further work
Incorporate segmental parameters into classifier (fix Unicode
issues with NLTK’s classify module)
Use classifier to predict assignment of random words from
Westeros to Dothraki, Astapori Valyrian, and High Valyrian
languages
Isolate most important word-internal parameters in
classification model (log-likelihood ranking in Naive Bayes
model)
Use full distributional account of select Assamese consonants
as priors in acoustic classification model
Thank you

More Related Content

PDF
Introducing natural language processing(NLP) with r
PPTX
Natural Language Processing and Python
PDF
From 0 to mine sweeper in pyside
PDF
Why Python (for Statisticians)
PDF
What we can learn from Rebol?
PDF
PyCon 2013 : Scripting to PyPi to GitHub and More
DOCX
เทคนิคการค้นหาด้วย Google
PDF
Gaurav Jatav , BCA Third Year
Introducing natural language processing(NLP) with r
Natural Language Processing and Python
From 0 to mine sweeper in pyside
Why Python (for Statisticians)
What we can learn from Rebol?
PyCon 2013 : Scripting to PyPi to GitHub and More
เทคนิคการค้นหาด้วย Google
Gaurav Jatav , BCA Third Year

What's hot (12)

PDF
Being Google
PDF
Ravi Prakash Yadav , BCA Third Year
PDF
F# delight
PDF
Learning Rust - experiences from a Python/Javascript developer
PDF
Mithlesh Singh Rawat , BCA Third Year
PDF
Harendra Singh,BCA Third Year
PPT
Embracing a new world - dynamic languages and .NET
PDF
Go serving: Building server app with go
PDF
Akshay Sharma , BCA Third Year
PPSX
Nltk - Boston Text Analytics
PPTX
Writing and using php streams and sockets
KEY
Your Own Metric System
Being Google
Ravi Prakash Yadav , BCA Third Year
F# delight
Learning Rust - experiences from a Python/Javascript developer
Mithlesh Singh Rawat , BCA Third Year
Harendra Singh,BCA Third Year
Embracing a new world - dynamic languages and .NET
Go serving: Building server app with go
Akshay Sharma , BCA Third Year
Nltk - Boston Text Analytics
Writing and using php streams and sockets
Your Own Metric System
Ad

Viewers also liked (20)

PPTX
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
PDF
Using Machine Learning to aid Journalism at the New York Times
PPTX
Streaming Python on Hadoop
PDF
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
PDF
Nyc open-data-2015-andvanced-sklearn-expanded
PDF
Data mining with caret package
PDF
Bayesian models in r
PDF
PDF
Max Kuhn's talk on R machine learning
PDF
Winning data science competitions, presented by Owen Zhang
PDF
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PDF
Nycdsa ml conference slides march 2015
PDF
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
PDF
Spatial query tutorial for nyc subway income level along subway
PDF
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
PPTX
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
PPTX
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
PPTX
Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
PPTX
Data science and Hadoop
Data Science Academy Student Demo day--Moyi Dang, Visualizing global public c...
Using Machine Learning to aid Journalism at the New York Times
Streaming Python on Hadoop
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)
Nyc open-data-2015-andvanced-sklearn-expanded
Data mining with caret package
Bayesian models in r
Max Kuhn's talk on R machine learning
Winning data science competitions, presented by Owen Zhang
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Nycdsa ml conference slides march 2015
THE HACK ON JERSEY CITY CONDO PRICES explore trends in public data
Spatial query tutorial for nyc subway income level along subway
Data Science Academy Student Demo day--Richard Sheng, kinvolved school attend...
Data Science Academy Student Demo day--Chang Wang, dogs breeds in nyc
Data Science Academy Student Demo day--Divyanka Sharma, Businesses in nyc
Ben Hamner, Co-founder and CTO, Kaggle at MLconf SF - 11/13/15
Data science and Hadoop
Ad

More from Vivian S. Zhang (12)

PDF
Why NYC DSA.pdf
PPTX
Career services workshop- Roger Ren
PDF
Nycdsa wordpress guide book
PDF
We're so skewed_presentation
PDF
Wikipedia: Tuned Predictions on Big Data
PDF
A Hybrid Recommender with Yelp Challenge Data
PDF
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
PPTX
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
PPTX
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
PPTX
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
PPTX
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
PPTX
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
Why NYC DSA.pdf
Career services workshop- Roger Ren
Nycdsa wordpress guide book
We're so skewed_presentation
Wikipedia: Tuned Predictions on Big Data
A Hybrid Recommender with Yelp Challenge Data
Kaggle Top1% Solution: Predicting Housing Prices in Moscow
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Shelby Ahern, An Exploration of Non-Mi...
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...

Recently uploaded (20)

PDF
PPT on Performance Review to get promotions
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPT
Mechanical Engineering MATERIALS Selection
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Construction Project Organization Group 2.pptx
PPTX
web development for engineering and engineering
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPT on Performance Review to get promotions
bas. eng. economics group 4 presentation 1.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
additive manufacturing of ss316l using mig welding
Mechanical Engineering MATERIALS Selection
Embodied AI: Ushering in the Next Era of Intelligent Systems
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
UNIT 4 Total Quality Management .pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Construction Project Organization Group 2.pptx
web development for engineering and engineering
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Internet of Things (IOT) - A guide to understanding
OOP with Java - Java Introduction (Basics)
Foundation to blockchain - A guide to Blockchain Tech
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Model Code of Practice - Construction Work - 21102022 .pdf

Natural Language Processing(SupStat Inc)

  • 1. Is that Dothraki or Valyrian? and other NLP tasks with Python and NLTK Charlie Redmon | SupStat, Inc. August 18, 2014
  • 5. Importing raw text dothraki_f = codecs.open( "/home/cr/Python/westeros/dothraki.txt", encoding=’utf -8’) dothraki_raw = dothraki_f.read () print dothraki_raw Athchomar chomakaan , [zhey] khal vezhven. Azha anhaan asshilat ... Itte oakah! Jadi , zhey Jora Andahli. Khal vezhven. Ajjalan anha zalat vitiherat yer hatif. Kash qoy qoyi thira disse. Hash shafka zali addrivat mae , zhey Khaleesi? Ishish chare ...
  • 6. Text processing: Cleaning punct_re = re.compile( ur’[. ,;:?! u2014u2019u2026 []] ’, re.UNICODE) dothraki_proc = punct_re.sub(’’, dothraki_raw) dothraki_proc = dothraki_proc.lower () print dothraki_proc athchomar chomakaan zhey khal vezhven azha anhaan asshilat itte oakah jadi zhey jora andahli khal vezhven ajjalan anha zalat vitiherat yer hatif kash qoy qoyi thira disse ...
  • 7. Text processing: Tokenizing dothraki_tokens = re.split(ur’s+’, dothraki_proc) dothraki_types = set(dothraki_tokens ) print dothraki_types set([u’izzi ’, u’ale’, u’morea ’, u’vesazhao ’, u’yeri ’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera ’, u’afisi ’, u’rhae ’, u’mawizzi ’, u’vee’, u’arrisse ’, u’ti’, u’ven’, u’rizh ’, u’afichak ’, u’gache ’, u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz ’, u’zigeree ’, u’ayyeyoon ’, u’maan ’, u’mahrazhi ’, u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’, u’meshafka ’, u’qisi ’, u’sani ’, u’ville ’, u’vikeesi ’, u’ifak ’, u’javrathi ’, u’zisa ’, u’chek ’, u’nem’, ... ])
  • 8. Inspecting the lexical distribution in a text dothraki_freqdist = FreqDist( dothraki_tokens) print dothraki_freqdist <FreqDist: u’anha ’: 50, u’vos’: 40, u’me’: 39, u’ma’: 38, u’zhey ’: 29, u’mae’: 27, u’anni ’: 26, u’hash ’: 23, u’yer’: 23, u’khal ’: 16, u’khaleesi ’: 16, u’mori ’: 15, u’jin’: 13, u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10, u’jini ’: 10, u’she’: 10, ... > dothraki_freqdist .plot (20, cumulative=True)
  • 9. CFD of Dothraki words Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal
  • 10. Valyrian vocabulary distribution Astapori Valyrian (Top 10): ji, me, do, espo, si, mysa, eji, ez, ivetr´a, sa High Valyrian (Top 10): daor, se, issa, syt, ziry, hen, jem¯ele, lue, yne, avy
  • 11. Feature 1: Consonant proportion def c_prop(word ): c_num = 0 for letter in u’bcdfgjklmnpqrstvxz u00f1 ’: c_num += word.count(letter) return c_num / len(word) c_prop(u’zu016bgusy ’) 0.5
  • 13. Feature 2: Obstruent proportion def obstruent_prop (word ): obstruent_num = 0 for letter in u’bcdfgjkpqstvxz ’ obstruent_num += word.count(letter) return obstruent_num / len(word) obstruent_prop (u’u012blvi ’) 0.25
  • 15. Feature 3: Coda presence def c_coda(word ): if word [-1] in u’bcdfgjklmnpqrstvxz u00f1 ’: return 1 else: return 0 def obstruent_coda (word ): if word [-1] in u’bcdfgjkpqstvxz ’: return 1 else: return 0 c_coda(u’lysoon ’) 1 obstruent_coda (u’lysoon ’) 0
  • 16. Mean coda consonant presence across languages
  • 17. Mean coda obstruent presence across languages
  • 18. Feature 4: Consonant clusters regex = ur’[ bcdfghjklmnpqrstvxz u00f1] [ bcdfghjklmnpqrstvxz u00f1 ]+’ def c_cluster(word ): cc_set = re.findall(regex , word , re.UNICODE) return len(cc_set) c_cluster(u’avvirsosh ’) 3
  • 19. Mean consonant cluster frequency across languages
  • 20. Feature 5: Obstruent clusters regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’ def obs_cluster(word ): oo_set = re.findall(regex1 , word , re.UNICODE) return len(oo_set) obs_cluster(u’avvirsosh ’) 2
  • 21. Mean obstruent cluster frequency across languages
  • 22. Feature 6: Vowel clusters regex2 = ur’[ bcdfghjklmnpqrstvxz u00f1 ]+’ def v_cluster(word ): v_set = re.split(regex2 , word , re.UNICODE) vv_set = [v for v in v_set if len(v) > 1] return len(vv_set) v_cluster(u’haeshi ’) 1
  • 23. Mean vowel cluster frequency across languages
  • 24. Data from real languages
  • 27. Assamese corpus files directory = "/home/cr/Documents/NLPwP_pres/ TDIL_assamese_corpus_data " os.listdir(directory) [’subj_art2.txt’, ’subj_politics1 .txt’, ’lit3.txt’, ’drama.txt’, ’religion2.txt’, ’criticism2.txt’, ’criticism1.txt’, ’subj_science3.txt’, ’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’, ’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt ’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’, ’subj_sociology .txt’, ’criticism3.txt’, ’lit8.txt’, ’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion ’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticis ’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science ’subj_science_5 .txt’, ’subj_history2.txt’, ’lit2.txt’, ’subj_science4.txt’, ’letter.txt’]
  • 29. Frequency of the sound /x/ in ’lit5.txt’ len(re.findall(ur’[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 1313 len(re.findall(ur’u09b6 ’, assamese_sample_raw , re.UNICODE )) 298 len(re.findall(ur’u09b7 ’, assamese_sample_raw , re.UNICODE )) 195 len(re.findall(ur’u09b8 ’, assamese_sample_raw , re.UNICODE )) 820
  • 30. Positional restrictions Beginning a word: len(re.findall(ur’b[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 1129 Ending a word: len(re.findall(ur’[ u09b6u09b7u09b8 ]b’, assamese_sample_raw , re.UNICODE )) 895
  • 31. Positional restrictions Following /a/: len(re.findall(ur’u09be [ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 57 Following /i/: len(re.findall(ur’[ u09bfu09c0 ][ u09b6u09b7u09b8]’, ssamese_sample_raw , re.UNICODE )) 70 Following /u/: len(re.findall(ur’[ u09c1u09c2 ][ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 10
  • 32. Further work Incorporate segmental parameters into classifier (fix Unicode issues with NLTK’s classify module) Use classifier to predict assignment of random words from Westeros to Dothraki, Astapori Valyrian, and High Valyrian languages Isolate most important word-internal parameters in classification model (log-likelihood ranking in Naive Bayes model) Use full distributional account of select Assamese consonants as priors in acoustic classification model