SlideShare a Scribd company logo
LIWC Dictionary
         Expansion
Luiz Gustavo Ferraz Aoqui
Social Computing Lab – GSCT – KAIST
Motivation
• Dictionary-based classifiers have high precision
  • But usually low recall

• Natural language is very dynamic
  • New words appear
  • Words change their meaning and sentiment
  • Heap’s Law

• Hard to update the dictionary at the same speed
LIWC Dictionary
• Fairly large dictionary
  • Almost 4,500 words and steams
     • 406 positive
     • 499 negative
• Development and Update is a long process
  • Almost exclusively done manually
  • Requires a lot of human resources
• Last update was in 2007
  • Twitter was launched in July, 2006
System overview
 19027743 1985381275 NULL NULL <d>2009-06-01
 00:00:00</d> <s>web</s> <t>I think i
 'm gonna go with the magic in 6.... just cause now
 that bron bron's out i wanna
 see kobe lose too.</t> SeanBennettt 98 434 159 -
 18000 0 0 <n>Sean Bennett</n> <u
 d>2009-01-15 16:36:04</ud> <t>Eastern Time (US
 &amp; Canada)</t> <l>Long Island,
  NY</l>
                            .
                            .
                            .




 Postive:
 .. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (:
 mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww
 album via luv photo ;- john pic different kno wearing
 la ).

 Negative:
 !! :( ?? getting twitter omg ?! ppl :/ dude idk da
 weather bout wtf iphone smh wat internet =( heat dnt
 =/ facebook :| gosh kate :[ fml ima jon swear punch
 text =[ cringe ): nd ** imma
System overview
System overview/Parser
 19027743 1985381275 NULL NULL <d>2009-06-01
 00:00:00</d> <s>web</s> <t>I think i'm gonna go
 with the magic in 6.... just cause now that bron bron's
 out i wanna see kobe lose too.</t> SeanBennettt 98
 434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009-
 01-15 16:36:04</ud> <t>Eastern Time (US &amp;
 Canada)</t> <l>Long Island, NY</l>
                               .
                               .
                               .




  haha nooo! i just wanna kill mee!!!! i didn`t do my
 homework...and i feel sick =(
 I can see the bus again. that makes me happy.
 $$ Black Swan Fund Makes a Big Bet on Inflation
 wonder how Roubini feels about this...?
 blahh, i feel boredd and tiredd as hell haha
 jay to conan... upgrade. lc to kristin... downgrade.
 rushing home for lauren's final episode. my life
 makes me sad.
Parser

 Structured          Extract tweet
                                                Tweets
    Text                (RegEx)




                                                 Filter
Clean Tweets


                           Clean
                Remove     Remove    Remove
               user name     URL     hash tag
                (RegEx)    (RegEx)   (RegEx)
Parser
    • Regular Expressions
         • Very powerful tool for text processing…
         • ..but very complex
         • Ex.:


<d>2009-06-01 00:00:00</d>
<s>web</s> <t>I just reached level 2.
#spymaster http://bit.ly/playspy</t>
asmith393 1522 1498 207 -18000 0 0                     I just reached level 2. #spymaster
<n>Adam Smith</n> <ud>2007-03-07        <t>(.*?)</t>   http://bit.ly/playspy
18:17:20</ud> <t>Eastern Time (US
&amp; Canada)</t>
Parser
   • Regular Expressions
        • Very powerful tool for text processing…
        • ..but very complex
        • Ex.:


I just reached level 2.                             I just reached level 2.
#spymaster                  #[0-9a-zA-Z+_]*         http://bit.ly/playspy
http://bit.ly/playspy
Parser
   • Regular Expressions
        • Very powerful tool for text processing…
        • ..but very complex
        • Ex.:


I just reached level 2.
#spymaster
                          ((http://|www.)([a-zA-    I just reached level 2.
                                                    #spymaster
http://bit.ly/playspy           Z0-9/.~])*)
System overview/Master
  haha nooo! i just wanna kill mee!!!! i didn`t do my
 homework...and i feel sick =(
 I can see the bus again. that makes me happy.
 $$ Black Swan Fund Makes a Big Bet on Inflation
 wonder how Roubini feels about this...?
 blahh, i feel boredd and tiredd as hell haha
 jay to conan... upgrade. lc to kristin... downgrade.
 rushing home for lauren's final episode. my life
 makes me sad.




                             Index                      Frequency   Chunks   Co-frequency
Master
                                  Tweets
                Splitter           Tweets
                                    Chunks              Mapper
Tweets

                Indexer            Index            M     M       M



                                                    R

                                   Reducer                 R
                                                                  R




                                        Unsorted          Co-frequency
                                                         Co-frequency
    Frequency              Sort                         Co-frequency
                                        Frequency
Master/Splitter
• Count the lines in the input file
• Select only tweets that words on the LIWC
  dictionary
• Split the input file in smaller chunks
Master/Indexer
• Simply save the vocabulary on a file sorted
  alphabetically
• Important in the future
Master/Mapper
• Spawn processes in parallel and divide the
  chunks among them
• Each worker does two jobs:
  • First: create (word, frequency) pairs


                                       Frequency.tmp
                                       someone         6
                                       down            8
                                       ever            10
    Chunk             Worker           kinda           2
                                       crazy           14
                                       …
Master/Mapper
• Spawn processes in parallel and divide the
  chunks among them
• Each worker does two jobs:
  • First: create (word, frequency) pairs
  • Second: save the co-words for each word
Master/Mapper
      Split Words
         Remove
        Duplicates
      Generate files

     Save co-words                                Worker
                                    haha
    haha                                            i
                                nooo                                 do
haha nooo! i just wanna kill                 !    didn`t
mee!!!! i didn`t do my          i                                         my
homework...and i feel sick =(           just
                                                   homework
                                wanna
                                                        ...          and
                                      kill
                                mee                           feel                    =(
                                                    i
                                           !!!!                                sick
Master/Mapper/Issues
• Splitting is not trivial
  • Splitting in whitespaces
     • homework… ≠ homework
  • Remove punctuation
     • :) ☐
  • Solution: RegEx again
     • ([w-'`]*)(W*)

• File names:
  • Unique, easy to find and respect OS rules
     • Hash
       • This is why the index file is important
Master/Mapper/Issues
• Parallel programming on Python
 • Original interpreter don’t support multi-thread…
    • Alternatives, such as Jython and IronPython, do
 • …but it is still possible to work in parallel
 • Multi-thread vs. Multi-process
 • Multi-process in Python
    • multiprocessing module
    • http://docs.python.org/library/multiprocessing.html#module-
      multiprocessing.pool
Master/Reducer
• Spawn processes in parallel and split the words
  among them
• Basically counts the mapper results
• Also, each work does two jobs:
  • First: sums all the (word, frequency) pairs and save

  frequency.tmp
  car     4                             frequency.txt
  house   2            Reducer          car      5
  ball    5                             house    3
  car     1                             ball     5
  house   1
Master/Reducer
• Spawn processes in parallel and split the words
  among them
• Basically counts the mapper results
• Also, each work does two jobs:
  • First: sums all the (word, frequency) pairs and save
  • Second: sums the co-occurrence frequency

   trip
                                      trip
   car     1
                      Worker          car     3
   ball    3
                                      Ball    3
   car     2
                                      house   1
   house   1
Master/Reducer/Issues
• Index file
  • Useful to access the files
     • Each word has a file with a list of co-words
     • But file name is hashed
       • Non-invertible function
     • Look-up on index, hash the word and get the file
Master/Sort
• Simply sort the frequencies file
  • Most frequent first
Classifier

                  α   β   γ
   Frequency                     Scores
                  δ




   Co-frequency
                  Max results   New words
Classifier/Sentiment words


            Car        232
            Ball       143
            Street     125   Top α%
Frequency   House      121
            Boat       114
            Pencil     105
            Pen        98
            Computer   81
Classifier/Co-words


                  Top β%

               engine    tire   door
      Car
      Ball
               court    game    play
      Street

               name     size
Classifier/Score

 engine     tire    door
                                  engine   1 0
 court     game     play
                                  tire     1 0
                                  door     2 1
  door     size
                                  size     1 2


 size     room     type    home


 price     size    door
Classifier/Collapse
• Created to deal with problems like:
  • :) :)) :), :).
  • They should all be treated as the same token
  • Harder for words
Classifier/New words
• Rules to compare the scores
  • So far the rules are
    • If the positive score is bigger than the negative
      score plus delta, tag the word as positive
    • Same idea for negative
• Returns the new words up to a maximum value
Other ideas
• WordNet based
• PMI similarity score
Evaluation
• Two evaluation methods:
 • First method
    • Find tweets that could not be categorized before
      but now they can
    • Manually check the precision of the result
 • Second method
    • Manually select positive and negative tweets
    • Compare the precision of the old dictionary with
      the new dictionary
Sub-product
• LIWC Dictionary Library for Python
  • Provides easy access to the dictionary information
     • Easy search
     • Reverse index
     • Match wildcard
  • Ex.:
LIWC Dictionary Expansion

More Related Content

PDF
Hp ux-security-check
PDF
LIWC-ing at Texts for Insights from Linguistic Patterns
PDF
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
PDF
Exploring Article Networks on Wikipedia with NodeXL
PDF
Coding Social Imagery: Learning from a #selfie #humor Image Set from Instagram
PDF
Real-time Tweet Analysis w/ Maltego Carbon 3.5.3
PDF
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
PDF
Sentiment Analysis with NVivo 11 Plus
Hp ux-security-check
LIWC-ing at Texts for Insights from Linguistic Patterns
Auto Mapping Texts for Human-Machine Analysis and Sensemaking
Exploring Article Networks on Wikipedia with NodeXL
Coding Social Imagery: Learning from a #selfie #humor Image Set from Instagram
Real-time Tweet Analysis w/ Maltego Carbon 3.5.3
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Sentiment Analysis with NVivo 11 Plus

Similar to LIWC Dictionary Expansion (20)

PPT
Go digital share'12
PPTX
Implicit Sentiment Mining in Twitter Streams
PPTX
The resourceful web - Kaplan PLI
PPT
4888009.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PDF
Using Ruby to do Map/Reduce with Hadoop
PPTX
Search, Signals & Sense: An Analytics Fueled Vision
KEY
Practical Machine Learning and Rails Part2
PDF
Seattle hug 2010
PPTX
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
PPTX
Wreck a nice beach: adventures in speech recognition
PDF
Aisb cyberbullying
PDF
Idiomatic Python
PDF
The CW Corpus PITR2013
PPT
Cloud computing-with-map reduce-and-hadoop
PDF
Reading & Palms
PPTX
Seven ages of technology in Education
PDF
Nicolas Pastorino - eZ Community - Innovation and Open-source inside
PDF
A Dose of Design Inspiration from Comic Strips
PDF
CSMR10c.ppt
PPTX
Search engines
Go digital share'12
Implicit Sentiment Mining in Twitter Streams
The resourceful web - Kaplan PLI
4888009.pptnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Using Ruby to do Map/Reduce with Hadoop
Search, Signals & Sense: An Analytics Fueled Vision
Practical Machine Learning and Rails Part2
Seattle hug 2010
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Wreck a nice beach: adventures in speech recognition
Aisb cyberbullying
Idiomatic Python
The CW Corpus PITR2013
Cloud computing-with-map reduce-and-hadoop
Reading & Palms
Seven ages of technology in Education
Nicolas Pastorino - eZ Community - Innovation and Open-source inside
A Dose of Design Inspiration from Comic Strips
CSMR10c.ppt
Search engines
Ad

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Hybrid model detection and classification of lung cancer
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Mushroom cultivation and it's methods.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
A Presentation on Touch Screen Technology
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
August Patch Tuesday
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Programs and apps: productivity, graphics, security and other tools
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Hybrid model detection and classification of lung cancer
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Encapsulation_ Review paper, used for researhc scholars
TLE Review Electricity (Electricity).pptx
Unlocking AI with Model Context Protocol (MCP)
WOOl fibre morphology and structure.pdf for textiles
Mushroom cultivation and it's methods.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Assigned Numbers - 2025 - Bluetooth® Document
A comparative study of natural language inference in Swahili using monolingua...
cloud_computing_Infrastucture_as_cloud_p
A comparative analysis of optical character recognition models for extracting...
A Presentation on Touch Screen Technology
Zenith AI: Advanced Artificial Intelligence
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
August Patch Tuesday
MIND Revenue Release Quarter 2 2025 Press Release
Ad

LIWC Dictionary Expansion

  • 1. LIWC Dictionary Expansion Luiz Gustavo Ferraz Aoqui Social Computing Lab – GSCT – KAIST
  • 2. Motivation • Dictionary-based classifiers have high precision • But usually low recall • Natural language is very dynamic • New words appear • Words change their meaning and sentiment • Heap’s Law • Hard to update the dictionary at the same speed
  • 3. LIWC Dictionary • Fairly large dictionary • Almost 4,500 words and steams • 406 positive • 499 negative • Development and Update is a long process • Almost exclusively done manually • Requires a lot of human resources • Last update was in 2007 • Twitter was launched in July, 2006
  • 4. System overview 19027743 1985381275 NULL NULL <d>2009-06-01 00:00:00</d> <s>web</s> <t>I think i 'm gonna go with the magic in 6.... just cause now that bron bron's out i wanna see kobe lose too.</t> SeanBennettt 98 434 159 - 18000 0 0 <n>Sean Bennett</n> <u d>2009-01-15 16:36:04</ud> <t>Eastern Time (US &amp; Canada)</t> <l>Long Island, NY</l> . . . Postive: .. :) :- ...... live tweet ;) .& -- =) everytime rain tweets (: mj xd michael !!!!!! lil ." dog sun jus fan wit =] :] aww album via luv photo ;- john pic different kno wearing la ). Negative: !! :( ?? getting twitter omg ?! ppl :/ dude idk da weather bout wtf iphone smh wat internet =( heat dnt =/ facebook :| gosh kate :[ fml ima jon swear punch text =[ cringe ): nd ** imma
  • 6. System overview/Parser 19027743 1985381275 NULL NULL <d>2009-06-01 00:00:00</d> <s>web</s> <t>I think i'm gonna go with the magic in 6.... just cause now that bron bron's out i wanna see kobe lose too.</t> SeanBennettt 98 434 159 -18000 0 0 <n>Sean Bennett</n> <ud>2009- 01-15 16:36:04</ud> <t>Eastern Time (US &amp; Canada)</t> <l>Long Island, NY</l> . . . haha nooo! i just wanna kill mee!!!! i didn`t do my homework...and i feel sick =( I can see the bus again. that makes me happy. $$ Black Swan Fund Makes a Big Bet on Inflation wonder how Roubini feels about this...? blahh, i feel boredd and tiredd as hell haha jay to conan... upgrade. lc to kristin... downgrade. rushing home for lauren's final episode. my life makes me sad.
  • 7. Parser Structured Extract tweet Tweets Text (RegEx) Filter Clean Tweets Clean Remove Remove Remove user name URL hash tag (RegEx) (RegEx) (RegEx)
  • 8. Parser • Regular Expressions • Very powerful tool for text processing… • ..but very complex • Ex.: <d>2009-06-01 00:00:00</d> <s>web</s> <t>I just reached level 2. #spymaster http://bit.ly/playspy</t> asmith393 1522 1498 207 -18000 0 0 I just reached level 2. #spymaster <n>Adam Smith</n> <ud>2007-03-07 <t>(.*?)</t> http://bit.ly/playspy 18:17:20</ud> <t>Eastern Time (US &amp; Canada)</t>
  • 9. Parser • Regular Expressions • Very powerful tool for text processing… • ..but very complex • Ex.: I just reached level 2. I just reached level 2. #spymaster #[0-9a-zA-Z+_]* http://bit.ly/playspy http://bit.ly/playspy
  • 10. Parser • Regular Expressions • Very powerful tool for text processing… • ..but very complex • Ex.: I just reached level 2. #spymaster ((http://|www.)([a-zA- I just reached level 2. #spymaster http://bit.ly/playspy Z0-9/.~])*)
  • 11. System overview/Master haha nooo! i just wanna kill mee!!!! i didn`t do my homework...and i feel sick =( I can see the bus again. that makes me happy. $$ Black Swan Fund Makes a Big Bet on Inflation wonder how Roubini feels about this...? blahh, i feel boredd and tiredd as hell haha jay to conan... upgrade. lc to kristin... downgrade. rushing home for lauren's final episode. my life makes me sad. Index Frequency Chunks Co-frequency
  • 12. Master Tweets Splitter Tweets Chunks Mapper Tweets Indexer Index M M M R Reducer R R Unsorted Co-frequency Co-frequency Frequency Sort Co-frequency Frequency
  • 13. Master/Splitter • Count the lines in the input file • Select only tweets that words on the LIWC dictionary • Split the input file in smaller chunks
  • 14. Master/Indexer • Simply save the vocabulary on a file sorted alphabetically • Important in the future
  • 15. Master/Mapper • Spawn processes in parallel and divide the chunks among them • Each worker does two jobs: • First: create (word, frequency) pairs Frequency.tmp someone 6 down 8 ever 10 Chunk Worker kinda 2 crazy 14 …
  • 16. Master/Mapper • Spawn processes in parallel and divide the chunks among them • Each worker does two jobs: • First: create (word, frequency) pairs • Second: save the co-words for each word
  • 17. Master/Mapper Split Words Remove Duplicates Generate files Save co-words Worker haha haha i nooo do haha nooo! i just wanna kill ! didn`t mee!!!! i didn`t do my i my homework...and i feel sick =( just homework wanna ... and kill mee feel =( i !!!! sick
  • 18. Master/Mapper/Issues • Splitting is not trivial • Splitting in whitespaces • homework… ≠ homework • Remove punctuation • :) ☐ • Solution: RegEx again • ([w-'`]*)(W*) • File names: • Unique, easy to find and respect OS rules • Hash • This is why the index file is important
  • 19. Master/Mapper/Issues • Parallel programming on Python • Original interpreter don’t support multi-thread… • Alternatives, such as Jython and IronPython, do • …but it is still possible to work in parallel • Multi-thread vs. Multi-process • Multi-process in Python • multiprocessing module • http://docs.python.org/library/multiprocessing.html#module- multiprocessing.pool
  • 20. Master/Reducer • Spawn processes in parallel and split the words among them • Basically counts the mapper results • Also, each work does two jobs: • First: sums all the (word, frequency) pairs and save frequency.tmp car 4 frequency.txt house 2 Reducer car 5 ball 5 house 3 car 1 ball 5 house 1
  • 21. Master/Reducer • Spawn processes in parallel and split the words among them • Basically counts the mapper results • Also, each work does two jobs: • First: sums all the (word, frequency) pairs and save • Second: sums the co-occurrence frequency trip trip car 1 Worker car 3 ball 3 Ball 3 car 2 house 1 house 1
  • 22. Master/Reducer/Issues • Index file • Useful to access the files • Each word has a file with a list of co-words • But file name is hashed • Non-invertible function • Look-up on index, hash the word and get the file
  • 23. Master/Sort • Simply sort the frequencies file • Most frequent first
  • 24. Classifier α β γ Frequency Scores δ Co-frequency Max results New words
  • 25. Classifier/Sentiment words Car 232 Ball 143 Street 125 Top α% Frequency House 121 Boat 114 Pencil 105 Pen 98 Computer 81
  • 26. Classifier/Co-words Top β% engine tire door Car Ball court game play Street name size
  • 27. Classifier/Score engine tire door engine 1 0 court game play tire 1 0 door 2 1 door size size 1 2 size room type home price size door
  • 28. Classifier/Collapse • Created to deal with problems like: • :) :)) :), :). • They should all be treated as the same token • Harder for words
  • 29. Classifier/New words • Rules to compare the scores • So far the rules are • If the positive score is bigger than the negative score plus delta, tag the word as positive • Same idea for negative • Returns the new words up to a maximum value
  • 30. Other ideas • WordNet based • PMI similarity score
  • 31. Evaluation • Two evaluation methods: • First method • Find tweets that could not be categorized before but now they can • Manually check the precision of the result • Second method • Manually select positive and negative tweets • Compare the precision of the old dictionary with the new dictionary
  • 32. Sub-product • LIWC Dictionary Library for Python • Provides easy access to the dictionary information • Easy search • Reverse index • Match wildcard • Ex.: