SlideShare a Scribd company logo
A Study on the Spacio-Temporal Trend
of Brand Index using Twitter Messages
Sentiment Analysis
Abstract
Twitter Data
Social
Scien
ce
Huma
n
ArtMedic
al
Econo
my
Sentiment
Analysis
Introduction
 Twitter Crawling
 Data Pre-processing
 Korean Morphology Analysis
 Twitter Opinion Mining
 Sentiment Dictionary
 Evaluating performance of candidate classifiers
 Sentiment Classification
 Visualize Associative Relationship of Terms
 Relationship with Brand Index
Twitter Crawling
Twitter API
Streaming API
REST API
- Search API
Get 1% of all
twitter data in
real time
Get twitter data
from the keyword
2013.9.9.Mon. 9:35pm ~ Now
About 10,000 ~ 15,000 tweets per a day
Total 1,220,000 tweets (2013.11.2.Sat)
Data Pre-Processing
 Only get tweets which contain at least more than 3 Korean characters and tweets within
a 500km radius of Seoul, Korea.
 To remove foreign languages, special characters
 Remove tweets which only contain location information.
 Remove retweets
‫ويتكلم‬ ‫نهائيا‬ ‫السمع‬ ‫فقد‬ ‫متعب‬ ‫ابو‬ ‫الملك‬ ‫ان‬ ‫خبر‬ ‫اكد‬ ‫المستوى‬ ‫رفيع‬ ‫وامير‬ ‫موثوق‬ ‫صدر‬
‫مفهوم‬ ‫وغير‬ ‫مترابط‬ ‫غير‬ ‫كالم‬((‫تخريف‬::)) Sat Oct 12 00:06:37 KST 2013
I'm at Club ELLUI - @ellui_seoul (서울특별시) w/ 2
others http://t.co/zhcrncosKH::Sat Oct 12 00:02:06 KST 2013
Korean Morpheme Analyzer
 꼬꼬마 Korean Morpheme Analyzer
 한나눔 Korean Morpheme Analyzer
 Komoran Korean Morpheme Analyzer
 Lucene Korean Analyzer
 은전한닢 Korean Morpheme Analyzer
 Performance of the analyzer
 Foreign language and slang tagging
 Sentiment related word tagging (slang,
verb, emoticon)
 It has good dictionary
 Don’t need to think about word spacing
 But, unable to perceive lots of emoticons,
metaphor, sarcasm, irony.
Korean Morpheme Analyzer
> 배가 아파서 병원에 갔다.
배 NN,F,배,*,*,*,*,*
가 JKS,F,가,*,*,*,*,*
아파서 VA+EC,F,아파서,Inflect,VA,EC,아프/VA+ㅏ서/EC,*
병원 NN,T,병원,*,*,*,*,*
에 JKB,F,에,*,*,*,*,*
갔 VV+EP,T,갔,Inflect,VV,EP,가/VV+ㅏㅆ/EP,*
다 EF,F,다,*,*,*,*,*
. SF,*,*,*,*,*,*,*
EOS
Noun
Verb
Adjective
Adverb
Root
Building Sentiment Dictionary
Manually labeled twitter data
1 • 6 days of twitter data (2013.9.9, 9.16, 9.23, 9.30, 10.7, 10.14)
• Labeled positive and negative sets of Noun, Adjective, Verb, Root (total 8 sets)
• Labeled by 4 person
2 • 20,000 reviews from 2 movies
• 545 positive set, 545 negative set,
545 neutral set
Naver Movie review data with rating
0
1000
2000
3000
4000
5000
6000
1 2 3 4 5 6 7 8 9 10
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5 6 7 8 9 10
Positive
Positivenegative
Movie 1 Movie 2
Sentiment Classification
 SVM Classifier
 1. Training set - 150 positive set, 150 negative set (Twitter data)
2. Test set – 545 positive set, 545 negative set (Movie review data)
Accuracy = 70.64220183486239% (770/1090) (classification)
Mean squared error = 1.1743119266055047 (regression)
Squared correlation coefficient = 0.18400994471523438 (regression)
 Naïve bayes Classifier
 SO-PMI Classifier
Building Sentiment Dictionary
Unlabeled &
labeled data set
Ternary classifier : Naïve Bayes,
SO-PMI, SVM
Positive
set
Negative
set
Neutral
set
Positive
set
Negative
set
Neutral
set
Positive
set
Negative
set
Neutral
set
SO-PMI
SVM
Naïve Bayes
Sentiment of Brand Index
Samsung
Galaxy S2
Battery LCDPrice ….
: Brand (keyword)
: Related nouns (attribute)
Adjective
Verb
Noun
Adverb …
correlation
good
good nice
good good
Nice, pretty,
lovely …
Bad, terrible …
PMI(word, pword) + PMI(word, nword)
Determining
Objectivity
Scenario

More Related Content

PPTX
NLP in Practice - Part I
PPTX
A method to improve survival prediction using mutual information based network
PDF
Developing Korean Chatbot 101
PDF
Chat bot making process using Python 3 & TensorFlow
PDF
Network-based machine learning approach for aggregating multi-modal data
PPTX
Revealing disease-associated pathways by network integration of untargeted me...
PPTX
Systems genetics approaches to understand complex traits
PPTX
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...
NLP in Practice - Part I
A method to improve survival prediction using mutual information based network
Developing Korean Chatbot 101
Chat bot making process using Python 3 & TensorFlow
Network-based machine learning approach for aggregating multi-modal data
Revealing disease-associated pathways by network integration of untargeted me...
Systems genetics approaches to understand complex traits
Robust Pathway-based Multi-Omics Data Integration using Directed Random Walk ...

More from SOYEON KIM (20)

PDF
Network embedding
PPTX
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
PPTX
Deep learning based multi-omics integration, a survey
PPTX
DeepWalk: Online Learning of Social Representations
PPTX
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
PPTX
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
PPTX
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
PDF
A survey of heterogeneous information network analysis
PPTX
Translated learning
PPTX
Self taught clustering
PPTX
Semi-automatic ground truth generation using unsupervised clustering and limi...
PPTX
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
PPTX
Text extraction from natural scene image, a survey
PPTX
Opinion Fraud Detection in Online Reviews by Network Effects
PPTX
Evaluating color descriptors for object and scene recognition
PPTX
Outcome-guided mutual information networks for investigating gene-gene intera...
PPTX
Spectral clustering
PPTX
Sentiwordnet: A publicly available lexical resource for opinion mining
PPT
Opinion spam and analysis
PPTX
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
Network embedding
Integrative Pathway-based Survival Prediction utilizing the Interaction betwe...
Deep learning based multi-omics integration, a survey
DeepWalk: Online Learning of Social Representations
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated wi...
A survey of heterogeneous information network analysis
Translated learning
Self taught clustering
Semi-automatic ground truth generation using unsupervised clustering and limi...
Mobile Phone Spam Image Detection based on Graph Partitioning with Pyramid H...
Text extraction from natural scene image, a survey
Opinion Fraud Detection in Online Reviews by Network Effects
Evaluating color descriptors for object and scene recognition
Outcome-guided mutual information networks for investigating gene-gene intera...
Spectral clustering
Sentiwordnet: A publicly available lexical resource for opinion mining
Opinion spam and analysis
Investigating the Effectiveness of E-mail Spam Image Data for Phone Spam Imag...
Ad

Recently uploaded (20)

PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Business Analytics and business intelligence.pdf
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
DOCX
Factor Analysis Word Document Presentation
PDF
Transcultural that can help you someday.
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
How to run a consulting project- client discovery
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Introduction to the R Programming Language
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Introduction to Inferential Statistics.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Business Analytics and business intelligence.pdf
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Factor Analysis Word Document Presentation
Transcultural that can help you someday.
Optimise Shopper Experiences with a Strong Data Estate.pdf
importance of Data-Visualization-in-Data-Science. for mba studnts
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
How to run a consulting project- client discovery
Qualitative Qantitative and Mixed Methods.pptx
Introduction to the R Programming Language
IBA_Chapter_11_Slides_Final_Accessible.pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
STERILIZATION AND DISINFECTION-1.ppthhhbx
IMPACT OF LANDSLIDE.....................
Pilar Kemerdekaan dan Identi Bangsa.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
Introduction to Inferential Statistics.pptx
Ad

A study on the spacio temporal trend of brand index using twitter messages sentiment analysis

  • 1. A Study on the Spacio-Temporal Trend of Brand Index using Twitter Messages Sentiment Analysis
  • 3. Introduction  Twitter Crawling  Data Pre-processing  Korean Morphology Analysis  Twitter Opinion Mining  Sentiment Dictionary  Evaluating performance of candidate classifiers  Sentiment Classification  Visualize Associative Relationship of Terms  Relationship with Brand Index
  • 4. Twitter Crawling Twitter API Streaming API REST API - Search API Get 1% of all twitter data in real time Get twitter data from the keyword 2013.9.9.Mon. 9:35pm ~ Now About 10,000 ~ 15,000 tweets per a day Total 1,220,000 tweets (2013.11.2.Sat)
  • 5. Data Pre-Processing  Only get tweets which contain at least more than 3 Korean characters and tweets within a 500km radius of Seoul, Korea.  To remove foreign languages, special characters  Remove tweets which only contain location information.  Remove retweets ‫ويتكلم‬ ‫نهائيا‬ ‫السمع‬ ‫فقد‬ ‫متعب‬ ‫ابو‬ ‫الملك‬ ‫ان‬ ‫خبر‬ ‫اكد‬ ‫المستوى‬ ‫رفيع‬ ‫وامير‬ ‫موثوق‬ ‫صدر‬ ‫مفهوم‬ ‫وغير‬ ‫مترابط‬ ‫غير‬ ‫كالم‬((‫تخريف‬::)) Sat Oct 12 00:06:37 KST 2013 I'm at Club ELLUI - @ellui_seoul (서울특별시) w/ 2 others http://t.co/zhcrncosKH::Sat Oct 12 00:02:06 KST 2013
  • 6. Korean Morpheme Analyzer  꼬꼬마 Korean Morpheme Analyzer  한나눔 Korean Morpheme Analyzer  Komoran Korean Morpheme Analyzer  Lucene Korean Analyzer  은전한닢 Korean Morpheme Analyzer  Performance of the analyzer  Foreign language and slang tagging  Sentiment related word tagging (slang, verb, emoticon)  It has good dictionary  Don’t need to think about word spacing  But, unable to perceive lots of emoticons, metaphor, sarcasm, irony.
  • 7. Korean Morpheme Analyzer > 배가 아파서 병원에 갔다. 배 NN,F,배,*,*,*,*,* 가 JKS,F,가,*,*,*,*,* 아파서 VA+EC,F,아파서,Inflect,VA,EC,아프/VA+ㅏ서/EC,* 병원 NN,T,병원,*,*,*,*,* 에 JKB,F,에,*,*,*,*,* 갔 VV+EP,T,갔,Inflect,VV,EP,가/VV+ㅏㅆ/EP,* 다 EF,F,다,*,*,*,*,* . SF,*,*,*,*,*,*,* EOS Noun Verb Adjective Adverb Root
  • 8. Building Sentiment Dictionary Manually labeled twitter data 1 • 6 days of twitter data (2013.9.9, 9.16, 9.23, 9.30, 10.7, 10.14) • Labeled positive and negative sets of Noun, Adjective, Verb, Root (total 8 sets) • Labeled by 4 person 2 • 20,000 reviews from 2 movies • 545 positive set, 545 negative set, 545 neutral set Naver Movie review data with rating 0 1000 2000 3000 4000 5000 6000 1 2 3 4 5 6 7 8 9 10 0 500 1000 1500 2000 2500 3000 3500 1 2 3 4 5 6 7 8 9 10 Positive Positivenegative Movie 1 Movie 2
  • 9. Sentiment Classification  SVM Classifier  1. Training set - 150 positive set, 150 negative set (Twitter data) 2. Test set – 545 positive set, 545 negative set (Movie review data) Accuracy = 70.64220183486239% (770/1090) (classification) Mean squared error = 1.1743119266055047 (regression) Squared correlation coefficient = 0.18400994471523438 (regression)  Naïve bayes Classifier  SO-PMI Classifier
  • 10. Building Sentiment Dictionary Unlabeled & labeled data set Ternary classifier : Naïve Bayes, SO-PMI, SVM Positive set Negative set Neutral set Positive set Negative set Neutral set Positive set Negative set Neutral set SO-PMI SVM Naïve Bayes
  • 11. Sentiment of Brand Index Samsung Galaxy S2 Battery LCDPrice …. : Brand (keyword) : Related nouns (attribute) Adjective Verb Noun Adverb … correlation good good nice good good Nice, pretty, lovely … Bad, terrible … PMI(word, pword) + PMI(word, nword) Determining Objectivity

Editor's Notes

  • #3: SNS(SocialNetWorkServic) 시작 확대 -> 개인 BigData 출현 BigData를 이용한 DataMining 대두 트위터롤로지(twitterology) 새로운 학문의 출현 - 트위터를 연구하는 학문’을 뜻하는 신조어 - 소셜네트워크서비스(SNS)인 트위터(twitter)에 학문을 뜻하는 접미사 로지(-logy) - 트위터의 실시간 정보가 사회학 경제학 의학 언어학 등의 연구
  • #5: Twitter 4J library를 이용한 Streaming API (실시간)와 REST API(15분에 420회- 15분마다 요청하면 420개 받음) 구현 전체 데이터의 1%만 받을 수 있음 – 승우 발표 9월 9일 9:35pm ~ 지금도 계속 하루 평균 만~만오천개의 데이터 현재 2013.11.2 122만개의 데이터 축적
  • #6: 한글 3글자 이하는 받지않음 (특수문자 다빠지고, 영어, 일본어 다 빠짐) 위치정보 imap 등의 정보 제거 서울 반경 500km 이내의 데이터 받음 (전세계의 트위터가 다나옴. 우리나라꺼만 받기위해)
  • #7: 은전한닢 형태소분석기 리눅스에서 자바연동
  • #10: 1. Training set - 긍정 : DB 검색 '좋' 결과 - 이중 150개                         부정 : DB 검색 '싫' 결과 - 이중 150개  2. Test set - 긍정 : 영화평 545개                    부정 : 영화평 545개  사전에 아예 걸리지 않은 영화평도 포함하였을 때  optimization finished, #iter = 73  nu = 0.16326140616206591  obj = -32.23746306073249, rho = 0.11723225832508417  nSV = 61, nBSV = 38  Total nSV = 61  Accuracy = 70.64220183486239% (770/1090) (classification)  Mean squared error = 1.1743119266055047 (regression)  Squared correlation coefficient = 0.18400994471523438 (regression)
  • #12: p(word1 & word2) is the probability that word1 and word2 co-occur f the degree of statistical dependence between the words The log of the ratio corresponds to a form of correlation
  • #13: – 시나리오 : 악성 보도 이후 해명기사를 낸 기업