[系列活動] 資料探勘速遊 - Session4 case-studies

Case Studies
Yi-Shin Chen
Institute of Information Systems and Applications
Department of Computer Science
National Tsing Hua University
yishin@gmail.com
Many slides provided by Tan, Steinbach, Kumar for book “Introduction to Data Mining” are adapted in this presentation

Case: Mining Reddit Data
不指定特定目標的Cases
2

Reddit Data
https://drive.google.com/open?id=0BwpI8947eCyuRFVDLU4tT2
5JbFE
3

Reddit: The Front Page of the
Internet
50k+ on
this set

Subreddit Categories
▷Reddit’s structure may already provide a
baseline similarity

Data Mining Final Project
Role-Playing Games Sales Prediction
Group 6

Dataset
• Use Reddit comment dataset
• Data Selection
– Choose 30 Role-playing games from the subreddits.
– Choose useful attributes form their comment
• Body
• Score
• Subreddit
11

Preprocessing
• Common Game Features
– Drop trivial words
• We manually build the trivial words dictionary by
ourselves.
– Ex : “is” “the” “I” “you” “we” “a”.
– TF-IDF
• We use Tf-IDF to calculate the importance of each
words.
13

Preprocessing
– TF-IDF result
• But the TF-IDF value are too small to find the significant
importance.
– Define Comment Feature
• We manually define the common gaming words in five
categories by referring several game website.
Game-Play | Criticize | Graphic | Music | Story
14

Preprocessing
• Filtering useful comments
– Common keywords of 5 categories:
– Filtering useful comments
• Using the frequent keywords to find out the useful
comments in each category.
15

Preprocessing
– Using FP-Growth to find other feature
• To Find other frequent keywords
{time, play, people, gt, limited}
{time, up, good, now, damage, gear, way, need,
build, better, d3, something, right, being, gt, limited}
• Then, manually choose some frequent keywords
16

Preprocessing
• Comment Emotion
– Filtering outliers
• First filter out those comment whose “score” separate
in top and bottom 2.5% which indicate that they might
be outliers.
17

Preprocessing
– Emotion detection
• We use LIWC dictionary to calculate each comment’s positive
and negative emotion percentage.
Ex: I like this game. -> positive
EX: I hate this character. -> negative
• Then find each category’s positive and negative emotion
score.
– 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃= 𝑗=0
𝑛
𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡
𝑗
* (positive words)
– 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁= 𝑗=0
𝑛
𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡
𝑗
* (negative words)
– 𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦= 𝑖=0
𝑚
𝑠𝑐𝑜𝑟𝑒_𝑒𝑎𝑐ℎ_𝑐𝑜𝑚𝑚𝑒𝑛𝑡 𝑚
– 𝐹𝑖𝑛𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃
= 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑃/
𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
– 𝐹𝑖𝑛𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁
= 𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝑁/
𝑇𝑜𝑡𝑎𝑙𝑆𝑐𝑜𝑟𝑒 𝑒𝑎𝑐ℎ_𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 18

Preprocessing
– Emotion detection
• We use LIWC dictionary to calculate each comment’s positive
and negative emotion percentage.
Ex: I like this game. -> positive
EX: I hate this character. -> negative
• Then find each category’s positive and negative emotion
score.
19
The meaning is :
Finding each category’s emotion score for each game
Calculating TotalScore of each category’s comments
Having the average emotion score FinalScore of each category

Preprocessing
• Sales Data Extraction
– Crawling website’s data
– Find the games’ sales on each platform
20

Preprocessing
– Annually sales for each game
– Median of each platform
– Mean of all platform
Median : 0.17 -> 30 games (26H 4L)
Mean : 0.533 -> 30 games (18H 12L)
– Try both of Median/ Mean to do the prediction.
21

Sales Prediction
• Model Construction
– We use the 4 outputs from pre-processing step to
be the input of the Naïve Bayes classifier
22

Evaluation
We use naïve Bayes to evaluate our result
1. training 70% & test30%
2. Leave-one-out
23

Evaluation (train70% & test30% )
0.67
0.78
0.67
0.56
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy
Median 0.17
Test1 Total_score
Test1 No Total_score
Test2 Total_score
0.78
0.44
0.56
0.22
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Accuracy
Mean 0.533
Test1 Total_score
Test2 Total_score
24

Evaluation(Leave-one-out)
Choose 29 games as training data and the other
one as test data, and totally run 30times to get
the accuracy.
Total 30 times Accuracy
Median Total_score 7 times wrong 77%
Median NO Total_score 5 times wrong 83%
Mean Total_score 6 times wrong 80%
Mean No Total_score 29 times wrong 3%
25

Attributes: Original Features Scores
HL
Median: 0.17M Mean: 0.53M
Attribute distribution to target(sales) domain
Sales boundary for H and L class
H, L
Sales class
Error
Times
Accuracy
Median 5 83%
Mean 29 3%
Evaluation: Leave-one-Out
Total Sample size: 30
26

Attributes: Transformed Features Scores
Boundary
Error
Times
Accuracy
Median 7 77%
Mean 6 80%
HL
Median:
0.17M
Mean:
0.53M
Attribute distribution project to target(sales) domain
Sales boundary for H and L class
H, L Sales class
Evaluation: Leave-one-Out
Total Sample size: 30
27

Finding related subreddits based
on the gamers’ participation
Data Mining Final Project
Group#4

Goal & Hypothesis
- To suggest new subreddits to the Gaming subreddit users based on the
participation of other users
29

Data Exploration
We extracted the Top 20 gaming related subreddits to be focused on.
30
leagueoflegends
DestinyTheGame
DotA2
GlobalOffensive
Gaming
GlobalOffensiveTrade
witcher
hearthstone
Games
2007scape
smashbros
wow
Smite
heroesofthestorm
EliteDangerous
FIFA
Guildwars2
tf2
summonerswar
runescape

31
Data Exploration (Cont.)
We sorted the subreddits by users’
comment activities.
We found out that the most active
subreddit is Leagueoflegends.

Data Pre-processing
- Removed comments from moderators and bots (>2,000 comments)
- Removed comments from default subreddits
- Extracted all the users that commented on at least 3 subreddits
- Transformed data into “Market Basket” transactions: 159,475 users
33

34
Processing
Sorted the frequent items (sort.js)
Eg. DotA2, Neverwinter, atheism → DotA2, atheism, Neverwinter

35
Processing
- Choosing the minimum support from 159,475 Transactions
- Max possible support 25.71% (at least in 41,004 Transactions)
- Min possible support 0.0000062% (at least in 1 Transaction)
Min_Support as
0.05 % = 79.73
transactions

36
Processing
- How a FP-Growth tree looks like

37
Processing
- Our FP-Growth (minimum support = 1% at least in 1,595 transactions )

38
Processing
- Our FP-Growth (minimum support = 5% at least in 7,974 transactions)

39
Processing
NumberofGeneratedTransactions

40
Processing
NumberofGeneratedTransactions

41
Processing
Our minimum support = 0.8% (at least in 1,276 transactions)
Games -> politics Support : 1.30% (2,073 users) Confidence : 8.30%

We classified the rules in 4 groups based on their confidence
Post-processing
42
1% - 20% 21% - 40% 41% - 60% 61% - 100%
Group #1 Group #2 Group #3 Group #4

- {Antecedent} → {Consequent}
eg. {FIFA, Soccer} → {NFL}
- Lift ratio = confidence / Benchmark confidence
- Benchmark confidence = # consequent element in data set / total
transactions in data set
Post-processing
43
1.00

Post-processing
44
We created 8 surveys with 64
rules (2 rules per group) and 2
questions per rules.

Post-processing
Q1: Do you think subreddit A and subreddit B are related?
[ yes | no ]
Q2: If you are subscriber of subreddit A, will you also be interested in subreddit
B?
Definitely No [ 1 , 2 , 3 , 4 , 5 ] Definitely Yes
45
Dependency
Expected Confidence

●We got response from 52 persons
●Data Confidence vs Expected Confidence → Lift Over Expected
Post-processing
46

●% Non-related was proportion of the “No” answer to entire response in first
question.
●Interestingness = Average of Expected confidence * % Non-related.
Post-processing
47

Post-processing
48
Interestingness value per Group

Results
49
●How we can suggest new subreddits then?
1% - 20%61% - 100%
Group #1Group #4
Less Confidence
High Interestingness
High Confidence
Less Interestingness

Results
50
●Suggest them As:
1% - 20%61% - 100%
Group #1Group #4
Maybe you could be
Interested in...
Other People also
talks about….

Results
51
●Results from Group# 4 ( 9 Rules )

Results
52
●Results from Group# 1 ( 173 Rules )

Challenges
- Reducing scope of data for decreasing computational time
- Defining and calculating the interestingness value
- How to suggest the rules that we got to reddit users
54

Conclusion
- We cannot be sure that our recommendation system will be 100% useful to
user since the interestingness can vary depending on the purpose of the
experiment
- To getting more accurate result, we need to ask all generated association
rules, more than 300 rules in survey
55

References
- Liu, B. (2000). Analyzing the subjective interestingness of association rules. 15(5), 47-55.
doi:10.1109/5254.889106
- Calculating Lift, How We Make Smart Online Product Recommendations
(https://www.youtube.com/watch?v=DeZFe1LerAQ)
- Reddit website (https://www.reddit.com/)
56

Application of Data Mining Techniques on Active
Users in Reddit
57

Raw data
Preprocessing
Clustering &
Evaluation
Knowledge
Data Mining Process
58

Facts - Why Active Users?
** http://www.pewinternet.org/2013/07/03/6-of-online-adults-are-reddit-users/
59

Facts - Why Active Users?
** http://www.snoosecret.com/statistics-about-reddit.html
60

Active Users Definitions
We define active users as people who :
1. who had posted or commented in at least 5 subreddits
2. who has at least 5 Posts or Comments in each of the
subreddits
3. # of Users’ comments above Q3
4. Average Score of the User, who satisfies three criteria
above > Q3
61

Preprocessing
Total posts in May2015: 54,504,410
Total distinct authors in May2015: 2,611,449
After deleting BOTs, [deleted], non-English, only-URL posts, and length < 3 posts,
we got 46,936,791 rows and 2,514,642 distinct authors.
Finally, we extracted 25,298 “active users” (0.97%)
and 5,007,845 posts (9.18%) by our active users
definitions
62

Clustering:
# of clusters (K) = 10 , k = √(n/2), k
= √(√(n/2))
using python sklearn module -
KMeans(), open source -
KPrototypes()
K-means, K-
prototype
63

Attributes
author = 1
subreddit = C (frequency: 27>others)
activity = 27/(11+6+27+17+3) = 0.42
assurance = 3/1 = 3
64

Clustering: K-means, K-prototype
ratio
65

Clustering: K-means, K-prototype
nominal
nominal
66

Evaluation
Separate data into 3 parts, A, B, C.
K-means && K-prototype for AB, AC
Compare labels of A which was from clustering result of AB, AC.
Measurement:
Adjusted Rand index: measure similarity between two list. (Ex. AB-A & AC-
A)
Homogeneity: each cluster contains only members of a single class
Completeness: all members of a given class are assigned to the same
cluster.
V-measure: harmonic mean of homogeneity & completeness
data A B C
data A B C
67

Evaluation: K-means v.s. K-prototype
K-means K-prototype
Adjusted Rand index 0.510904 0.193399
Homogeneity 0.803911 0.326116
Completeness 0.681669 0.298049
V-measure 0.737761 0.311451
68

[系列活動] 資料探勘速遊 - Session4 case-studies

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to [系列活動] 資料探勘速遊 - Session4 case-studies (20)

More from 台灣資料科學年會 (20)

Recently uploaded (20)

[系列活動] 資料探勘速遊 - Session4 case-studies