SlideShare a Scribd company logo
User Behaviour Modeling on Financial Message Boards
Pritha DN and Sahaj Biyani
Abstract
Online social communities like discussion boards and message boards are fast evolving in
their usage bringing people with similar interests together. From a social and anthropological
standpoint, these are the most interesting to study compared to Online Social Networks because
they connect people (most often with no offline links) from different backgrounds and histories.
Various theories exist in sociology about the intended behavior of users in online forums. In this
paper, we study the applicability of one such theory - “Participation Inequality” on financial
message boards. We consider the activity of user, his network interaction structure and the
content of postings and employ Machine Learning techniques to identify, cluster and infer roles
of users exhibiting similar behavior.
1 Introduction
1.1 Participation inequality
In Internet culture, the 1% rule is a rule of thumb pertaining to participation of users in an Internet
community, stating that only 1% of the users of a website actively create new content, while the
other 99% of the participants only lurk. Variants include the 1-9-90 rule (sometimes 90–9–1 principle
or the 89:10:1 ratio), which states that in a collaborative website such as a Wikipedia, 90% of the
participants of a community only view content, 9% of the participants edit content, and 1% of the
participants actively create new content. A related observation is that 1% of users generate the
majority of revenue in free-to-play games. Similar rules are known in information science, such as
the 80/20 rule known as the Pareto principle, that 20 percent of a group will produce 80 percent of
the activity, however the activity may be defined.
1.2 User behaviour in online communities
Online communities form a fundamental part of the web today where a large portion of the Internet’s
traffic is driven by and through them. These communities are where the majority of web users share
content, seek support, and socialize. Of particular relevance is identifying the behavioral patterns or
roles that emerge from the community (e.g. experts, leaders, ignored users, etc.) as well as assessing
the distribution of users assuming different roles. There is currently no standard or agreed list of
behavior types for describing activities of users in online communities. In this paper, we consider the
activity of user, his network interaction structure and the content of postings, and employ Machine
Learning techniques to identify, cluster and infer roles of users exhibiting similar behavior.
1
2 Motivation and Related Work
Earlier work on user behavior modeling on financial message boards has primarily dealt with predict-
ing the market status of a stock. In this paper, we aim to study the user behavior on financial message
boards from a sociological point of view. Previous research in [1], demonstrated Participation In-
equality by studying users on Digital Health Social Networks (DHSNs)- the AlcoholHelpCenter,
DepressionCenter, PanicCenter, and StopSmokingCenter sites. It would be interesting to analyze if
a similar pattern exists in communities dealing with a tangential domain. Hence we analyze user
behaviors in financial message boards, identify, cluster, label roles and study the interactions among
the groups. In the Machine Learning perspective, the problem is of unsupervised clustering.
3 Dataset
3.1 Investors Hub and data retrieval
Investors Hub(”iHub”) is an online forum for investors to gather and share market insights in a
dynamic environment using a discussion platform. Investors Hub has been online for over 15 years
and currently has 549,380 members who have posted 118,971,723 messages on 25,091 Boards. It hosts
message boards broadly on the topics of stock market, commodities and Foreign Exchange. It offers
forums, which are premium, and paid as well as free. We particularly focus on the Stock Market
Message Boards, which has message boards for US listed, Canadian listed stocks. We analyze the
free US listed stocks message boards. We crawled this data from the website, gathering information
from a total of 6278 boards and 53,491 users with 5.6 million posts.
Figure 1: Volume of
posts across boards
Figure 2: User activity
across message boards
Figure 3: Activity of
users on exact same
boards
Figure 4: Number of
posts initiated by each
user
Figure 5: Number of
replies made by each
user
Figure 6: Number of
replies each user re-
ceives
Figure 7: Number of
replies across Message
Boards
Figure 8: Average User
response time
3.2 Initial data analysis
In this section, we present the result of performing statistical analysis on the gathered data. We find
that many message boards are inactive and hardly record any conversation. Out of all the boards
we found that only about 1600 boards have more than 100 messages. And these boards account for
97% of active users (who have commented at least once on any board). Figure 1. Depicts volume
of posts over each Message Board. We then analyze the activity of users across all Message Boards.
We find that 57% of the users are active only on one message board. 53.1% of users are active on the
same exact message boards. Figure 2. and Figure 3. show the activity of users across all message
2
Initiation Rate Number of threads a user initiated over time
Reply Rate Number of replies a user makes over time
Out-Degree Number of users a user replies to
In-Degree Number of users who reply to a user
Activity Across Boards Number of boards he is active on / total number of boards
Followers Number of followers
Replier Share AVG[proportion of replies a user gets on a board]
Reply Share AVG[proportion of reply a user makes on a board]
Response Time Avg time to respond to a reply
Volume of Posted Content Average length of post
Links in Posts No of links he has posted
Table 1: Features
boards. We further analyze the activity of users – the number of posts they initiate and the number
of replies they make. We find that 19.3% of the users have never initiated a post and 21.7% of the
users have never replied to another’ post. Figure 4. and Figure 5. Indicate the distribution of posts
initiated and replies made by each user. We also observed that 33.6% of the users do not get any
replies on the posts the initiate. Figure 6. Shows the distribution of the number of replies each user
receives. We also infer that 18.4% of the Message Boards do not receive and reply posts. Figure
7. Indicates the distribution of replies across all boards. We also analyzed the average time a user
takes to reply to another user’s post. Figure 8. shows the distribution of average response times of
all users.
4 Behavior Modeling and Approach
4.1 Feature selection
Taking inferences from the existing literature and research on the study of user roles in Social
Networks[3], we consider the activity of the user, the egocentric network structure of the user and
the content of his postings as features for our clustering task. Features 3,4 are representative of the
“Reply to” and “Replies to” network structure. Features 10, 11, represent the content or quality of
a users post. The rest of the features represent the activity of user on the message boards. Not all
of the features listed in Table 1. are of equal importance in the task clustering. Features can be
redundant and may be inter-correlated. This could lead to erroneous clustering results. Due to the
absence of ground-truth data, there is no metric to compare with for the most significant features.
We employ Principal Component Analysis (PCA) for this task.
4.2 Min-Max Normalization
In order to operate on data originating from different features which fall in different ranges, it is
important to normalize all feature values into a fixed range. Since we use the distance metric -
euclidean distance, we normalize data using Min-Max Normalization, to fit the range [0-1].
4.3 Principal Component Analysis
PCA is a data reduction technique in which possibly correlated features are transformed into a
smaller number of factors called principal components. We use the scree plot to determine the
number of principal components to consider. The scree plot graphs the eigenvalue against the
component number. To determine the appropriate number of components, we look for an ”elbow”
in the scree plot. Figure 9. shows the Scree Plot we obtained. The component number is taken to
be the point at which the remaining eigenvalues are relatively small and all about the same size.
3
For our dataset, we find that the first 5 components account for 95% of variance in the data. Thus
we build a new set of features from the original feature set using the first 5 principal components.
Figure 9: Scree Plot
Figure 10: Silhou-
ette Coefficient Plot
Figure 11: Elbow
Plot
Figure 12: Feature
Importance
4.4 Unsupervised Clustering
4.4.1 Determining K in the K-means
One way to select K for the K-means algorithm is to try different values of K, plot the K-means
objective versus K, and look at the “elbow-point” in the plot. By plotting the Within Group Sum
of Squares against the number of clusters, we can visually examine the best point to choose for
the number of clusters. Initially the first cluster will add much information but at some point the
marginal gain will drop giving an angle in the graph. that would indicate the number of clusters
we should aim for. We narrow down on cluster size of 4, 5 as they look promising from the plot.
Silhouette Coefficient is a measure of how close each point in one cluster is to points in the neighboring
clusters and thus provides a way to assess parameters like number of clusters visually. This measure
has a range of [-1, 1]. Silhouette coefficients near +1 indicate that the sample is far away from
the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision
boundary between two neighboring clusters and negative values indicate that those samples might
have been assigned to the wrong cluster. For each datum i, let a(i) be the average dissimilarity of
i with all other data within the same cluster. We can interpret a(i) as how well i is assigned to its
cluster (the smaller the value, the better the assignment). The Silhouette Coefficient is given by:
si =
b(i) − a(i)
max(a(i), b(i))
Based on our observation from the Elbow plot and using Silhouette Coefficients, we choose K=4.
Figure 10. shows the Elbow Plot and Figure 11. shows the Silhouette Coefficient Plot for out
dataset.
4.4.2 Cluster validation
In this step, we use the four clusters obtained as the labeled ground-truth dataset to train an ensemble
Random forest, and obtain the relative importance of each of the 12 features in the classification
task. We use 5, 10, 20, 50 trees in the Random Forest in which we get the maximum score for cross
validation, with number of trees equal to 10. Figure 12. shows the obtained feature importance
values. We do not consider features with less than 1% importance, and therefore are left with only 9
features. The content volume, average number of hyperlinks posted and the content volume features
are dropped. Next, we perform clustering using the selected important features using the K-means
algorithm with K=4. K-means depend on the initial centroid seeds that get chosen. K-means
algorithm converges when the assignment no longer changes. We do 300 iterations for a single run.
The algorithm aims to minimize the Within Cluster Sum of Squares and might not converge to the
global optimum. So we run the algorithm with 10 different centroid seeds to get the best result.
Comparing the clusters now obtained with clusters from PCA, the clusters formed are of similar size
and have 99.99% overlap amongst its members. The cluster we chose had a silhouette coefficient of
0.789 and an average inter-cluster distance of 2.51 and average intra-cluster distance of 1.01.
4
Cluster 1 91.73%
Cluster 2 6.44%
Cluster 3 1.13%
Cluster 4 0.73%
Table 2: User distribution in each cluster
Figure 13: Comparison of cluster
composition and the content con-
tribution
Figure 14: Interaction among users
across clusters
Figure 15: Out-degree feature dis-
tribution across clusters
Figure 16: In-degree feature distri-
bution across clusters
5 Methodology
In this section, we summarize the methodology of our implementation. First, considering the metrics
- activity of user, his network interaction structure and the content of postings, we extract 12
features. In order to extract most significant of all these features, we perform PCA and use the
first 5 Principal Components as determined by the scree plot. We use the projection data from
this step as the set to be clustered. We construct 4 clusters using K Means clustering technique.
K = 4 is selected considering the Silhouette Coefficients and the elbow plot. The aim is to obtain
the important features from our set of 11 features. We use the Random Forest ensemble to get the
feature importance using the labeled data from previous step as input. Nine of the input features
are found to have relative importance greater than 1%, and they are subject to K-Means clustering
using the Euclidean distance as the distance metric, to obtained the final groups of users.
5.0.1 Role Inference
The final step is that of role inference. Table 2. depicts the size of each cluster obtained. Three
thresholds are determined as Low, High, and Medium for each of the features. Sociology attributes
different characteristics to different user roles. Figure 15. and 16. show the out-degree and in-
degree feature value distribution respectively, in the four clusters. In the obtained four clusters, we
find a behavior marked by low initiation, low engagement and low activity and assign the label of
“Lurkers” to this cluster. The second cluster exhibits medium initiation, medium reply share and
5
medium engagement; we label it as “Contributors”. There is an interesting third group highlighted
by a high replier and reply rate, but with low post initiation value; we label this cluster as “Debaters”.
The last cluster is characterized by high initiation, high user engagement and high interaction; we
assign the label of “Super Users” to this cluster.
6 Results
We obtain four significant user roles of Lurkers, Contributors, Super users and Debaters from our
dataset. We find that 91.73% of the users from the dataset are lurkers. This observation is in
accordance with the theory of participation inequality. Thus our observations fortify and act as
proof of the 1% Internet Rule and the 90–9–1 principle. We also make an interesting observation
that 72% of the posts in our dataset are made by the Contributors and Super-users though they
only constitute 7.14% of our user base. Figure 13. illustrates the anomaly in the number of users
in each cluster and the volume of content the users in that cluster generate. Figure 14. depicts the
interaction (via replies) of users across clusters. We made an interesting observation that the users
in Cluster 3, which we label as Debaters, interact most among users of the same cluster.
7 Conclusion
In this paper we have presented one approach to label behavior of users of online communities.
We presented a method to capture the behavioral characteristics of users as numeric attributes and
explained how Machine Learning can be employed to infer the role that a given user plays. A key
contribution of this paper is that we successfully substantiate and support the theory of Participation
Inequality on financial message boards.
References
[1] Van Mierlo, T. The 1% Rule in Four Digital Health Social Networks, in J Med Internet Res,
16(2):e33 2014.
[2] Mattew Rowe , Miriam Fernandez and Harith Alani. ‘Modelling and analysis of user behaviour
in online communities’.
[3] Mathilde Forestier, Anna Stavrianou, Julien Velcin, and Djamel A. Zighed. ‘Roles in social
networks: methodologies and research issues’.
6

More Related Content

PDF
Online Social Netowrks- report
PDF
Scalable recommendation with social contextual information
PDF
Predicting_new_friendships_in_social_networks
PDF
20142014_20142015_20142115
PDF
2014 USA
PPTX
Data Mining In Social Networks Using K-Means Clustering Algorithm
PPTX
Master thesis presentation
PDF
An Improved PageRank Algorithm for Multilayer Networks
Online Social Netowrks- report
Scalable recommendation with social contextual information
Predicting_new_friendships_in_social_networks
20142014_20142015_20142115
2014 USA
Data Mining In Social Networks Using K-Means Clustering Algorithm
Master thesis presentation
An Improved PageRank Algorithm for Multilayer Networks

What's hot (20)

PDF
LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
PDF
Ijetcas14 347
PPTX
06 Regression with Networks – EGO Networks and Randomization (2017)
PDF
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...
PDF
Social Friend Overlying Communities Based on Social Network Context
PDF
Alluding Communities in Social Networking Websites using Enhanced Quasi-cliqu...
PPTX
PDF
IJSRED-V2I2P09
PPTX
Hybrid sentiment and network analysis of social opinion polarization icoict
DOCX
Final Report
PDF
Recommendation system based on association rules applied to consistent behavi...
PDF
IRJET- Privacy Preserving Friend Matching
PDF
Taxonomy and survey of community
PDF
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
PDF
EffectiveCrowdSourcingForProductFeatureIdeation v18
PDF
Link prediction
PPTX
11 Keynote (2017)
PDF
Current trends of opinion mining and sentiment analysis in social networks
PDF
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
PPT
14 Dynamic Networks
LCF: A Temporal Approach to Link Prediction in Dynamic Social Networks
Ijetcas14 347
06 Regression with Networks – EGO Networks and Randomization (2017)
Immune-Inspired Method for Selecting the Optimal Solution in Semantic Web Ser...
Social Friend Overlying Communities Based on Social Network Context
Alluding Communities in Social Networking Websites using Enhanced Quasi-cliqu...
IJSRED-V2I2P09
Hybrid sentiment and network analysis of social opinion polarization icoict
Final Report
Recommendation system based on association rules applied to consistent behavi...
IRJET- Privacy Preserving Friend Matching
Taxonomy and survey of community
A Proposal on Social Tagging Systems Using Tensor Reduction and Controlling R...
EffectiveCrowdSourcingForProductFeatureIdeation v18
Link prediction
11 Keynote (2017)
Current trends of opinion mining and sentiment analysis in social networks
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
14 Dynamic Networks
Ad

Similar to User Behaviour Modeling on Financial Message Boards (20)

PPTX
User Behavior Modeling on Financial Message Boards
PDF
Social Media Analytics with a pinch of semantics
PDF
What makes communities tick? Community health analysis using role compositions
PPTX
Socialcom2011 discussionactivityprediction
PPTX
Decomposing discussion forums using user roles
PPTX
Ifip wg-galway-
PDF
a modified weight balanced algorithm for influential users community detectio...
PDF
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
PDF
Profile Analysis of Users in Data Analytics Domain
PDF
COMMUNITY DETECTION IN THE COLLABORATIVE WEB
PDF
FRIEND SUGGESTION SYSTEM FOR THE SOCIAL NETWORK BASED ON USER BEHAVIOR
PDF
Prediction Model Using Web Usage Mining Techniques
PDF
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
PPTX
Iswc2011 role-composition-analysis
PPTX
Modelling and Analysis of User Behaviour in Online Communities
PDF
Making More Sense Out of Social Data
PDF
Community profiling for social networks
PPTX
Jeffrey xu yu large graph processing
PDF
From smart meters to smart behaviour
PDF
Multi-Mode Conceptual Clustering Algorithm Based Social Group Identification ...
User Behavior Modeling on Financial Message Boards
Social Media Analytics with a pinch of semantics
What makes communities tick? Community health analysis using role compositions
Socialcom2011 discussionactivityprediction
Decomposing discussion forums using user roles
Ifip wg-galway-
a modified weight balanced algorithm for influential users community detectio...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
Profile Analysis of Users in Data Analytics Domain
COMMUNITY DETECTION IN THE COLLABORATIVE WEB
FRIEND SUGGESTION SYSTEM FOR THE SOCIAL NETWORK BASED ON USER BEHAVIOR
Prediction Model Using Web Usage Mining Techniques
Fuzzy AndANN Based Mining Approach Testing For Social Network Analysis
Iswc2011 role-composition-analysis
Modelling and Analysis of User Behaviour in Online Communities
Making More Sense Out of Social Data
Community profiling for social networks
Jeffrey xu yu large graph processing
From smart meters to smart behaviour
Multi-Mode Conceptual Clustering Algorithm Based Social Group Identification ...
Ad

Recently uploaded (20)

PPTX
IB Computer Science - Internal Assessment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Introduction to machine learning and Linear Models
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
1_Introduction to advance data techniques.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
IB Computer Science - Internal Assessment.pptx
.pdf is not working space design for the following data for the following dat...
Introduction-to-Cloud-ComputingFinal.pptx
Mega Projects Data Mega Projects Data
Business Ppt On Nestle.pptx huunnnhhgfvu
Reliability_Chapter_ presentation 1221.5784
Qualitative Qantitative and Mixed Methods.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Introduction to machine learning and Linear Models
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Clinical guidelines as a resource for EBP(1).pdf
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Galatica Smart Energy Infrastructure Startup Pitch Deck
climate analysis of Dhaka ,Banglades.pptx
Fluorescence-microscope_Botany_detailed content
1_Introduction to advance data techniques.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
ISS -ESG Data flows What is ESG and HowHow
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx

User Behaviour Modeling on Financial Message Boards

  • 1. User Behaviour Modeling on Financial Message Boards Pritha DN and Sahaj Biyani Abstract Online social communities like discussion boards and message boards are fast evolving in their usage bringing people with similar interests together. From a social and anthropological standpoint, these are the most interesting to study compared to Online Social Networks because they connect people (most often with no offline links) from different backgrounds and histories. Various theories exist in sociology about the intended behavior of users in online forums. In this paper, we study the applicability of one such theory - “Participation Inequality” on financial message boards. We consider the activity of user, his network interaction structure and the content of postings and employ Machine Learning techniques to identify, cluster and infer roles of users exhibiting similar behavior. 1 Introduction 1.1 Participation inequality In Internet culture, the 1% rule is a rule of thumb pertaining to participation of users in an Internet community, stating that only 1% of the users of a website actively create new content, while the other 99% of the participants only lurk. Variants include the 1-9-90 rule (sometimes 90–9–1 principle or the 89:10:1 ratio), which states that in a collaborative website such as a Wikipedia, 90% of the participants of a community only view content, 9% of the participants edit content, and 1% of the participants actively create new content. A related observation is that 1% of users generate the majority of revenue in free-to-play games. Similar rules are known in information science, such as the 80/20 rule known as the Pareto principle, that 20 percent of a group will produce 80 percent of the activity, however the activity may be defined. 1.2 User behaviour in online communities Online communities form a fundamental part of the web today where a large portion of the Internet’s traffic is driven by and through them. These communities are where the majority of web users share content, seek support, and socialize. Of particular relevance is identifying the behavioral patterns or roles that emerge from the community (e.g. experts, leaders, ignored users, etc.) as well as assessing the distribution of users assuming different roles. There is currently no standard or agreed list of behavior types for describing activities of users in online communities. In this paper, we consider the activity of user, his network interaction structure and the content of postings, and employ Machine Learning techniques to identify, cluster and infer roles of users exhibiting similar behavior. 1
  • 2. 2 Motivation and Related Work Earlier work on user behavior modeling on financial message boards has primarily dealt with predict- ing the market status of a stock. In this paper, we aim to study the user behavior on financial message boards from a sociological point of view. Previous research in [1], demonstrated Participation In- equality by studying users on Digital Health Social Networks (DHSNs)- the AlcoholHelpCenter, DepressionCenter, PanicCenter, and StopSmokingCenter sites. It would be interesting to analyze if a similar pattern exists in communities dealing with a tangential domain. Hence we analyze user behaviors in financial message boards, identify, cluster, label roles and study the interactions among the groups. In the Machine Learning perspective, the problem is of unsupervised clustering. 3 Dataset 3.1 Investors Hub and data retrieval Investors Hub(”iHub”) is an online forum for investors to gather and share market insights in a dynamic environment using a discussion platform. Investors Hub has been online for over 15 years and currently has 549,380 members who have posted 118,971,723 messages on 25,091 Boards. It hosts message boards broadly on the topics of stock market, commodities and Foreign Exchange. It offers forums, which are premium, and paid as well as free. We particularly focus on the Stock Market Message Boards, which has message boards for US listed, Canadian listed stocks. We analyze the free US listed stocks message boards. We crawled this data from the website, gathering information from a total of 6278 boards and 53,491 users with 5.6 million posts. Figure 1: Volume of posts across boards Figure 2: User activity across message boards Figure 3: Activity of users on exact same boards Figure 4: Number of posts initiated by each user Figure 5: Number of replies made by each user Figure 6: Number of replies each user re- ceives Figure 7: Number of replies across Message Boards Figure 8: Average User response time 3.2 Initial data analysis In this section, we present the result of performing statistical analysis on the gathered data. We find that many message boards are inactive and hardly record any conversation. Out of all the boards we found that only about 1600 boards have more than 100 messages. And these boards account for 97% of active users (who have commented at least once on any board). Figure 1. Depicts volume of posts over each Message Board. We then analyze the activity of users across all Message Boards. We find that 57% of the users are active only on one message board. 53.1% of users are active on the same exact message boards. Figure 2. and Figure 3. show the activity of users across all message 2
  • 3. Initiation Rate Number of threads a user initiated over time Reply Rate Number of replies a user makes over time Out-Degree Number of users a user replies to In-Degree Number of users who reply to a user Activity Across Boards Number of boards he is active on / total number of boards Followers Number of followers Replier Share AVG[proportion of replies a user gets on a board] Reply Share AVG[proportion of reply a user makes on a board] Response Time Avg time to respond to a reply Volume of Posted Content Average length of post Links in Posts No of links he has posted Table 1: Features boards. We further analyze the activity of users – the number of posts they initiate and the number of replies they make. We find that 19.3% of the users have never initiated a post and 21.7% of the users have never replied to another’ post. Figure 4. and Figure 5. Indicate the distribution of posts initiated and replies made by each user. We also observed that 33.6% of the users do not get any replies on the posts the initiate. Figure 6. Shows the distribution of the number of replies each user receives. We also infer that 18.4% of the Message Boards do not receive and reply posts. Figure 7. Indicates the distribution of replies across all boards. We also analyzed the average time a user takes to reply to another user’s post. Figure 8. shows the distribution of average response times of all users. 4 Behavior Modeling and Approach 4.1 Feature selection Taking inferences from the existing literature and research on the study of user roles in Social Networks[3], we consider the activity of the user, the egocentric network structure of the user and the content of his postings as features for our clustering task. Features 3,4 are representative of the “Reply to” and “Replies to” network structure. Features 10, 11, represent the content or quality of a users post. The rest of the features represent the activity of user on the message boards. Not all of the features listed in Table 1. are of equal importance in the task clustering. Features can be redundant and may be inter-correlated. This could lead to erroneous clustering results. Due to the absence of ground-truth data, there is no metric to compare with for the most significant features. We employ Principal Component Analysis (PCA) for this task. 4.2 Min-Max Normalization In order to operate on data originating from different features which fall in different ranges, it is important to normalize all feature values into a fixed range. Since we use the distance metric - euclidean distance, we normalize data using Min-Max Normalization, to fit the range [0-1]. 4.3 Principal Component Analysis PCA is a data reduction technique in which possibly correlated features are transformed into a smaller number of factors called principal components. We use the scree plot to determine the number of principal components to consider. The scree plot graphs the eigenvalue against the component number. To determine the appropriate number of components, we look for an ”elbow” in the scree plot. Figure 9. shows the Scree Plot we obtained. The component number is taken to be the point at which the remaining eigenvalues are relatively small and all about the same size. 3
  • 4. For our dataset, we find that the first 5 components account for 95% of variance in the data. Thus we build a new set of features from the original feature set using the first 5 principal components. Figure 9: Scree Plot Figure 10: Silhou- ette Coefficient Plot Figure 11: Elbow Plot Figure 12: Feature Importance 4.4 Unsupervised Clustering 4.4.1 Determining K in the K-means One way to select K for the K-means algorithm is to try different values of K, plot the K-means objective versus K, and look at the “elbow-point” in the plot. By plotting the Within Group Sum of Squares against the number of clusters, we can visually examine the best point to choose for the number of clusters. Initially the first cluster will add much information but at some point the marginal gain will drop giving an angle in the graph. that would indicate the number of clusters we should aim for. We narrow down on cluster size of 4, 5 as they look promising from the plot. Silhouette Coefficient is a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1]. Silhouette coefficients near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. For each datum i, let a(i) be the average dissimilarity of i with all other data within the same cluster. We can interpret a(i) as how well i is assigned to its cluster (the smaller the value, the better the assignment). The Silhouette Coefficient is given by: si = b(i) − a(i) max(a(i), b(i)) Based on our observation from the Elbow plot and using Silhouette Coefficients, we choose K=4. Figure 10. shows the Elbow Plot and Figure 11. shows the Silhouette Coefficient Plot for out dataset. 4.4.2 Cluster validation In this step, we use the four clusters obtained as the labeled ground-truth dataset to train an ensemble Random forest, and obtain the relative importance of each of the 12 features in the classification task. We use 5, 10, 20, 50 trees in the Random Forest in which we get the maximum score for cross validation, with number of trees equal to 10. Figure 12. shows the obtained feature importance values. We do not consider features with less than 1% importance, and therefore are left with only 9 features. The content volume, average number of hyperlinks posted and the content volume features are dropped. Next, we perform clustering using the selected important features using the K-means algorithm with K=4. K-means depend on the initial centroid seeds that get chosen. K-means algorithm converges when the assignment no longer changes. We do 300 iterations for a single run. The algorithm aims to minimize the Within Cluster Sum of Squares and might not converge to the global optimum. So we run the algorithm with 10 different centroid seeds to get the best result. Comparing the clusters now obtained with clusters from PCA, the clusters formed are of similar size and have 99.99% overlap amongst its members. The cluster we chose had a silhouette coefficient of 0.789 and an average inter-cluster distance of 2.51 and average intra-cluster distance of 1.01. 4
  • 5. Cluster 1 91.73% Cluster 2 6.44% Cluster 3 1.13% Cluster 4 0.73% Table 2: User distribution in each cluster Figure 13: Comparison of cluster composition and the content con- tribution Figure 14: Interaction among users across clusters Figure 15: Out-degree feature dis- tribution across clusters Figure 16: In-degree feature distri- bution across clusters 5 Methodology In this section, we summarize the methodology of our implementation. First, considering the metrics - activity of user, his network interaction structure and the content of postings, we extract 12 features. In order to extract most significant of all these features, we perform PCA and use the first 5 Principal Components as determined by the scree plot. We use the projection data from this step as the set to be clustered. We construct 4 clusters using K Means clustering technique. K = 4 is selected considering the Silhouette Coefficients and the elbow plot. The aim is to obtain the important features from our set of 11 features. We use the Random Forest ensemble to get the feature importance using the labeled data from previous step as input. Nine of the input features are found to have relative importance greater than 1%, and they are subject to K-Means clustering using the Euclidean distance as the distance metric, to obtained the final groups of users. 5.0.1 Role Inference The final step is that of role inference. Table 2. depicts the size of each cluster obtained. Three thresholds are determined as Low, High, and Medium for each of the features. Sociology attributes different characteristics to different user roles. Figure 15. and 16. show the out-degree and in- degree feature value distribution respectively, in the four clusters. In the obtained four clusters, we find a behavior marked by low initiation, low engagement and low activity and assign the label of “Lurkers” to this cluster. The second cluster exhibits medium initiation, medium reply share and 5
  • 6. medium engagement; we label it as “Contributors”. There is an interesting third group highlighted by a high replier and reply rate, but with low post initiation value; we label this cluster as “Debaters”. The last cluster is characterized by high initiation, high user engagement and high interaction; we assign the label of “Super Users” to this cluster. 6 Results We obtain four significant user roles of Lurkers, Contributors, Super users and Debaters from our dataset. We find that 91.73% of the users from the dataset are lurkers. This observation is in accordance with the theory of participation inequality. Thus our observations fortify and act as proof of the 1% Internet Rule and the 90–9–1 principle. We also make an interesting observation that 72% of the posts in our dataset are made by the Contributors and Super-users though they only constitute 7.14% of our user base. Figure 13. illustrates the anomaly in the number of users in each cluster and the volume of content the users in that cluster generate. Figure 14. depicts the interaction (via replies) of users across clusters. We made an interesting observation that the users in Cluster 3, which we label as Debaters, interact most among users of the same cluster. 7 Conclusion In this paper we have presented one approach to label behavior of users of online communities. We presented a method to capture the behavioral characteristics of users as numeric attributes and explained how Machine Learning can be employed to infer the role that a given user plays. A key contribution of this paper is that we successfully substantiate and support the theory of Participation Inequality on financial message boards. References [1] Van Mierlo, T. The 1% Rule in Four Digital Health Social Networks, in J Med Internet Res, 16(2):e33 2014. [2] Mattew Rowe , Miriam Fernandez and Harith Alani. ‘Modelling and analysis of user behaviour in online communities’. [3] Mathilde Forestier, Anna Stavrianou, Julien Velcin, and Djamel A. Zighed. ‘Roles in social networks: methodologies and research issues’. 6