User Behaviour Modeling on Financial Message Boards

User Behaviour Modeling on Financial Message Boards
Pritha DN and Sahaj Biyani
Abstract
Online social communities like discussion boards and message boards are fast evolving in
their usage bringing people with similar interests together. From a social and anthropological
standpoint, these are the most interesting to study compared to Online Social Networks because
they connect people (most often with no offline links) from different backgrounds and histories.
Various theories exist in sociology about the intended behavior of users in online forums. In this
paper, we study the applicability of one such theory - “Participation Inequality” on financial
message boards. We consider the activity of user, his network interaction structure and the
content of postings and employ Machine Learning techniques to identify, cluster and infer roles
of users exhibiting similar behavior.
1 Introduction
1.1 Participation inequality
In Internet culture, the 1% rule is a rule of thumb pertaining to participation of users in an Internet
community, stating that only 1% of the users of a website actively create new content, while the
other 99% of the participants only lurk. Variants include the 1-9-90 rule (sometimes 90–9–1 principle
or the 89:10:1 ratio), which states that in a collaborative website such as a Wikipedia, 90% of the
participants of a community only view content, 9% of the participants edit content, and 1% of the
participants actively create new content. A related observation is that 1% of users generate the
majority of revenue in free-to-play games. Similar rules are known in information science, such as
the 80/20 rule known as the Pareto principle, that 20 percent of a group will produce 80 percent of
the activity, however the activity may be defined.
1.2 User behaviour in online communities
Online communities form a fundamental part of the web today where a large portion of the Internet’s
traffic is driven by and through them. These communities are where the majority of web users share
content, seek support, and socialize. Of particular relevance is identifying the behavioral patterns or
roles that emerge from the community (e.g. experts, leaders, ignored users, etc.) as well as assessing
the distribution of users assuming different roles. There is currently no standard or agreed list of
behavior types for describing activities of users in online communities. In this paper, we consider the
activity of user, his network interaction structure and the content of postings, and employ Machine
Learning techniques to identify, cluster and infer roles of users exhibiting similar behavior.
1

2 Motivation and Related Work
Earlier work on user behavior modeling on financial message boards has primarily dealt with predict-
ing the market status of a stock. In this paper, we aim to study the user behavior on financial message
boards from a sociological point of view. Previous research in [1], demonstrated Participation In-
equality by studying users on Digital Health Social Networks (DHSNs)- the AlcoholHelpCenter,
DepressionCenter, PanicCenter, and StopSmokingCenter sites. It would be interesting to analyze if
a similar pattern exists in communities dealing with a tangential domain. Hence we analyze user
behaviors in financial message boards, identify, cluster, label roles and study the interactions among
the groups. In the Machine Learning perspective, the problem is of unsupervised clustering.
3 Dataset
3.1 Investors Hub and data retrieval
Investors Hub(”iHub”) is an online forum for investors to gather and share market insights in a
dynamic environment using a discussion platform. Investors Hub has been online for over 15 years
and currently has 549,380 members who have posted 118,971,723 messages on 25,091 Boards. It hosts
message boards broadly on the topics of stock market, commodities and Foreign Exchange. It offers
forums, which are premium, and paid as well as free. We particularly focus on the Stock Market
Message Boards, which has message boards for US listed, Canadian listed stocks. We analyze the
free US listed stocks message boards. We crawled this data from the website, gathering information
from a total of 6278 boards and 53,491 users with 5.6 million posts.
Figure 1: Volume of
posts across boards
Figure 2: User activity
across message boards
Figure 3: Activity of
users on exact same
boards
Figure 4: Number of
posts initiated by each
user
Figure 5: Number of
replies made by each
user
Figure 6: Number of
replies each user re-
ceives
Figure 7: Number of
replies across Message
Boards
Figure 8: Average User
response time
3.2 Initial data analysis
In this section, we present the result of performing statistical analysis on the gathered data. We find
that many message boards are inactive and hardly record any conversation. Out of all the boards
we found that only about 1600 boards have more than 100 messages. And these boards account for
97% of active users (who have commented at least once on any board). Figure 1. Depicts volume
of posts over each Message Board. We then analyze the activity of users across all Message Boards.
We find that 57% of the users are active only on one message board. 53.1% of users are active on the
same exact message boards. Figure 2. and Figure 3. show the activity of users across all message
2

Initiation Rate Number of threads a user initiated over time
Reply Rate Number of replies a user makes over time
Out-Degree Number of users a user replies to
In-Degree Number of users who reply to a user
Activity Across Boards Number of boards he is active on / total number of boards
Followers Number of followers
Replier Share AVG[proportion of replies a user gets on a board]
Reply Share AVG[proportion of reply a user makes on a board]
Response Time Avg time to respond to a reply
Volume of Posted Content Average length of post
Links in Posts No of links he has posted
Table 1: Features
boards. We further analyze the activity of users – the number of posts they initiate and the number
of replies they make. We find that 19.3% of the users have never initiated a post and 21.7% of the
users have never replied to another’ post. Figure 4. and Figure 5. Indicate the distribution of posts
initiated and replies made by each user. We also observed that 33.6% of the users do not get any
replies on the posts the initiate. Figure 6. Shows the distribution of the number of replies each user
receives. We also infer that 18.4% of the Message Boards do not receive and reply posts. Figure
7. Indicates the distribution of replies across all boards. We also analyzed the average time a user
takes to reply to another user’s post. Figure 8. shows the distribution of average response times of
all users.
4 Behavior Modeling and Approach
4.1 Feature selection
Taking inferences from the existing literature and research on the study of user roles in Social
Networks[3], we consider the activity of the user, the egocentric network structure of the user and
the content of his postings as features for our clustering task. Features 3,4 are representative of the
“Reply to” and “Replies to” network structure. Features 10, 11, represent the content or quality of
a users post. The rest of the features represent the activity of user on the message boards. Not all
of the features listed in Table 1. are of equal importance in the task clustering. Features can be
redundant and may be inter-correlated. This could lead to erroneous clustering results. Due to the
absence of ground-truth data, there is no metric to compare with for the most significant features.
We employ Principal Component Analysis (PCA) for this task.
4.2 Min-Max Normalization
In order to operate on data originating from different features which fall in different ranges, it is
important to normalize all feature values into a fixed range. Since we use the distance metric -
euclidean distance, we normalize data using Min-Max Normalization, to fit the range [0-1].
4.3 Principal Component Analysis
PCA is a data reduction technique in which possibly correlated features are transformed into a
smaller number of factors called principal components. We use the scree plot to determine the
number of principal components to consider. The scree plot graphs the eigenvalue against the
component number. To determine the appropriate number of components, we look for an ”elbow”
in the scree plot. Figure 9. shows the Scree Plot we obtained. The component number is taken to
be the point at which the remaining eigenvalues are relatively small and all about the same size.
3

For our dataset, we find that the first 5 components account for 95% of variance in the data. Thus
we build a new set of features from the original feature set using the first 5 principal components.
Figure 9: Scree Plot
Figure 10: Silhou-
ette Coefficient Plot
Figure 11: Elbow
Plot
Figure 12: Feature
Importance
4.4 Unsupervised Clustering
4.4.1 Determining K in the K-means
One way to select K for the K-means algorithm is to try different values of K, plot the K-means
objective versus K, and look at the “elbow-point” in the plot. By plotting the Within Group Sum
of Squares against the number of clusters, we can visually examine the best point to choose for
the number of clusters. Initially the first cluster will add much information but at some point the
marginal gain will drop giving an angle in the graph. that would indicate the number of clusters
we should aim for. We narrow down on cluster size of 4, 5 as they look promising from the plot.
Silhouette Coefficient is a measure of how close each point in one cluster is to points in the neighboring
clusters and thus provides a way to assess parameters like number of clusters visually. This measure
has a range of [-1, 1]. Silhouette coefficients near +1 indicate that the sample is far away from
the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision
boundary between two neighboring clusters and negative values indicate that those samples might
have been assigned to the wrong cluster. For each datum i, let a(i) be the average dissimilarity of
i with all other data within the same cluster. We can interpret a(i) as how well i is assigned to its
cluster (the smaller the value, the better the assignment). The Silhouette Coefficient is given by:
si =
b(i) − a(i)
max(a(i), b(i))
Based on our observation from the Elbow plot and using Silhouette Coefficients, we choose K=4.
Figure 10. shows the Elbow Plot and Figure 11. shows the Silhouette Coefficient Plot for out
dataset.
4.4.2 Cluster validation
In this step, we use the four clusters obtained as the labeled ground-truth dataset to train an ensemble
Random forest, and obtain the relative importance of each of the 12 features in the classification
task. We use 5, 10, 20, 50 trees in the Random Forest in which we get the maximum score for cross
validation, with number of trees equal to 10. Figure 12. shows the obtained feature importance
values. We do not consider features with less than 1% importance, and therefore are left with only 9
features. The content volume, average number of hyperlinks posted and the content volume features
are dropped. Next, we perform clustering using the selected important features using the K-means
algorithm with K=4. K-means depend on the initial centroid seeds that get chosen. K-means
algorithm converges when the assignment no longer changes. We do 300 iterations for a single run.
The algorithm aims to minimize the Within Cluster Sum of Squares and might not converge to the
global optimum. So we run the algorithm with 10 different centroid seeds to get the best result.
Comparing the clusters now obtained with clusters from PCA, the clusters formed are of similar size
and have 99.99% overlap amongst its members. The cluster we chose had a silhouette coefficient of
0.789 and an average inter-cluster distance of 2.51 and average intra-cluster distance of 1.01.
4

Cluster 1 91.73%
Cluster 2 6.44%
Cluster 3 1.13%
Cluster 4 0.73%
Table 2: User distribution in each cluster
Figure 13: Comparison of cluster
composition and the content con-
tribution
Figure 14: Interaction among users
across clusters
Figure 15: Out-degree feature dis-
tribution across clusters
Figure 16: In-degree feature distri-
bution across clusters
5 Methodology
In this section, we summarize the methodology of our implementation. First, considering the metrics
- activity of user, his network interaction structure and the content of postings, we extract 12
features. In order to extract most significant of all these features, we perform PCA and use the
first 5 Principal Components as determined by the scree plot. We use the projection data from
this step as the set to be clustered. We construct 4 clusters using K Means clustering technique.
K = 4 is selected considering the Silhouette Coefficients and the elbow plot. The aim is to obtain
the important features from our set of 11 features. We use the Random Forest ensemble to get the
feature importance using the labeled data from previous step as input. Nine of the input features
are found to have relative importance greater than 1%, and they are subject to K-Means clustering
using the Euclidean distance as the distance metric, to obtained the final groups of users.
5.0.1 Role Inference
The final step is that of role inference. Table 2. depicts the size of each cluster obtained. Three
thresholds are determined as Low, High, and Medium for each of the features. Sociology attributes
different characteristics to different user roles. Figure 15. and 16. show the out-degree and in-
degree feature value distribution respectively, in the four clusters. In the obtained four clusters, we
find a behavior marked by low initiation, low engagement and low activity and assign the label of
“Lurkers” to this cluster. The second cluster exhibits medium initiation, medium reply share and
5

medium engagement; we label it as “Contributors”. There is an interesting third group highlighted
by a high replier and reply rate, but with low post initiation value; we label this cluster as “Debaters”.
The last cluster is characterized by high initiation, high user engagement and high interaction; we
assign the label of “Super Users” to this cluster.
6 Results
We obtain four significant user roles of Lurkers, Contributors, Super users and Debaters from our
dataset. We find that 91.73% of the users from the dataset are lurkers. This observation is in
accordance with the theory of participation inequality. Thus our observations fortify and act as
proof of the 1% Internet Rule and the 90–9–1 principle. We also make an interesting observation
that 72% of the posts in our dataset are made by the Contributors and Super-users though they
only constitute 7.14% of our user base. Figure 13. illustrates the anomaly in the number of users
in each cluster and the volume of content the users in that cluster generate. Figure 14. depicts the
interaction (via replies) of users across clusters. We made an interesting observation that the users
in Cluster 3, which we label as Debaters, interact most among users of the same cluster.
7 Conclusion
In this paper we have presented one approach to label behavior of users of online communities.
We presented a method to capture the behavioral characteristics of users as numeric attributes and
explained how Machine Learning can be employed to infer the role that a given user plays. A key
contribution of this paper is that we successfully substantiate and support the theory of Participation
Inequality on financial message boards.
References
[1] Van Mierlo, T. The 1% Rule in Four Digital Health Social Networks, in J Med Internet Res,
16(2):e33 2014.
[2] Mattew Rowe , Miriam Fernandez and Harith Alani. ‘Modelling and analysis of user behaviour
in online communities’.
[3] Mathilde Forestier, Anna Stavrianou, Julien Velcin, and Djamel A. Zighed. ‘Roles in social
networks: methodologies and research issues’.
6

User Behaviour Modeling on Financial Message Boards

More Related Content

What's hot (20)

Similar to User Behaviour Modeling on Financial Message Boards (20)

Recently uploaded (20)

User Behaviour Modeling on Financial Message Boards