SlideShare a Scribd company logo
ISSN: 2277 – 9043
                                                International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                              Volume 1, Issue 2, April 2012




            Sentiment Analysis and Influence Tracking
                         using Twitter
             Rushabh Mehta, Dhaval Mehta, Disha Chheda, Charmi Shah and Pramila M. Chawan


                                                                    share opinions on variety of topics and discuss current issues.
   Abstract— An overwhelming number of consumers are                 Because of a free format of messages and an easy accessibility
active in social media platforms. Within these platforms             of microblogging platforms, Internet users tend to shift from
consumers are sharing their true feelings about a                    traditional communication tools (such as traditional blogs or
particular brand/product, its features, customer service             mailing lists) to microblogging services.
and how it stands the competition. With the booming of
microblogs on the Web, people have begun to express their            As more and more users post about products and services they
opinions on a wide variety of topics on Twitter and other            use, or express their political and religious views,
similar services. In a world where information can bias              microblogging[2] web- sites become valuable sources of
public opinion it is essential to analyse the propagation and        people‟s opinions and sentiments. Such data can be efficiently
influence of information in large-scale networks. Recent             used for marketing or social studies. We use a dataset formed
research studying social media data to rank users by                 of collected messages from Twitter. Twitter contains a very
topical relevance have largely focused on the “retweet",             large number of very short messages created by the users of
“following" and “mention" relations. We also perform                 this microblogging platform. The contents of the messages
linguistic analysis of the collected corpus and explain              vary from personal thoughts to public statements.
discovered phenomena. Using the corpus, we build a
sentiment classifier, that is able to determine positive,            As a microblogging and social networking website, Twitter
negative and neutral sentiments for a document. This                 has become very popular and has grown rapidly. An
paper discusses how Twitter data is used as a corpus for             increasing number of people are willing to post their opinions
analysis by the application of sentiment analysis and a              on Twitter, which is now considered a valuable online source
study of different algorithms and methods that help to               for opinions. As a result, sentiment analysis on Twitter is a
track influence and impact of a particular user/brand                rapid and effective way of gauging public opinion for business
active on the social network.                                        marketing or social studies. For example, a business can
                                                                     retrieve timely feedback on a new product in the market by
  Index Terms—Twitter, sentiment analysis, influence,                evaluating people's opinions on Twitter. As people often talk
People Rank, TwitterRank.                                            about various entities (e.g., products, organizations, people,
                                                                     etc.) in a tweet, we perform sentiment analysis at the entity
                                                                     level; that is, we mine people's opinions on specific entities in
                    I. INTRODUCTION                                  each tweet rather than the opinion about each whole sentence
                                                                     or whole tweet. We assume that the entities are provided by
Microblogging today has become a very popular
                                                                     the user, e.g., he/she is interested in opinions on iPhone (an
communication tool among Internet users. Millions of
                                                                     entity).
messages are appearing daily in popular web-sites that provide
services for microblogging such as Twitter, Tumblr,
                                                                     In our paper, we study how microblogging can be used for
Facebook. Authors of those messages write about their life,
                                                                     sentiment analysis purposes. We show how to use Twitter as a
                                                                     corpus for sentiment analysis and opinion mining. We use
     RUSHABH MEHTA Btech Computer Engineer from                     microblogging and more particularly Twitter for the following
      VJTI,MUMBAI,INDIA
                                                                     reasons:
     DHAVAL MEHTA Btech Computer Engineer from
      VJTI,MUMBAI,INDIA
     DISHA CHHEDA         Btech Computer Engineer from
                                                                     • Microblogging platforms are used by different people to
      VJTI,MUMBAI,INDIA                                              express their opinion about different topics, thus it is a
     CHARMI SHAH          Btech Computer Engineer from              valuable source of people‟s opinions.
      VJTI,MUMBAI, INDIA
     PRAMILA M.CHAWAN                                               • Twitter contains an enormous number of text posts and it
      Associate Professor Computer Department                        grows every day. The collected corpus can be arbitrarily large.
      VJTI,MUMBAI,INDIA


                                                                                                                                              72
                                            All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                                    International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                                  Volume 1, Issue 2, April 2012

• Twitter‟s audience varies from regular users to celebrities,           2.”#" called the hashtag is used to mark, organize
company representatives, politicians4, and even country                  or alter tweets according to topics or categories.
presidents. Therefore, it is possible to collect text posts of
users from different social and interests groups.                        3. “@username1" represents that a message is a reply to a user
                                                                         whose user name is username1".
Sentiment is an attitude, thought, or judgment prompted by
feeling. Sentiment analysis is the process of determining and            4. Emoticons and colloquial expressions are frequently used in
measuring the tone, attitude, opinion, and emotional state of            tweets, e.g. :-)", lovvve",
responses. More precisely, it is the concept of deciding                 lmao".
whether a specific conversation is positive, negative, or
neutral. Sentiment analysis has broad applications and                   5. External Web links (e.g. http://amze.ly/8K4n0t)
encompasses work in classifying subjectivity, polarity,                  are also commonly found in tweets to refer to some
tonality, emotion mining, opinion mining, persuasion analysis,           external sources.
and affective computing. It is a tool that allows companies to
analyze what their customers are saying regarding their                  6. Length: Tweets are limited to 140 characters.
products and services, and also monitor trends in the opinions           This is different from usual opinionated corpora such as
and attitudes of their customers toward the products and                 reviews and blogs, which are usually long.
services with respect to their competitors.
                                                                         Another unique characteristic of Twitter data compared to the
There are various types of marketing strategies such as mass             other opinionated corpora is its volume. It is estimated that
marketing, segmentation and one to one marketing. One to one             people post about 60 million tweets every day and the number
marketing is an effort to find individual customer's needs and           is still increasing rapidly.
to provide a good response for them. Recommender systems
have appeared in e-commerce problems to support product
recommendation, which provide one to one marketing. Indeed,                                     II. METHODOLOGY
recommender systems individualize the way of recommending
products. These systems try to recommend different products
to each customer with collecting data of customer preferences
                                                                         A. Data Collection
and data mining techniques. Recommender systems have
                                                                         Twitter has an open API that allows anyone to get a list of a
recently become popular among many well-known e-
                                                                         user's friends (provided the account is not private) It is
businesses such as Amazon.com, MovieFinder.com.
                                                                         therefore easy to create a graph of the network. Since there are
                                                                         more than 100M nodes in this graph with many times that
As people often talk about various entities (e.g., products,
                                                                         many edges, it requires a lot of computational power to
organizations, people, etc.) in a tweet, we perform sentiment
                                                                         process this entire graph. I therefore propose to focus on a
analysis at the entity level; that is, we mine people's opinions
                                                                         smaller subset. However, recently Twitter has been more
on specific entities in each tweet rather than the opinion about
                                                                         circumspect in allowing unfettered access to the entire social
each whole sentence or whole tweet. We assume that the
                                                                         graph and tweet stream. It allows this access termed the "fire
entities are provided by the user, e.g., he/she is interested in
                                                                         hose" to a small chosen set of companies only. Through the
opinions on iPhone (an entity). One approach to perform
                                                                         public API, one can only access a single user's tweet stream
sentiment analysis is based on a function of opinion words in
                                                                         and his profile information and also the public timeline of
context. Opinion words are words that are commonly used to
                                                                         tweets.
express positive or negative sentiments, e.g., good" and bad".
The approach generally uses a dictionary of opinion words to
                                                                         The Streaming API is the real-time sample of the Twitter
identify and determine sentiment orientation (positive,
                                                                         Firehose. This API is for those developers with data intensive
negative or neutral). The dictionary is called the opinion
                                                                         needs. If you're looking to build a data mining product or are
lexicon.
                                                                         interested in analytics research, the Streaming API is most
                                                                         suited for such things. Streaming API allows for large
                                                                         quantities of keywords to be specified and tracked, retrieving
                                                                         geo-tagged tweets from a certain region, or have the public
Twitter Data
                                                                         statuses of a user set returned. This requires you to establish a
                                                                         long-lived HTTP connection and maintain that connection.
Twitter has developed its own language conventions.The
following are examples of Twitter conventions.
                                                                         The Twitter Search API is a dedicated API for running
1. “RT" is an acronym for retweet, which is put in
                                                                         searches against the real-time index of recent Tweets. If you're
front of a tweet to indicate that the user is repeating
                                                                         currently developing on the Search API, and find that your
or reposting.
                                                                         application is being rate-limited or you just have aggressive

                                                                                                                                                  73
                                                All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                                 International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                               Volume 1, Issue 2, April 2012

querying needs, then you should be moving over to the
Streaming API.                                                        Classifications

                                                                      Four classifications are used in this corpus:
B. Analysis
Using Twitter API we collected a corpus of text posts and             Positive             ● Positive indicator on topic
formed a dataset of three classes: positive sentiments, negative      Neutral              ● Neither positive nor negative
sentiments, and a set of objective texts (no sentiments). We                               indicators
queried Twitter for two types of emoticons:                                                ● Mixed positive and negative indicators
• Happy emoticons: “:-)”, “:)”, “=)”, “:D” etc.                                            ● On topic, but indicator undeterminable
• Sad emoticons: “:-(”, “:(”, “=(”, “;(” etc.                                              ● Simple factual statements
                                                                                           ● Questions with no strong emotions
The two types of collected corpora will be used to train a                                 indicated
classifier to recognize positive and negative sentiments.
Because each message cannot exceed 140 characters by the              Negative             ● Negative indicator on topic
rules of the microblogging platform, it is usually composed of        Irrelevant           ● Not English language
a single sentence. Therefore, we assume that an emoticon                                   ● Not on-topic (e.g. spam)
within a message represents an emotion for the whole message
and all the words of the message are related to this emotion.

                                                                      Sentiment assignment is an extremely subjective exercise.
i. Preparing the Data
The data set contains information on X million profiles.              For this corpus, “Positive” and “Negative” labels were
Profile information is limited to the user accounts followed by       reserved for tweets which clearly express an emotion or where
the user. Since the data set is as of a certain date, the             the implications were unambiguous. As a rule of thumb,
information is not complete as of today. However, the data set        “neutral” was the preferred label for border line cases.
contained enough information to create training and test data
sets.                                                                 Examples:

All the tweets would be initially considered as a bag of words.       There are huge lines at the @apple store.
                                                                      Labeled neutral. From a shoppers perspective this could be
for eg. "This is excellent"                                           bad, or it could be a sign of excitement about the product
                                                                      launch. From an investor‟s perspective this could be good,
would not be considered as a string but as a bag of three words       since it indicates a strong new product launch.
"This", "is" and "excellent".
                                                                      I had to wait for six friggin’ hours in line at the @apple
Then the stop words such as "the", "a", "with" etc will be            store.
removed from the bag as these words do not have any                   Labeled negative. The tweeter is clearly unhappy with the
sentiment expressing nature. Once these non-sentimental stop          situation and is referring to Apple in the negative sense.
words are are removed and hence the corpus refined, the
process of sentiment analysis can begin.
                                                                      iii. Preprocessing
                                                                      Data preprocessing consists of three steps: 1) tokenization,
ii. Sentiment gradation                                               2) normalization, and 3) part-of-speech (POS) tagging.[12]
The bag of sentiment expressive words i.e. every tweet is now
analyzed in parts. A knowledge base is created which has the          Emoticons and abbreviations (e.g., OMG, WTF, BRB) are
relative sentiments of words denoted by a floating point              identified as part of the tokenization process and treated as
number ranging from -1 to 1.                                          individual tokens.

All the words in the bag are cross checked across this                 For the normalization process, the presence
knowledge base. This gives the sentiment of ever word in the          of abbreviations within a tweet is noted and then abbreviations
range. After this, taking into consideration the type of words        are replaced by their actual meaning (e.g., BRB
and their sentiment score, the sentiment of the overall tweet is      - > be right back). We also identify informal intensifiers
calculated. This would determine what the sentiment of the            such as all-caps (e.g., I LOVE this show!!! and character
tweet is and how the user has expressed his satisfaction over         repetitions (e.g., I‟ve got a mortgage!! happyyyyyy”), note
the product or service.                                               their presence in the tweet. All-caps words are made into
                                                                      lowercase, and instances of repeated charaters are replaced by
                                                                      a single character.
                                                                                                                                               74
                                              All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                                  International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                                Volume 1, Issue 2, April 2012



 Finally, the presence of any special Twitter                          b. Usage of links. Users very often include links in their
tokens is noted (e.g., #hashtags, usertags, and URLs) and              tweets. An equivalence class was created for all URLs. That
placeholders indicating the token type are substituted. Our            is, a URL like "http://tinyurl.com/cvvg9a" was converted to
hope is that this normalization improves the performance of            the symbol "URL."
the POS tagger, which is the last preprocessing step.
                                                                       c. Usernames. Users often include usernames in their tweets,
                                                                       in order to address messages to particular users. A de facto
iv. Feature-based extraction                                           standard is to include the @ symbol before the username (e.g.
                                                                       @alecmgo). An equivalence class was made for all words that
The collected dataset is used to extract features that will be         started with the @ symbol.
used to train our sentiment classifier. We used the presence of
an n-gram as a binary feature, while for general information           d. Removing the query term. Query terms were stripped out
retrieval purposes, the frequency of a keyword‟s occurrence is         from Tweets, to avoid having
a more suitable feature, since the overall sentiment may not           the query term affect the classification.
necessarily be indicated through the repeated use of keywords.
 A. Process of constructing n-grams
                                                                       2.Bigrams
                                                                       The reason we experimented with bigrams was we wanted to
     1. Filtering – we remove URL links (e.g.                          smooth out instances like 'not good' or 'not bad'. When
        http://example.com), Twitter user names (e.g.                  negation as an explicit feature didn't help, we thought of
        @alex – with symbol @ indicating a user name),                 experimenting with bigrams.
        Twitter special words (such as “RT”), and emoticons.

     2. Tokenization – we segment text by splitting it by              B. Negate as a features
        spaces and punctuation marks, and form a bag of                NEGATE is added as a specific feature which is added when
        words. However, we make sure that short forms such             “not” or „n‟t” are observed in the dataset. [7]
        as “don‟t”, “I‟ll”, “she‟d” will remain as one word.
                                                                       C. Part of Speech (POS) features
     3. Removing stopwords – we remove articles (“a”, “an”,            We felt like POS tags would be a useful feature since how you
     “the”) from the bag of words.                                     made use of a particular word. For example, „over‟ as a verb
                                                                       has a negative connotation whereas „over‟ as the noun, would
     4. Constructing n-grams – we make a set of n-grams out            refer to the cricket over which by itself doesn‟t carry any
     of consecutive words. A negation (such as “no” and                negative or positive connotation.
     “not”) is attached to a word which precedes it or follows
     it. For example, a sentence “I do not like fish” will form        D. Lexicon features
     two bigrams: “I do+not”, “do+not like”, “not+like fish”.          Words listed the MPQA subjectivity lexicon (Wilson,
     Such a procedure allows to improve the accuracy of the            Wiebe, and Hoffmann 2009) are tagged with their prior
     classification since the negation plays a special role in an      polarity:positive, negative, or neutral.We create three features
     opinion and sentiment expression.                                 based on the presence of any words from the lexicon.


1.Unigram                                                              iv. Literature Review on taggers
Building the unigram model took special care because the               The models included for sentiment analysis in our paper can
Twitter language model is very different from other domains            be downloaded for the POS tagger website at
from past research. The unigram feature extractor addressed            http://nlp.stanford.edu/software/tagger.shtml . All taggers are
the following issues:                                                  accompanied by the props files used to create them,given
                                                                       below is a more detailed information about the creation of the
a. Tweets contain very casual language. For example, you               taggers.
can search "hungry" with a random number of u's in the
middle of the word on http://search.twitter.com to understand          For English, the bidirectional taggers are slightly more
this. Here is an example sampling:                                     accurate, but tag much more slowly; choose the appropriate
huuuungry: 17 results in the last day                                  tagger based on your speed/performance needs.
huuuuuuungry: 4 results in the last day
huuuuuuuuuungry: 1 result in the last day                              English taggers
Besides showing that people are hungry, this emphasizes the            ---------------------------
casual nature of Twitter and                                           wsj-0-18-bidirectional-distsim.tagger
the disregard for correct spelling.

                                                                                                                                                75
                                              All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                                 International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                               Volume 1, Issue 2, April 2012

Trained on WSJ sections 0-18 using a bidirectional                    construct some implicit signals from the user's content stream
architecture and including word shape and distributional              that are analogous to recommendation. Specifically, I look at
similarity features.                                                  three signals that are counted as up votes. First, if a user
Penn Treebank tagset.                                                 follows another account, that is considered a positive rating
Performance:                                                          for the account that is followed. Second, if a user retweets (i.e.
97.28% correct on WSJ 19-21                                           echoes a tweet to his own tweet stream), that can also
(90.46% correct on unknown words)                                     considered a positive rating. Thirdly, if a user shares a
                                                                      "hashtag" with another user, that is considered a positive
wsj-0-18-left3words.tagger                                            rating for the user who is being followed. Sharing a hashtag
Trained on WSJ sections 0-18 using the left3words                     implies that the two tweets are related to the same topic,
architecture and includes word shape features. Penn tagset.           although they may express two entirely different opinions (for
Performance:                                                          e.g. the recent controversy around wikileaks elicited a storm of
96.97% correct on WSJ 19-21                                           either vehement approval or disapproval from twitter users,
(88.85% correct on unknown words)                                     but they used the same #wikileaks hashtag).

wsj-0-18-left3words-distsim.tagger
Trained on WSJ sections 0-18 using the left3words
architecture and includes word shape and distributional
similarity features. Penn tagset.                                                               III.ALGORITHMS
Performance:
97.01% correct on WSJ 19-21
(89.81% correct on unknown words)                                       B. PeopleRank Algorithm
                                                                        In general, global knowledge of network topology can make
english-left3words-distsim.tagger                                       for very efficient routing and forwarding decisions.
Trained on WSJ sections 0-18 and extra parser training data             Collecting and exchanging topology information in
using the left3words architecture and includes word shape and           opportunistic networks is cumbersome because of their
distributional similarity features. Penn tagset.                        intermittent connectivity and unpredictable mobility.

english-bidirectional-distsim.tagger                                     PeopleRank is inspired by the PageRank [5] algorithm
Trained on WSJ sections 0-18 using a bidirectional                       employed by Google to rank web pages. By crawling the
architecture and including word shape and distributional                 entire web, this algorithm measures the relative importance
similarity features.                                                     of a page within a graph (web). Motivated by the success of
Penn Treebank tagset.                                                    this algorithm, we propose to apply a similar technique,
                                                                         which we call PeopleRank to rank the nodes in a social
wsj-0-18-caseless-left3words-distsim.tagger                              graph. The main idea is that nodes with a higher
Trained on WSJ sections 0-18 left3words architecture and                 PeopleRank value will generally be more “central” in the
includes word shape and distributional similarity features.              social graph.
Penn tagset. Ignores case.
                                                                          a. Centralized Peoplerank
english-caseless-left3words-distsim.tagger                               In PeopleRank we tag people as “important” when they are
Trained on WSJ sections 0-18 and extra parser training data              linked (in a social context) to many other “important”
using the left3words architecture and includes word shape and            people. We assume that only neighbors in the social graph
distributional similarity features. Penn tagset. Ignores case.           have an impact of the popularity.

                                                                         a social graph Gs = (Vs,Es) as a finite undirected graph with
v. Inferring Edge Strength                                               a vertex set V and an edge set Es. An edge (u, v) ∈ Es if,
In the simplest setting, a user being connected to another user          and only if, there is a social relation between nodes u and v.
can be used as a preference signal. In recent times, given the           In this paper, we define a social relationship between two
explosive growth of twitter, there have emerged a large                  nodes u and v either (i) if they are declared friends, or (ii) if
number of "bot" accounts that seek to follow as many users as            they are sharing k common interests.
possible in the hope that unwitting users will follow them
back. Therefore, looking at "followed" accounts yields more
information about the account holder's preferences rather than            b. Distributed PeopleRank
"follower" accounts.                                                     The distributed version of PeopleRank is shown in
                                                                         Algorithm. In this version, whenever two neighbor nodes in
In a traditional item recommendation setting, users rate items           the social graph meet, they exchange two pieces of
on a scale of 1-5 or by an up or down vote. In twitter, there is         information:
no explicit rating of accounts by other users. However, we can

                                                                                                                                               76
                                              All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                                   International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                                 Volume 1, Issue 2, April 2012

                                                                        others repeat your message either through literal retweets or
        (i) their current PeopleRank values; and                        more subtle gestures, such as replies and repeating the URLs
        (ii) the number of social graph neighbors they have.            that you tweet. If someone on Twitter receives your message
        Then, the two neighbors update their PeopleRank                 through a trusted intermediary, it is assigned a much greater
        values                                                          level of trust. So the goal is to get influential people to follow
                                                                        you and then act as a conduit for your marketing.[10]

                                                                        Influence is a sophisticated measure of a user‟s relative
                                                                        importance among the entire Twitter network.
C. TwitterRank
                                                                        Uses various statistics about a handle as parameters like
TwitterRank measures the influence taking both the topical              number of followers, retweets, mentions, URL‟s shared.
similarity between users and the link structure into account.           There are three major components that add up to the score:
 In a dataset prepared for this study, it is observed that 1)72.4%      your followers, your mentions and retweets, and your lists, all
of the users follow more than 80% of their followers, and (2)           accounted as ratios between you and others.
80.5% of the user have 80% of their friends follow them
back.[4]Our study reveals that the presence of “reciprocity” can        Followers is the strongest component of the calculation is the
be explained by phenomenon of homophily.Based on this                   number of followers you have. In my opinion, your presence
finding, TwitterRank, an extension of PageRank algorithm, is            on Twitter and getting followers can be influenced by at least
proposed to measure the influence of users in Twitter.                  the following three major factors concerning you and your
TwitterRank measures the influence taking both the topical              Twitter account:
similarity between users and the link structure into account.
Experimental results show that TwiterRank outperforms the                     i. Persona – how known you are. Measured by the
one Twitter currently uses and other related algorithms,                               number of followers you have, our time on
including the original PageRank and Topic-sensitive PageRank.                          Twitter.
                                                                              ii. Engagement – how engaged you are. Measured by the
                                                                                       number of followers you have, compared the
First, it potentially brings order to the real-time web in that it                     number of people you follow; Measured by the
allows the search results to be sorted by the authority/influence                      number of followers you have, compared to the
of the contributing twitterers giving a timely update of the                           number of mentions and retweets you‟ve made.
thoughts of influential twitterers. Second, Twitter is also a                 iii. Wits – how smart and creative your tweets are.
marketing platform. Targeting those influential users will                             Measured by the number of followers you have
increase the efficiency of                                    the                      compared to the total number of tweets you've
marketing campaign. For example, a handphone manufacturer                              made.
can engage those twitterers influential in topics about IT
gadgets to potentially influence more people . There are also           For this part, the followers/following ratio the weight of 3, the
applications that utilize Twitter to gather opinions and                followers/tweets a weight of 2 and the followers/time a weight
information on particular topics. Identifying influential               of 1. The followers/(mentions + retweets) has a weight of 0.5
twitterers for interesting topics can improve the quality of            and works in the negative way, so people who bother other
opinions gathered.                                                      people get a bit of a minus to their followers result. Besides,
                                                                        those who are able to get the same number of followers
 PageRank improves over in-degree by considering the link               without mentioning people, must have a small advantage.
 structure of the whole network. Nevertheless, Pagerank
 ignores the interests of twitterers, which affects the way
 twitterers influence one another. Our proposed approach                The second most important part of the calculation is the ratio
 addresses the shortcomings of in-degree and PageRank by                between mentions and being mentioned, together with the
 taking into account both the link structure and topical                number of retweets you get with the absolute "reach" of those
 similarity among twitterers.                                           retweets (measured in the number of people who follow
                                                                        people that retweeted you). A similar reach is also accounted
 In       the      context   of      Twitter,     homophily             in the mentions and replies.
 implies that a twitterer follows a friend because she is
 interested in some topics the friend is publishing, and the            Twitter lists are getting used more and more, so they are also
 friend follows back because she finds they share similar               considered in the calculation. The number of lists you appear
 topical interest.                                                      on, the number of people who follow those lists and the
                                                                        number of people, who follow lists you've created are the
 C. Influence Tracking                                                  basic parameters for the calculation. This component adds
                                                                        only a small bit to the final score.
 In many ways Twitter-based marketing is like a pyramid
 scheme. While sending tweets to your own followers is one
 way of broadcasting a message, it is more effective to have
                                                                                                                                                 77
                                                All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                                   International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                                 Volume 1, Issue 2, April 2012

The three major components currently have the following                       products, services, and spark new waves of tweets gushing
weight in the final score:                                                    with positive sentiment. Doing so over time helps to build
                                                                              the social, and more relevant, business of the future while
     Followers: around 60%                                                   improving relationships to convert followers into
     Mentions and retweets: around 30%                                       stakeholders.
     Lists: around 10%




D. Model for calculating influence                                                                   REFERENCES
                                                                        [1]    Bo Pang and Lillian Lee, “Opinion Mining and Sentiment Analysis”,
The assumptions about the model:                                              Foundations and Trends in Information Retrieval
                                                                              Vol. 2, No 1-2 (2008)
1. Influence(X) = Expected number of people who will read a             [2] Alexander Pak, Patrick Paroubek, “Twitter as a Corpus for Sentiment
                                                                                 Analysis and Opinion Mining”
tweet that X tweets, including all retweets of that tweet. For          [3] Aditya Pal & Scott Counts, “Identifying Topical Authorities in
simplicity, we assume that, if a person reads the same message                Microblogs”, WSDM‟11, February 9–12, 2011, Hong Kong, China,
twice (because of retweets), both readings count.                             Copyright 2011 ACM
                                                                        [4] Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He, “TwitterRank: Finding
                                                                              Topic-sensitive Influential Twitterers”, WSDM‟10, February 4–6, 2010,
 2. If X is a member of Followers(Y), then there is a                         New York City, New York, USA Copyright 2010 ACM
1/||Following(X)|| probability that X will read a tweet posted          [5] Abderrahmen ,Mtibaa Martin May Christophe Diot Mostafa          Ammar,
by Y, where Following(X) is the set of people that X follows.           “PeopleRank: Social Opportunistic Forwarding”
                                                                         [6] Albert Bifet and Eibe Frank,” Sentiment Knowledge Discovery in
                                                                        Twitter Streaming Data”
 3. If X reads a tweet from Y, there‟s a constant probability p         [7] Alec Go , Lei Huang and Richa Bhayani,” Twitter Sentiment Analysis”,
that X will retweet it.                                                 CS224N - Final Project Report June 6, 2009.
                                                                        [8] B. Jansen, M. Zhang, K. Sobel, A. Chowdury. The Commerical Impact of
This model is obviously simplistic in all three assumptions.            Social Mediating Technologies: Micro-blogging as Online Word-of-Mouth
                                                                        Branding, 2009.
But it‟s a reasonable first cut. In particular, it accounts for the     [9] C. Manning and H. Schuetze. Foundations of Statistical Natural Language
inflation that occurs from people who follow in the hopes of            Processing,1999.
reciprocity. There‟s less value in being followed by someone            [10] D. Kempe, J. Kleinberg, and E. Tardos., “Maximizing
who follows a lot of people, because that person is less likely         the spread of influence through a social network”, In KDD ‟03: Proceedings
                                                                        of the ninth ACM SIGKDD international conference on Knowledge discovery
to read your messages or retweet them.                                  and data mining, pages 137–146, New York, NY, USA, 2003. ACM.
                                                                        [11] W. Zhang and S. Skiena., “Improving movie gross prediction through
Of course, there‟s room for adding more realism to this model,          news analysis”, In Web Intelligence, pages 301304, 2009.
but it is at least close enough to the truth to be interesting.         [12] Efthymios Kouloumpis, TheresaWilson, Johanna Moore “Twitter
                                                                        Sentiment Analysis:The Good, the Bad and the OMG”, Proceedings of the
                                                                        Fifth International AAAI Conference on Weblogs and Social Media
From this model, it‟s easy to measure someone‟s influence
recursively, assuming that we know the constant retweet
probability p:

Influence(X) = ∑    (1+p * Influence(Y))/||Following(Y)
                   Followers(X)



                      IV. CONCLUSION
  Microblogging nowadays became one of the major types of
  the communication. A recent research has identified it as             • RUSHABH MEHTA
  online word-of-mouth branding.The era of analysis
  paralysis is officially over. Instead of just listening,
  companies can now study people and their interests based
  on what they say and do and also how they color their
  profiles.
                                                                        Rushabh Mehta is a Final Year B.Tech student of Computer Technology at
  This goldmine of insight gives brands the potential to                VJTI. He gave his HSC from Ramnivas Ruia College securing 93.83% &
  improve marketing, promotional and advertising campaigns              stood 45th out of 2,20,000 students in Engineering Entrance Exam. Currently
                                                                        his CGPA is 9.1 at VJTI. He has pursued internships at Cisco & IIT-Bombay.
  to start. As this practice develops, brands can also gather the       Being the technology evangelist, he has co-founded CSI chapter of VJTI as
  intelligence necessary, and widely available, to improve              well.
                                                                                                                                                 78
                                                All Rights Reserved © 2012 IJARCSEE
ISSN: 2277 – 9043
                                                         International Journal of Advanced Research in Computer Science and Electronics Engineering
                                                                                                                       Volume 1, Issue 2, April 2012

                                                                              publications to her credit. She has guides 35 M. Tech. projects & 85 B. Tech.
                                                                              projects.

                                                                              Some publications:
• DHAVAL MEHTA
                                                                              1. Paper on „Grid FTP Protocol combined with Data Grid use for sharing
                                                                              files‟ – Published in „ETCC-08 - National Conference on emerging trends in
                                                                              Computing & Communication‟ at NIT, Hamirpur (30-31 Dec 2008)

                                                                              2. Better Approach to Requirement Engineering with Agile Process Nirmala
                                                                              Shinde, Mansi U Kulkarni, Pramila Chawan
                                                                              Published in RTICSIT- National Conference On Recent Trends In Computer
                                                                              Science & Information Technology at Guru Nanak Dev
                                                                              Engineering College, Mailoor Road Bidar(9-10 May 2009)
Dhaval Mehta is a Final Year B.Tech student of Computer Technology at
VJTI. He gave his HSC from KC College securing 95% & stood 22nd out of        3. Archana S. Sumant & Pramila M. Chawan, Smart Cards & Biometrics
2,20,000 students in Engineering Entrance Exam. With 197/200.Currently his    :Integration Of Two Growing Technologies, International Conference &
CGPA is 8.7 at VJTI. He has co-founded CSI chapter of VJTI as well.           Workshop on Emerging Trends in Technology 2010 (ICWET 2010), ISBN
                                                                              978-1-60558-812-4

• DISHA CHHEDA                                                                5. Mrs. Pramila M. Chawan Mr. Sandip Shingade Mr. Pravin Bansode,
                                                                              Retrieving images on World Wide Web, The 2nd National Conference On
                                                                              Recent Trends in Computer Engineering (RTCE 2009)

                                                                              6. Ajinkya Patil, Apurva Mayekar, Shruti Gurye, Varun Karandikar and
                                                                              Pramila M. Chawan, Audio Streaming on mobile phones, International
                                                                              Journal of Science and Engineering Research 2011,IJSER-11, June 2011.

                                                                              7. Deepali kadam, Nandan Bhalwankar, Rahul Neware, Rajesh Sapkale,
                                                                              Raunika lamage and Pramila M. Chawan, Oracle Real Application Clusters,
Disha Chheda is a Final Year BTech Student of Computer Engineering at
                                                                              International Journal of Science and Engineering Research 2011,IJSER-11,
VJTI. She has completed her Diploma in Computer Technology from
                                                                              June 2011
Vivekanand Education Society’s Polytechnic, Chembur in 2009 and was a
topper in Mumbai division of MSBTE with aggregate of 92% marks. She has
been studying in VJTI since 2009 and will be graduating in 2012. Her
Cumulative Performance Index is 8.7/10. She has participated in many
college-level academic and extra-curricular competitions and has even had
an experience in managing events in different college level festivals.


• CHARMI SHAH




CHARMI SHAH has done diploma in computer engg from
K.J.SOMAIYA POLYTECHNIC having scored 91.38% and right
now pursuing degree from V.J.T.I . She is very hard working, easy to
grasp things and can easily adopt any new environment. She has previously
worked with vb.net and java language for project purposes.


• PRAMILA M.CHAWAN


                   Pramila M. Chawan is currently working as an Associate
                   Professor in the Computer Technology Department of
                   “Veermata Jijabai Technological Institute (V.J.T.I.),
                   Matunga, Mumbai        (INDIA)”. She received her
                   Bachelors’ Degree in Computer Engineering from
                   V.J.T.I., Mumbai University (INDIA) in 1991 & Masters’
                   Degree in Computer Engineering from V.J.T.I., Mumbai
                   University (INDIA) in 1997.She has an academic
experience of 20 years. She has taught Computer related subjects at both
Undergraduate & Post Graduate levels. Her areas of interest are Software
Engineering, Software Project Management, Management Information
Systems, Advanced Computer Architecture & Operating Systems. She has
published 12 papers in National Conferences and 7 papers in International
Conferences & Symposiums. She also has 16 International Journal


                                                                                                                                                        79
                                                     All Rights Reserved © 2012 IJARCSEE

More Related Content

PDF
Fake Product Review Monitoring System
PDF
IRJET-Fake Product Review Monitoring
PDF
A Survey Of Collaborative Filtering Techniques
PDF
IRJET- Fake Profile Identification using Machine Learning
PDF
IRJET- Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
PDF
[IJET V2I4P9] Authors: Praveen Jayasankar , Prashanth Jayaraman ,Rachel Hannah
DOCX
Tweet sentiment analysis
PDF
Vol 7 No 1 - November 2013
Fake Product Review Monitoring System
IRJET-Fake Product Review Monitoring
A Survey Of Collaborative Filtering Techniques
IRJET- Fake Profile Identification using Machine Learning
IRJET- Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
[IJET V2I4P9] Authors: Praveen Jayasankar , Prashanth Jayaraman ,Rachel Hannah
Tweet sentiment analysis
Vol 7 No 1 - November 2013

What's hot (20)

PDF
F017433947
PDF
Sentiment Analysis of Twitter Data
PDF
E017433538
PDF
IRJET - Twitter Sentimental Analysis
PDF
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
PDF
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
PDF
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
PPTX
Seminar on detecting fake accounts in social media using machine learning
PPT
Twitter Analytics
PDF
IRJET- Interpreting Public Sentiments Variation by using FB-LDA Technique
PDF
IRJET - Detection of Drug Abuse using Social Media Mining
PDF
IRJET- Analytic System Based on Prediction Analysis of Social Emotions from U...
PDF
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
PDF
Sentimental Emotion Analysis using Python and Machine Learning
DOC
Seminar Report Mine
PDF
Social networks protection against fake profiles and social bots attacks
PDF
IRJET- Fake News Detection using Logistic Regression
DOC
Monitoring opinion on esop through social media and clustering its polarity
PDF
Sentiment analysis by using fuzzy logic
F017433947
Sentiment Analysis of Twitter Data
E017433538
IRJET - Twitter Sentimental Analysis
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
Seminar on detecting fake accounts in social media using machine learning
Twitter Analytics
IRJET- Interpreting Public Sentiments Variation by using FB-LDA Technique
IRJET - Detection of Drug Abuse using Social Media Mining
IRJET- Analytic System Based on Prediction Analysis of Social Emotions from U...
DETECTION OF FAKE ACCOUNTS IN INSTAGRAM USING MACHINE LEARNING
Sentimental Emotion Analysis using Python and Machine Learning
Seminar Report Mine
Social networks protection against fake profiles and social bots attacks
IRJET- Fake News Detection using Logistic Regression
Monitoring opinion on esop through social media and clustering its polarity
Sentiment analysis by using fuzzy logic
Ad

Similar to 32 99-1-pb (20)

PDF
A Baseline Based Deep Learning Approach of Live Tweets
PDF
TWITTER SENTIMENT ANALYSIS
PDF
TWITTER SENTIMENT ANALYSIS
PDF
Sentiment Analysis using Fuzzy logic
PDF
SENTIMENT ANALYSIS BY USING FUZZY LOGIC
PDF
International Journal of Computer Science, Engineering and Information Techno...
PDF
Classification of Sentiment Analysis on Tweets Based on Techniques from Machi...
PDF
[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...
PDF
DOCX
NLP journal paper
PDF
Sentiment Analysis and Social Media: How and Why
PDF
Sentiment of Sentence in Tweets: A Review
PDF
W01761157162
PDF
Twitter sentimentanalysis report
PDF
IRJET- Review Analyser with Bot
PPTX
Presentation10-OF-project.pptx
PDF
IRJET - Social Media Intelligence Tools
PDF
UTILIZING TWITTER TO PERFORM AUTONOMOUS SENTIMENT ANALYSIS
PDF
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
PDF
Who gives a tweet
A Baseline Based Deep Learning Approach of Live Tweets
TWITTER SENTIMENT ANALYSIS
TWITTER SENTIMENT ANALYSIS
Sentiment Analysis using Fuzzy logic
SENTIMENT ANALYSIS BY USING FUZZY LOGIC
International Journal of Computer Science, Engineering and Information Techno...
Classification of Sentiment Analysis on Tweets Based on Techniques from Machi...
[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...
NLP journal paper
Sentiment Analysis and Social Media: How and Why
Sentiment of Sentence in Tweets: A Review
W01761157162
Twitter sentimentanalysis report
IRJET- Review Analyser with Bot
Presentation10-OF-project.pptx
IRJET - Social Media Intelligence Tools
UTILIZING TWITTER TO PERFORM AUTONOMOUS SENTIMENT ANALYSIS
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
Who gives a tweet
Ad

More from Mahendra Sisodia (15)

PDF
48 144-1-pb
PDF
47 141-1-pb
PDF
45 135-1-pb
PDF
43 131-1-pb
PDF
42 128-1-pb
PDF
41 125-1-pb
PDF
38 116-1-pb
PDF
37 112-1-pb
PDF
34 107-1-pb
PDF
33 102-1-pb
PDF
27 122-1-pb
PDF
24 83-1-pb
PDF
23 79-1-pb
PDF
20 74-1-pb
PDF
46 138-1-pb
48 144-1-pb
47 141-1-pb
45 135-1-pb
43 131-1-pb
42 128-1-pb
41 125-1-pb
38 116-1-pb
37 112-1-pb
34 107-1-pb
33 102-1-pb
27 122-1-pb
24 83-1-pb
23 79-1-pb
20 74-1-pb
46 138-1-pb

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Encapsulation theory and applications.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Modernizing your data center with Dell and AMD
Review of recent advances in non-invasive hemoglobin estimation
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Reach Out and Touch Someone: Haptics and Empathic Computing
Mobile App Security Testing_ A Comprehensive Guide.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Weekly Chronicles - August'25 Week I
Advanced methodologies resolving dimensionality complications for autism neur...
Encapsulation theory and applications.pdf
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Dropbox Q2 2025 Financial Results & Investor Presentation
20250228 LYD VKU AI Blended-Learning.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Unlocking AI with Model Context Protocol (MCP)
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Modernizing your data center with Dell and AMD

32 99-1-pb

  • 1. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 Sentiment Analysis and Influence Tracking using Twitter Rushabh Mehta, Dhaval Mehta, Disha Chheda, Charmi Shah and Pramila M. Chawan  share opinions on variety of topics and discuss current issues. Abstract— An overwhelming number of consumers are Because of a free format of messages and an easy accessibility active in social media platforms. Within these platforms of microblogging platforms, Internet users tend to shift from consumers are sharing their true feelings about a traditional communication tools (such as traditional blogs or particular brand/product, its features, customer service mailing lists) to microblogging services. and how it stands the competition. With the booming of microblogs on the Web, people have begun to express their As more and more users post about products and services they opinions on a wide variety of topics on Twitter and other use, or express their political and religious views, similar services. In a world where information can bias microblogging[2] web- sites become valuable sources of public opinion it is essential to analyse the propagation and people‟s opinions and sentiments. Such data can be efficiently influence of information in large-scale networks. Recent used for marketing or social studies. We use a dataset formed research studying social media data to rank users by of collected messages from Twitter. Twitter contains a very topical relevance have largely focused on the “retweet", large number of very short messages created by the users of “following" and “mention" relations. We also perform this microblogging platform. The contents of the messages linguistic analysis of the collected corpus and explain vary from personal thoughts to public statements. discovered phenomena. Using the corpus, we build a sentiment classifier, that is able to determine positive, As a microblogging and social networking website, Twitter negative and neutral sentiments for a document. This has become very popular and has grown rapidly. An paper discusses how Twitter data is used as a corpus for increasing number of people are willing to post their opinions analysis by the application of sentiment analysis and a on Twitter, which is now considered a valuable online source study of different algorithms and methods that help to for opinions. As a result, sentiment analysis on Twitter is a track influence and impact of a particular user/brand rapid and effective way of gauging public opinion for business active on the social network. marketing or social studies. For example, a business can retrieve timely feedback on a new product in the market by Index Terms—Twitter, sentiment analysis, influence, evaluating people's opinions on Twitter. As people often talk People Rank, TwitterRank. about various entities (e.g., products, organizations, people, etc.) in a tweet, we perform sentiment analysis at the entity level; that is, we mine people's opinions on specific entities in I. INTRODUCTION each tweet rather than the opinion about each whole sentence or whole tweet. We assume that the entities are provided by Microblogging today has become a very popular the user, e.g., he/she is interested in opinions on iPhone (an communication tool among Internet users. Millions of entity). messages are appearing daily in popular web-sites that provide services for microblogging such as Twitter, Tumblr, In our paper, we study how microblogging can be used for Facebook. Authors of those messages write about their life, sentiment analysis purposes. We show how to use Twitter as a corpus for sentiment analysis and opinion mining. We use  RUSHABH MEHTA Btech Computer Engineer from microblogging and more particularly Twitter for the following VJTI,MUMBAI,INDIA reasons:  DHAVAL MEHTA Btech Computer Engineer from VJTI,MUMBAI,INDIA  DISHA CHHEDA Btech Computer Engineer from • Microblogging platforms are used by different people to VJTI,MUMBAI,INDIA express their opinion about different topics, thus it is a  CHARMI SHAH Btech Computer Engineer from valuable source of people‟s opinions. VJTI,MUMBAI, INDIA  PRAMILA M.CHAWAN • Twitter contains an enormous number of text posts and it Associate Professor Computer Department grows every day. The collected corpus can be arbitrarily large. VJTI,MUMBAI,INDIA 72 All Rights Reserved © 2012 IJARCSEE
  • 2. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 • Twitter‟s audience varies from regular users to celebrities, 2.”#" called the hashtag is used to mark, organize company representatives, politicians4, and even country or alter tweets according to topics or categories. presidents. Therefore, it is possible to collect text posts of users from different social and interests groups. 3. “@username1" represents that a message is a reply to a user whose user name is username1". Sentiment is an attitude, thought, or judgment prompted by feeling. Sentiment analysis is the process of determining and 4. Emoticons and colloquial expressions are frequently used in measuring the tone, attitude, opinion, and emotional state of tweets, e.g. :-)", lovvve", responses. More precisely, it is the concept of deciding lmao". whether a specific conversation is positive, negative, or neutral. Sentiment analysis has broad applications and 5. External Web links (e.g. http://amze.ly/8K4n0t) encompasses work in classifying subjectivity, polarity, are also commonly found in tweets to refer to some tonality, emotion mining, opinion mining, persuasion analysis, external sources. and affective computing. It is a tool that allows companies to analyze what their customers are saying regarding their 6. Length: Tweets are limited to 140 characters. products and services, and also monitor trends in the opinions This is different from usual opinionated corpora such as and attitudes of their customers toward the products and reviews and blogs, which are usually long. services with respect to their competitors. Another unique characteristic of Twitter data compared to the There are various types of marketing strategies such as mass other opinionated corpora is its volume. It is estimated that marketing, segmentation and one to one marketing. One to one people post about 60 million tweets every day and the number marketing is an effort to find individual customer's needs and is still increasing rapidly. to provide a good response for them. Recommender systems have appeared in e-commerce problems to support product recommendation, which provide one to one marketing. Indeed, II. METHODOLOGY recommender systems individualize the way of recommending products. These systems try to recommend different products to each customer with collecting data of customer preferences A. Data Collection and data mining techniques. Recommender systems have Twitter has an open API that allows anyone to get a list of a recently become popular among many well-known e- user's friends (provided the account is not private) It is businesses such as Amazon.com, MovieFinder.com. therefore easy to create a graph of the network. Since there are more than 100M nodes in this graph with many times that As people often talk about various entities (e.g., products, many edges, it requires a lot of computational power to organizations, people, etc.) in a tweet, we perform sentiment process this entire graph. I therefore propose to focus on a analysis at the entity level; that is, we mine people's opinions smaller subset. However, recently Twitter has been more on specific entities in each tweet rather than the opinion about circumspect in allowing unfettered access to the entire social each whole sentence or whole tweet. We assume that the graph and tweet stream. It allows this access termed the "fire entities are provided by the user, e.g., he/she is interested in hose" to a small chosen set of companies only. Through the opinions on iPhone (an entity). One approach to perform public API, one can only access a single user's tweet stream sentiment analysis is based on a function of opinion words in and his profile information and also the public timeline of context. Opinion words are words that are commonly used to tweets. express positive or negative sentiments, e.g., good" and bad". The approach generally uses a dictionary of opinion words to The Streaming API is the real-time sample of the Twitter identify and determine sentiment orientation (positive, Firehose. This API is for those developers with data intensive negative or neutral). The dictionary is called the opinion needs. If you're looking to build a data mining product or are lexicon. interested in analytics research, the Streaming API is most suited for such things. Streaming API allows for large quantities of keywords to be specified and tracked, retrieving geo-tagged tweets from a certain region, or have the public Twitter Data statuses of a user set returned. This requires you to establish a long-lived HTTP connection and maintain that connection. Twitter has developed its own language conventions.The following are examples of Twitter conventions. The Twitter Search API is a dedicated API for running 1. “RT" is an acronym for retweet, which is put in searches against the real-time index of recent Tweets. If you're front of a tweet to indicate that the user is repeating currently developing on the Search API, and find that your or reposting. application is being rate-limited or you just have aggressive 73 All Rights Reserved © 2012 IJARCSEE
  • 3. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 querying needs, then you should be moving over to the Streaming API. Classifications Four classifications are used in this corpus: B. Analysis Using Twitter API we collected a corpus of text posts and Positive ● Positive indicator on topic formed a dataset of three classes: positive sentiments, negative Neutral ● Neither positive nor negative sentiments, and a set of objective texts (no sentiments). We indicators queried Twitter for two types of emoticons: ● Mixed positive and negative indicators • Happy emoticons: “:-)”, “:)”, “=)”, “:D” etc. ● On topic, but indicator undeterminable • Sad emoticons: “:-(”, “:(”, “=(”, “;(” etc. ● Simple factual statements ● Questions with no strong emotions The two types of collected corpora will be used to train a indicated classifier to recognize positive and negative sentiments. Because each message cannot exceed 140 characters by the Negative ● Negative indicator on topic rules of the microblogging platform, it is usually composed of Irrelevant ● Not English language a single sentence. Therefore, we assume that an emoticon ● Not on-topic (e.g. spam) within a message represents an emotion for the whole message and all the words of the message are related to this emotion. Sentiment assignment is an extremely subjective exercise. i. Preparing the Data The data set contains information on X million profiles. For this corpus, “Positive” and “Negative” labels were Profile information is limited to the user accounts followed by reserved for tweets which clearly express an emotion or where the user. Since the data set is as of a certain date, the the implications were unambiguous. As a rule of thumb, information is not complete as of today. However, the data set “neutral” was the preferred label for border line cases. contained enough information to create training and test data sets. Examples: All the tweets would be initially considered as a bag of words. There are huge lines at the @apple store. Labeled neutral. From a shoppers perspective this could be for eg. "This is excellent" bad, or it could be a sign of excitement about the product launch. From an investor‟s perspective this could be good, would not be considered as a string but as a bag of three words since it indicates a strong new product launch. "This", "is" and "excellent". I had to wait for six friggin’ hours in line at the @apple Then the stop words such as "the", "a", "with" etc will be store. removed from the bag as these words do not have any Labeled negative. The tweeter is clearly unhappy with the sentiment expressing nature. Once these non-sentimental stop situation and is referring to Apple in the negative sense. words are are removed and hence the corpus refined, the process of sentiment analysis can begin. iii. Preprocessing Data preprocessing consists of three steps: 1) tokenization, ii. Sentiment gradation 2) normalization, and 3) part-of-speech (POS) tagging.[12] The bag of sentiment expressive words i.e. every tweet is now analyzed in parts. A knowledge base is created which has the Emoticons and abbreviations (e.g., OMG, WTF, BRB) are relative sentiments of words denoted by a floating point identified as part of the tokenization process and treated as number ranging from -1 to 1. individual tokens. All the words in the bag are cross checked across this For the normalization process, the presence knowledge base. This gives the sentiment of ever word in the of abbreviations within a tweet is noted and then abbreviations range. After this, taking into consideration the type of words are replaced by their actual meaning (e.g., BRB and their sentiment score, the sentiment of the overall tweet is - > be right back). We also identify informal intensifiers calculated. This would determine what the sentiment of the such as all-caps (e.g., I LOVE this show!!! and character tweet is and how the user has expressed his satisfaction over repetitions (e.g., I‟ve got a mortgage!! happyyyyyy”), note the product or service. their presence in the tweet. All-caps words are made into lowercase, and instances of repeated charaters are replaced by a single character. 74 All Rights Reserved © 2012 IJARCSEE
  • 4. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 Finally, the presence of any special Twitter b. Usage of links. Users very often include links in their tokens is noted (e.g., #hashtags, usertags, and URLs) and tweets. An equivalence class was created for all URLs. That placeholders indicating the token type are substituted. Our is, a URL like "http://tinyurl.com/cvvg9a" was converted to hope is that this normalization improves the performance of the symbol "URL." the POS tagger, which is the last preprocessing step. c. Usernames. Users often include usernames in their tweets, in order to address messages to particular users. A de facto iv. Feature-based extraction standard is to include the @ symbol before the username (e.g. @alecmgo). An equivalence class was made for all words that The collected dataset is used to extract features that will be started with the @ symbol. used to train our sentiment classifier. We used the presence of an n-gram as a binary feature, while for general information d. Removing the query term. Query terms were stripped out retrieval purposes, the frequency of a keyword‟s occurrence is from Tweets, to avoid having a more suitable feature, since the overall sentiment may not the query term affect the classification. necessarily be indicated through the repeated use of keywords. A. Process of constructing n-grams 2.Bigrams The reason we experimented with bigrams was we wanted to 1. Filtering – we remove URL links (e.g. smooth out instances like 'not good' or 'not bad'. When http://example.com), Twitter user names (e.g. negation as an explicit feature didn't help, we thought of @alex – with symbol @ indicating a user name), experimenting with bigrams. Twitter special words (such as “RT”), and emoticons. 2. Tokenization – we segment text by splitting it by B. Negate as a features spaces and punctuation marks, and form a bag of NEGATE is added as a specific feature which is added when words. However, we make sure that short forms such “not” or „n‟t” are observed in the dataset. [7] as “don‟t”, “I‟ll”, “she‟d” will remain as one word. C. Part of Speech (POS) features 3. Removing stopwords – we remove articles (“a”, “an”, We felt like POS tags would be a useful feature since how you “the”) from the bag of words. made use of a particular word. For example, „over‟ as a verb has a negative connotation whereas „over‟ as the noun, would 4. Constructing n-grams – we make a set of n-grams out refer to the cricket over which by itself doesn‟t carry any of consecutive words. A negation (such as “no” and negative or positive connotation. “not”) is attached to a word which precedes it or follows it. For example, a sentence “I do not like fish” will form D. Lexicon features two bigrams: “I do+not”, “do+not like”, “not+like fish”. Words listed the MPQA subjectivity lexicon (Wilson, Such a procedure allows to improve the accuracy of the Wiebe, and Hoffmann 2009) are tagged with their prior classification since the negation plays a special role in an polarity:positive, negative, or neutral.We create three features opinion and sentiment expression. based on the presence of any words from the lexicon. 1.Unigram iv. Literature Review on taggers Building the unigram model took special care because the The models included for sentiment analysis in our paper can Twitter language model is very different from other domains be downloaded for the POS tagger website at from past research. The unigram feature extractor addressed http://nlp.stanford.edu/software/tagger.shtml . All taggers are the following issues: accompanied by the props files used to create them,given below is a more detailed information about the creation of the a. Tweets contain very casual language. For example, you taggers. can search "hungry" with a random number of u's in the middle of the word on http://search.twitter.com to understand For English, the bidirectional taggers are slightly more this. Here is an example sampling: accurate, but tag much more slowly; choose the appropriate huuuungry: 17 results in the last day tagger based on your speed/performance needs. huuuuuuungry: 4 results in the last day huuuuuuuuuungry: 1 result in the last day English taggers Besides showing that people are hungry, this emphasizes the --------------------------- casual nature of Twitter and wsj-0-18-bidirectional-distsim.tagger the disregard for correct spelling. 75 All Rights Reserved © 2012 IJARCSEE
  • 5. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 Trained on WSJ sections 0-18 using a bidirectional construct some implicit signals from the user's content stream architecture and including word shape and distributional that are analogous to recommendation. Specifically, I look at similarity features. three signals that are counted as up votes. First, if a user Penn Treebank tagset. follows another account, that is considered a positive rating Performance: for the account that is followed. Second, if a user retweets (i.e. 97.28% correct on WSJ 19-21 echoes a tweet to his own tweet stream), that can also (90.46% correct on unknown words) considered a positive rating. Thirdly, if a user shares a "hashtag" with another user, that is considered a positive wsj-0-18-left3words.tagger rating for the user who is being followed. Sharing a hashtag Trained on WSJ sections 0-18 using the left3words implies that the two tweets are related to the same topic, architecture and includes word shape features. Penn tagset. although they may express two entirely different opinions (for Performance: e.g. the recent controversy around wikileaks elicited a storm of 96.97% correct on WSJ 19-21 either vehement approval or disapproval from twitter users, (88.85% correct on unknown words) but they used the same #wikileaks hashtag). wsj-0-18-left3words-distsim.tagger Trained on WSJ sections 0-18 using the left3words architecture and includes word shape and distributional similarity features. Penn tagset. III.ALGORITHMS Performance: 97.01% correct on WSJ 19-21 (89.81% correct on unknown words) B. PeopleRank Algorithm In general, global knowledge of network topology can make english-left3words-distsim.tagger for very efficient routing and forwarding decisions. Trained on WSJ sections 0-18 and extra parser training data Collecting and exchanging topology information in using the left3words architecture and includes word shape and opportunistic networks is cumbersome because of their distributional similarity features. Penn tagset. intermittent connectivity and unpredictable mobility. english-bidirectional-distsim.tagger PeopleRank is inspired by the PageRank [5] algorithm Trained on WSJ sections 0-18 using a bidirectional employed by Google to rank web pages. By crawling the architecture and including word shape and distributional entire web, this algorithm measures the relative importance similarity features. of a page within a graph (web). Motivated by the success of Penn Treebank tagset. this algorithm, we propose to apply a similar technique, which we call PeopleRank to rank the nodes in a social wsj-0-18-caseless-left3words-distsim.tagger graph. The main idea is that nodes with a higher Trained on WSJ sections 0-18 left3words architecture and PeopleRank value will generally be more “central” in the includes word shape and distributional similarity features. social graph. Penn tagset. Ignores case. a. Centralized Peoplerank english-caseless-left3words-distsim.tagger In PeopleRank we tag people as “important” when they are Trained on WSJ sections 0-18 and extra parser training data linked (in a social context) to many other “important” using the left3words architecture and includes word shape and people. We assume that only neighbors in the social graph distributional similarity features. Penn tagset. Ignores case. have an impact of the popularity. a social graph Gs = (Vs,Es) as a finite undirected graph with v. Inferring Edge Strength a vertex set V and an edge set Es. An edge (u, v) ∈ Es if, In the simplest setting, a user being connected to another user and only if, there is a social relation between nodes u and v. can be used as a preference signal. In recent times, given the In this paper, we define a social relationship between two explosive growth of twitter, there have emerged a large nodes u and v either (i) if they are declared friends, or (ii) if number of "bot" accounts that seek to follow as many users as they are sharing k common interests. possible in the hope that unwitting users will follow them back. Therefore, looking at "followed" accounts yields more information about the account holder's preferences rather than b. Distributed PeopleRank "follower" accounts. The distributed version of PeopleRank is shown in Algorithm. In this version, whenever two neighbor nodes in In a traditional item recommendation setting, users rate items the social graph meet, they exchange two pieces of on a scale of 1-5 or by an up or down vote. In twitter, there is information: no explicit rating of accounts by other users. However, we can 76 All Rights Reserved © 2012 IJARCSEE
  • 6. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 others repeat your message either through literal retweets or (i) their current PeopleRank values; and more subtle gestures, such as replies and repeating the URLs (ii) the number of social graph neighbors they have. that you tweet. If someone on Twitter receives your message Then, the two neighbors update their PeopleRank through a trusted intermediary, it is assigned a much greater values level of trust. So the goal is to get influential people to follow you and then act as a conduit for your marketing.[10] Influence is a sophisticated measure of a user‟s relative importance among the entire Twitter network. C. TwitterRank Uses various statistics about a handle as parameters like TwitterRank measures the influence taking both the topical number of followers, retweets, mentions, URL‟s shared. similarity between users and the link structure into account. There are three major components that add up to the score: In a dataset prepared for this study, it is observed that 1)72.4% your followers, your mentions and retweets, and your lists, all of the users follow more than 80% of their followers, and (2) accounted as ratios between you and others. 80.5% of the user have 80% of their friends follow them back.[4]Our study reveals that the presence of “reciprocity” can Followers is the strongest component of the calculation is the be explained by phenomenon of homophily.Based on this number of followers you have. In my opinion, your presence finding, TwitterRank, an extension of PageRank algorithm, is on Twitter and getting followers can be influenced by at least proposed to measure the influence of users in Twitter. the following three major factors concerning you and your TwitterRank measures the influence taking both the topical Twitter account: similarity between users and the link structure into account. Experimental results show that TwiterRank outperforms the i. Persona – how known you are. Measured by the one Twitter currently uses and other related algorithms, number of followers you have, our time on including the original PageRank and Topic-sensitive PageRank. Twitter. ii. Engagement – how engaged you are. Measured by the number of followers you have, compared the First, it potentially brings order to the real-time web in that it number of people you follow; Measured by the allows the search results to be sorted by the authority/influence number of followers you have, compared to the of the contributing twitterers giving a timely update of the number of mentions and retweets you‟ve made. thoughts of influential twitterers. Second, Twitter is also a iii. Wits – how smart and creative your tweets are. marketing platform. Targeting those influential users will Measured by the number of followers you have increase the efficiency of the compared to the total number of tweets you've marketing campaign. For example, a handphone manufacturer made. can engage those twitterers influential in topics about IT gadgets to potentially influence more people . There are also For this part, the followers/following ratio the weight of 3, the applications that utilize Twitter to gather opinions and followers/tweets a weight of 2 and the followers/time a weight information on particular topics. Identifying influential of 1. The followers/(mentions + retweets) has a weight of 0.5 twitterers for interesting topics can improve the quality of and works in the negative way, so people who bother other opinions gathered. people get a bit of a minus to their followers result. Besides, those who are able to get the same number of followers PageRank improves over in-degree by considering the link without mentioning people, must have a small advantage. structure of the whole network. Nevertheless, Pagerank ignores the interests of twitterers, which affects the way twitterers influence one another. Our proposed approach The second most important part of the calculation is the ratio addresses the shortcomings of in-degree and PageRank by between mentions and being mentioned, together with the taking into account both the link structure and topical number of retweets you get with the absolute "reach" of those similarity among twitterers. retweets (measured in the number of people who follow people that retweeted you). A similar reach is also accounted In the context of Twitter, homophily in the mentions and replies. implies that a twitterer follows a friend because she is interested in some topics the friend is publishing, and the Twitter lists are getting used more and more, so they are also friend follows back because she finds they share similar considered in the calculation. The number of lists you appear topical interest. on, the number of people who follow those lists and the number of people, who follow lists you've created are the C. Influence Tracking basic parameters for the calculation. This component adds only a small bit to the final score. In many ways Twitter-based marketing is like a pyramid scheme. While sending tweets to your own followers is one way of broadcasting a message, it is more effective to have 77 All Rights Reserved © 2012 IJARCSEE
  • 7. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 The three major components currently have the following products, services, and spark new waves of tweets gushing weight in the final score: with positive sentiment. Doing so over time helps to build the social, and more relevant, business of the future while  Followers: around 60% improving relationships to convert followers into  Mentions and retweets: around 30% stakeholders.  Lists: around 10% D. Model for calculating influence REFERENCES [1] Bo Pang and Lillian Lee, “Opinion Mining and Sentiment Analysis”, The assumptions about the model: Foundations and Trends in Information Retrieval Vol. 2, No 1-2 (2008) 1. Influence(X) = Expected number of people who will read a [2] Alexander Pak, Patrick Paroubek, “Twitter as a Corpus for Sentiment Analysis and Opinion Mining” tweet that X tweets, including all retweets of that tweet. For [3] Aditya Pal & Scott Counts, “Identifying Topical Authorities in simplicity, we assume that, if a person reads the same message Microblogs”, WSDM‟11, February 9–12, 2011, Hong Kong, China, twice (because of retweets), both readings count. Copyright 2011 ACM [4] Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He, “TwitterRank: Finding Topic-sensitive Influential Twitterers”, WSDM‟10, February 4–6, 2010, 2. If X is a member of Followers(Y), then there is a New York City, New York, USA Copyright 2010 ACM 1/||Following(X)|| probability that X will read a tweet posted [5] Abderrahmen ,Mtibaa Martin May Christophe Diot Mostafa Ammar, by Y, where Following(X) is the set of people that X follows. “PeopleRank: Social Opportunistic Forwarding” [6] Albert Bifet and Eibe Frank,” Sentiment Knowledge Discovery in Twitter Streaming Data” 3. If X reads a tweet from Y, there‟s a constant probability p [7] Alec Go , Lei Huang and Richa Bhayani,” Twitter Sentiment Analysis”, that X will retweet it. CS224N - Final Project Report June 6, 2009. [8] B. Jansen, M. Zhang, K. Sobel, A. Chowdury. The Commerical Impact of This model is obviously simplistic in all three assumptions. Social Mediating Technologies: Micro-blogging as Online Word-of-Mouth Branding, 2009. But it‟s a reasonable first cut. In particular, it accounts for the [9] C. Manning and H. Schuetze. Foundations of Statistical Natural Language inflation that occurs from people who follow in the hopes of Processing,1999. reciprocity. There‟s less value in being followed by someone [10] D. Kempe, J. Kleinberg, and E. Tardos., “Maximizing who follows a lot of people, because that person is less likely the spread of influence through a social network”, In KDD ‟03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery to read your messages or retweet them. and data mining, pages 137–146, New York, NY, USA, 2003. ACM. [11] W. Zhang and S. Skiena., “Improving movie gross prediction through Of course, there‟s room for adding more realism to this model, news analysis”, In Web Intelligence, pages 301304, 2009. but it is at least close enough to the truth to be interesting. [12] Efthymios Kouloumpis, TheresaWilson, Johanna Moore “Twitter Sentiment Analysis:The Good, the Bad and the OMG”, Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media From this model, it‟s easy to measure someone‟s influence recursively, assuming that we know the constant retweet probability p: Influence(X) = ∑ (1+p * Influence(Y))/||Following(Y) Followers(X) IV. CONCLUSION Microblogging nowadays became one of the major types of the communication. A recent research has identified it as • RUSHABH MEHTA online word-of-mouth branding.The era of analysis paralysis is officially over. Instead of just listening, companies can now study people and their interests based on what they say and do and also how they color their profiles. Rushabh Mehta is a Final Year B.Tech student of Computer Technology at This goldmine of insight gives brands the potential to VJTI. He gave his HSC from Ramnivas Ruia College securing 93.83% & improve marketing, promotional and advertising campaigns stood 45th out of 2,20,000 students in Engineering Entrance Exam. Currently his CGPA is 9.1 at VJTI. He has pursued internships at Cisco & IIT-Bombay. to start. As this practice develops, brands can also gather the Being the technology evangelist, he has co-founded CSI chapter of VJTI as intelligence necessary, and widely available, to improve well. 78 All Rights Reserved © 2012 IJARCSEE
  • 8. ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 2, April 2012 publications to her credit. She has guides 35 M. Tech. projects & 85 B. Tech. projects. Some publications: • DHAVAL MEHTA 1. Paper on „Grid FTP Protocol combined with Data Grid use for sharing files‟ – Published in „ETCC-08 - National Conference on emerging trends in Computing & Communication‟ at NIT, Hamirpur (30-31 Dec 2008) 2. Better Approach to Requirement Engineering with Agile Process Nirmala Shinde, Mansi U Kulkarni, Pramila Chawan Published in RTICSIT- National Conference On Recent Trends In Computer Science & Information Technology at Guru Nanak Dev Engineering College, Mailoor Road Bidar(9-10 May 2009) Dhaval Mehta is a Final Year B.Tech student of Computer Technology at VJTI. He gave his HSC from KC College securing 95% & stood 22nd out of 3. Archana S. Sumant & Pramila M. Chawan, Smart Cards & Biometrics 2,20,000 students in Engineering Entrance Exam. With 197/200.Currently his :Integration Of Two Growing Technologies, International Conference & CGPA is 8.7 at VJTI. He has co-founded CSI chapter of VJTI as well. Workshop on Emerging Trends in Technology 2010 (ICWET 2010), ISBN 978-1-60558-812-4 • DISHA CHHEDA 5. Mrs. Pramila M. Chawan Mr. Sandip Shingade Mr. Pravin Bansode, Retrieving images on World Wide Web, The 2nd National Conference On Recent Trends in Computer Engineering (RTCE 2009) 6. Ajinkya Patil, Apurva Mayekar, Shruti Gurye, Varun Karandikar and Pramila M. Chawan, Audio Streaming on mobile phones, International Journal of Science and Engineering Research 2011,IJSER-11, June 2011. 7. Deepali kadam, Nandan Bhalwankar, Rahul Neware, Rajesh Sapkale, Raunika lamage and Pramila M. Chawan, Oracle Real Application Clusters, Disha Chheda is a Final Year BTech Student of Computer Engineering at International Journal of Science and Engineering Research 2011,IJSER-11, VJTI. She has completed her Diploma in Computer Technology from June 2011 Vivekanand Education Society’s Polytechnic, Chembur in 2009 and was a topper in Mumbai division of MSBTE with aggregate of 92% marks. She has been studying in VJTI since 2009 and will be graduating in 2012. Her Cumulative Performance Index is 8.7/10. She has participated in many college-level academic and extra-curricular competitions and has even had an experience in managing events in different college level festivals. • CHARMI SHAH CHARMI SHAH has done diploma in computer engg from K.J.SOMAIYA POLYTECHNIC having scored 91.38% and right now pursuing degree from V.J.T.I . She is very hard working, easy to grasp things and can easily adopt any new environment. She has previously worked with vb.net and java language for project purposes. • PRAMILA M.CHAWAN Pramila M. Chawan is currently working as an Associate Professor in the Computer Technology Department of “Veermata Jijabai Technological Institute (V.J.T.I.), Matunga, Mumbai (INDIA)”. She received her Bachelors’ Degree in Computer Engineering from V.J.T.I., Mumbai University (INDIA) in 1991 & Masters’ Degree in Computer Engineering from V.J.T.I., Mumbai University (INDIA) in 1997.She has an academic experience of 20 years. She has taught Computer related subjects at both Undergraduate & Post Graduate levels. Her areas of interest are Software Engineering, Software Project Management, Management Information Systems, Advanced Computer Architecture & Operating Systems. She has published 12 papers in National Conferences and 7 papers in International Conferences & Symposiums. She also has 16 International Journal 79 All Rights Reserved © 2012 IJARCSEE