SlideShare a Scribd company logo
CA652A




    Semantic Web
    Based Sentiment
    Engine
    A system to determine online sentiment
    on current affairs for the purpose of
    analysis and prediction




                            11210889
                            52595354
                                 CA652A
ABSTRACT
Sentiment analysis involves classifying opinions from text as "positive", "negative" or
“neutral. Its purpose and benefit is to assist in extracting valuable information and insight
from copious amounts of unstructured data. This proposed system will have the capability to
determine online sentiment on current affairs for the purpose of analysis and prediction. For
the sentiment analysis a cluster-method approach is recommended, which is a recent
advancement in this area. Various APIs will assist in extracting other data such as location
and time. Evaluation of system through the use of the Pang et al movie review data sets is
recommended to validate basic functionality and real life data in the form of the 2008 US
presidential race data to evaluate all functionality of the system. Multiple industries are
identified as potential users of this system from marketing companies to hotels adding to our
benefit in the commercialisation potential of the system.




                                                                                 1|Page
A report submitted to Dublin City University, School of Computing for module

CA652: Information Access, 2011/2012.

We hereby certify that the work presented and the material contained herein is
my/our own except where explicitly stated references to other material are made


Student Numbers

52595354

11210889




                                                                        2|Page
TABLE OF CONTENTS
Abstract .................................................................................................................................... 1

Introduction ............................................................................................................................ 5

Concept Overview ................................................................................................................. 5

   Constraints and Limitations ............................................................................................ 5

Functional Description ......................................................................................................... 6

   Sentiment Search Functions............................................................................................... 6

   Techniques ........................................................................................................................... 6

       Time parameter Based Search ....................................................................................... 8

       Geographical Extraction Based ..................................................................................... 8

       Social Sentiment Extraction Based data ....................................................................... 9

       Graphical Data Generation Tools ................................................................................. 9

   Pros & Cons of proposed system ...................................................................................... 9

Evaluation Plan..................................................................................................................... 10

   Stage One Testing - Validation ..................................................................................... 10

   Stage Two Testing – Functionality Testing ................................................................ 11

   Stage Three Testing – Real Life Data ........................................................................... 11

Commercialisation Potential ............................................................................................. 13

Conclusion and Further Research Opportunities .......................................................... 14

References .............................................................................................................................. 15




                                                                                                                             3|Page
Table of Figures

Figure 1 - Sentiment Analysis framework ........................................................................... 7
Figure 2 - Cluster Method Accuracy/Efficiency ................................................................ 8
Figure 3 - Graphical Representation of content .................................................................. 9
Figure 4 - Basic Validation Testing Results ....................................................................... 11
Figure 5 - Two Topic Validation Testing ........................................................................... 11
Figure 6 - Sample Test Output (Obama) ............................................................................ 12
Figure 7 - Sample Test Data (McCain) ............................................................................... 13




                                                                                                        4|Page
INTRODUCTION
The ‘media’ as we now conceptualise it has changed dramatically. With the internet,
people have an opportunity to ‘weigh in’ on events, by providing their opinions, and
feedback and in real time through blogs, forum, social networks and commenting
systems on news websites. There is a growing interest in measuring sentiment that
can be contributed to the dramatic increase in the volume of digitized information.

“An increasing number of studies in political communication focus on the “sentiment” or
“tone” of news content, political speeches, or advertisements” (Young, L, & Soroka, S 2012)

This report discusses the concept of developing a Semantic Web based sentiment
engine that will be able to analyse public sentiment on current issues, from politics
to reality TV shows. Based on the analysis, tracking of popular opinion through
social media channels and leveraging research in the area of sentiment analysis,
accurate predictions could be made possible on events from presidential elections to
the X-Factor competition.


CONCEPT OVERVIEW
This proposed system is not a standard sentiment engine that returns static data; it
offers increased functionality to assist with data interpretation. By allowing end
users to customise their search, filter the returned data under multiple parameters
and have graphical representation of results to facilitate interpretation.


CONSTRAINTS AND LIMITATIONS
The limitations of this concept are not due to the technological constraints but are
simply down to the volatility of public opinion and that is something that cannot be
remedied or correcting by technology.

Another limitation is the scope of the opinion being captured. User groups of social
media and participants in online forums are statistical of a younger age group. The
lack of inclusion of the opinion of older age groups could greatly affect the accuracy




                                                                               5|Page
of the data as it would not be entirely representative – the impact of this imbalance
would particularly impact politics with older groups statistical more likely to vote.


FUNCTIONAL DESCRIPTION
SENTIMENT SEARCH FUNCTIONS

   •   Users can enter multiple search terms for the purpose of data comparison.
       Other features would be utilised to improve the analysis returns.
   •   Multiple Search Parameters
          o Time Frame Defined Search - Data retrieved can be limited to a specific
              time frame.
          o Geographical Location Based Search – Search data retrieved can be
              filtered by location of users
          o Narrow Search Scope – Select websites to exclude or restrict search to
              small number of websites.
   •   Graphical representations of the data are generated.

TECHNIQUES

Sentiment Analysis Techniques

There is much research in the area of sentiment analysis, the primary objective being
to find a technique where there is no trade-off between speed and accuracy. Several
new and emerging techniques have been researched as part of identifying the best fit
for this system.

   •   Proximity-Based Approach (Hasan, S, & Adjeroh, D 2011)
          o This proposed method uses proximity-based features to determine
              sentiment; proximity distribution, mutual information between
              proximity types, and proximity patterns.




                                                                            6|Page
•   Based on Annotation (Shukla, A 2011)
          o This proposed method counts all the annotation present, calculates
             sentiment scores of all annotation including comments to determine
             sentiments.


   •   Sentence-level Lexical Based Semantic Orientation (Khan, A et al, 2011)
          o This proposed method uses SentiWordNet to calculate the semantic
             ‘score’ of sentences it has classified as subjective from reviews and blog
             comments.


   •   Machine Learning approach to contextual information (YANG, C et al, 2008)
          o This proposed method differentiates itself from others by taking
             context into account when determining the sentiment category. Its
             primary focus and test data sets have been blog posts. Figure 1 below,
             shows the framework employed.




                        FIGURE 1 - SENTIMENT ANALYSIS FRAMEWORK


   •   Clustering-Based Sentiment Analysis Approach (Li, G, & Liu, F 2012)

The method deemed most appropriate for this proposed system was based on a
article from the Journal Of Information Science in April this year, which outlined the
Clustering-Based Sentiment Analysis approach. It proposed that by applying a “TF-
IDF weighting method, a voting mechanism and importing term scores, an acceptable and
stable clustering result can be obtained” (Li, G, & Liu, F 2012) The evaluation results


                                                                            7|Page
were the most impressive of all techniques reviewed as part of this research. It
appears to have performed well in terms of both accuracy and efficiency with no
need for human participation, as can be seen from figure 1.




                         FIGURE 2 - CLUSTER METHOD ACCURACY/EFFICIENCY


Apart from its accuracy and efficiency, this technique was deemed the most suitable
as it can be applied universally to any data set. Other techniques researched, have
been developed for particular data types, customer reviews or blogs and their
evaluation appraisals appear to suggest they do not perform as well outside of these
data types.

TIME PARAMETER BASED SEARCH

This sentiment engine would make use of the adaptible Librato API libraries to
allow sentiment returns to be time sensative. This would be in order for a user to
evaluate how sentiment is changing over time             or what sentiment was during
specific time periods.

GEOGRAPHICAL EXTRACTION BASED

Adding a geographical element would be a unique feature allowing for mapping of
sentiment results. Preferred location content will be pulled from the Twitter API as
it gives access to Twitter profile location. Comment systems used by news websites
etc. request a location prior to posting the comment like on the Irish Times website.
Facebook API allows access to location of user if the privacy setting is turned on.
OAUTH setting would be used to allow the users of the sentiment engine to explore
the opinions of their friends and networked associates and how it would fit on the
sentiment scales. Other free use location APIs may also be needed.




                                                                           8|Page
SOCIAL SENTIMENT EXTRACTION BASED DATA

The content used to create athematrix of information to evaluate sentiment within
via FLP would likely be the following but not limited to: Twitter; Disqus; Livefyre;
Intensedebate; Drupal comments; Wordpress comments; other blog posts; scraped
open facebook and fan page comments; facebook comment system; text comments;
G+ posts; Slideshare.net; Pinterest pins; Google News articles; various bookmarking
site comments like fark.com reddit; and other language relavent wire news services.

GRAPHICAL DATA GENERATION TOOLS

Graphical representations of the data are generated. The results could be rendered as
web-based flash objects or in way that is complient to the evolving HTML5
standards and be IOS 5 comlient given the anamosity Apple has with Adobe over
flash for results to be useful on mobile devices and tablets. These reports woud be
exportable to Crystal Reports.

          1600
          1400
          1200
          1000
           800                                                    Candidate A

           600                                                    Candidate B

           400
           200
             0
                    Postive        Neutral        Negative


                     FIGURE 3 - GRAPHICAL REPRESENTATION OF CONTENT


PROS & CONS OF PROPOSED SYSTEM

The primary argument for why sentiment engines via Semantic Web and linked data
are useful is based upon the new information and insight that can be gleaned from it.
The ability to know relative and positional sentiment can be useful in many anytical
or informational arbitrage situations.


                                                                                9|Page
In terms of the cons, primary concern would be data quality. Problems with data
quality are a huge issue and can skew any resulting analysis. The extent of the data
quality problem has been often discovered by information activists working in the
open data movement.

Secondly privacy concerns and staying within the spirit and letter of the relavent
data privacy laws of the regulatory regime you operate under may at times be an
issue. This can be tricky given the interconnected nature of the web.

Lastly, inaccuracies of data and it being organisied in “short sets” vs deeper data
may create false sentiments. Is their enough data being looked at to create a realist
postive or negative sentiment? Some additional analysis may need some addition
parsing to tease out, for example, intial heated emotion responses from the rationale
morning after response.


EVALUATION PLAN

STAGE ONE TESTING - VALIDATION
The evaluation plan would begin with simple software validation. The first test case
would consist of validating the fundamental functionality of the system, its ability to
differentiate between sentiments. The data set that’s to be used is the movie review
data from Pang et al experiments1 Movie review data is widely regarded as the most
challenging data for sentiment engines to analysis, this can be contributed to the fact
that a positive review may contain descriptions of gory or violent scenes and equally
a negative review could contain descriptions of light-hearted pleasant scenes. For
additional testing other data sets could be used for each iteration of this dynamic
testing stage



1   Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using
machine learning techniques. In: Conference on empirical methods in natural
language processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79.



                                                                          10 | P a g e
20%
                        39%
                                                                       Neutral
                                                                       Positive
                                                41%                    Negative




         .

                        FIGURE 4 - BASIC VALIDATION TESTING RESULTS


STAGE TWO TESTING – FUNCTIONALITY TESTING
The second stage of testing would be the validation of the multiple input
functionality; to ensure that data can be retrieved for two or more search terms and
also that they can be accurately differentiated. The test case for this would be built
on the first stage of testing with added content regarding a second movie etc.


          Schlinder's List                            The Usual Suspects



         39%    20%                                        20%   21%
                                   Neutral                                            Neutral

                  41%              Positive                                           Positive
                                                             59%
                                   Negative                                           Negative




                          FIGURE 5 - TWO TOPIC VALIDATION TESTING


STAGE THREE TESTING – REAL LIFE DATA
The final stage of the evaluation plan would be to perform testing using previous
high profile events as the test cases, such as the US Presidential Election of 2008 and


                                                                                  11 | P a g e
the X-Factor competition from previous years. This validation is more complex as it
will span the entire internet not just the staging website.

The testing would be performed over different time intervals, days, weeks, months,
and the entire duration of the event. In the case of the political elections these time
periods could be used to coincide with official opinion polls, for example Gallop and
Rasmussen state side or RedC for Irish based events.

Validation of the geographical based sentiment analysis function would be tested to
gauge the accuracy of the location results. In the case of the US Presidential Election
the final voting percentages for each candidate per state would give an accurate
basis for comparison.

SAMPLE EVALUATION TEST CASE

By taking the ten states where each candidate won by the largest percentage
majority, and graphing the percentage of votes each candidate received, and also the
percentage of positive, negative and neutral data regarding that candidate. What one
would expect in a fully evaluated system would be a close correlation between
positive data and the percentage of votes and also a correlation with the negative or
neutral data and the other candidate’s percentage of votes, as per the sample charts
below for Obama and McCain respectively.

         90
                                                         Obama’s Data
         80
         70                                                   Obama's Percentage
         60                                                   of Votes
         50                                                   McCain's Percentage
         40                                                   of Votes
         30                                                   Positive %
         20
         10                                                   Negative %
          0
                                                              Neutral %




                           FIGURE 6 - SAMPLE TEST OUTPUT (OBAMA)

                                                                              12 | P a g e
70
                                                        McCain’s Data
           60
                                                             McCain's Percentage
           50                                                of Votes

           40                                                Obama's Percentage
                                                             of Votes
           30                                                Positive %
           20
                                                             Negative %
           10

            0                                                Neutral %




                           FIGURE 7 - SAMPLE TEST DATA (MCCAIN)



COMMERCIALISATION POTENTIAL
In an era where both business and individuals are attempting to move further and
further to data driven decision sentiment engine products have a range of
commercial potential.

Some companies have already begun commercializing Semantic Web applications
like IBM licensing of their WebFountain Internet analytical engine to FActiva and
ThompsonReuters 2003 for example for those interested in corporate reputational
data.

Various market research for people who cannot afford Enterprise Resoruce Planning
(ERP) add ons like SAP Business Objects, SAS, or say LexisNexis Analytics and for
who the current available crop of free semantic sentiment engines (name a few from
those ten) tools are just insufficient, too niche, or unscalable (Basu, 2010). Semantic
Web products are becoming important in internal and external Business Inframatics.

However, information arbitrage is not merely for professional market traders. This
system would likely be a software as service (SaaS) on the web, it could be sold on a
free-mium basis or a monthly subscription or yearly license depending on the
implementation.


                                                                              13 | P a g e
Primary clients would depend on the sentiments needing to be parsed and the
proprietary and public data sets being used in within the sentiment engine.

Examples to be included: Corporate Media; Content Publishing industry; PR firms;
polling; market research firms; Trading platforms; Political Parties; Elections;
Government agencies; security services; and Bookmarkers for deciding odds on
Novelty bets - reality TV shows, politics etc.


CONCLUSION AND FURTHER RESEARCH OPPORTUNITIES
Where does the Semantic Web lead to exactly? We don’t really know, but opening
up the segregated data silos and making sense of deeper dark ‘big data,’ in pursuit
of the benefits of a deeper rooted “hyperdata” would be a nice path. However, the
road will be long but it may improve our day to day lives immensely.

       "Many applications and services claim to be "semantic" in one manner or another,
       but that does not mean they are "Semantic Web." Semantic applications include any
       applications that can make sense of meaning, particularly in language such as
       unstructured text, or structured data in some cases. By this definition, all search
       engines today are somewhat "semantic" but few would qualify as "Semantic Web"
       apps. (Spivak, 2007)

How we get from the early steps of Web 3.0 to this deeper data web will be a long
process. It will provide countless benefits, many of which we may not even percieve
today. However, sentiment engines are mearly one way to get the public and the
developer community interested and excited for all the other benefits that this open
data future could hold. For that reason sentiment engines will remain an important
component in the near term future, as “big data,” holds much of the future promise
to bring the of the “web of things” and make sense and use of them.




                                                                             14 | P a g e
REFERENCES
Abbasi, A, Hsinchun, C, & Salem, A 2008, 'Sentiment Analysis in Multiple
Languages: Feature Selection for Opinion Classification in Web Forums', ACM
Transactions On Information Systems, 26, 3, pp. 1-34, Computers & Applied Sciences
Complete, viewed 4 May 2012.

Basu, Saikat 2010. 10 Web Tools To Try Out Sentiment Search & Feel the Pulse Make
Use Of [Online] 30 April. http://www.makeuseof.com/tag/10-web-tools-sentiment-
search-feel-pulse/ [Accessed 1 May 2012]

Bergman, Mike 2010. I Have Yet to Metadata I Didn’t Like. AI3 [Online] 16 August.
http://www.mkbergman.com/902/i-have-yet-to-metadata-i-didnt-like/ [Accessed
1 May 2012]

Bollen, J. Mao, Huina. Zeng, Xiao-Jun March 2011. Twitter mood predicts the stock
market. Journal of Computational Science, 2(1), Pages 1-8 Available from:
http://arxiv.org/abs/1010.3003

Cai, K, Spangler, S, Ying, C, & Li, Z 2010, 'Leveraging sentiment analysis for topic
detection', Web Intelligence & Agent Systems, 8, 3, pp. 291-302, Academic Search
Complete, viewed 20 April 2012.

Dalton, Jeff 2007. Caffè Java Open Source NLP and Text Mining tools. Jeff's Search
Engine Caffé [Online] 16 March. http://www.searchenginecaffe.com/2007/03/java-
open-source-text-mining-and.html [Accessed 1 May 2012]

Hamouda, A, Marei, M, & Rohaim, M 2011, 'Building Machine Learning Based Senti-
word Lexicon for Sentiment Analysis', Journal Of Advances In Information Technology,
2, 4, pp. 199-203, Library, Information Science & Technology Abstracts with Full
Text, , viewed 1 May 2012.

Hasan, S, & Adjeroh, D 2011, 'Detecting Human Sentiment from Text using a
Proximity-Based Approach', Journal Of Digital Information Management, 9, 5, pp.




                                                                        15 | P a g e
206-212, Library, Information Science & Technology Abstracts with Full Text,              ,
viewed 7 May 2012.

Kang, H, Yoo, S, & Han, D 2012, 'Senti-lexicon and improved Naïve Bayes
algorithms for sentiment analysis of restaurant reviews', Expert Systems With
Applications, 39, 5, pp. 6000-6010, Academic Search Complete, , viewed 10 April
2012.

Lévy, Pierre CRC, FRSC 2007. Elements of Semantic Engineering I3 workshop / WWW
Consortium       Conference      /       Banff     2007        Available         from:
http://www.ieml.org/text/semantic_space.pdf

Li, G, & Liu, F 2012, 'Application of a clustering method on sentiment analysis',
Journal Of Information Science, 38, 2, pp. 127-139, Business Source Complete, ,
viewed 21 April 2012.

Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using machine
learning techniques. In: Conference on empirical methods in natural language
processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79.

Shukla, A 2011, 'SENTIMENT ANALYSIS OF DOCUMENT BASED ON
ANNOTATION', International Journal Of Web & Semantic Technology, 2, 4, pp. 91-103,
Computers & Applied Sciences Complete, , viewed 6 May 2012.

Spivac, Nova 2007. The Semantic Web, Collective Intelligence and Hyperdata.
novaspivack.typepad.com              [Online]             18               September.
http://novaspivack.typepad.com/nova_spivacks_weblog/2007/09/hyperdata.html
[Accessed 1 May 2012]

Vishwanath, J, & Aishwarya, S 2011, 'User Suggestions Extraction from customer
Reviews: A Sentiment Analysis approach', International Journal On Computer Science
& Engineering, 3, 3, pp. 1203-1206, Academic Search Complete, , viewed 1 May 2012.

YANG, C, LIN, K, & CHEN, H 2008, 'Sentiment Analysis in Weblog Using
Contextual Information:: A Machine Learning Approach', International Journal Of


                                                                           16 | P a g e
Computer Processing Of Languages, 21, 4, pp. 331-345, Academic Search Complete, ,
viewed 27 April 2012.

Young, L, & Soroka, S 2012, 'Affective News: The Automated Coding of Sentiment in
Political Texts', Political Communication, 29, 2, pp. 205-231, Academic Search
Complete, , viewed 10 May 2012.




                                                                     17 | P a g e

More Related Content

PDF
IRJET- Stock Market Prediction using Deep Learning and Sentiment Analysis
PDF
Aliano neural
PDF
opinion feature extraction using enhanced opinion mining technique and intrin...
PDF
IRJET- Credit Card Fraud Detection Analysis
PDF
IRJET- The Sentimental Analysis on Product Reviews of Amazon Data using the H...
PDF
EXTRACTING BUSINESS INTELLIGENCE FROM ONLINE PRODUCT REVIEWS
PDF
Object Detection using Deep Learning with Hierarchical Multi Swarm Optimization
PPTX
Doc 20190909-wa0025
IRJET- Stock Market Prediction using Deep Learning and Sentiment Analysis
Aliano neural
opinion feature extraction using enhanced opinion mining technique and intrin...
IRJET- Credit Card Fraud Detection Analysis
IRJET- The Sentimental Analysis on Product Reviews of Amazon Data using the H...
EXTRACTING BUSINESS INTELLIGENCE FROM ONLINE PRODUCT REVIEWS
Object Detection using Deep Learning with Hierarchical Multi Swarm Optimization
Doc 20190909-wa0025

What's hot (18)

PDF
A Clustering Method for Weak Signals to Support Anticipative Intelligence
PDF
D018212428
PPTX
Statistical Modeling in 3D: Describing, Explaining and Predicting
PDF
IRJET- Prediction of Stock Market using Machine Learning Algorithms
PDF
Accenture multi-speed-it-po v
PDF
Empirical Model of Supervised Learning Approach for Opinion Mining
PPTX
Recommender system
PPTX
Regression and correlation
PDF
Sentiment Analysis of Feedback Data
PDF
Anomaly detection Workshop slides
PDF
Sentiment Features based Analysis of Online Reviews
PPTX
Classification
PPTX
To Explain, To Predict, or To Describe?
PPTX
To explain or to predict
PDF
Application of AI in customer relationship management
PDF
Approach to BSA/AML Rule Thresholds
PDF
IRJET- Credit Card Fraud Detection using Isolation Forest
PDF
Demand forecasting
A Clustering Method for Weak Signals to Support Anticipative Intelligence
D018212428
Statistical Modeling in 3D: Describing, Explaining and Predicting
IRJET- Prediction of Stock Market using Machine Learning Algorithms
Accenture multi-speed-it-po v
Empirical Model of Supervised Learning Approach for Opinion Mining
Recommender system
Regression and correlation
Sentiment Analysis of Feedback Data
Anomaly detection Workshop slides
Sentiment Features based Analysis of Online Reviews
Classification
To Explain, To Predict, or To Describe?
To explain or to predict
Application of AI in customer relationship management
Approach to BSA/AML Rule Thresholds
IRJET- Credit Card Fraud Detection using Isolation Forest
Demand forecasting
Ad

Viewers also liked (20)

DOCX
Trabajo cmc copia
DOC
Cuánto fósforo aplico
DOC
Ti nicole karolina_gema_powerpoint
PPT
Presentación D E Q U I M I C A
PDF
Miss HIV
DOCX
DOCX
Felipe hincapié m octavo23
PDF
Case do Grêmio publicado na revista Case Studies nº96 em 2013 - case de mark...
PPTX
PPTX
La huelga
PPS
La Diferencia Que Hace La Diferencia
PPT
CRM - João / Frederico
PPT
Navajo code talkers
PDF
Plano agricola e pecuario 2012 e 2013 mapa
DOCX
Columnas
PDF
PDF
Bernardo33
DOCX
Texto Linux
PDF
Los números naturales
PDF
Reus do mensalão
Trabajo cmc copia
Cuánto fósforo aplico
Ti nicole karolina_gema_powerpoint
Presentación D E Q U I M I C A
Miss HIV
Felipe hincapié m octavo23
Case do Grêmio publicado na revista Case Studies nº96 em 2013 - case de mark...
La huelga
La Diferencia Que Hace La Diferencia
CRM - João / Frederico
Navajo code talkers
Plano agricola e pecuario 2012 e 2013 mapa
Columnas
Bernardo33
Texto Linux
Los números naturales
Reus do mensalão
Ad

Similar to Semantic Web Based Sentiment Engine (20)

PDF
ENTERTAINMENT CONTENT RECOMMENDATION SYSTEM USING MACHINE LEARNING
PDF
Risk mgmt-analysis-wp-326822
PDF
IRJET - Sentiment Analysis and Rumour Detection in Online Product Reviews
PDF
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
PDF
IRJET- Analysis of Brand Value Prediction based on Social Media Data
PDF
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
PDF
IRJET- Searching an Optimal Algorithm for Movie Recommendation System
PDF
Emotion Recognition By Textual Tweets Using Machine Learning
PDF
IRJET- Stock Market Prediction using Financial News Articles
DOCX
Developing Movie Recommendation System
PDF
Extracting Business Intelligence from Online Product Reviews
PDF
Product Analyst Advisor
PDF
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
PDF
IRJET- Analysis of Rating Difference and User Interest
PDF
IRJET- Analyzing Voting Results using Influence Matrix
PDF
A Brief Survey on Recommendation System for a Gradient Classifier based Inade...
PDF
IRJET- Opinion Mining and Sentiment Analysis for Online Review
PDF
Recommendation System using Machine Learning Techniques
PDF
A Novel Jewellery Recommendation System using Machine Learning and Natural La...
ENTERTAINMENT CONTENT RECOMMENDATION SYSTEM USING MACHINE LEARNING
Risk mgmt-analysis-wp-326822
IRJET - Sentiment Analysis and Rumour Detection in Online Product Reviews
A Novel Feature Selection with Annealing For Computer Vision And Big Data Lea...
IRJET- Analysis of Brand Value Prediction based on Social Media Data
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
IRJET- Searching an Optimal Algorithm for Movie Recommendation System
Emotion Recognition By Textual Tweets Using Machine Learning
IRJET- Stock Market Prediction using Financial News Articles
Developing Movie Recommendation System
Extracting Business Intelligence from Online Product Reviews
Product Analyst Advisor
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Analysis of Rating Difference and User Interest
IRJET- Analyzing Voting Results using Influence Matrix
A Brief Survey on Recommendation System for a Gradient Classifier based Inade...
IRJET- Opinion Mining and Sentiment Analysis for Online Review
Recommendation System using Machine Learning Techniques
A Novel Jewellery Recommendation System using Machine Learning and Natural La...

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
KodekX | Application Modernization Development
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Programs and apps: productivity, graphics, security and other tools
Unlocking AI with Model Context Protocol (MCP)
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
Understanding_Digital_Forensics_Presentation.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
KodekX | Application Modernization Development
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Semantic Web Based Sentiment Engine

  • 1. CA652A Semantic Web Based Sentiment Engine A system to determine online sentiment on current affairs for the purpose of analysis and prediction 11210889 52595354 CA652A
  • 2. ABSTRACT Sentiment analysis involves classifying opinions from text as "positive", "negative" or “neutral. Its purpose and benefit is to assist in extracting valuable information and insight from copious amounts of unstructured data. This proposed system will have the capability to determine online sentiment on current affairs for the purpose of analysis and prediction. For the sentiment analysis a cluster-method approach is recommended, which is a recent advancement in this area. Various APIs will assist in extracting other data such as location and time. Evaluation of system through the use of the Pang et al movie review data sets is recommended to validate basic functionality and real life data in the form of the 2008 US presidential race data to evaluate all functionality of the system. Multiple industries are identified as potential users of this system from marketing companies to hotels adding to our benefit in the commercialisation potential of the system. 1|Page
  • 3. A report submitted to Dublin City University, School of Computing for module CA652: Information Access, 2011/2012. We hereby certify that the work presented and the material contained herein is my/our own except where explicitly stated references to other material are made Student Numbers 52595354 11210889 2|Page
  • 4. TABLE OF CONTENTS Abstract .................................................................................................................................... 1 Introduction ............................................................................................................................ 5 Concept Overview ................................................................................................................. 5 Constraints and Limitations ............................................................................................ 5 Functional Description ......................................................................................................... 6 Sentiment Search Functions............................................................................................... 6 Techniques ........................................................................................................................... 6 Time parameter Based Search ....................................................................................... 8 Geographical Extraction Based ..................................................................................... 8 Social Sentiment Extraction Based data ....................................................................... 9 Graphical Data Generation Tools ................................................................................. 9 Pros & Cons of proposed system ...................................................................................... 9 Evaluation Plan..................................................................................................................... 10 Stage One Testing - Validation ..................................................................................... 10 Stage Two Testing – Functionality Testing ................................................................ 11 Stage Three Testing – Real Life Data ........................................................................... 11 Commercialisation Potential ............................................................................................. 13 Conclusion and Further Research Opportunities .......................................................... 14 References .............................................................................................................................. 15 3|Page
  • 5. Table of Figures Figure 1 - Sentiment Analysis framework ........................................................................... 7 Figure 2 - Cluster Method Accuracy/Efficiency ................................................................ 8 Figure 3 - Graphical Representation of content .................................................................. 9 Figure 4 - Basic Validation Testing Results ....................................................................... 11 Figure 5 - Two Topic Validation Testing ........................................................................... 11 Figure 6 - Sample Test Output (Obama) ............................................................................ 12 Figure 7 - Sample Test Data (McCain) ............................................................................... 13 4|Page
  • 6. INTRODUCTION The ‘media’ as we now conceptualise it has changed dramatically. With the internet, people have an opportunity to ‘weigh in’ on events, by providing their opinions, and feedback and in real time through blogs, forum, social networks and commenting systems on news websites. There is a growing interest in measuring sentiment that can be contributed to the dramatic increase in the volume of digitized information. “An increasing number of studies in political communication focus on the “sentiment” or “tone” of news content, political speeches, or advertisements” (Young, L, & Soroka, S 2012) This report discusses the concept of developing a Semantic Web based sentiment engine that will be able to analyse public sentiment on current issues, from politics to reality TV shows. Based on the analysis, tracking of popular opinion through social media channels and leveraging research in the area of sentiment analysis, accurate predictions could be made possible on events from presidential elections to the X-Factor competition. CONCEPT OVERVIEW This proposed system is not a standard sentiment engine that returns static data; it offers increased functionality to assist with data interpretation. By allowing end users to customise their search, filter the returned data under multiple parameters and have graphical representation of results to facilitate interpretation. CONSTRAINTS AND LIMITATIONS The limitations of this concept are not due to the technological constraints but are simply down to the volatility of public opinion and that is something that cannot be remedied or correcting by technology. Another limitation is the scope of the opinion being captured. User groups of social media and participants in online forums are statistical of a younger age group. The lack of inclusion of the opinion of older age groups could greatly affect the accuracy 5|Page
  • 7. of the data as it would not be entirely representative – the impact of this imbalance would particularly impact politics with older groups statistical more likely to vote. FUNCTIONAL DESCRIPTION SENTIMENT SEARCH FUNCTIONS • Users can enter multiple search terms for the purpose of data comparison. Other features would be utilised to improve the analysis returns. • Multiple Search Parameters o Time Frame Defined Search - Data retrieved can be limited to a specific time frame. o Geographical Location Based Search – Search data retrieved can be filtered by location of users o Narrow Search Scope – Select websites to exclude or restrict search to small number of websites. • Graphical representations of the data are generated. TECHNIQUES Sentiment Analysis Techniques There is much research in the area of sentiment analysis, the primary objective being to find a technique where there is no trade-off between speed and accuracy. Several new and emerging techniques have been researched as part of identifying the best fit for this system. • Proximity-Based Approach (Hasan, S, & Adjeroh, D 2011) o This proposed method uses proximity-based features to determine sentiment; proximity distribution, mutual information between proximity types, and proximity patterns. 6|Page
  • 8. Based on Annotation (Shukla, A 2011) o This proposed method counts all the annotation present, calculates sentiment scores of all annotation including comments to determine sentiments. • Sentence-level Lexical Based Semantic Orientation (Khan, A et al, 2011) o This proposed method uses SentiWordNet to calculate the semantic ‘score’ of sentences it has classified as subjective from reviews and blog comments. • Machine Learning approach to contextual information (YANG, C et al, 2008) o This proposed method differentiates itself from others by taking context into account when determining the sentiment category. Its primary focus and test data sets have been blog posts. Figure 1 below, shows the framework employed. FIGURE 1 - SENTIMENT ANALYSIS FRAMEWORK • Clustering-Based Sentiment Analysis Approach (Li, G, & Liu, F 2012) The method deemed most appropriate for this proposed system was based on a article from the Journal Of Information Science in April this year, which outlined the Clustering-Based Sentiment Analysis approach. It proposed that by applying a “TF- IDF weighting method, a voting mechanism and importing term scores, an acceptable and stable clustering result can be obtained” (Li, G, & Liu, F 2012) The evaluation results 7|Page
  • 9. were the most impressive of all techniques reviewed as part of this research. It appears to have performed well in terms of both accuracy and efficiency with no need for human participation, as can be seen from figure 1. FIGURE 2 - CLUSTER METHOD ACCURACY/EFFICIENCY Apart from its accuracy and efficiency, this technique was deemed the most suitable as it can be applied universally to any data set. Other techniques researched, have been developed for particular data types, customer reviews or blogs and their evaluation appraisals appear to suggest they do not perform as well outside of these data types. TIME PARAMETER BASED SEARCH This sentiment engine would make use of the adaptible Librato API libraries to allow sentiment returns to be time sensative. This would be in order for a user to evaluate how sentiment is changing over time or what sentiment was during specific time periods. GEOGRAPHICAL EXTRACTION BASED Adding a geographical element would be a unique feature allowing for mapping of sentiment results. Preferred location content will be pulled from the Twitter API as it gives access to Twitter profile location. Comment systems used by news websites etc. request a location prior to posting the comment like on the Irish Times website. Facebook API allows access to location of user if the privacy setting is turned on. OAUTH setting would be used to allow the users of the sentiment engine to explore the opinions of their friends and networked associates and how it would fit on the sentiment scales. Other free use location APIs may also be needed. 8|Page
  • 10. SOCIAL SENTIMENT EXTRACTION BASED DATA The content used to create athematrix of information to evaluate sentiment within via FLP would likely be the following but not limited to: Twitter; Disqus; Livefyre; Intensedebate; Drupal comments; Wordpress comments; other blog posts; scraped open facebook and fan page comments; facebook comment system; text comments; G+ posts; Slideshare.net; Pinterest pins; Google News articles; various bookmarking site comments like fark.com reddit; and other language relavent wire news services. GRAPHICAL DATA GENERATION TOOLS Graphical representations of the data are generated. The results could be rendered as web-based flash objects or in way that is complient to the evolving HTML5 standards and be IOS 5 comlient given the anamosity Apple has with Adobe over flash for results to be useful on mobile devices and tablets. These reports woud be exportable to Crystal Reports. 1600 1400 1200 1000 800 Candidate A 600 Candidate B 400 200 0 Postive Neutral Negative FIGURE 3 - GRAPHICAL REPRESENTATION OF CONTENT PROS & CONS OF PROPOSED SYSTEM The primary argument for why sentiment engines via Semantic Web and linked data are useful is based upon the new information and insight that can be gleaned from it. The ability to know relative and positional sentiment can be useful in many anytical or informational arbitrage situations. 9|Page
  • 11. In terms of the cons, primary concern would be data quality. Problems with data quality are a huge issue and can skew any resulting analysis. The extent of the data quality problem has been often discovered by information activists working in the open data movement. Secondly privacy concerns and staying within the spirit and letter of the relavent data privacy laws of the regulatory regime you operate under may at times be an issue. This can be tricky given the interconnected nature of the web. Lastly, inaccuracies of data and it being organisied in “short sets” vs deeper data may create false sentiments. Is their enough data being looked at to create a realist postive or negative sentiment? Some additional analysis may need some addition parsing to tease out, for example, intial heated emotion responses from the rationale morning after response. EVALUATION PLAN STAGE ONE TESTING - VALIDATION The evaluation plan would begin with simple software validation. The first test case would consist of validating the fundamental functionality of the system, its ability to differentiate between sentiments. The data set that’s to be used is the movie review data from Pang et al experiments1 Movie review data is widely regarded as the most challenging data for sentiment engines to analysis, this can be contributed to the fact that a positive review may contain descriptions of gory or violent scenes and equally a negative review could contain descriptions of light-hearted pleasant scenes. For additional testing other data sets could be used for each iteration of this dynamic testing stage 1 Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using machine learning techniques. In: Conference on empirical methods in natural language processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79. 10 | P a g e
  • 12. 20% 39% Neutral Positive 41% Negative . FIGURE 4 - BASIC VALIDATION TESTING RESULTS STAGE TWO TESTING – FUNCTIONALITY TESTING The second stage of testing would be the validation of the multiple input functionality; to ensure that data can be retrieved for two or more search terms and also that they can be accurately differentiated. The test case for this would be built on the first stage of testing with added content regarding a second movie etc. Schlinder's List The Usual Suspects 39% 20% 20% 21% Neutral Neutral 41% Positive Positive 59% Negative Negative FIGURE 5 - TWO TOPIC VALIDATION TESTING STAGE THREE TESTING – REAL LIFE DATA The final stage of the evaluation plan would be to perform testing using previous high profile events as the test cases, such as the US Presidential Election of 2008 and 11 | P a g e
  • 13. the X-Factor competition from previous years. This validation is more complex as it will span the entire internet not just the staging website. The testing would be performed over different time intervals, days, weeks, months, and the entire duration of the event. In the case of the political elections these time periods could be used to coincide with official opinion polls, for example Gallop and Rasmussen state side or RedC for Irish based events. Validation of the geographical based sentiment analysis function would be tested to gauge the accuracy of the location results. In the case of the US Presidential Election the final voting percentages for each candidate per state would give an accurate basis for comparison. SAMPLE EVALUATION TEST CASE By taking the ten states where each candidate won by the largest percentage majority, and graphing the percentage of votes each candidate received, and also the percentage of positive, negative and neutral data regarding that candidate. What one would expect in a fully evaluated system would be a close correlation between positive data and the percentage of votes and also a correlation with the negative or neutral data and the other candidate’s percentage of votes, as per the sample charts below for Obama and McCain respectively. 90 Obama’s Data 80 70 Obama's Percentage 60 of Votes 50 McCain's Percentage 40 of Votes 30 Positive % 20 10 Negative % 0 Neutral % FIGURE 6 - SAMPLE TEST OUTPUT (OBAMA) 12 | P a g e
  • 14. 70 McCain’s Data 60 McCain's Percentage 50 of Votes 40 Obama's Percentage of Votes 30 Positive % 20 Negative % 10 0 Neutral % FIGURE 7 - SAMPLE TEST DATA (MCCAIN) COMMERCIALISATION POTENTIAL In an era where both business and individuals are attempting to move further and further to data driven decision sentiment engine products have a range of commercial potential. Some companies have already begun commercializing Semantic Web applications like IBM licensing of their WebFountain Internet analytical engine to FActiva and ThompsonReuters 2003 for example for those interested in corporate reputational data. Various market research for people who cannot afford Enterprise Resoruce Planning (ERP) add ons like SAP Business Objects, SAS, or say LexisNexis Analytics and for who the current available crop of free semantic sentiment engines (name a few from those ten) tools are just insufficient, too niche, or unscalable (Basu, 2010). Semantic Web products are becoming important in internal and external Business Inframatics. However, information arbitrage is not merely for professional market traders. This system would likely be a software as service (SaaS) on the web, it could be sold on a free-mium basis or a monthly subscription or yearly license depending on the implementation. 13 | P a g e
  • 15. Primary clients would depend on the sentiments needing to be parsed and the proprietary and public data sets being used in within the sentiment engine. Examples to be included: Corporate Media; Content Publishing industry; PR firms; polling; market research firms; Trading platforms; Political Parties; Elections; Government agencies; security services; and Bookmarkers for deciding odds on Novelty bets - reality TV shows, politics etc. CONCLUSION AND FURTHER RESEARCH OPPORTUNITIES Where does the Semantic Web lead to exactly? We don’t really know, but opening up the segregated data silos and making sense of deeper dark ‘big data,’ in pursuit of the benefits of a deeper rooted “hyperdata” would be a nice path. However, the road will be long but it may improve our day to day lives immensely. "Many applications and services claim to be "semantic" in one manner or another, but that does not mean they are "Semantic Web." Semantic applications include any applications that can make sense of meaning, particularly in language such as unstructured text, or structured data in some cases. By this definition, all search engines today are somewhat "semantic" but few would qualify as "Semantic Web" apps. (Spivak, 2007) How we get from the early steps of Web 3.0 to this deeper data web will be a long process. It will provide countless benefits, many of which we may not even percieve today. However, sentiment engines are mearly one way to get the public and the developer community interested and excited for all the other benefits that this open data future could hold. For that reason sentiment engines will remain an important component in the near term future, as “big data,” holds much of the future promise to bring the of the “web of things” and make sense and use of them. 14 | P a g e
  • 16. REFERENCES Abbasi, A, Hsinchun, C, & Salem, A 2008, 'Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums', ACM Transactions On Information Systems, 26, 3, pp. 1-34, Computers & Applied Sciences Complete, viewed 4 May 2012. Basu, Saikat 2010. 10 Web Tools To Try Out Sentiment Search & Feel the Pulse Make Use Of [Online] 30 April. http://www.makeuseof.com/tag/10-web-tools-sentiment- search-feel-pulse/ [Accessed 1 May 2012] Bergman, Mike 2010. I Have Yet to Metadata I Didn’t Like. AI3 [Online] 16 August. http://www.mkbergman.com/902/i-have-yet-to-metadata-i-didnt-like/ [Accessed 1 May 2012] Bollen, J. Mao, Huina. Zeng, Xiao-Jun March 2011. Twitter mood predicts the stock market. Journal of Computational Science, 2(1), Pages 1-8 Available from: http://arxiv.org/abs/1010.3003 Cai, K, Spangler, S, Ying, C, & Li, Z 2010, 'Leveraging sentiment analysis for topic detection', Web Intelligence & Agent Systems, 8, 3, pp. 291-302, Academic Search Complete, viewed 20 April 2012. Dalton, Jeff 2007. Caffè Java Open Source NLP and Text Mining tools. Jeff's Search Engine Caffé [Online] 16 March. http://www.searchenginecaffe.com/2007/03/java- open-source-text-mining-and.html [Accessed 1 May 2012] Hamouda, A, Marei, M, & Rohaim, M 2011, 'Building Machine Learning Based Senti- word Lexicon for Sentiment Analysis', Journal Of Advances In Information Technology, 2, 4, pp. 199-203, Library, Information Science & Technology Abstracts with Full Text, , viewed 1 May 2012. Hasan, S, & Adjeroh, D 2011, 'Detecting Human Sentiment from Text using a Proximity-Based Approach', Journal Of Digital Information Management, 9, 5, pp. 15 | P a g e
  • 17. 206-212, Library, Information Science & Technology Abstracts with Full Text, , viewed 7 May 2012. Kang, H, Yoo, S, & Han, D 2012, 'Senti-lexicon and improved Naïve Bayes algorithms for sentiment analysis of restaurant reviews', Expert Systems With Applications, 39, 5, pp. 6000-6010, Academic Search Complete, , viewed 10 April 2012. Lévy, Pierre CRC, FRSC 2007. Elements of Semantic Engineering I3 workshop / WWW Consortium Conference / Banff 2007 Available from: http://www.ieml.org/text/semantic_space.pdf Li, G, & Liu, F 2012, 'Application of a clustering method on sentiment analysis', Journal Of Information Science, 38, 2, pp. 127-139, Business Source Complete, , viewed 21 April 2012. Pang B, Lee L, Vaithyanathan S. Thumbs up, Sentiment classification using machine learning techniques. In: Conference on empirical methods in natural language processing (EMNLP). Philadelphia, Pennsylvania, USA, 2002, p. 79. Shukla, A 2011, 'SENTIMENT ANALYSIS OF DOCUMENT BASED ON ANNOTATION', International Journal Of Web & Semantic Technology, 2, 4, pp. 91-103, Computers & Applied Sciences Complete, , viewed 6 May 2012. Spivac, Nova 2007. The Semantic Web, Collective Intelligence and Hyperdata. novaspivack.typepad.com [Online] 18 September. http://novaspivack.typepad.com/nova_spivacks_weblog/2007/09/hyperdata.html [Accessed 1 May 2012] Vishwanath, J, & Aishwarya, S 2011, 'User Suggestions Extraction from customer Reviews: A Sentiment Analysis approach', International Journal On Computer Science & Engineering, 3, 3, pp. 1203-1206, Academic Search Complete, , viewed 1 May 2012. YANG, C, LIN, K, & CHEN, H 2008, 'Sentiment Analysis in Weblog Using Contextual Information:: A Machine Learning Approach', International Journal Of 16 | P a g e
  • 18. Computer Processing Of Languages, 21, 4, pp. 331-345, Academic Search Complete, , viewed 27 April 2012. Young, L, & Soroka, S 2012, 'Affective News: The Automated Coding of Sentiment in Political Texts', Political Communication, 29, 2, pp. 205-231, Academic Search Complete, , viewed 10 May 2012. 17 | P a g e