SlideShare a Scribd company logo
International Journal of Current Trends in Engineering & Technology
ISSN: 2395-3152
Volume: 02, Issue: 02 (MAR-APR, 2016)
282
SOCIAL DATA ANALYSIS USING APACHE FLUME, HDFS, HIVE
Mrs. Sulochana Panigrahi
PG Scholar, Department of Computer
Science and Engineering,
New Horizon College of Engineering,
Bangalore, Karnataka, India
sulochanap01@gmail.com
Dr. S. Mohan Kumar
Associate Professor, Department of
Computer Science and Engineering,
New Horizon College of Engineering,
Bangalore, Karnataka, India
drsmohankumar@gmail.com
Abstract: - Twitter is one of the most popular micro
blogging website in today's globalized world. Twitter
messages can be mined to gain valuable information.
Although Twitter provides a list of most popular topics
people tweet about known as Trending Topics in real
time, it is often hard to understand what these trending
topics are about. Therefore, various efforts are being made
to classify these topics into general categories with high
accuracy for better information retrieval. In this paper, we
are going to talk how effectively sentiment analysis is
done on the data which is collected from the Twitter using
Flume. Twitter is an online web application which of data
that can be a structured, semi-structured and un-structured
data. Collect the data from the twitter by using
CLOUDERA VM using online streaming tool Flume.
And doing analysis on Twitter is also difficult due to
language that is used for comments. And, coming to
analysis there are different types of analysis that can be
done on the collected data. So here we are taking
sentiment analysis, for this we are using Hive and its
queries to give the sentiment data based up on the groups
that we have defined in the HQL (Hive Query Language)
and use Visual Studio to show in User Interface.
Keywords: - Analysis, BIGDATA, Comment, Flume,
Hive, HQL, Sentiment Analysis, Structured, Semi-
Structured, Twitter, Tweets, Un-Structured.
I. INTRODUCTION
Present situation is completely they are expressing their
thoughts through online blogs, discussion forms and also
some online applications like Facebook, Twitter, etc. If
we take Twitter as our example nearly 1TB of text data is
generating within a week in the form of tweets. So, by this
it is understand clearly how this Internet is changing the
way of living and style of people. Among these tweets can
be categorized by the hash value tags for which they are
commenting and posting their tweets. So, now many
companies and also the survey companies are using this
for doing some analytics such that they can predict the
success rate of their product or also they can show the
different view from the data that they have collected for
analysis. But, to calculate their views is very difficult in a
normal way by taking these heavy data that are going to
generate day by day.
Fig. 1: Describes clearly Cloudera VM
The above figure shows clearly the different types of
service that are available on cloudera VM so, this problem
is taking now and can be solved by using BIGDATA
Problem as a solution. And if we consider getting the data
from Twitter one should use any one programming
language to crawl the data from their database or from
their web pages. Coming to this problem here we are
collecting this data by using BIGDATA online streaming
Eco System Tool known as Flume and also the shuffling
of data and generating them into structured data in the
form of tables can be done by using Apache Hive.
II. PROBLEM STATEMENT
2.1 Existing System
As we have already discussed about the older way of
getting data and also performing the sentiment analysis on
those data. Here they are going to use some coding
techniques for crawling the data from the twitter where
they can extract the data from the Twitter web pages by
using some code that may be written either in JAVA,
Python etc. For those they are going to download the
libraries that are provided by the twitter guys by using this
they are crawling the data that we want particularly. [1]
After getting raw data they will filter by using some old
techniques and also they will find out the positive,
negative and moderate words from the list of collected
words in a text file. All these words should be collected
by us to filter out or do some sentiment analysis on the
filtered data. [2]. these words can be called as a dictionary
International Journal of Current Trends in Engineering & Technology
ISSN: 2395-3152
Volume: 02, Issue: 02 (MAR-APR, 2016)
283
set by which they will perform sentiment analysis. Also,
after performing all these things and they want to store
these in a database and coming to here they can use
RDBMS, where they are having limitations in creating
tables and also accessing the tables effectively.
2.2 Proposed System
As it can have seen existing system drawbacks, here we
are going to overcome them by solving this issue using
Big Data problem statement. So here we are going to use
Hadoop and its Ecosystems, for getting raw data from the
Twitter we are using Hadoop online streaming tool using
Apache Flume. In this tool only we are going to configure
everything that we want to get data from the Twitter. For
this we want to set the configuration and also want to
define what information that we want to get form Twitter.
All these will be saved into our HDFS (Hadoop
Distributed File System) in our prescribed format. From
this raw data we are going to create the table and filter the
information that is needed for us and sort them into the
Hive Table. And form that we are going to perform the
Sentiment Analysis by using some UDF’s (User Defined
Functions) by which we can perform sentiment analysis.
The following figure shows clearly the architecture view
for the proposed system by this we can understand how
our project is effective using the Hadoop ecosystems and
how the data is going to store form the Flume, also how it
is going to create tables using Hive also how the
sentiment analysis is going to perform.
Fig. 2: Architecture diagram for proposed system
III. METHODOLOGY
As we have seen the procedure how to overcome the
problem that we are facing in the existing problem that is
shown clearly in the proposed system. So, to achieve this
we are going to follow the following methods.
3.1 Creating Twitter Application
First of all if we want to do sentiment analysis on Twitter
data we want to get Twitter data first so to get it we want
to create an account in Twitter developer and create an
application by clicking on the new application button
provided by them. After creating a new application just
create the access tokens so that we no need to provide our
authentication details there and also after creating
application it will be having one consumer keys to access
that application for getting Twitter data. The given figure
clearly show that how the application data looks provide
our authentication details there and also after creating
application it will be having one consumer keys to access
that application for getting Twitter data. The following is
the figure that show clearly how the application data looks
after creating the application and here it’s self we can see
the consumer details and also the access token details. We
want to take this keys and token details and want to set in
the Flume configuration file such that we can get the
required data from the Twitter in the form of twits. The
figure show clearly the application keys that are generated
after creating application and in this keys we can see the
top two keys are the API key and API secret. And coming
to the reaming two keys it is nothing but we know as the
access tokens that we want to generate it by ourselves by
clicking the generate access token. After clicking that we
can get the two keys that are our account access token and
coming to that one is Access token and the other one is
the Access token secret.
Fig. 3: Creating Twitter application from Twitter
Developer
3.2 Getting data using Flume
After creating an application in the Twitter developer site
we want to use the consumer key and secret along with
the access token and secret values. By which we can
International Journal of Current Trends in Engineering & Technology
ISSN: 2395-3152
Volume: 02, Issue: 02 (MAR-APR, 2016)
284
access the Twitter and we can get the information that
what we want exactly here we will get everything in
JSON format and this is stored in the HDFS that we have
given the location where to save all the data that comes
from the Twitter. The following is the configuration file
that we want to use to get the Twitter data from the
Twitter.
Fig. 4: Flume configuration files for Twitter data
Fig.5:Twitter data in HDFS (Hadoop Distributed File
System).
Fig. 6: Twitter data in JSON format
3.3 Querying using Hive Query Language (HQL)
After running the Flume by setting the above
configuration then the Twitter data will automatically will
save into HDFS where we have the set the path storage to
save the Twitter data that was taken by using Flume. The
following is the figure that shows clearly how the data is
stored in the HDFS in a documented format and the raw
data those we got form the Twitter is also in the JSON
format that is shown clearly in figure:
Fig. 7: Validating JSON data for HQL.
From these data first we want to create a table where the
filtered data want to set into a formatted structured such
that by which we can say clearly that we have converted
the unstructured data into structured format. For this we
want to use some custom serde concepts. These concepts
are nothing but how we are going to read the data that is
in the form of JSON format for that we are using the
custom serde for JSON so that our hive can read the
JSONdata [10] and can create a table in our prescribed
format. From that we can perform the sentiment analysis
and acquire the results where a new table is created such
that all the comments those to know which User has most
number of followers in figure 10.
Fig. 8: HQL Query for creating Tweets table.
International Journal of Current Trends in Engineering & Technology
ISSN: 2395-3152
Volume: 02, Issue: 02 (MAR-APR, 2016)
285
Fig. 9: Inserting data by performing sentiment analysis.
Fig 10: Sentimental Analysis through Excel Sheet
IV. CONCLUSION & FUTURE SCOPE
There are different ways to get Twitter data or any other
online streaming data where they want to code lines of
coding to achieve this. And, also they want to perform the
sentiment analysis on the stored data where it makes some
complex to perform those operations. Coming to this
paper we have achieved by this problem statement and
solving it in BIGDATA by using Hadoop and its Eco
Systems. And finally we have done sentiment analysis on
the Twitter data that is stored in HDFS. So, here the
processing time taken is also very less compared to the
previous methods because Hadoop Map Reduce and Hive
are the best methods to process large amount of data in a
small time. In this paper it has shown the way for doing
sentiment analysis for Twitter data. Also, we can do by
creating a work flow so that we can give a time slang such
that it will work based upon that time we allocated for
performing a particular work. Also at last we can also
visualize the word map i.e., the most frequent words that
are used in positive, moderate and negative fields by
using R language to visualize.
REFERENCES
[1]. Sunil B. Mane, Yashwant Sawant, Saif Kazi, Vaibhav
Shinde , Real Time Sentiment Analysis of Twitter
Data Using Hadoop, International Journal of Computer
Science and Information Technologies. Vol 5 ISSUE
3, 2014.
[2]. Penchalaiah C, Murali G , Suresh Babu, Effective
Sentiment Analysis on Twitter Data using: Apache
Flume and Hive, IJISET - International Journal of
Innovative Science, Engineering & Technology, Vol.1,
ISSUE 8, ISSUE 8, October 2014.
[3]. Manoj Kumar Danthala, Tweet Analysis: Twitter Data
processing Using Apache Hadoop, International
Journal of Core Engineering & Management (IJCEM),
Volume 1, Issue 11, February 2015.
[4]. Manoj Kumar Danthala, Dr. Siddhartha Ghosh,
Bigdata Analysis: Streaming Twitter Data with
Apache Hadoop and Visualizing using Big Insights,
International Journal of Engineering Research &
Technology (IJERT) Vol. 4 Issue 05, May-2015.
[5]. Judith Sherin Tilsha S, Shobha ,A Survey on Twitter
Data Analysis Techniques to Extract Public Opinion ,
International Journal of Advanced Research in
Computer Science and Software Engineering Research
Paper , Volume 5, Issue 11 , November 2015 .
[6]. Munesh Kataria1, Ms. Pooja Mittal, Big Data and
Hadoop with Components like Flume, Pig, Hive and
Jaql, International Journal of Computer Science and
Mobile Computing a Monthly Journal of Computer
Science and Information Technology. IJCSMC, Vol.
3, Issue 7, July 2014.
[7]. Kushal Sharma, Prashant Singh, Sachin Mote,
Sudarshan Patil,Twitter Sentimental Analysis using
Hadoop, International Journal of Computer
Application , Volume 2, Issue 5, January 2015 .
[8]. Mr. Agar Nadagoud , Channa basaveshwara ,Market
Sentiment Analysis for Popularity of Flipkart,
International Journal of Advanced Research in
Computer Engineering &Technology(IJARCET) ,
Volume 4, Issue 5, may2015.

More Related Content

PDF
Twitter's Elections Integrity Datasets (Galvanize; February 22, 2019)
PDF
SentricWorkforce Query Builder
PDF
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
PDF
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
PDF
Datawarehousing and Business Intelligence
PPTX
LookupPoint the 60 second intro
PDF
PDF
Interaria Social Media Optimization Kit
Twitter's Elections Integrity Datasets (Galvanize; February 22, 2019)
SentricWorkforce Query Builder
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
Detailed Investigation of Text Classification and Clustering of Twitter Data ...
Datawarehousing and Business Intelligence
LookupPoint the 60 second intro
Interaria Social Media Optimization Kit

What's hot (8)

PDF
Science of the Interwebs
PDF
What's New with Discovery Attender for Notes
PPTX
Link analysis : Comparative study of HITS and Page Rank Algorithm
PDF
Analytics With PowerBI On Azure
PPTX
Social Media Data Mining
PPTX
Curiosity Bits Python Tutorial: Mining Facebook Fan Page - getting posts and ...
PDF
The Sherpa Approach: Features and Limitations of Exchange E-Discovery
PDF
Volume 2-issue-6-2016-2020
Science of the Interwebs
What's New with Discovery Attender for Notes
Link analysis : Comparative study of HITS and Page Rank Algorithm
Analytics With PowerBI On Azure
Social Media Data Mining
Curiosity Bits Python Tutorial: Mining Facebook Fan Page - getting posts and ...
The Sherpa Approach: Features and Limitations of Exchange E-Discovery
Volume 2-issue-6-2016-2020
Ad

Viewers also liked (10)

PPTX
NextStep24_company_4 EK
PDF
SAMPLE OPERATIONAL MANUAL ALISA AXTMAN
PPTX
Costos de produccion y sus relaciones graficas
PDF
A1 - Cibersegurança - Raising the Bar for Cybersecurity
PDF
ISSTA'16 Summer School: Intro to Statistics
PPSX
Тімбаль
PDF
Load aware and load balancing using aomdv routing in manet
DOC
Kiran ABAP Resume 3yrs
PDF
Arsenal CRM Case Study
PPTX
13th April 2014 Surfing Lesson at Middleton SA witg Surf & Sun
NextStep24_company_4 EK
SAMPLE OPERATIONAL MANUAL ALISA AXTMAN
Costos de produccion y sus relaciones graficas
A1 - Cibersegurança - Raising the Bar for Cybersecurity
ISSTA'16 Summer School: Intro to Statistics
Тімбаль
Load aware and load balancing using aomdv routing in manet
Kiran ABAP Resume 3yrs
Arsenal CRM Case Study
13th April 2014 Surfing Lesson at Middleton SA witg Surf & Sun
Ad

Similar to Social data analysis using apache flume, hdfs, hive (20)

PDF
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
DOCX
Paper ijert
PDF
Introduction One can argue that the most challenging task in.pdf
PDF
IRJET- Opinion Mining on Pulwama Attack
PDF
1 Introduction One can argue that the most challenging task .pdf
PDF
Analyse Tweets using Flume, Hadoop and Hive
PDF
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
DOCX
Ijircce publish this paper
PPTX
Hadoop Solutions
PPTX
Streaming map reduce
PDF
Analyzing twitter data with hadoop
PPT
Mozilla - Anurag Phadke - Hadoop World 2010
KEY
Hadoop at Twitter (Hadoop Summit 2010)
PDF
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
PDF
An Analysis of a Checkpointing Mechanism for a Stream Processing System
PPTX
Apache flume - Twitter Streaming
PPTX
Big data analytics with hadoop volume 2
PDF
Unstructured Datasets Analysis: Thesaurus Model
DOCX
Twitter sentiment analysis project report
Sentiment Analysis on Twitter Data Using Apache Flume and Hive
Paper ijert
Introduction One can argue that the most challenging task in.pdf
IRJET- Opinion Mining on Pulwama Attack
1 Introduction One can argue that the most challenging task .pdf
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume 1.4, Hadoop 2.7 and Hive
Ijircce publish this paper
Hadoop Solutions
Streaming map reduce
Analyzing twitter data with hadoop
Mozilla - Anurag Phadke - Hadoop World 2010
Hadoop at Twitter (Hadoop Summit 2010)
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
An Analysis of a Checkpointing Mechanism for a Stream Processing System
Apache flume - Twitter Streaming
Big data analytics with hadoop volume 2
Unstructured Datasets Analysis: Thesaurus Model
Twitter sentiment analysis project report

More from ijctet (20)

PDF
Survey on video object detection & tracking
PDF
Structural and morphological studies on fly ash reinforced polystyrene compos...
PDF
The apt identification and blocking through ids in manet
PDF
Survey of apt and other attacks with reliable security schemes in manet
PDF
Load aware and load balancing using aomdv routing in manet
PDF
Investigation of detection & prevention sinkhole attack in manet
PDF
Implementation of different tutoring system to enhance student learning
PDF
Enhanced method for intrusion detection over kdd cup 99 dataset
PDF
Comparison of routing protocols with performance parameters for different num...
PDF
Commutative approach for securing digital media
PDF
Cecimg an ste cryptographic approach for data security in image
PDF
Analysis of recoverable exhaust energy from a light duty gasoline engine by u...
PDF
An intrusion detection system for detecting malicious nodes in manet using tr...
PDF
A robust combination of dwt and chaotic function for image watermarking
PDF
A review on preheating of bio diesel for the improvement of the performance c...
PDF
Optimal design & analysis of load frequency control for two interconnecte...
PDF
Improving reliability & performance of wsn via routing errors
PDF
Enhancing msf for mobile ad hoc network security though active handshaking &a...
PDF
A literature review of modern association rule mining techniques
PDF
A detail survey of page re ranking various web features and techniques
Survey on video object detection & tracking
Structural and morphological studies on fly ash reinforced polystyrene compos...
The apt identification and blocking through ids in manet
Survey of apt and other attacks with reliable security schemes in manet
Load aware and load balancing using aomdv routing in manet
Investigation of detection & prevention sinkhole attack in manet
Implementation of different tutoring system to enhance student learning
Enhanced method for intrusion detection over kdd cup 99 dataset
Comparison of routing protocols with performance parameters for different num...
Commutative approach for securing digital media
Cecimg an ste cryptographic approach for data security in image
Analysis of recoverable exhaust energy from a light duty gasoline engine by u...
An intrusion detection system for detecting malicious nodes in manet using tr...
A robust combination of dwt and chaotic function for image watermarking
A review on preheating of bio diesel for the improvement of the performance c...
Optimal design & analysis of load frequency control for two interconnecte...
Improving reliability & performance of wsn via routing errors
Enhancing msf for mobile ad hoc network security though active handshaking &a...
A literature review of modern association rule mining techniques
A detail survey of page re ranking various web features and techniques

Recently uploaded (20)

PPTX
Lecture Notes Electrical Wiring System Components
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
Well-logging-methods_new................
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
web development for engineering and engineering
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
OOP with Java - Java Introduction (Basics)
PPT
Mechanical Engineering MATERIALS Selection
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Sustainable Sites - Green Building Construction
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Lecture Notes Electrical Wiring System Components
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Well-logging-methods_new................
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
web development for engineering and engineering
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CH1 Production IntroductoryConcepts.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
R24 SURVEYING LAB MANUAL for civil enggi
OOP with Java - Java Introduction (Basics)
Mechanical Engineering MATERIALS Selection
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Sustainable Sites - Green Building Construction
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd

Social data analysis using apache flume, hdfs, hive

  • 1. International Journal of Current Trends in Engineering & Technology ISSN: 2395-3152 Volume: 02, Issue: 02 (MAR-APR, 2016) 282 SOCIAL DATA ANALYSIS USING APACHE FLUME, HDFS, HIVE Mrs. Sulochana Panigrahi PG Scholar, Department of Computer Science and Engineering, New Horizon College of Engineering, Bangalore, Karnataka, India sulochanap01@gmail.com Dr. S. Mohan Kumar Associate Professor, Department of Computer Science and Engineering, New Horizon College of Engineering, Bangalore, Karnataka, India drsmohankumar@gmail.com Abstract: - Twitter is one of the most popular micro blogging website in today's globalized world. Twitter messages can be mined to gain valuable information. Although Twitter provides a list of most popular topics people tweet about known as Trending Topics in real time, it is often hard to understand what these trending topics are about. Therefore, various efforts are being made to classify these topics into general categories with high accuracy for better information retrieval. In this paper, we are going to talk how effectively sentiment analysis is done on the data which is collected from the Twitter using Flume. Twitter is an online web application which of data that can be a structured, semi-structured and un-structured data. Collect the data from the twitter by using CLOUDERA VM using online streaming tool Flume. And doing analysis on Twitter is also difficult due to language that is used for comments. And, coming to analysis there are different types of analysis that can be done on the collected data. So here we are taking sentiment analysis, for this we are using Hive and its queries to give the sentiment data based up on the groups that we have defined in the HQL (Hive Query Language) and use Visual Studio to show in User Interface. Keywords: - Analysis, BIGDATA, Comment, Flume, Hive, HQL, Sentiment Analysis, Structured, Semi- Structured, Twitter, Tweets, Un-Structured. I. INTRODUCTION Present situation is completely they are expressing their thoughts through online blogs, discussion forms and also some online applications like Facebook, Twitter, etc. If we take Twitter as our example nearly 1TB of text data is generating within a week in the form of tweets. So, by this it is understand clearly how this Internet is changing the way of living and style of people. Among these tweets can be categorized by the hash value tags for which they are commenting and posting their tweets. So, now many companies and also the survey companies are using this for doing some analytics such that they can predict the success rate of their product or also they can show the different view from the data that they have collected for analysis. But, to calculate their views is very difficult in a normal way by taking these heavy data that are going to generate day by day. Fig. 1: Describes clearly Cloudera VM The above figure shows clearly the different types of service that are available on cloudera VM so, this problem is taking now and can be solved by using BIGDATA Problem as a solution. And if we consider getting the data from Twitter one should use any one programming language to crawl the data from their database or from their web pages. Coming to this problem here we are collecting this data by using BIGDATA online streaming Eco System Tool known as Flume and also the shuffling of data and generating them into structured data in the form of tables can be done by using Apache Hive. II. PROBLEM STATEMENT 2.1 Existing System As we have already discussed about the older way of getting data and also performing the sentiment analysis on those data. Here they are going to use some coding techniques for crawling the data from the twitter where they can extract the data from the Twitter web pages by using some code that may be written either in JAVA, Python etc. For those they are going to download the libraries that are provided by the twitter guys by using this they are crawling the data that we want particularly. [1] After getting raw data they will filter by using some old techniques and also they will find out the positive, negative and moderate words from the list of collected words in a text file. All these words should be collected by us to filter out or do some sentiment analysis on the filtered data. [2]. these words can be called as a dictionary
  • 2. International Journal of Current Trends in Engineering & Technology ISSN: 2395-3152 Volume: 02, Issue: 02 (MAR-APR, 2016) 283 set by which they will perform sentiment analysis. Also, after performing all these things and they want to store these in a database and coming to here they can use RDBMS, where they are having limitations in creating tables and also accessing the tables effectively. 2.2 Proposed System As it can have seen existing system drawbacks, here we are going to overcome them by solving this issue using Big Data problem statement. So here we are going to use Hadoop and its Ecosystems, for getting raw data from the Twitter we are using Hadoop online streaming tool using Apache Flume. In this tool only we are going to configure everything that we want to get data from the Twitter. For this we want to set the configuration and also want to define what information that we want to get form Twitter. All these will be saved into our HDFS (Hadoop Distributed File System) in our prescribed format. From this raw data we are going to create the table and filter the information that is needed for us and sort them into the Hive Table. And form that we are going to perform the Sentiment Analysis by using some UDF’s (User Defined Functions) by which we can perform sentiment analysis. The following figure shows clearly the architecture view for the proposed system by this we can understand how our project is effective using the Hadoop ecosystems and how the data is going to store form the Flume, also how it is going to create tables using Hive also how the sentiment analysis is going to perform. Fig. 2: Architecture diagram for proposed system III. METHODOLOGY As we have seen the procedure how to overcome the problem that we are facing in the existing problem that is shown clearly in the proposed system. So, to achieve this we are going to follow the following methods. 3.1 Creating Twitter Application First of all if we want to do sentiment analysis on Twitter data we want to get Twitter data first so to get it we want to create an account in Twitter developer and create an application by clicking on the new application button provided by them. After creating a new application just create the access tokens so that we no need to provide our authentication details there and also after creating application it will be having one consumer keys to access that application for getting Twitter data. The given figure clearly show that how the application data looks provide our authentication details there and also after creating application it will be having one consumer keys to access that application for getting Twitter data. The following is the figure that show clearly how the application data looks after creating the application and here it’s self we can see the consumer details and also the access token details. We want to take this keys and token details and want to set in the Flume configuration file such that we can get the required data from the Twitter in the form of twits. The figure show clearly the application keys that are generated after creating application and in this keys we can see the top two keys are the API key and API secret. And coming to the reaming two keys it is nothing but we know as the access tokens that we want to generate it by ourselves by clicking the generate access token. After clicking that we can get the two keys that are our account access token and coming to that one is Access token and the other one is the Access token secret. Fig. 3: Creating Twitter application from Twitter Developer 3.2 Getting data using Flume After creating an application in the Twitter developer site we want to use the consumer key and secret along with the access token and secret values. By which we can
  • 3. International Journal of Current Trends in Engineering & Technology ISSN: 2395-3152 Volume: 02, Issue: 02 (MAR-APR, 2016) 284 access the Twitter and we can get the information that what we want exactly here we will get everything in JSON format and this is stored in the HDFS that we have given the location where to save all the data that comes from the Twitter. The following is the configuration file that we want to use to get the Twitter data from the Twitter. Fig. 4: Flume configuration files for Twitter data Fig.5:Twitter data in HDFS (Hadoop Distributed File System). Fig. 6: Twitter data in JSON format 3.3 Querying using Hive Query Language (HQL) After running the Flume by setting the above configuration then the Twitter data will automatically will save into HDFS where we have the set the path storage to save the Twitter data that was taken by using Flume. The following is the figure that shows clearly how the data is stored in the HDFS in a documented format and the raw data those we got form the Twitter is also in the JSON format that is shown clearly in figure: Fig. 7: Validating JSON data for HQL. From these data first we want to create a table where the filtered data want to set into a formatted structured such that by which we can say clearly that we have converted the unstructured data into structured format. For this we want to use some custom serde concepts. These concepts are nothing but how we are going to read the data that is in the form of JSON format for that we are using the custom serde for JSON so that our hive can read the JSONdata [10] and can create a table in our prescribed format. From that we can perform the sentiment analysis and acquire the results where a new table is created such that all the comments those to know which User has most number of followers in figure 10. Fig. 8: HQL Query for creating Tweets table.
  • 4. International Journal of Current Trends in Engineering & Technology ISSN: 2395-3152 Volume: 02, Issue: 02 (MAR-APR, 2016) 285 Fig. 9: Inserting data by performing sentiment analysis. Fig 10: Sentimental Analysis through Excel Sheet IV. CONCLUSION & FUTURE SCOPE There are different ways to get Twitter data or any other online streaming data where they want to code lines of coding to achieve this. And, also they want to perform the sentiment analysis on the stored data where it makes some complex to perform those operations. Coming to this paper we have achieved by this problem statement and solving it in BIGDATA by using Hadoop and its Eco Systems. And finally we have done sentiment analysis on the Twitter data that is stored in HDFS. So, here the processing time taken is also very less compared to the previous methods because Hadoop Map Reduce and Hive are the best methods to process large amount of data in a small time. In this paper it has shown the way for doing sentiment analysis for Twitter data. Also, we can do by creating a work flow so that we can give a time slang such that it will work based upon that time we allocated for performing a particular work. Also at last we can also visualize the word map i.e., the most frequent words that are used in positive, moderate and negative fields by using R language to visualize. REFERENCES [1]. Sunil B. Mane, Yashwant Sawant, Saif Kazi, Vaibhav Shinde , Real Time Sentiment Analysis of Twitter Data Using Hadoop, International Journal of Computer Science and Information Technologies. Vol 5 ISSUE 3, 2014. [2]. Penchalaiah C, Murali G , Suresh Babu, Effective Sentiment Analysis on Twitter Data using: Apache Flume and Hive, IJISET - International Journal of Innovative Science, Engineering & Technology, Vol.1, ISSUE 8, ISSUE 8, October 2014. [3]. Manoj Kumar Danthala, Tweet Analysis: Twitter Data processing Using Apache Hadoop, International Journal of Core Engineering & Management (IJCEM), Volume 1, Issue 11, February 2015. [4]. Manoj Kumar Danthala, Dr. Siddhartha Ghosh, Bigdata Analysis: Streaming Twitter Data with Apache Hadoop and Visualizing using Big Insights, International Journal of Engineering Research & Technology (IJERT) Vol. 4 Issue 05, May-2015. [5]. Judith Sherin Tilsha S, Shobha ,A Survey on Twitter Data Analysis Techniques to Extract Public Opinion , International Journal of Advanced Research in Computer Science and Software Engineering Research Paper , Volume 5, Issue 11 , November 2015 . [6]. Munesh Kataria1, Ms. Pooja Mittal, Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql, International Journal of Computer Science and Mobile Computing a Monthly Journal of Computer Science and Information Technology. IJCSMC, Vol. 3, Issue 7, July 2014. [7]. Kushal Sharma, Prashant Singh, Sachin Mote, Sudarshan Patil,Twitter Sentimental Analysis using Hadoop, International Journal of Computer Application , Volume 2, Issue 5, January 2015 . [8]. Mr. Agar Nadagoud , Channa basaveshwara ,Market Sentiment Analysis for Popularity of Flipkart, International Journal of Advanced Research in Computer Engineering &Technology(IJARCET) , Volume 4, Issue 5, may2015.