SlideShare a Scribd company logo
AnalyzingFire DepartmentCalls in San Francisco:
Using HiveQL and MapReduce
CIS 798: Programming Techniques for Big Data
1
Nithin Kumar Kakkireni,2
Sharmila Vegesana
1
Graduate Student, 2
Graduate Student, Dept. of Computer Science, Kansas State University, Manhattan, KS
E-mail: 1
kakkiren@ksu.edu, 2
sharmila@ksu.edu
ABSTRACT: Fire department calls data is created
and maintained by the San Francisco Fire department
(SFFD). The calls include all the response by fire
units to calls. A few news articlessuggest that SFFD
has been having a few troubles with the dispatch
system and organizing their ambulances. Analyzing
these emergency calls, will give us a deep insight on
how to improve man power and vehicular
requirements to enable better service to the citizens.
HiveQL and MapReduce programshave been used to
query the data. After obtaining the results for the
queries, data visualization was performed to observe
patterns between the call volumes, the locations and
timeline of these calls.
I. INTRODUCTION
The term ‘Big Data’ can be described as huge
volumes of data in any form be it structured or
unstructured. The data can be from varied fields
and is collected on a day-to-day basis. Various
organizations analyze and mine the data to
retrieve the important information. The volume
of data that gets created and stored in the world
every day, is almost unimaginable and also
because of its ever growing nature the necessity
to clean the data. The amount of data alwaysdoes
not matter, but the flow of information is
important which gives the insights which can be
obtained from the analysis of data and could lead
to better and more strategic business decisions.
Volume, Variety and Velocity have emerged as
a regular framework to describe Big Data. They
are commonly known as the three Vs. Volume
refers to the magnitude of the data, i.e. the
amount of data that is being processed. The
problem with this level of data is that, it is too
much for a
traditional relational database to handle. The
ideal way to manage big data is to share the
workload among multiple servers. Information
can be drawn from a varied range of sources like
images, audio files, video files, statistical
representation and also from text.This gives way
to the second V, Variety. The solutions drafted
for working on Big Data need to be able to
process the data in raw and unstructured format
and find a structured meaning from it. Velocity,
is the speed at which the data can be shared,
processed and sent. When managing big data,
this is a major concernconsidering the amount of
data which is being handled. Another point of
focus would be the speed at which the data is
analyzed and the results are generated. Since the
data we deal with gets updated on a daily basis,
satisfying all these properties is a huge issue.
The dataset being used in this project is the Fire
emergency calls dataset. It includes all fire units’
responses to calls. Each record in the data set
contains the call number, incident number,
address,unit identifier, call type, and disposition.
All the relevant time intervals are also included.
There are multiple records for each call number
because this dataset is based on responses, and
since most calls involved multiple units.
Addresses are associated with a block number,
intersection or call box, not a specific address.
The dataset amounted to around 1.7 GB in size
and had around 5 million rows as records. The
data was collected over a span of 16 years from
2000 to 2016. The data for the year 2016 is not
entirely complete aswe downloaded it before the
end of the year. It has calls recorded from in and
around the city of San Francisco. Since the data
set has so many attributes and records, it was a
good challenge to work on it. The data set can be
analyzed and the analysis reports can be used to
improve the way the SFFD performs and is
organized.
II. LITERATURE SURVEY
Technology Overview
HiveQL
All the queries on the data were done using
HiveQL. Apache Hive is a data warehouse
software which enables reading, writing and
managing of large datasets which reside in
distributed storage. SQL is used for this purpose.
This structure can be projected on to the data
which is already present in the data storage. The
queries can be run in the command line and
JDBC drivers are also provided to connect the
users to Hive. It is built on top of Apache
Hadoop.
Hive provides a lot of features like, tools to
enable easy access to data via SQL. This enables
various data warehousing tasks like extraction,
transformation and loading. Hive also includes
mechanisms to impose structure on a range of
data formats. It also has direct access to files
stored in the Hadoop Distributed File
Storage(HDFS). It has an inbuilt method for
implementing MapReduce. Hive also provides
the standard SOL functionality which can be
used to analyze data stored in the table format.
Users can extend Hive with their code using the
user defined functions, user defined aggregates
and also userdefined table functions. The built in
file format for Hive includes connectors like
comma and tab, and most of its files are in the
comma-separated and tab-separated formats.
Usage of Hive maximizes the scalability,
performance and fault tolerance.
The operations in Hive follow the following
routine:
 Data Preparation
 Extraction, Transformation and Loading
 Mining the data
 Optimization
In Hive formulating graphs and other forms of
visualization is not possible, as the results which
Hive returns are in a columnar fashion.
MAP-REDUCE
MapReduce is the core component of the Apache
Hadoop framework. MapReduce has two main
functions: it sends out work to various nodes
within the map and then organizes and reduces
the results from eachnode to a reasonable answer
to a query. When dealing with big data, the data
needs to be distributed and the end results need
to be collected. MapReduce performs parallel
operations acrosshuge clusters.The jobs are split
across any number of servers. The results from
these pass through partitioners and combiners
when necessary and then flows to the reducers.
The concept of MapReduce can be written in
many languages, C,C++, Java,Python etc.Many
MapReduce libraries are also present for
programmers to use to create tasks so that they
need not deal with communication or the
coordination between nodes.
We used Java to implement the MapReduce
concept on our data set.
Fig 1: Code Snippet: MapReduce Programin Java
Hive is one of the platforms that implements
MapReduce but in higher levels of abstraction. In
actual, it provides an interface which doesn’t
have much to do with the map and reduce
concepts, but the system interacts with other
higher level languages in a concurrent series of
MapReduce jobs.
Numerical Summarization Pattern Concept:
The data we deal with is huge and vast. And the
dataset used in this project has a chance of
getting updated on a regular basis.
Summarization analytics can be described as
activities which group similar kind of data
together. Various operations like calculating
statistics, building of indexes or even simply
counting. One of the best way to extract required
values is to perform aggregate functions over
groups of datasets. Numerical summarization
was used in this project to perform aggregate
calculations over the dataset.The highest level of
care that needs to be taken while performing
these operations is that the combiner should be
used properly and the calculations which are
being performed should be clearly understood.
Consider θ to be a generic numerical
summarization function we wish to execute over
some list of values (v1,v2,v3,...,vn) to find a
value λ, i.e. λ = θ(v1, v2, v3, …, vn). Examples
of θ include a minimum, maximum, average,
median, and standard deviation1.
Numerical Summarization is applicable when we
are dealing with numerical data and when the
data can be grouped by certain fields or
attributes.
The structure of the numerical summarization
pattern is as follows:
Mapper: It produces the output that consists of
eachfield and values which canbe setto relevant
values. The mapper works similarly like a
relational table that is the column relate fields
that the aggregate function can perform.
Combiner: Combiner helps to decrease the
Key/value pairs produced by the mapper which
are being sent to reducer to perform the
operations.
Partitioner: The function of partitioner is to
partition the data to certain number of reducers.
Reducer:The input to the reducer is set of values
(v1,v2,v3,v3...,vn) linked with a group-by key
records to perform the aggregation functions
λ=θ(v1,v2,v3,v4).
Some of the numerical summarization examples
used in the project are count, min, max, and
average for calculating the desired values and
also for the analysis of results.
Fig 2: Map Reducer flow
JFreeCharts
JFreeChart is a free Java chart library. It is easy
for developers to display professional quality
charts in their applications using JFreeCharts.
JFreeChart's extensive feature list includes: a
well-documented API,a flexible design that can
be easily extended. It also supports a variety of
output types. This was used to implement out
results into graphs and aided in better
visualization of the results.
III. SOFTWARE REQUIREMENTS
Functional requirements
Our project can be divided into 3 parts:
 The loading of the data, installations and
the command execution forms the first
part. The installation of Cloudera,
VMWare, Hive, Java and JFreeCharts
will fall under this category.
 Writing MapReduce in Java and
performing the Hive queries to study the
data.
 The results obtained after the execution
of the Hive queries and the map reduce programs
are transferred in the form of text files to the
JFreeCharts to form Graphs and other forms of
visualization.
Non Functional Requirements
SOFTWARE REQUIREMENTS
Technologies: Java,Hive, MapReduce,
JFreeCharts.
Operating System: Windows/Linux
User Interface:Java GUI
Scripting: Hive
Tools: Cloudera, VMWarePlayer
Data Storage: HDFS
HARDWARE REQUIREMENTS
Ram: 8 GB
Memory: 4 GB
Guest Operating System: 2 GB
Host Operating System: 2 GB
IV. IMPLEMENTATION
Fig 3: Flow ofthe Project
After going through the various news articles
from the San Francisco newspapers, we got
inspired to work on this particular data set. From
this project we intend to provide the analysis
reports of this data set which can be used to
improve man power and vehicular requirements
to enable better service to the citizens.
Depending on locality wise results, that
particular area and Fire Department Unit can be
better prepared for certain kinds of incidents.
Analysis of call volumes, will help in improving
the dispatch system.
The data was interesting but had a lot of
redundancies and garbage values. Data Cleaning
generally deals with analysing the data,
detecting, removing unwanted and inconsistency
presentin the dataset.These type of problems are
found in single data collections, in files and
databases. Some examples of data
inconsistencies in dataset are fields which are
misspelled during data entry, invalid data or the
dataset field is not being entered or empty. Data
cleaning is considered to be one of the biggest
problems in data warehousing
For having access to accurate and consistent
data, elimination of various errors,
inconsistencies, duplicate information has
become necessary and was also our first priority.
There are many methods for obtaining the
accurate data such as the process of filtering. In
our project, data cleaning was implemented
using Java in our project. The dataset was
retrieved in the .CSV format. We have used Java
‘io’ BufferedReader API to read the data from
the dataset; the data set was then loaded into
HDFS using Cloudera. A ‘,’ delimiter was used
for separating the entries. The HDFS file was
parsedusing Java and then the ‘jar’ file wasbuilt.
The generated output data will be transferred to
a CSV file and is placed back in to HDFS.
Cleaning is done in such a way that we have
omitted the columns that are not necessary for
our analysis such as dispatch time, call number,
removed duplicate calls having same records.
We converted the time stamp into various fields
like date, month, year into separate and more
useful data fields. We retrieved day for the given
data stamp using java.util.library API. For some
of the attributes where there was no data in their
respective fields, we replaced them using the
keyword Miscellaneous.
The database for the project was created using
the following command:
CREATE database 798project;
The table is created in the database using the
‘create table’ command and filled its arguments
with respective newly cleaned dataset columns.
Create table
calls_fire(call_type String,day
String,month int,day_of_month
int,year
int,Call_Final_disposition
String,Street string,Zipcode
int,Batallion
String,Station_area int,Box
int, O_priority int,F_Priority
int,call_type_group
String,unit_type
String,neighborhood_district
String, Latitude int,Longitude
int) row format delimited
fields terminated by ',';
The dataset is inserted in the table created in the
above step using the Insert command and giving
the path of the file that is located in HDFS using
‘,’ as delimiter.
LOAD DATA INTO LOCAL INPATH
‘/home/cloudera/798project/new_o
utput.csv’ OVERWRITE INTO TABLE
calls_fire;
Different queries are performed and they are
stored into an output directory which consists of
one or more number of files which depends on
the number of reducers that have been utilized to
perform the query.
Map-reduce Programs are written for some of
the queries though they can be done using Hive
to understand the deeper concepts of Map-
reduce and compared the running and execution
time of the same query both using MapReduce
and Hive which resulted in a conclusion that
Hive takes less execution time for performing
those particular queries. There may be various
reasons why this might have occurred. One of
them is the possibility that this has happened due
to number of mappers and number of reducers
that have to be taken into account; in Hive, the
mappers and reducers are automatically decided
according to the data volume and the query
executed.
Hive uses three mappers and 3 reducers for this
dataset. Whereas,in the MapReduce code,it was
quite a challenge to decide the number of
mappers and reducers to lessen the execution
time. The optimal number of mappers and
reducers has a lot of impact on the performance.
The main thing to aim is to have a balance
between CPU power, the amount of data that is
being processed by the mapper, the data sent to
the reducers and output generated by reducers.
Quoting from Hadoop the definitive guide 3.0
Edition- “Because MapReduce jobs are
normally I/O-bound , it makes sense to have
more tasks than processors to get better
utilization. The amount of over subscription
depends on the CPU utilization of jobs you run
,but a good thumb rule is to have a factor of one
and two more tasks (counting both map and
reduce tasks) than processors.(processor may be
equal to one logical core)”.
The output files that are produced by Hive
queries and map-reduce programs are given as
input to the JFreeChart Java program to generate
and plot the results, to analyse the data. Different
kinds of analysis were performed and deductions
were made with respective to the output of the
JFreeCharts graphs.
V. RESULTS
After the query results were obtained, we
developed a Graphical User Interface (GUI)
using Java. Graphical User Interface was
developed using JFrame and JButtons in order to
load data directly from the output files and
generate the result within a button click. Given
below is a simple GUI panel that has buttons
which helps us to retrieve the results in a single
button click.
Fig 4: Java GUI
All type of calls in all years:
A Java Map-Reduce program and Hive query
were performed in order to retrieve the results. A
code snippet is written below and the comparison
was made between MapReduce program and
Hive query can clearly be seen in Fig 5& fig 6.
As shown in figures the total time taken for a
Hive program to execute is around 18seconds
whereas the map-reduce code with a single
mapper and a single reducer takes around 50
seconds.
Fig 5. MapReduce program execution time
Fig 6: Hive Query execution time
The result of the performed query and map
reduce program is shown in fig 7. The result is
compiled in such a way that it displays the total
number of calls in each call-type from year 2000
that were recorded and entered in the dataset.
Fig 7: Results of‘All types ofcalls in a year’ query
Query performed:
Select count(call_type),
call_type
from calls_fire
Where year = 2000
Group By call_type;
The source code for Map-reduce program is
provided in the supplementary data. We created
file paths and stored the input and output files for
the MapReduce program in HDFS. The file was
executed using the following commands:
$ mkdir -p build
$ javac -cp
/usr/lib/hadoop/*:/usr/lib/hadoo
p-mapreduce/* mr.java -d build -
Xlint
$ jar -cvf mr.jar -C build/ .
$ hadoop jar mr.jar org.myorg.mr
/user/cloudera/mapreduce/input
/user/cloudera/mapreduce/output-
monthmr
$ hadoop fs -cat
/user/cloudera/798Project/output
_ct/*
When multiple number of files are created they
are joined using the following command:
$cat
'/home/cloudera/798Project/month
'/* > month.txt
Calls during the days of the week:
This analysis is basically done for checking if the
fire accidents occur during a certain day of the
week. The analysis is made for the past three
years. The days of the week have been
scrutinized to check which one has the highest
number of calls. From the graph it is easy to
notice how the pattern of calls were over the
weekfor the past 3 years. The result for the query
is shown in the fig8.
Fig 8: Calls during all the days ofthe week for years 2012-2015
Query performed:
Select count(call_type), day
from calls_fire Group By day;
Count of calls during Office and Non-office
hours
After analysing the calls in a day between 9 am
– 5 pm (termed as office hours) and from 5 pm
to 9 am (termed as non-office hours), we
noticed that the number of calls during Office
hours were extremely high when compared to
the non-office hours. The result can be viewed
below in fig 9.
Fig 9: Comparison ofcalls during office and non-office hours
This was performed using MapReduce program.
The program will be attached in the
supplementary files.
Top 5 call-types each year:
One way to mitigate the accidents is to analyze
the high volume of accidents and their
percentage of occurrence each year. The result
shown in the fig.11 gives us the top incidents that
occur each year in the form of a pie chart. In the
GUI when we select on the ‘top 5 call types each
year’ option a JOptionPane dialog pops out as
shown in fig.10 which prompts the user to input
the year for which the top 5 call types are
required and produces a pie chart with top most
accidents occurred in each year.
Fig 10: Popup to enter year
After inputting the year 2014 it produces the
highest type of accidents happened in that
particular year.
Fig 11: Top 5 call types in a year
Top-3 types of incidents, grouped locality
wise:
This part of analysis can be considered as one of
the most important ones in our report. After
getting the top-5 incidents from the previous
query, we moved to choosing the top three
incidents, which turned out be medical incidents,
structure fires and alarms. In this particular query
we check the top-5 localities for the past two
years, so that awareness can be increased and
appropriate safetymeasures canbe taken in those
locations. When we select this option in GUI
panel a JOption Pane Input box asks user to
decide what type of accident he needs to analyze
in eachzipcode. The result is shown in the fig 12.
Fig 12: JOption pane to enter type ofincident
Fig 13: The top 5 localities where the incident has occurred in 2014 and
2015
Type ofIncidents each year:
Each record has another attribute which has
entries like potentially life threatening, non-
potentially life threatening and only fire. We
could analyze the frequency of each such
occurrence and can decide if the potential life
threatening situations have increased, decreased
or remain unchanged from the previous years.
We can study the reasons behind this occurrence
and also the kind of situations surrounding the
fire. We caninterlink this with the analysis of the
localities done previously. Results are shown in
the fig. 14 and can be compared if the life
threatening situations are increasing by year or
decreasing. Extra measures can be taken to
prevent them from increasing further.
Fig 14: Level ofIncidents
The number of calls grouped by month in
each year:
The query is extremely important for our
analysis as it helps us to predict the possibility
of total calls per month in the future. By
reviewing the statistics of calls, the fire
department can be more prepared for the calls
and the large volumes. The result shown in the
fig 15 is for the years 2014-2016. Fig 15: Monthly calls
Priorities ofCalls and incidents
Each call is basically classified into 3 priorities
(1,2,3) giving 1as less severe, 2 as severe and 3
as most severe. In the dataset we have the data
for initial priority that has been entered by the
receiver from the fire department and the actual
situation priority to which the fire escalated by
the time the fire engine reached the spot. This
type of analysis helps us to know if there is a
mismatch between the priority value assumed
and the end result at the location. If yes, then the
fire department should be more conscious
training their employees. By assuming the
correct priority one can easily send the number
of ambulances or fires –engines that are needed
for the situation.
Result: The result depicted in the following fig
16 tells us total number of calls and their
difference between assumed and actual priority
in their respective years.
Fig: 16 The difference between original and final priorities
For an in-depth understanding of Map-Reduce
methods and concepts we have implemented
map reduce for this query and snippets of that
code can be found in fig 17.
Fig 17: Code Snippet: Mapper code
Fig 18: Code: Snippet: Reducer code
Checking if the Medical Incidents are
increasing on a year by year basis or not:
We got the top 3 incidents that are occurring in
each year, we then decided to analyze if these
incidents have a pattern of decrementing or
incrementing and to prepare in advance and take
measures to decrease these incidents The result
shown in the fig. 19 trend the Medical Incident,
Structure Fire and Alarms in each year.
Fig 19: Trend of calls on monthly basis
VI. CONCLUSIONS
We have deduced the following conclusions
from analyzing the above results:
 There is an equal chance of fire hazards
and accidents occurring on all days of
the week, irrespective of weekdays and
weekends.
 Majority of the fire accidents are
occurring in the time range 9:00-17:00
hours. Even though this time frame is the
number of hours is less than the rest of
the day. The call volume during this
period is almost twice the other one.
 The top 5 incidents occurring in most
years are Medical incidents, Structure
fires, Alarms, Traffic Collison and
Miscellaneous.
 Most of the medical incidents happen in
the localities 94102 and 94103 for the
past two years.
 Alarms occur mostly in the areas 94102,
94103, 94109, 94110 and 94107.
 Observing the patterns, we can notice
that Non-Life threatening and
Potentially Life Threatening have
increased and the fire incidents have
decreased.
 From observing the patterns,in the years
2014 and 2015, we can predict that the
calls in December 2016 will increase
from the present value.
 Even though we notice that the
difference in priorities is very less, that
call volume as well is high enough to
wreck lives and property and measures
should be taken to avoid this.
 Medical incidents have a strictly
increasing curve and there is a constant
rate of alarms and a decrease in the
number of structure fires over the years.
VII. FUTURE ENHANCEMENTS
We could use Hadoop HUEto plot and formulate
graphs to aid in the visual analysis. Spark can be
used to increase the efficiency and the to ease the
computational capability. Deeperanalysis canbe
performed such as comparing the dispatch unit
attribute and the incident with the battalion
attribute. We could also plot the data geospatially
by using the street, latitude and longitude
attributes given in the dataset.
VIII. ACKNOWLEDGEMENT
This project would not be completed without the
support of our instructor Dr. William H. Hsu,
Associate Professor, Department of Computer
Science, Kansas State University.
IX. REFERENCES
1. Dataset: https://data.sfgov.org/
2. https://www.firerescue1.com/communications-
interoperability/articles/1944513-Is-San-Franciscos-
EMS-911-systems-stressed-to-breaking-point/
3. http://www.sfexaminer.com/unreliable-dispatch-
system-exacerbates-flaws-in-sfs-emergency-response/
4. http://www.sfgate.com/bayarea/article/Why-S-F-still-
counts-on-street-fire-alarm-boxes-3081293.php
5. http://www.thesfnews.com/sffd-engine-1-ranked-
busiest-in-nation/22197
6. http://hortonworks.com/hadoop/hive/
7. http://hortonworks.com/wp-
content/uploads/downloads/2013/08/Hortonworks.Che
atSheet.SQLtoHive.pdf
8. https://www.ijircce.com/upload/2013/october/27Predi
ctive.pdf
9. https://www.ijircce.com/upload/2015/may/74_32_Stati
stical.pdf
10. https://www.researchgate.net/publication/301801698_
Earthquake_Data_Analysis_and_Visualization_using_
Big_Data_Tool
11. http://www.ejournalofscience.org/archive/vol3no12/vo
l3no12_16.pdf
12. http://scientific-
journals.org/journalofsystemsandsoftware/archive/vol5
no2/vol5no2_3.pdf http://betterevaluation.org/sites/def
ault/files/data_cleaning.pdf
13. http://barbie.uta.edu/~jli/Resources/MapReduce&Hado
op/MapReduce%20Design%20Patterns.pdf
14. http://stackoverflow.com/questions/20307404/hadoop-
number-of-mappers-and-reducers.
Final Report_798 Project_Nithin_Sharmila

More Related Content

DOCX
Map reduce advantages over parallel databases report
DOC
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
PDF
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
PDF
Big Data on Implementation of Many to Many Clustering
PDF
Paper id 25201498
PDF
Sparkr sigmod
PDF
Jovian DATA: A multidimensional database for the cloud
Map reduce advantages over parallel databases report
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Data on Implementation of Many to Many Clustering
Paper id 25201498
Sparkr sigmod
Jovian DATA: A multidimensional database for the cloud

What's hot (16)

PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
PDF
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
PDF
Applying stratosphere for big data analytics
PDF
A sql implementation on the map reduce framework
PPTX
Stratosphere with big_data_analytics
DOCX
Seminar Report Vaibhav
PDF
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
PDF
E031201032036
PDF
Big Data Analysis and Its Scheduling Policy – Hadoop
PPTX
Distributed computing poli
PDF
Enhancing Big Data Analysis by using Map-reduce Technique
PPTX
Big Data Hadoop (Overview)
PDF
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
PDF
JovianDATA MDX Engine Comad oct 22 2011
PDF
Survey of Parallel Data Processing in Context with MapReduce
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Applying stratosphere for big data analytics
A sql implementation on the map reduce framework
Stratosphere with big_data_analytics
Seminar Report Vaibhav
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
E031201032036
Big Data Analysis and Its Scheduling Policy – Hadoop
Distributed computing poli
Enhancing Big Data Analysis by using Map-reduce Technique
Big Data Hadoop (Overview)
A New Multi-Dimensional Hyperbolic Structure for Cloud Service Indexing
JovianDATA MDX Engine Comad oct 22 2011
Survey of Parallel Data Processing in Context with MapReduce
Ad

Similar to Final Report_798 Project_Nithin_Sharmila (20)

PPTX
Hadoop Integration with Microstrategy
PPTX
Big data
PDF
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
PDF
TCS_DATA_ANALYSIS_REPORT_ADITYA
PDF
IRJET - Survey Paper on Map Reduce Processing using HADOOP
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
PPTX
Big data
PPTX
Big data
PDF
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
PDF
A Comprehensive Study on Big Data Applications and Challenges
PDF
PPTX
Big data & Hadoop
DOCX
hadoop seminar training report
PDF
B017320612
PDF
Leveraging Map Reduce With Hadoop for Weather Data Analytics
PDF
A Survey on Big Data Analysis Techniques
PDF
Implementation of p pic algorithm in map reduce to handle big data
PDF
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
PDF
Mapreduce2008 cacm
Hadoop Integration with Microstrategy
Big data
SURVEY ON BIG DATA PROCESSING USING HADOOP, MAP REDUCE
TCS_DATA_ANALYSIS_REPORT_ADITYA
IRJET - Survey Paper on Map Reduce Processing using HADOOP
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
Big data
Big data
[IJCT-V3I2P32] Authors: Amarbir Singh, Palwinder Singh
A Comprehensive Study on Big Data Applications and Challenges
Big data & Hadoop
hadoop seminar training report
B017320612
Leveraging Map Reduce With Hadoop for Weather Data Analytics
A Survey on Big Data Analysis Techniques
Implementation of p pic algorithm in map reduce to handle big data
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
Mapreduce2008 cacm
Ad

Final Report_798 Project_Nithin_Sharmila

  • 1. AnalyzingFire DepartmentCalls in San Francisco: Using HiveQL and MapReduce CIS 798: Programming Techniques for Big Data 1 Nithin Kumar Kakkireni,2 Sharmila Vegesana 1 Graduate Student, 2 Graduate Student, Dept. of Computer Science, Kansas State University, Manhattan, KS E-mail: 1 kakkiren@ksu.edu, 2 sharmila@ksu.edu ABSTRACT: Fire department calls data is created and maintained by the San Francisco Fire department (SFFD). The calls include all the response by fire units to calls. A few news articlessuggest that SFFD has been having a few troubles with the dispatch system and organizing their ambulances. Analyzing these emergency calls, will give us a deep insight on how to improve man power and vehicular requirements to enable better service to the citizens. HiveQL and MapReduce programshave been used to query the data. After obtaining the results for the queries, data visualization was performed to observe patterns between the call volumes, the locations and timeline of these calls. I. INTRODUCTION The term ‘Big Data’ can be described as huge volumes of data in any form be it structured or unstructured. The data can be from varied fields and is collected on a day-to-day basis. Various organizations analyze and mine the data to retrieve the important information. The volume of data that gets created and stored in the world every day, is almost unimaginable and also because of its ever growing nature the necessity to clean the data. The amount of data alwaysdoes not matter, but the flow of information is important which gives the insights which can be obtained from the analysis of data and could lead to better and more strategic business decisions. Volume, Variety and Velocity have emerged as a regular framework to describe Big Data. They are commonly known as the three Vs. Volume refers to the magnitude of the data, i.e. the amount of data that is being processed. The problem with this level of data is that, it is too much for a traditional relational database to handle. The ideal way to manage big data is to share the workload among multiple servers. Information can be drawn from a varied range of sources like images, audio files, video files, statistical representation and also from text.This gives way to the second V, Variety. The solutions drafted for working on Big Data need to be able to process the data in raw and unstructured format and find a structured meaning from it. Velocity, is the speed at which the data can be shared, processed and sent. When managing big data, this is a major concernconsidering the amount of data which is being handled. Another point of focus would be the speed at which the data is analyzed and the results are generated. Since the data we deal with gets updated on a daily basis, satisfying all these properties is a huge issue. The dataset being used in this project is the Fire emergency calls dataset. It includes all fire units’ responses to calls. Each record in the data set contains the call number, incident number, address,unit identifier, call type, and disposition. All the relevant time intervals are also included. There are multiple records for each call number because this dataset is based on responses, and since most calls involved multiple units. Addresses are associated with a block number, intersection or call box, not a specific address. The dataset amounted to around 1.7 GB in size and had around 5 million rows as records. The data was collected over a span of 16 years from 2000 to 2016. The data for the year 2016 is not entirely complete aswe downloaded it before the
  • 2. end of the year. It has calls recorded from in and around the city of San Francisco. Since the data set has so many attributes and records, it was a good challenge to work on it. The data set can be analyzed and the analysis reports can be used to improve the way the SFFD performs and is organized. II. LITERATURE SURVEY Technology Overview HiveQL All the queries on the data were done using HiveQL. Apache Hive is a data warehouse software which enables reading, writing and managing of large datasets which reside in distributed storage. SQL is used for this purpose. This structure can be projected on to the data which is already present in the data storage. The queries can be run in the command line and JDBC drivers are also provided to connect the users to Hive. It is built on top of Apache Hadoop. Hive provides a lot of features like, tools to enable easy access to data via SQL. This enables various data warehousing tasks like extraction, transformation and loading. Hive also includes mechanisms to impose structure on a range of data formats. It also has direct access to files stored in the Hadoop Distributed File Storage(HDFS). It has an inbuilt method for implementing MapReduce. Hive also provides the standard SOL functionality which can be used to analyze data stored in the table format. Users can extend Hive with their code using the user defined functions, user defined aggregates and also userdefined table functions. The built in file format for Hive includes connectors like comma and tab, and most of its files are in the comma-separated and tab-separated formats. Usage of Hive maximizes the scalability, performance and fault tolerance. The operations in Hive follow the following routine:  Data Preparation  Extraction, Transformation and Loading  Mining the data  Optimization In Hive formulating graphs and other forms of visualization is not possible, as the results which Hive returns are in a columnar fashion. MAP-REDUCE MapReduce is the core component of the Apache Hadoop framework. MapReduce has two main functions: it sends out work to various nodes within the map and then organizes and reduces the results from eachnode to a reasonable answer to a query. When dealing with big data, the data needs to be distributed and the end results need to be collected. MapReduce performs parallel operations acrosshuge clusters.The jobs are split across any number of servers. The results from these pass through partitioners and combiners when necessary and then flows to the reducers. The concept of MapReduce can be written in many languages, C,C++, Java,Python etc.Many MapReduce libraries are also present for programmers to use to create tasks so that they need not deal with communication or the coordination between nodes. We used Java to implement the MapReduce concept on our data set. Fig 1: Code Snippet: MapReduce Programin Java Hive is one of the platforms that implements MapReduce but in higher levels of abstraction. In actual, it provides an interface which doesn’t have much to do with the map and reduce concepts, but the system interacts with other
  • 3. higher level languages in a concurrent series of MapReduce jobs. Numerical Summarization Pattern Concept: The data we deal with is huge and vast. And the dataset used in this project has a chance of getting updated on a regular basis. Summarization analytics can be described as activities which group similar kind of data together. Various operations like calculating statistics, building of indexes or even simply counting. One of the best way to extract required values is to perform aggregate functions over groups of datasets. Numerical summarization was used in this project to perform aggregate calculations over the dataset.The highest level of care that needs to be taken while performing these operations is that the combiner should be used properly and the calculations which are being performed should be clearly understood. Consider θ to be a generic numerical summarization function we wish to execute over some list of values (v1,v2,v3,...,vn) to find a value λ, i.e. λ = θ(v1, v2, v3, …, vn). Examples of θ include a minimum, maximum, average, median, and standard deviation1. Numerical Summarization is applicable when we are dealing with numerical data and when the data can be grouped by certain fields or attributes. The structure of the numerical summarization pattern is as follows: Mapper: It produces the output that consists of eachfield and values which canbe setto relevant values. The mapper works similarly like a relational table that is the column relate fields that the aggregate function can perform. Combiner: Combiner helps to decrease the Key/value pairs produced by the mapper which are being sent to reducer to perform the operations. Partitioner: The function of partitioner is to partition the data to certain number of reducers. Reducer:The input to the reducer is set of values (v1,v2,v3,v3...,vn) linked with a group-by key records to perform the aggregation functions λ=θ(v1,v2,v3,v4). Some of the numerical summarization examples used in the project are count, min, max, and average for calculating the desired values and also for the analysis of results. Fig 2: Map Reducer flow JFreeCharts JFreeChart is a free Java chart library. It is easy for developers to display professional quality charts in their applications using JFreeCharts. JFreeChart's extensive feature list includes: a well-documented API,a flexible design that can be easily extended. It also supports a variety of output types. This was used to implement out results into graphs and aided in better visualization of the results. III. SOFTWARE REQUIREMENTS Functional requirements Our project can be divided into 3 parts:  The loading of the data, installations and the command execution forms the first part. The installation of Cloudera, VMWare, Hive, Java and JFreeCharts will fall under this category.  Writing MapReduce in Java and performing the Hive queries to study the data.
  • 4.  The results obtained after the execution of the Hive queries and the map reduce programs are transferred in the form of text files to the JFreeCharts to form Graphs and other forms of visualization. Non Functional Requirements SOFTWARE REQUIREMENTS Technologies: Java,Hive, MapReduce, JFreeCharts. Operating System: Windows/Linux User Interface:Java GUI Scripting: Hive Tools: Cloudera, VMWarePlayer Data Storage: HDFS HARDWARE REQUIREMENTS Ram: 8 GB Memory: 4 GB Guest Operating System: 2 GB Host Operating System: 2 GB IV. IMPLEMENTATION Fig 3: Flow ofthe Project After going through the various news articles from the San Francisco newspapers, we got inspired to work on this particular data set. From this project we intend to provide the analysis reports of this data set which can be used to improve man power and vehicular requirements to enable better service to the citizens. Depending on locality wise results, that particular area and Fire Department Unit can be better prepared for certain kinds of incidents. Analysis of call volumes, will help in improving the dispatch system. The data was interesting but had a lot of redundancies and garbage values. Data Cleaning generally deals with analysing the data, detecting, removing unwanted and inconsistency presentin the dataset.These type of problems are found in single data collections, in files and databases. Some examples of data inconsistencies in dataset are fields which are misspelled during data entry, invalid data or the dataset field is not being entered or empty. Data cleaning is considered to be one of the biggest problems in data warehousing For having access to accurate and consistent data, elimination of various errors, inconsistencies, duplicate information has become necessary and was also our first priority. There are many methods for obtaining the accurate data such as the process of filtering. In our project, data cleaning was implemented using Java in our project. The dataset was retrieved in the .CSV format. We have used Java ‘io’ BufferedReader API to read the data from the dataset; the data set was then loaded into HDFS using Cloudera. A ‘,’ delimiter was used for separating the entries. The HDFS file was parsedusing Java and then the ‘jar’ file wasbuilt. The generated output data will be transferred to a CSV file and is placed back in to HDFS. Cleaning is done in such a way that we have omitted the columns that are not necessary for our analysis such as dispatch time, call number, removed duplicate calls having same records. We converted the time stamp into various fields like date, month, year into separate and more useful data fields. We retrieved day for the given
  • 5. data stamp using java.util.library API. For some of the attributes where there was no data in their respective fields, we replaced them using the keyword Miscellaneous. The database for the project was created using the following command: CREATE database 798project; The table is created in the database using the ‘create table’ command and filled its arguments with respective newly cleaned dataset columns. Create table calls_fire(call_type String,day String,month int,day_of_month int,year int,Call_Final_disposition String,Street string,Zipcode int,Batallion String,Station_area int,Box int, O_priority int,F_Priority int,call_type_group String,unit_type String,neighborhood_district String, Latitude int,Longitude int) row format delimited fields terminated by ','; The dataset is inserted in the table created in the above step using the Insert command and giving the path of the file that is located in HDFS using ‘,’ as delimiter. LOAD DATA INTO LOCAL INPATH ‘/home/cloudera/798project/new_o utput.csv’ OVERWRITE INTO TABLE calls_fire; Different queries are performed and they are stored into an output directory which consists of one or more number of files which depends on the number of reducers that have been utilized to perform the query. Map-reduce Programs are written for some of the queries though they can be done using Hive to understand the deeper concepts of Map- reduce and compared the running and execution time of the same query both using MapReduce and Hive which resulted in a conclusion that Hive takes less execution time for performing those particular queries. There may be various reasons why this might have occurred. One of them is the possibility that this has happened due to number of mappers and number of reducers that have to be taken into account; in Hive, the mappers and reducers are automatically decided according to the data volume and the query executed. Hive uses three mappers and 3 reducers for this dataset. Whereas,in the MapReduce code,it was quite a challenge to decide the number of mappers and reducers to lessen the execution time. The optimal number of mappers and reducers has a lot of impact on the performance. The main thing to aim is to have a balance between CPU power, the amount of data that is being processed by the mapper, the data sent to the reducers and output generated by reducers. Quoting from Hadoop the definitive guide 3.0 Edition- “Because MapReduce jobs are normally I/O-bound , it makes sense to have more tasks than processors to get better utilization. The amount of over subscription depends on the CPU utilization of jobs you run ,but a good thumb rule is to have a factor of one and two more tasks (counting both map and reduce tasks) than processors.(processor may be equal to one logical core)”. The output files that are produced by Hive queries and map-reduce programs are given as input to the JFreeChart Java program to generate and plot the results, to analyse the data. Different kinds of analysis were performed and deductions were made with respective to the output of the JFreeCharts graphs. V. RESULTS After the query results were obtained, we developed a Graphical User Interface (GUI)
  • 6. using Java. Graphical User Interface was developed using JFrame and JButtons in order to load data directly from the output files and generate the result within a button click. Given below is a simple GUI panel that has buttons which helps us to retrieve the results in a single button click. Fig 4: Java GUI All type of calls in all years: A Java Map-Reduce program and Hive query were performed in order to retrieve the results. A code snippet is written below and the comparison was made between MapReduce program and Hive query can clearly be seen in Fig 5& fig 6. As shown in figures the total time taken for a Hive program to execute is around 18seconds whereas the map-reduce code with a single mapper and a single reducer takes around 50 seconds. Fig 5. MapReduce program execution time Fig 6: Hive Query execution time The result of the performed query and map reduce program is shown in fig 7. The result is compiled in such a way that it displays the total number of calls in each call-type from year 2000 that were recorded and entered in the dataset. Fig 7: Results of‘All types ofcalls in a year’ query
  • 7. Query performed: Select count(call_type), call_type from calls_fire Where year = 2000 Group By call_type; The source code for Map-reduce program is provided in the supplementary data. We created file paths and stored the input and output files for the MapReduce program in HDFS. The file was executed using the following commands: $ mkdir -p build $ javac -cp /usr/lib/hadoop/*:/usr/lib/hadoo p-mapreduce/* mr.java -d build - Xlint $ jar -cvf mr.jar -C build/ . $ hadoop jar mr.jar org.myorg.mr /user/cloudera/mapreduce/input /user/cloudera/mapreduce/output- monthmr $ hadoop fs -cat /user/cloudera/798Project/output _ct/* When multiple number of files are created they are joined using the following command: $cat '/home/cloudera/798Project/month '/* > month.txt Calls during the days of the week: This analysis is basically done for checking if the fire accidents occur during a certain day of the week. The analysis is made for the past three years. The days of the week have been scrutinized to check which one has the highest number of calls. From the graph it is easy to notice how the pattern of calls were over the weekfor the past 3 years. The result for the query is shown in the fig8. Fig 8: Calls during all the days ofthe week for years 2012-2015 Query performed: Select count(call_type), day from calls_fire Group By day; Count of calls during Office and Non-office hours After analysing the calls in a day between 9 am – 5 pm (termed as office hours) and from 5 pm to 9 am (termed as non-office hours), we noticed that the number of calls during Office hours were extremely high when compared to the non-office hours. The result can be viewed below in fig 9. Fig 9: Comparison ofcalls during office and non-office hours This was performed using MapReduce program. The program will be attached in the supplementary files.
  • 8. Top 5 call-types each year: One way to mitigate the accidents is to analyze the high volume of accidents and their percentage of occurrence each year. The result shown in the fig.11 gives us the top incidents that occur each year in the form of a pie chart. In the GUI when we select on the ‘top 5 call types each year’ option a JOptionPane dialog pops out as shown in fig.10 which prompts the user to input the year for which the top 5 call types are required and produces a pie chart with top most accidents occurred in each year. Fig 10: Popup to enter year After inputting the year 2014 it produces the highest type of accidents happened in that particular year. Fig 11: Top 5 call types in a year Top-3 types of incidents, grouped locality wise: This part of analysis can be considered as one of the most important ones in our report. After getting the top-5 incidents from the previous query, we moved to choosing the top three incidents, which turned out be medical incidents, structure fires and alarms. In this particular query we check the top-5 localities for the past two years, so that awareness can be increased and appropriate safetymeasures canbe taken in those locations. When we select this option in GUI panel a JOption Pane Input box asks user to decide what type of accident he needs to analyze in eachzipcode. The result is shown in the fig 12. Fig 12: JOption pane to enter type ofincident Fig 13: The top 5 localities where the incident has occurred in 2014 and 2015 Type ofIncidents each year: Each record has another attribute which has entries like potentially life threatening, non- potentially life threatening and only fire. We could analyze the frequency of each such occurrence and can decide if the potential life
  • 9. threatening situations have increased, decreased or remain unchanged from the previous years. We can study the reasons behind this occurrence and also the kind of situations surrounding the fire. We caninterlink this with the analysis of the localities done previously. Results are shown in the fig. 14 and can be compared if the life threatening situations are increasing by year or decreasing. Extra measures can be taken to prevent them from increasing further. Fig 14: Level ofIncidents The number of calls grouped by month in each year: The query is extremely important for our analysis as it helps us to predict the possibility of total calls per month in the future. By reviewing the statistics of calls, the fire department can be more prepared for the calls and the large volumes. The result shown in the fig 15 is for the years 2014-2016. Fig 15: Monthly calls Priorities ofCalls and incidents Each call is basically classified into 3 priorities (1,2,3) giving 1as less severe, 2 as severe and 3 as most severe. In the dataset we have the data for initial priority that has been entered by the receiver from the fire department and the actual situation priority to which the fire escalated by the time the fire engine reached the spot. This type of analysis helps us to know if there is a mismatch between the priority value assumed and the end result at the location. If yes, then the fire department should be more conscious training their employees. By assuming the correct priority one can easily send the number of ambulances or fires –engines that are needed for the situation. Result: The result depicted in the following fig 16 tells us total number of calls and their difference between assumed and actual priority in their respective years. Fig: 16 The difference between original and final priorities For an in-depth understanding of Map-Reduce methods and concepts we have implemented map reduce for this query and snippets of that code can be found in fig 17.
  • 10. Fig 17: Code Snippet: Mapper code Fig 18: Code: Snippet: Reducer code Checking if the Medical Incidents are increasing on a year by year basis or not: We got the top 3 incidents that are occurring in each year, we then decided to analyze if these incidents have a pattern of decrementing or incrementing and to prepare in advance and take measures to decrease these incidents The result shown in the fig. 19 trend the Medical Incident, Structure Fire and Alarms in each year. Fig 19: Trend of calls on monthly basis VI. CONCLUSIONS We have deduced the following conclusions from analyzing the above results:  There is an equal chance of fire hazards and accidents occurring on all days of the week, irrespective of weekdays and weekends.  Majority of the fire accidents are occurring in the time range 9:00-17:00 hours. Even though this time frame is the number of hours is less than the rest of the day. The call volume during this period is almost twice the other one.  The top 5 incidents occurring in most years are Medical incidents, Structure fires, Alarms, Traffic Collison and Miscellaneous.  Most of the medical incidents happen in the localities 94102 and 94103 for the past two years.  Alarms occur mostly in the areas 94102, 94103, 94109, 94110 and 94107.  Observing the patterns, we can notice that Non-Life threatening and Potentially Life Threatening have increased and the fire incidents have decreased.  From observing the patterns,in the years 2014 and 2015, we can predict that the calls in December 2016 will increase from the present value.  Even though we notice that the difference in priorities is very less, that call volume as well is high enough to wreck lives and property and measures should be taken to avoid this.  Medical incidents have a strictly increasing curve and there is a constant rate of alarms and a decrease in the number of structure fires over the years.
  • 11. VII. FUTURE ENHANCEMENTS We could use Hadoop HUEto plot and formulate graphs to aid in the visual analysis. Spark can be used to increase the efficiency and the to ease the computational capability. Deeperanalysis canbe performed such as comparing the dispatch unit attribute and the incident with the battalion attribute. We could also plot the data geospatially by using the street, latitude and longitude attributes given in the dataset. VIII. ACKNOWLEDGEMENT This project would not be completed without the support of our instructor Dr. William H. Hsu, Associate Professor, Department of Computer Science, Kansas State University. IX. REFERENCES 1. Dataset: https://data.sfgov.org/ 2. https://www.firerescue1.com/communications- interoperability/articles/1944513-Is-San-Franciscos- EMS-911-systems-stressed-to-breaking-point/ 3. http://www.sfexaminer.com/unreliable-dispatch- system-exacerbates-flaws-in-sfs-emergency-response/ 4. http://www.sfgate.com/bayarea/article/Why-S-F-still- counts-on-street-fire-alarm-boxes-3081293.php 5. http://www.thesfnews.com/sffd-engine-1-ranked- busiest-in-nation/22197 6. http://hortonworks.com/hadoop/hive/ 7. http://hortonworks.com/wp- content/uploads/downloads/2013/08/Hortonworks.Che atSheet.SQLtoHive.pdf 8. https://www.ijircce.com/upload/2013/october/27Predi ctive.pdf 9. https://www.ijircce.com/upload/2015/may/74_32_Stati stical.pdf 10. https://www.researchgate.net/publication/301801698_ Earthquake_Data_Analysis_and_Visualization_using_ Big_Data_Tool 11. http://www.ejournalofscience.org/archive/vol3no12/vo l3no12_16.pdf 12. http://scientific- journals.org/journalofsystemsandsoftware/archive/vol5 no2/vol5no2_3.pdf http://betterevaluation.org/sites/def ault/files/data_cleaning.pdf 13. http://barbie.uta.edu/~jli/Resources/MapReduce&Hado op/MapReduce%20Design%20Patterns.pdf 14. http://stackoverflow.com/questions/20307404/hadoop- number-of-mappers-and-reducers.