Final Report_798 Project_Nithin_Sharmila

AnalyzingFire DepartmentCalls in San Francisco:
Using HiveQL and MapReduce
CIS 798: Programming Techniques for Big Data
1
Nithin Kumar Kakkireni,2
Sharmila Vegesana
1
Graduate Student, 2
Graduate Student, Dept. of Computer Science, Kansas State University, Manhattan, KS
E-mail: 1
kakkiren@ksu.edu, 2
sharmila@ksu.edu
ABSTRACT: Fire department calls data is created
and maintained by the San Francisco Fire department
(SFFD). The calls include all the response by fire
units to calls. A few news articlessuggest that SFFD
has been having a few troubles with the dispatch
system and organizing their ambulances. Analyzing
these emergency calls, will give us a deep insight on
how to improve man power and vehicular
requirements to enable better service to the citizens.
HiveQL and MapReduce programshave been used to
query the data. After obtaining the results for the
queries, data visualization was performed to observe
patterns between the call volumes, the locations and
timeline of these calls.
I. INTRODUCTION
The term ‘Big Data’ can be described as huge
volumes of data in any form be it structured or
unstructured. The data can be from varied fields
and is collected on a day-to-day basis. Various
organizations analyze and mine the data to
retrieve the important information. The volume
of data that gets created and stored in the world
every day, is almost unimaginable and also
because of its ever growing nature the necessity
to clean the data. The amount of data alwaysdoes
not matter, but the flow of information is
important which gives the insights which can be
obtained from the analysis of data and could lead
to better and more strategic business decisions.
Volume, Variety and Velocity have emerged as
a regular framework to describe Big Data. They
are commonly known as the three Vs. Volume
refers to the magnitude of the data, i.e. the
amount of data that is being processed. The
problem with this level of data is that, it is too
much for a
traditional relational database to handle. The
ideal way to manage big data is to share the
workload among multiple servers. Information
can be drawn from a varied range of sources like
images, audio files, video files, statistical
representation and also from text.This gives way
to the second V, Variety. The solutions drafted
for working on Big Data need to be able to
process the data in raw and unstructured format
and find a structured meaning from it. Velocity,
is the speed at which the data can be shared,
processed and sent. When managing big data,
this is a major concernconsidering the amount of
data which is being handled. Another point of
focus would be the speed at which the data is
analyzed and the results are generated. Since the
data we deal with gets updated on a daily basis,
satisfying all these properties is a huge issue.
The dataset being used in this project is the Fire
emergency calls dataset. It includes all fire units’
responses to calls. Each record in the data set
contains the call number, incident number,
address,unit identifier, call type, and disposition.
All the relevant time intervals are also included.
There are multiple records for each call number
because this dataset is based on responses, and
since most calls involved multiple units.
Addresses are associated with a block number,
intersection or call box, not a specific address.
The dataset amounted to around 1.7 GB in size
and had around 5 million rows as records. The
data was collected over a span of 16 years from
2000 to 2016. The data for the year 2016 is not
entirely complete aswe downloaded it before the

end of the year. It has calls recorded from in and
around the city of San Francisco. Since the data
set has so many attributes and records, it was a
good challenge to work on it. The data set can be
analyzed and the analysis reports can be used to
improve the way the SFFD performs and is
organized.
II. LITERATURE SURVEY
Technology Overview
HiveQL
All the queries on the data were done using
HiveQL. Apache Hive is a data warehouse
software which enables reading, writing and
managing of large datasets which reside in
distributed storage. SQL is used for this purpose.
This structure can be projected on to the data
which is already present in the data storage. The
queries can be run in the command line and
JDBC drivers are also provided to connect the
users to Hive. It is built on top of Apache
Hadoop.
Hive provides a lot of features like, tools to
enable easy access to data via SQL. This enables
various data warehousing tasks like extraction,
transformation and loading. Hive also includes
mechanisms to impose structure on a range of
data formats. It also has direct access to files
stored in the Hadoop Distributed File
Storage(HDFS). It has an inbuilt method for
implementing MapReduce. Hive also provides
the standard SOL functionality which can be
used to analyze data stored in the table format.
Users can extend Hive with their code using the
user defined functions, user defined aggregates
and also userdefined table functions. The built in
file format for Hive includes connectors like
comma and tab, and most of its files are in the
comma-separated and tab-separated formats.
Usage of Hive maximizes the scalability,
performance and fault tolerance.
The operations in Hive follow the following
routine:
 Data Preparation
 Extraction, Transformation and Loading
 Mining the data
 Optimization
In Hive formulating graphs and other forms of
visualization is not possible, as the results which
Hive returns are in a columnar fashion.
MAP-REDUCE
MapReduce is the core component of the Apache
Hadoop framework. MapReduce has two main
functions: it sends out work to various nodes
within the map and then organizes and reduces
the results from eachnode to a reasonable answer
to a query. When dealing with big data, the data
needs to be distributed and the end results need
to be collected. MapReduce performs parallel
operations acrosshuge clusters.The jobs are split
across any number of servers. The results from
these pass through partitioners and combiners
when necessary and then flows to the reducers.
The concept of MapReduce can be written in
many languages, C,C++, Java,Python etc.Many
MapReduce libraries are also present for
programmers to use to create tasks so that they
need not deal with communication or the
coordination between nodes.
We used Java to implement the MapReduce
concept on our data set.
Fig 1: Code Snippet: MapReduce Programin Java
Hive is one of the platforms that implements
MapReduce but in higher levels of abstraction. In
actual, it provides an interface which doesn’t
have much to do with the map and reduce
concepts, but the system interacts with other

higher level languages in a concurrent series of
MapReduce jobs.
Numerical Summarization Pattern Concept:
The data we deal with is huge and vast. And the
dataset used in this project has a chance of
getting updated on a regular basis.
Summarization analytics can be described as
activities which group similar kind of data
together. Various operations like calculating
statistics, building of indexes or even simply
counting. One of the best way to extract required
values is to perform aggregate functions over
groups of datasets. Numerical summarization
was used in this project to perform aggregate
calculations over the dataset.The highest level of
care that needs to be taken while performing
these operations is that the combiner should be
used properly and the calculations which are
being performed should be clearly understood.
Consider θ to be a generic numerical
summarization function we wish to execute over
some list of values (v1,v2,v3,...,vn) to find a
value λ, i.e. λ = θ(v1, v2, v3, …, vn). Examples
of θ include a minimum, maximum, average,
median, and standard deviation1.
Numerical Summarization is applicable when we
are dealing with numerical data and when the
data can be grouped by certain fields or
attributes.
The structure of the numerical summarization
pattern is as follows:
Mapper: It produces the output that consists of
eachfield and values which canbe setto relevant
values. The mapper works similarly like a
relational table that is the column relate fields
that the aggregate function can perform.
Combiner: Combiner helps to decrease the
Key/value pairs produced by the mapper which
are being sent to reducer to perform the
operations.
Partitioner: The function of partitioner is to
partition the data to certain number of reducers.
Reducer:The input to the reducer is set of values
(v1,v2,v3,v3...,vn) linked with a group-by key
records to perform the aggregation functions
λ=θ(v1,v2,v3,v4).
Some of the numerical summarization examples
used in the project are count, min, max, and
average for calculating the desired values and
also for the analysis of results.
Fig 2: Map Reducer flow
JFreeCharts
JFreeChart is a free Java chart library. It is easy
for developers to display professional quality
charts in their applications using JFreeCharts.
JFreeChart's extensive feature list includes: a
well-documented API,a flexible design that can
be easily extended. It also supports a variety of
output types. This was used to implement out
results into graphs and aided in better
visualization of the results.
III. SOFTWARE REQUIREMENTS
Functional requirements
Our project can be divided into 3 parts:
 The loading of the data, installations and
the command execution forms the first
part. The installation of Cloudera,
VMWare, Hive, Java and JFreeCharts
will fall under this category.
 Writing MapReduce in Java and
performing the Hive queries to study the
data.

 The results obtained after the execution
of the Hive queries and the map reduce programs
are transferred in the form of text files to the
JFreeCharts to form Graphs and other forms of
visualization.
Non Functional Requirements
SOFTWARE REQUIREMENTS
Technologies: Java,Hive, MapReduce,
JFreeCharts.
Operating System: Windows/Linux
User Interface:Java GUI
Scripting: Hive
Tools: Cloudera, VMWarePlayer
Data Storage: HDFS
HARDWARE REQUIREMENTS
Ram: 8 GB
Memory: 4 GB
Guest Operating System: 2 GB
Host Operating System: 2 GB
IV. IMPLEMENTATION
Fig 3: Flow ofthe Project
After going through the various news articles
from the San Francisco newspapers, we got
inspired to work on this particular data set. From
this project we intend to provide the analysis
reports of this data set which can be used to
improve man power and vehicular requirements
to enable better service to the citizens.
Depending on locality wise results, that
particular area and Fire Department Unit can be
better prepared for certain kinds of incidents.
Analysis of call volumes, will help in improving
the dispatch system.
The data was interesting but had a lot of
redundancies and garbage values. Data Cleaning
generally deals with analysing the data,
detecting, removing unwanted and inconsistency
presentin the dataset.These type of problems are
found in single data collections, in files and
databases. Some examples of data
inconsistencies in dataset are fields which are
misspelled during data entry, invalid data or the
dataset field is not being entered or empty. Data
cleaning is considered to be one of the biggest
problems in data warehousing
For having access to accurate and consistent
data, elimination of various errors,
inconsistencies, duplicate information has
become necessary and was also our first priority.
There are many methods for obtaining the
accurate data such as the process of filtering. In
our project, data cleaning was implemented
using Java in our project. The dataset was
retrieved in the .CSV format. We have used Java
‘io’ BufferedReader API to read the data from
the dataset; the data set was then loaded into
HDFS using Cloudera. A ‘,’ delimiter was used
for separating the entries. The HDFS file was
parsedusing Java and then the ‘jar’ file wasbuilt.
The generated output data will be transferred to
a CSV file and is placed back in to HDFS.
Cleaning is done in such a way that we have
omitted the columns that are not necessary for
our analysis such as dispatch time, call number,
removed duplicate calls having same records.
We converted the time stamp into various fields
like date, month, year into separate and more
useful data fields. We retrieved day for the given

data stamp using java.util.library API. For some
of the attributes where there was no data in their
respective fields, we replaced them using the
keyword Miscellaneous.
The database for the project was created using
the following command:
CREATE database 798project;
The table is created in the database using the
‘create table’ command and filled its arguments
with respective newly cleaned dataset columns.
Create table
calls_fire(call_type String,day
String,month int,day_of_month
int,year
int,Call_Final_disposition
String,Street string,Zipcode
int,Batallion
String,Station_area int,Box
int, O_priority int,F_Priority
int,call_type_group
String,unit_type
String,neighborhood_district
String, Latitude int,Longitude
int) row format delimited
fields terminated by ',';
The dataset is inserted in the table created in the
above step using the Insert command and giving
the path of the file that is located in HDFS using
‘,’ as delimiter.
LOAD DATA INTO LOCAL INPATH
‘/home/cloudera/798project/new_o
utput.csv’ OVERWRITE INTO TABLE
calls_fire;
Different queries are performed and they are
stored into an output directory which consists of
one or more number of files which depends on
the number of reducers that have been utilized to
perform the query.
Map-reduce Programs are written for some of
the queries though they can be done using Hive
to understand the deeper concepts of Map-
reduce and compared the running and execution
time of the same query both using MapReduce
and Hive which resulted in a conclusion that
Hive takes less execution time for performing
those particular queries. There may be various
reasons why this might have occurred. One of
them is the possibility that this has happened due
to number of mappers and number of reducers
that have to be taken into account; in Hive, the
mappers and reducers are automatically decided
according to the data volume and the query
executed.
Hive uses three mappers and 3 reducers for this
dataset. Whereas,in the MapReduce code,it was
quite a challenge to decide the number of
mappers and reducers to lessen the execution
time. The optimal number of mappers and
reducers has a lot of impact on the performance.
The main thing to aim is to have a balance
between CPU power, the amount of data that is
being processed by the mapper, the data sent to
the reducers and output generated by reducers.
Quoting from Hadoop the definitive guide 3.0
Edition- “Because MapReduce jobs are
normally I/O-bound , it makes sense to have
more tasks than processors to get better
utilization. The amount of over subscription
depends on the CPU utilization of jobs you run
,but a good thumb rule is to have a factor of one
and two more tasks (counting both map and
reduce tasks) than processors.(processor may be
equal to one logical core)”.
The output files that are produced by Hive
queries and map-reduce programs are given as
input to the JFreeChart Java program to generate
and plot the results, to analyse the data. Different
kinds of analysis were performed and deductions
were made with respective to the output of the
JFreeCharts graphs.
V. RESULTS
After the query results were obtained, we
developed a Graphical User Interface (GUI)

using Java. Graphical User Interface was
developed using JFrame and JButtons in order to
load data directly from the output files and
generate the result within a button click. Given
below is a simple GUI panel that has buttons
which helps us to retrieve the results in a single
button click.
Fig 4: Java GUI
All type of calls in all years:
A Java Map-Reduce program and Hive query
were performed in order to retrieve the results. A
code snippet is written below and the comparison
was made between MapReduce program and
Hive query can clearly be seen in Fig 5& fig 6.
As shown in figures the total time taken for a
Hive program to execute is around 18seconds
whereas the map-reduce code with a single
mapper and a single reducer takes around 50
seconds.
Fig 5. MapReduce program execution time
Fig 6: Hive Query execution time
The result of the performed query and map
reduce program is shown in fig 7. The result is
compiled in such a way that it displays the total
number of calls in each call-type from year 2000
that were recorded and entered in the dataset.
Fig 7: Results of‘All types ofcalls in a year’ query

Query performed:
Select count(call_type),
call_type
from calls_fire
Where year = 2000
Group By call_type;
The source code for Map-reduce program is
provided in the supplementary data. We created
file paths and stored the input and output files for
the MapReduce program in HDFS. The file was
executed using the following commands:
$ mkdir -p build
$ javac -cp
/usr/lib/hadoop/*:/usr/lib/hadoo
p-mapreduce/* mr.java -d build -
Xlint
$ jar -cvf mr.jar -C build/ .
$ hadoop jar mr.jar org.myorg.mr
/user/cloudera/mapreduce/input
/user/cloudera/mapreduce/output-
monthmr
$ hadoop fs -cat
/user/cloudera/798Project/output
_ct/*
When multiple number of files are created they
are joined using the following command:
$cat
'/home/cloudera/798Project/month
'/* > month.txt
Calls during the days of the week:
This analysis is basically done for checking if the
fire accidents occur during a certain day of the
week. The analysis is made for the past three
years. The days of the week have been
scrutinized to check which one has the highest
number of calls. From the graph it is easy to
notice how the pattern of calls were over the
weekfor the past 3 years. The result for the query
is shown in the fig8.
Fig 8: Calls during all the days ofthe week for years 2012-2015
Query performed:
Select count(call_type), day
from calls_fire Group By day;
Count of calls during Office and Non-office
hours
After analysing the calls in a day between 9 am
– 5 pm (termed as office hours) and from 5 pm
to 9 am (termed as non-office hours), we
noticed that the number of calls during Office
hours were extremely high when compared to
the non-office hours. The result can be viewed
below in fig 9.
Fig 9: Comparison ofcalls during office and non-office hours
This was performed using MapReduce program.
The program will be attached in the
supplementary files.

Top 5 call-types each year:
One way to mitigate the accidents is to analyze
the high volume of accidents and their
percentage of occurrence each year. The result
shown in the fig.11 gives us the top incidents that
occur each year in the form of a pie chart. In the
GUI when we select on the ‘top 5 call types each
year’ option a JOptionPane dialog pops out as
shown in fig.10 which prompts the user to input
the year for which the top 5 call types are
required and produces a pie chart with top most
accidents occurred in each year.
Fig 10: Popup to enter year
After inputting the year 2014 it produces the
highest type of accidents happened in that
particular year.
Fig 11: Top 5 call types in a year
Top-3 types of incidents, grouped locality
wise:
This part of analysis can be considered as one of
the most important ones in our report. After
getting the top-5 incidents from the previous
query, we moved to choosing the top three
incidents, which turned out be medical incidents,
structure fires and alarms. In this particular query
we check the top-5 localities for the past two
years, so that awareness can be increased and
appropriate safetymeasures canbe taken in those
locations. When we select this option in GUI
panel a JOption Pane Input box asks user to
decide what type of accident he needs to analyze
in eachzipcode. The result is shown in the fig 12.
Fig 12: JOption pane to enter type ofincident
Fig 13: The top 5 localities where the incident has occurred in 2014 and
2015
Type ofIncidents each year:
Each record has another attribute which has
entries like potentially life threatening, non-
potentially life threatening and only fire. We
could analyze the frequency of each such
occurrence and can decide if the potential life

threatening situations have increased, decreased
or remain unchanged from the previous years.
We can study the reasons behind this occurrence
and also the kind of situations surrounding the
fire. We caninterlink this with the analysis of the
localities done previously. Results are shown in
the fig. 14 and can be compared if the life
threatening situations are increasing by year or
decreasing. Extra measures can be taken to
prevent them from increasing further.
Fig 14: Level ofIncidents
The number of calls grouped by month in
each year:
The query is extremely important for our
analysis as it helps us to predict the possibility
of total calls per month in the future. By
reviewing the statistics of calls, the fire
department can be more prepared for the calls
and the large volumes. The result shown in the
fig 15 is for the years 2014-2016. Fig 15: Monthly calls
Priorities ofCalls and incidents
Each call is basically classified into 3 priorities
(1,2,3) giving 1as less severe, 2 as severe and 3
as most severe. In the dataset we have the data
for initial priority that has been entered by the
receiver from the fire department and the actual
situation priority to which the fire escalated by
the time the fire engine reached the spot. This
type of analysis helps us to know if there is a
mismatch between the priority value assumed
and the end result at the location. If yes, then the
fire department should be more conscious
training their employees. By assuming the
correct priority one can easily send the number
of ambulances or fires –engines that are needed
for the situation.
Result: The result depicted in the following fig
16 tells us total number of calls and their
difference between assumed and actual priority
in their respective years.
Fig: 16 The difference between original and final priorities
For an in-depth understanding of Map-Reduce
methods and concepts we have implemented
map reduce for this query and snippets of that
code can be found in fig 17.

Fig 17: Code Snippet: Mapper code
Fig 18: Code: Snippet: Reducer code
Checking if the Medical Incidents are
increasing on a year by year basis or not:
We got the top 3 incidents that are occurring in
each year, we then decided to analyze if these
incidents have a pattern of decrementing or
incrementing and to prepare in advance and take
measures to decrease these incidents The result
shown in the fig. 19 trend the Medical Incident,
Structure Fire and Alarms in each year.
Fig 19: Trend of calls on monthly basis
VI. CONCLUSIONS
We have deduced the following conclusions
from analyzing the above results:
 There is an equal chance of fire hazards
and accidents occurring on all days of
the week, irrespective of weekdays and
weekends.
 Majority of the fire accidents are
occurring in the time range 9:00-17:00
hours. Even though this time frame is the
number of hours is less than the rest of
the day. The call volume during this
period is almost twice the other one.
 The top 5 incidents occurring in most
years are Medical incidents, Structure
fires, Alarms, Traffic Collison and
Miscellaneous.
 Most of the medical incidents happen in
the localities 94102 and 94103 for the
past two years.
 Alarms occur mostly in the areas 94102,
94103, 94109, 94110 and 94107.
 Observing the patterns, we can notice
that Non-Life threatening and
Potentially Life Threatening have
increased and the fire incidents have
decreased.
 From observing the patterns,in the years
2014 and 2015, we can predict that the
calls in December 2016 will increase
from the present value.
 Even though we notice that the
difference in priorities is very less, that
call volume as well is high enough to
wreck lives and property and measures
should be taken to avoid this.
 Medical incidents have a strictly
increasing curve and there is a constant
rate of alarms and a decrease in the
number of structure fires over the years.

VII. FUTURE ENHANCEMENTS
We could use Hadoop HUEto plot and formulate
graphs to aid in the visual analysis. Spark can be
used to increase the efficiency and the to ease the
computational capability. Deeperanalysis canbe
performed such as comparing the dispatch unit
attribute and the incident with the battalion
attribute. We could also plot the data geospatially
by using the street, latitude and longitude
attributes given in the dataset.
VIII. ACKNOWLEDGEMENT
This project would not be completed without the
support of our instructor Dr. William H. Hsu,
Associate Professor, Department of Computer
Science, Kansas State University.
IX. REFERENCES
1. Dataset: https://data.sfgov.org/
2. https://www.firerescue1.com/communications-
interoperability/articles/1944513-Is-San-Franciscos-
EMS-911-systems-stressed-to-breaking-point/
3. http://www.sfexaminer.com/unreliable-dispatch-
system-exacerbates-flaws-in-sfs-emergency-response/
4. http://www.sfgate.com/bayarea/article/Why-S-F-still-
counts-on-street-fire-alarm-boxes-3081293.php
5. http://www.thesfnews.com/sffd-engine-1-ranked-
busiest-in-nation/22197
6. http://hortonworks.com/hadoop/hive/
7. http://hortonworks.com/wp-
content/uploads/downloads/2013/08/Hortonworks.Che
atSheet.SQLtoHive.pdf
8. https://www.ijircce.com/upload/2013/october/27Predi
ctive.pdf
9. https://www.ijircce.com/upload/2015/may/74_32_Stati
stical.pdf
10. https://www.researchgate.net/publication/301801698_
Earthquake_Data_Analysis_and_Visualization_using_
Big_Data_Tool
11. http://www.ejournalofscience.org/archive/vol3no12/vo
l3no12_16.pdf
12. http://scientific-
journals.org/journalofsystemsandsoftware/archive/vol5
no2/vol5no2_3.pdf http://betterevaluation.org/sites/def
ault/files/data_cleaning.pdf
13. http://barbie.uta.edu/~jli/Resources/MapReduce&Hado
op/MapReduce%20Design%20Patterns.pdf
14. http://stackoverflow.com/questions/20307404/hadoop-
number-of-mappers-and-reducers.

Final Report_798 Project_Nithin_Sharmila

Final Report_798 Project_Nithin_Sharmila

More Related Content

What's hot (16)

Similar to Final Report_798 Project_Nithin_Sharmila (20)

Final Report_798 Project_Nithin_Sharmila