High Performance Computing on NYC Yellow Taxi Data Set

EXTRACTING INSIGHTS FROM BIG
DATA: A CASE OF NEW YORK CITY
YELLOW TAXI DATASET
Parag Ahire
January 11, 2020

PRESENTATION OUTLINE
Brief introduction
• Big Data & high performance computing
• Describe few techniques for high performance computing
 Compare and Contrast few techniques
Sample Dataset introduction – NYC Yellow Taxi
Apply four techniques
• High-level code review
• Demonstration on Hortonworks virtual machine on Azure

BIG DATA
 Data is growing
• Digital Age : 2002 onwards
• 2019 – 1770 Exabytes
• 2020 – 2000 Exabytes
 What is it ?
• A data set that cannot be processed by a “normal” machine in “reasonable” amount of time
• Three V’s
 Volume
 Velocity
 Variety
• May vary by time and prevalent technology
 Used to be Giga/Tera bytes
 Now Exa/Peta bytes
 Future Zeta/Yotta bytes
o Zeta byte – 1000 data centers occupying 20% of Manhattan
o Yotta byte – 1M data centers occupying Delaware and Rhode Island

HIGH PERFORMANCE COMPUTING
 The ability to process massive amount of data and perform complex
calculations at high speed
 New Challenges (7 V’s)
 Previously - Volume
 Now – Velocity, Variety, Variability, Veracity, Visualization, Value
 How to perform ?
• Supercomputers – expensive, require specialized expertise to use and
solve specialized problems
• Cluster of small or medium sized business computers
• Modern “supercomputers” are mostly “cluster of computers”

PARALLEL AND DISTRIBUTED COMPUTING
Parallel Computing – All
processors have access to
shared memory
Distributed Computing –
Each processor has its own
memory. Information is
exchanged by passing
messages between
processors
Images taken from : Wikipedia

DISTRIBUTED COMPUTING MODELS
 Parallel algorithms
 Shared-memory model
• All processors access shared memory
• Programmer decides what program is executed by each processor
 Message-passing model
• Programmer chooses
o Network structure
o Program executed by each computer
 Distributed algorithms
 Programmer chooses the computer program
 All computers run the same program

HIGH PERFORMANCE COMPUTING TECHNIQUES (HPCT)
 Map Reduce
 A framework or programming model
 Suitable for processing large volume of structured/unstructured data
 Pig
 Procedural rather than declarative coding approach
 Provides a high degree of abstraction for map reduce
 Hive
 A traditional data warehouse interface for map reduce
 Spark
 A open source big data framework
 A unified analytics engine for large-scale data processing

MAP REDUCE
 Map function
 Input – A Key Value pair
• (k1, v1) -> list(k2, v2)
 Output – A list of key value pairs (one or more elements)
 Reduce function
 Input – A Key and a list of values
• (k2, list(v2)) -> list(v2)
 Sort
 Merging and sorting of output produced in the map phase
 Shuffle
 Transfers intermediate output of map phase to reducer
 Passes on intermediate output of one or more keys to a single reducer

MAP REDUCE
 Concerns
 Map phase – done in parallel, typically 20% of the work
 Reduce phase – executed sequentially for each key, typically 80% of the work
 Tips
 Increase with work done in the map phase and leave less for the reduce phase
 Include the optional combine phase to reduce work done by the reducer
 Combine (Optional)
 A mini-reducer to summarizes mapper output record for a single key
 Reduces data transfer between mapper and reducer
 Decreases the amount of data to be processed by the reducer

MAP REDUCE
Image taken from : Data Flair Training Blog

QUESTION : MAP REDUCE
For each unique day in a month across all months in the year 2014 print
the maximum total number of passengers across all months (across all
eligible trips) alighting (i.e. picking up) a Yellow Taxi between the hours
09:00 am (inclusive) and 10:00 am (exclusive) for a trip distance of less
than 3 miles where a tip was paid ? Print the day of the month as a
number and the total number of passengers across all eligible trips during
the month that was a maximum across all months in a month. The day of
the month should be represented as a number between 1 and 31 while
considering the maximum number of days occurring in each month of the
year 2014. Any trip data that did not have a pickup date between 1st
January 2014 and 31st December 2014 should be ignored. The day of the
month need not be sorted while printing the output.

ANSWER : MAP REDUCE
Day:CountOfPassengers
1:25
10:25
11:37
12:53
13:25
14:21
15:23
16:27
17:30
18:45
28:39
29:36
3:32
30:26
31:21
4:34
5:30
6:31
7:36
8:25
9:27
19:38
2:33
20:49
21:39
22:38
23:33
24:38
25:33
26:38
27:44

TOP HADOOP VENDORS
Amazon Elastic Map Reduce (EMR)
Cloudera* CDH Hadoop Distribution
Hortonworks* Data Platform (HDP)
MapR Hadoop Distribution
IBM Open Platform
Microsoft Azure HDInsight
Pivotal Big Data Suite
*Merged

PIG
Grew out of Yahoo
A platform for analyzing large data sets
Pig Latin – A procedural language
 Provides a sequence of data transformations
• To merge, filter, apply functions, group records
• Supports User Defined Functions (UDF) for special processing
Programs are compiled into map reduce jobs
 Support for python, java, groovy, JavaScript, ruby

PIG
Abstraction for map reduce programming
 Improves developer productivity
 Suitable for use for data analysts
Lower performance than map reduce
 Use additional machines in cluster to increase performance
Used to perform tasks for
 Data Storage
 Data Execution
 Data Manipulation

QUESTION : PIG
For all data available for the year 2014 (consider all months), which drop-
off location had the maximum total amount collected by credit card for a
trip exceeding 1 mile where no toll was paid, a tip was also paid but a
standard rate was applied for yellow taxi rides? Any trip data that did not
have a drop-off date between 1st January 2014 and 31st December 2014,
or does not have a valid month or does not have a valid day of the month
should be ignored. Print the drop off location ID (IDS) and the
aggregated total amount for the top location.

ANSWER : PIG
Drop-Off Latitude Drop-Off Longitude Sum Total Amount
40.78508 -73.95587 $65221.65

HIVE
Developed at Facebook
A SQL engine on its own meta store on HDFS
 Can be queried though HQL (Hive Query Language)
Provides a traditional data ware house interface
Hive compiler
 Converts hive queries to map reduce programs
 Executed in parallel across machines in the Hadoop cluster

HIVE
 Abstraction for map reduce programming
 Improves developer productivity
 Suitable for individuals with a SQL background
 Lower performance than map reduce
 Use additional machines in cluster to increase performance
 Supports User Defined Functions (UDF’s)
 Used for processing structured data
 Data is loaded in tables
 Unstructured data needs to be structured
 Data is then loaded to tables

QUESTION : HIVE
Which three pairs of pickup location / drop off location had the largest ratio of
total amount paid per passenger for trips taken by a yellow taxi for all data
available for the year 2014? Only trips that utilized a payment type of credit
card and utilized a standard rate code should be considered. Any trip data that
did not have a drop-off date between 1st January 2014 and 31st December
2014, or does not have a valid month or does not have a valid day of the
month should be ignored. Print the rank, pickup location, drop off location and
the ratio of total amount paid to the passenger count for these three top pairs
of pickup / drop off locations. Locations should be printed in descending order
of the ratio of total amount paid to the passenger count. The pickup location
and drop off location should be printed as a string of the form
"latitude:longitude" based on the latitude and longitude of the pick-up location
or drop off location. A dense ranking should be performed.

ANSWER : HIVE
RANK Pickup Pickup
Longitude
Drop-Off
Latitude
Drop-Off
Longitude
Ratio of Total
Amount to
Total
Count
1 40.72941 -73.98386 41.30529 -72.92268 $401.5
2 40.73249 -73.98791 40.72129 -73.95615 $354.25
3 40.67019 -73.91853 40.87084 -73.90391 $354.0

COMPARISON – MAP REDUCE, PIG, HIVE
MAP REDUCE PIG HIVE
Compiled Language Scripting Language Query Language
Lower level of abstraction Higher level of abstraction Higher level of abstraction
Higher learning curve Lower learning curve Lowest learning curve
Best performance for very large
data
Intermediate performance for
very large data(50 % lower)
Least performance for very large
data
Programmer writes more lines of
code
Programmer writes intermediate
lines of code
Programmer writes least lines of
code
Highest code efficiency (more
flexibility)
Relatively less code efficiency
(lesser flexibility)
Relatively less code efficiency
(lesser flexibility)
Possible to handle unstructured
data
Not very friendly with
unstructured data like images
Not very friendly with
unstructured data like images
Possible to deal with poor
schema design of xml, json
Cannot deal with poor design of
xml, json
Not easy to deal with poor
design of xml, json
More potential of introducing
defects due to having to write
very custom code
Limited possibility of introducing
defects due to fixed syntactic
possibilities
Limited possibility of introducing
defects due to fixed syntactic
possibilities

SPARK
 Developed at UC Berkeley AMPLab
 An open source big data framework
 Utilizes DAG (Directed Acyclic Graph) programming style
 Now maintained by non-profit Apache Software Foundation
 An unified analytics engine for
 Large scale processing
 Faster, general purpose processing
 Reduces read/write operations from/to disk
 Intermediate data stored in memory to achieve speed
 RDD’s (Resilient Distributed Dataset)
 DataFrame
 Used to build batch, iterative, interactive, graph and streaming
applications

SPARK
Supports cross-platform development
Programming in Scala, Java, Python, R, SQL – Core API’s
 PySpark (Python)
 SparkR
 Spark SQL (fka Shark)
Rest of the eco-system
 MLLib (Machine Learning)
 GraphX (Graph Computation)
 Spark Streaming

COMPARISON – SPARK, MAP REDUCE
CRITERIA SPARK MAP REDUCE
Written In Scala Java
License Apache 2 Apache 2
OS support Cross-platform Cross platform
Programming Languages Scala, Java, Python, R, SQL Java, C, C++, Ruby, Groovy,
Python, Perl
Lines of Code (LOC) Approximately 20,000 Approximately 120,000
Hardware Requirements Requires the use of mid to high level
hardware
Runs well on commodity
hardware
Data Storage Hadoop Distributed File System (HFDS),
Google Cloud Storage, Amazon S3,
Microsoft Azure
Hadoop Distributed File System
(HDFS), MapR, HBase
Community Strong community, one of the most
active projects at Apache
MapReduce community has
shifted to Spark
Scalability Highly scalable, one of the largest
cluster has 8K nodes
Even higher scalability, one of the
largest cluster has 14K nodes

Speed 100x faster in memory
10x faster on disk
Faster than traditional approaches
Difficulty / Ease of
use
Easy to program with the use of high level
operators (RDD’s and data frames)
Difficult due to the need to program each
and every operation
Ease of management Easy since it is a single analytics engine that
performs various tasks
It is a batch engine and needs to be
coupled with other engines like Storm,
Giraph, Impala etc. to achieve various
tasks
Fault tolerance No need to start from scratch (except for
programming errors) but some limitations
due to in memory operations
No need to start from scratch (except for
programming errors)
Data Processing
modes
Batch, Real Time, Iterative, Interactive, Graph,
Streaming
Batch
API’s and caching Caches data in memory No support for caching
SQL Support Support via Spark SQL (fka Shark) Supported via Hive

Real Time analysis Possible to handle at scale No support for real-time analysis
Streaming Spark Streaming handles streaming No support for streaming
Interactive mode Supported Not supported
Recovery Allows recovery of failed nodes by re-
computation of DAG
Resilient to system faults or failures. It is
highly tolerant system.
Latency Low High
Scheduler Due to in-memory computations it acts as
its own flow-scheduler
Requires an external job scheduler like
Oozie for its flows
Security / Access
Permission
Less secure since the only mechanism
supported is shared secret authentication
More secure because of Kerberos and
ACL’s (access control lists)
Cost Requires plenty of RAM for in-memory
computations, so increases costs as cluster
size increases
It is cheaper in terms of cost
Category Choice Choice of data scientists since it is a
complete analytics engine
Choice of data engineers since it is a
basic data processing engine

QUESTION : SPARK
Which day (or days) across all months in the year 2014 yielded the largest total
tip amount (across all eligible trips) as a percentage of the total amount (across
all eligible trips) for trips that charged the standard rate on a Yellow Taxi where
the total amount for each trip exceeded 5 and no toll was paid? Print the day
(or days) of the month (only the day ranging from 1 to 31) in 2014 and the
total tip amount as a percentage of the total amount. Utilize the pickup date
time for deciding which day of the month that the trip counts against. The drop
off datetime need not be considered. Any trip data that did not have a pickup
date between 1st January 2014 and 31st December 2014, or does not have a
valid month or does not have a valid day of the month should be ignored.

ANSWER : SPARK
PICK UP DAY OF THE MONTH PERCENTAGE OF SUM TIP AMOUNT
TO SUM TOTAL AMOUNT
10 9.991076

ALTERNATIVES
TECHNIQUE ALTERNATIVE TECHNIQUE
Map Reduce Apache Spark
Pig Apache Spark
Hive Apache Spark, Impala, HAWK, Spark SQL, Shark
Pivotal HDB (fka HAWK)
PrestoDB (Facebook)
BigSQL (IBM)
BigQuery (Google)
Spark Apache Storm, Flume
Cassandra
Amazon Kinesis
Splunk
Elasticsearch
Koalas (Databricks)
Vaex – python library for lazy Out-Of-Code data frames

REFERENCES
What is Big Data?
Data Center storage capacity worldwide from 2016 to 2021, by segment
How big is a Yottabyte?
What is High Performance Computing?
The 7 V’s of Big Data
Distributed Computing
Hadoop Combiner – Best Explanation to MapReduce Combiner
Pig Documentation
UC Berkeley AMPLab
NYC TLC Trip Record Data
Map Reduce vs Pig vs Hive
Spark vs Hadoop MapReduce: Which big data framework to choose
Apache Spark vs Hadoop MapReduce – Feature Wise Comparison
Spark vs Hadoop MapReduce
MapReduce vs Spark – 20 Useful Comparisons To Learn
Spark vs Hadoop : Which is the Best Big Data Framework

High Performance Computing on NYC Yellow Taxi Data Set

More Related Content

What's hot (20)

Similar to High Performance Computing on NYC Yellow Taxi Data Set (20)

Recently uploaded (20)

High Performance Computing on NYC Yellow Taxi Data Set