SlideShare a Scribd company logo
EXTRACTING INSIGHTS FROM BIG
DATA: A CASE OF NEW YORK CITY
YELLOW TAXI DATASET
Parag Ahire
January 11, 2020
PRESENTATION OUTLINE
Brief introduction
• Big Data & high performance computing
• Describe few techniques for high performance computing
 Compare and Contrast few techniques
Sample Dataset introduction – NYC Yellow Taxi
Apply four techniques
• High-level code review
• Demonstration on Hortonworks virtual machine on Azure
BIG DATA
 Data is growing
• Digital Age : 2002 onwards
• 2019 – 1770 Exabytes
• 2020 – 2000 Exabytes
 What is it ?
• A data set that cannot be processed by a “normal” machine in “reasonable” amount of time
• Three V’s
 Volume
 Velocity
 Variety
• May vary by time and prevalent technology
 Used to be Giga/Tera bytes
 Now Exa/Peta bytes
 Future Zeta/Yotta bytes
o Zeta byte – 1000 data centers occupying 20% of Manhattan
o Yotta byte – 1M data centers occupying Delaware and Rhode Island
HIGH PERFORMANCE COMPUTING
 The ability to process massive amount of data and perform complex
calculations at high speed
 New Challenges (7 V’s)
 Previously - Volume
 Now – Velocity, Variety, Variability, Veracity, Visualization, Value
 How to perform ?
• Supercomputers – expensive, require specialized expertise to use and
solve specialized problems
• Cluster of small or medium sized business computers
• Modern “supercomputers” are mostly “cluster of computers”
PARALLEL AND DISTRIBUTED COMPUTING
Parallel Computing – All
processors have access to
shared memory
Distributed Computing –
Each processor has its own
memory. Information is
exchanged by passing
messages between
processors
Images taken from : Wikipedia
DISTRIBUTED COMPUTING MODELS
 Parallel algorithms
 Shared-memory model
• All processors access shared memory
• Programmer decides what program is executed by each processor
 Message-passing model
• Programmer chooses
o Network structure
o Program executed by each computer
 Distributed algorithms
 Programmer chooses the computer program
 All computers run the same program
HIGH PERFORMANCE COMPUTING TECHNIQUES (HPCT)
 Map Reduce
 A framework or programming model
 Suitable for processing large volume of structured/unstructured data
 Pig
 Procedural rather than declarative coding approach
 Provides a high degree of abstraction for map reduce
 Hive
 A traditional data warehouse interface for map reduce
 Spark
 A open source big data framework
 A unified analytics engine for large-scale data processing
MAP REDUCE
 Map function
 Input – A Key Value pair
• (k1, v1) -> list(k2, v2)
 Output – A list of key value pairs (one or more elements)
 Reduce function
 Input – A Key and a list of values
• (k2, list(v2)) -> list(v2)
 Sort
 Merging and sorting of output produced in the map phase
 Shuffle
 Transfers intermediate output of map phase to reducer
 Passes on intermediate output of one or more keys to a single reducer
MAP REDUCE
 Concerns
 Map phase – done in parallel, typically 20% of the work
 Reduce phase – executed sequentially for each key, typically 80% of the work
 Tips
 Increase with work done in the map phase and leave less for the reduce phase
 Include the optional combine phase to reduce work done by the reducer
 Combine (Optional)
 A mini-reducer to summarizes mapper output record for a single key
 Reduces data transfer between mapper and reducer
 Decreases the amount of data to be processed by the reducer
MAP REDUCE
Image taken from : Data Flair Training Blog
MAP REDUCE
QUESTION : MAP REDUCE
For each unique day in a month across all months in the year 2014 print
the maximum total number of passengers across all months (across all
eligible trips) alighting (i.e. picking up) a Yellow Taxi between the hours
09:00 am (inclusive) and 10:00 am (exclusive) for a trip distance of less
than 3 miles where a tip was paid ? Print the day of the month as a
number and the total number of passengers across all eligible trips during
the month that was a maximum across all months in a month. The day of
the month should be represented as a number between 1 and 31 while
considering the maximum number of days occurring in each month of the
year 2014. Any trip data that did not have a pickup date between 1st
January 2014 and 31st December 2014 should be ignored. The day of the
month need not be sorted while printing the output.
ANSWER : MAP REDUCE
Day:CountOfPassengers
1:25
10:25
11:37
12:53
13:25
14:21
15:23
16:27
17:30
18:45
Day:CountOfPassengers
28:39
29:36
3:32
30:26
31:21
4:34
5:30
6:31
7:36
8:25
9:27
Day:CountOfPassengers
19:38
2:33
20:49
21:39
22:38
23:33
24:38
25:33
26:38
27:44
TOP HADOOP VENDORS
Amazon Elastic Map Reduce (EMR)
Cloudera* CDH Hadoop Distribution
Hortonworks* Data Platform (HDP)
MapR Hadoop Distribution
IBM Open Platform
Microsoft Azure HDInsight
Pivotal Big Data Suite
*Merged
PIG
Grew out of Yahoo
A platform for analyzing large data sets
Pig Latin – A procedural language
 Provides a sequence of data transformations
• To merge, filter, apply functions, group records
• Supports User Defined Functions (UDF) for special processing
Programs are compiled into map reduce jobs
 Support for python, java, groovy, JavaScript, ruby
PIG
Abstraction for map reduce programming
 Improves developer productivity
 Suitable for use for data analysts
Lower performance than map reduce
 Use additional machines in cluster to increase performance
Used to perform tasks for
 Data Storage
 Data Execution
 Data Manipulation
QUESTION : PIG
For all data available for the year 2014 (consider all months), which drop-
off location had the maximum total amount collected by credit card for a
trip exceeding 1 mile where no toll was paid, a tip was also paid but a
standard rate was applied for yellow taxi rides? Any trip data that did not
have a drop-off date between 1st January 2014 and 31st December 2014,
or does not have a valid month or does not have a valid day of the month
should be ignored. Print the drop off location ID (IDS) and the
aggregated total amount for the top location.
ANSWER : PIG
Drop-Off Latitude Drop-Off Longitude Sum Total Amount
40.78508 -73.95587 $65221.65
HIVE
Developed at Facebook
A SQL engine on its own meta store on HDFS
 Can be queried though HQL (Hive Query Language)
Provides a traditional data ware house interface
Hive compiler
 Converts hive queries to map reduce programs
 Executed in parallel across machines in the Hadoop cluster
HIVE
 Abstraction for map reduce programming
 Improves developer productivity
 Suitable for individuals with a SQL background
 Lower performance than map reduce
 Use additional machines in cluster to increase performance
 Supports User Defined Functions (UDF’s)
 Used for processing structured data
 Data is loaded in tables
 Unstructured data needs to be structured
 Data is then loaded to tables
QUESTION : HIVE
Which three pairs of pickup location / drop off location had the largest ratio of
total amount paid per passenger for trips taken by a yellow taxi for all data
available for the year 2014? Only trips that utilized a payment type of credit
card and utilized a standard rate code should be considered. Any trip data that
did not have a drop-off date between 1st January 2014 and 31st December
2014, or does not have a valid month or does not have a valid day of the
month should be ignored. Print the rank, pickup location, drop off location and
the ratio of total amount paid to the passenger count for these three top pairs
of pickup / drop off locations. Locations should be printed in descending order
of the ratio of total amount paid to the passenger count. The pickup location
and drop off location should be printed as a string of the form
"latitude:longitude" based on the latitude and longitude of the pick-up location
or drop off location. A dense ranking should be performed.
ANSWER : HIVE
RANK Pickup Pickup
Longitude
Drop-Off
Latitude
Drop-Off
Longitude
Ratio of Total
Amount to
Total
Count
1 40.72941 -73.98386 41.30529 -72.92268 $401.5
2 40.73249 -73.98791 40.72129 -73.95615 $354.25
3 40.67019 -73.91853 40.87084 -73.90391 $354.0
COMPARISON – MAP REDUCE, PIG, HIVE
MAP REDUCE PIG HIVE
Compiled Language Scripting Language Query Language
Lower level of abstraction Higher level of abstraction Higher level of abstraction
Higher learning curve Lower learning curve Lowest learning curve
Best performance for very large
data
Intermediate performance for
very large data(50 % lower)
Least performance for very large
data
Programmer writes more lines of
code
Programmer writes intermediate
lines of code
Programmer writes least lines of
code
Highest code efficiency (more
flexibility)
Relatively less code efficiency
(lesser flexibility)
Relatively less code efficiency
(lesser flexibility)
Possible to handle unstructured
data
Not very friendly with
unstructured data like images
Not very friendly with
unstructured data like images
Possible to deal with poor
schema design of xml, json
Cannot deal with poor design of
xml, json
Not easy to deal with poor
design of xml, json
More potential of introducing
defects due to having to write
very custom code
Limited possibility of introducing
defects due to fixed syntactic
possibilities
Limited possibility of introducing
defects due to fixed syntactic
possibilities
SPARK
 Developed at UC Berkeley AMPLab
 An open source big data framework
 Utilizes DAG (Directed Acyclic Graph) programming style
 Now maintained by non-profit Apache Software Foundation
 An unified analytics engine for
 Large scale processing
 Faster, general purpose processing
 Reduces read/write operations from/to disk
 Intermediate data stored in memory to achieve speed
 RDD’s (Resilient Distributed Dataset)
 DataFrame
 Used to build batch, iterative, interactive, graph and streaming
applications
SPARK
Supports cross-platform development
Programming in Scala, Java, Python, R, SQL – Core API’s
 PySpark (Python)
 SparkR
 Spark SQL (fka Shark)
Rest of the eco-system
 MLLib (Machine Learning)
 GraphX (Graph Computation)
 Spark Streaming
COMPARISON – SPARK, MAP REDUCE
CRITERIA SPARK MAP REDUCE
Written In Scala Java
License Apache 2 Apache 2
OS support Cross-platform Cross platform
Programming Languages Scala, Java, Python, R, SQL Java, C, C++, Ruby, Groovy,
Python, Perl
Lines of Code (LOC) Approximately 20,000 Approximately 120,000
Hardware Requirements Requires the use of mid to high level
hardware
Runs well on commodity
hardware
Data Storage Hadoop Distributed File System (HFDS),
Google Cloud Storage, Amazon S3,
Microsoft Azure
Hadoop Distributed File System
(HDFS), MapR, HBase
Community Strong community, one of the most
active projects at Apache
MapReduce community has
shifted to Spark
Scalability Highly scalable, one of the largest
cluster has 8K nodes
Even higher scalability, one of the
largest cluster has 14K nodes
COMPARISON – SPARK, MAP REDUCE
CRITERIA SPARK MAP REDUCE
Speed 100x faster in memory
10x faster on disk
Faster than traditional approaches
Difficulty / Ease of
use
Easy to program with the use of high level
operators (RDD’s and data frames)
Difficult due to the need to program each
and every operation
Ease of management Easy since it is a single analytics engine that
performs various tasks
It is a batch engine and needs to be
coupled with other engines like Storm,
Giraph, Impala etc. to achieve various
tasks
Fault tolerance No need to start from scratch (except for
programming errors) but some limitations
due to in memory operations
No need to start from scratch (except for
programming errors)
Data Processing
modes
Batch, Real Time, Iterative, Interactive, Graph,
Streaming
Batch
API’s and caching Caches data in memory No support for caching
SQL Support Support via Spark SQL (fka Shark) Supported via Hive
COMPARISON – SPARK, MAP REDUCE
CRITERIA SPARK MAP REDUCE
Real Time analysis Possible to handle at scale No support for real-time analysis
Streaming Spark Streaming handles streaming No support for streaming
Interactive mode Supported Not supported
Recovery Allows recovery of failed nodes by re-
computation of DAG
Resilient to system faults or failures. It is
highly tolerant system.
Latency Low High
Scheduler Due to in-memory computations it acts as
its own flow-scheduler
Requires an external job scheduler like
Oozie for its flows
Security / Access
Permission
Less secure since the only mechanism
supported is shared secret authentication
More secure because of Kerberos and
ACL’s (access control lists)
Cost Requires plenty of RAM for in-memory
computations, so increases costs as cluster
size increases
It is cheaper in terms of cost
Category Choice Choice of data scientists since it is a
complete analytics engine
Choice of data engineers since it is a
basic data processing engine
QUESTION : SPARK
Which day (or days) across all months in the year 2014 yielded the largest total
tip amount (across all eligible trips) as a percentage of the total amount (across
all eligible trips) for trips that charged the standard rate on a Yellow Taxi where
the total amount for each trip exceeded 5 and no toll was paid? Print the day
(or days) of the month (only the day ranging from 1 to 31) in 2014 and the
total tip amount as a percentage of the total amount. Utilize the pickup date
time for deciding which day of the month that the trip counts against. The drop
off datetime need not be considered. Any trip data that did not have a pickup
date between 1st January 2014 and 31st December 2014, or does not have a
valid month or does not have a valid day of the month should be ignored.
ANSWER : SPARK
PICK UP DAY OF THE MONTH PERCENTAGE OF SUM TIP AMOUNT
TO SUM TOTAL AMOUNT
10 9.991076
ALTERNATIVES
TECHNIQUE ALTERNATIVE TECHNIQUE
Map Reduce Apache Spark
Pig Apache Spark
Hive Apache Spark, Impala, HAWK, Spark SQL, Shark
Pivotal HDB (fka HAWK)
PrestoDB (Facebook)
BigSQL (IBM)
BigQuery (Google)
Spark Apache Storm, Flume
Cassandra
Amazon Kinesis
Splunk
Elasticsearch
Koalas (Databricks)
Vaex – python library for lazy Out-Of-Code data frames
REFERENCES
What is Big Data?
Data Center storage capacity worldwide from 2016 to 2021, by segment
How big is a Yottabyte?
What is High Performance Computing?
The 7 V’s of Big Data
Distributed Computing
Hadoop Combiner – Best Explanation to MapReduce Combiner
Pig Documentation
UC Berkeley AMPLab
NYC TLC Trip Record Data
Map Reduce vs Pig vs Hive
Spark vs Hadoop MapReduce: Which big data framework to choose
Apache Spark vs Hadoop MapReduce – Feature Wise Comparison
Spark vs Hadoop MapReduce
MapReduce vs Spark – 20 Useful Comparisons To Learn
Spark vs Hadoop : Which is the Best Big Data Framework

More Related Content

PPTX
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
PDF
マルチスレッドRxSwift @ 社内RxSwift勉強会
PDF
Lessons From Edward Tufte
PDF
The Evolution of Big Data at Spotify
PDF
2022 Trends in Enterprise Analytics
PDF
Why My Streaming Job is Slow - Profiling and Optimizing Kafka Streams Apps (L...
PPTX
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
マルチスレッドRxSwift @ 社内RxSwift勉強会
Lessons From Edward Tufte
The Evolution of Big Data at Spotify
2022 Trends in Enterprise Analytics
Why My Streaming Job is Slow - Profiling and Optimizing Kafka Streams Apps (L...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...

What's hot (20)

PPTX
Snapchat Best Practice 2023.pptx
PPTX
Comparison with Traditional databases
PPTX
Predictive analytics in health insurance
PDF
Apache Druid 101
PDF
Navigating a Project to Product Shift - Walsh FiveWhyz LLC
PDF
Big Data Analytics for Real Time Systems
PDF
Big Data Architecture and Deployment
PDF
Introduction to DataFusion An Embeddable Query Engine Written in Rust
PDF
An Introduction to Distributed Search with Cassandra and Solr
PDF
Introduction To Data Science
PPTX
Introduction to Kafka Streams Presentation
PDF
Lead scoring case study
PPTX
Map Reduce
PPTX
Web topic 13 html validation tools
PPTX
Big Data - 25 Amazing Facts Everyone Should Know
PPTX
Enabling ABAC with Accumulo and Ranger integration
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PPTX
Hive+Tez: A performance deep dive
PDF
Big Data Analytics Powerpoint Presentation Slide
PDF
Snapchat Best Practice 2023.pptx
Comparison with Traditional databases
Predictive analytics in health insurance
Apache Druid 101
Navigating a Project to Product Shift - Walsh FiveWhyz LLC
Big Data Analytics for Real Time Systems
Big Data Architecture and Deployment
Introduction to DataFusion An Embeddable Query Engine Written in Rust
An Introduction to Distributed Search with Cassandra and Solr
Introduction To Data Science
Introduction to Kafka Streams Presentation
Lead scoring case study
Map Reduce
Web topic 13 html validation tools
Big Data - 25 Amazing Facts Everyone Should Know
Enabling ABAC with Accumulo and Ranger integration
Presto Summit 2018 - 09 - Netflix Iceberg
Hive+Tez: A performance deep dive
Big Data Analytics Powerpoint Presentation Slide
Ad

Similar to High Performance Computing on NYC Yellow Taxi Data Set (20)

PDF
Big data analytics 1
PPTX
A Glimpse of Bigdata - Introduction
PDF
(R17A0528) BIG DATA ANALYTICS.pdf
PDF
(R17A0528) BIG DATA ANALYTICS.pdf
PPTX
Big data concepts
PDF
Big data technology
PPTX
Big data
PPTX
Hadoop for the Absolute Beginner
PDF
What is Big Data?
ODP
Big data nyu
PPTX
Apache pig presentation_siddharth_mathur
PPTX
Big data
PPTX
Big data
PPTX
Apache Hive for modern DBAs
PPTX
Hive and Pig for .NET User Group
PPTX
bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx
PPTX
Hands on Hadoop and pig
PPTX
Yoda fifth elephant
PDF
Hadoop Fundamentals I
PDF
Microsoft Big Data @ SQLUG 2013
Big data analytics 1
A Glimpse of Bigdata - Introduction
(R17A0528) BIG DATA ANALYTICS.pdf
(R17A0528) BIG DATA ANALYTICS.pdf
Big data concepts
Big data technology
Big data
Hadoop for the Absolute Beginner
What is Big Data?
Big data nyu
Apache pig presentation_siddharth_mathur
Big data
Big data
Apache Hive for modern DBAs
Hive and Pig for .NET User Group
bda ghhhhhftttyygghhjjuuujjjhhunit1.pptx
Hands on Hadoop and pig
Yoda fifth elephant
Hadoop Fundamentals I
Microsoft Big Data @ SQLUG 2013
Ad

Recently uploaded (20)

PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Fluorescence-microscope_Botany_detailed content
Qualitative Qantitative and Mixed Methods.pptx
Business Acumen Training GuidePresentation.pptx
1_Introduction to advance data techniques.pptx
annual-report-2024-2025 original latest.
Supervised vs unsupervised machine learning algorithms
.pdf is not working space design for the following data for the following dat...
IBA_Chapter_11_Slides_Final_Accessible.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Foundation of Data Science unit number two notes
Clinical guidelines as a resource for EBP(1).pdf
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf

High Performance Computing on NYC Yellow Taxi Data Set

  • 1. EXTRACTING INSIGHTS FROM BIG DATA: A CASE OF NEW YORK CITY YELLOW TAXI DATASET Parag Ahire January 11, 2020
  • 2. PRESENTATION OUTLINE Brief introduction • Big Data & high performance computing • Describe few techniques for high performance computing  Compare and Contrast few techniques Sample Dataset introduction – NYC Yellow Taxi Apply four techniques • High-level code review • Demonstration on Hortonworks virtual machine on Azure
  • 3. BIG DATA  Data is growing • Digital Age : 2002 onwards • 2019 – 1770 Exabytes • 2020 – 2000 Exabytes  What is it ? • A data set that cannot be processed by a “normal” machine in “reasonable” amount of time • Three V’s  Volume  Velocity  Variety • May vary by time and prevalent technology  Used to be Giga/Tera bytes  Now Exa/Peta bytes  Future Zeta/Yotta bytes o Zeta byte – 1000 data centers occupying 20% of Manhattan o Yotta byte – 1M data centers occupying Delaware and Rhode Island
  • 4. HIGH PERFORMANCE COMPUTING  The ability to process massive amount of data and perform complex calculations at high speed  New Challenges (7 V’s)  Previously - Volume  Now – Velocity, Variety, Variability, Veracity, Visualization, Value  How to perform ? • Supercomputers – expensive, require specialized expertise to use and solve specialized problems • Cluster of small or medium sized business computers • Modern “supercomputers” are mostly “cluster of computers”
  • 5. PARALLEL AND DISTRIBUTED COMPUTING Parallel Computing – All processors have access to shared memory Distributed Computing – Each processor has its own memory. Information is exchanged by passing messages between processors Images taken from : Wikipedia
  • 6. DISTRIBUTED COMPUTING MODELS  Parallel algorithms  Shared-memory model • All processors access shared memory • Programmer decides what program is executed by each processor  Message-passing model • Programmer chooses o Network structure o Program executed by each computer  Distributed algorithms  Programmer chooses the computer program  All computers run the same program
  • 7. HIGH PERFORMANCE COMPUTING TECHNIQUES (HPCT)  Map Reduce  A framework or programming model  Suitable for processing large volume of structured/unstructured data  Pig  Procedural rather than declarative coding approach  Provides a high degree of abstraction for map reduce  Hive  A traditional data warehouse interface for map reduce  Spark  A open source big data framework  A unified analytics engine for large-scale data processing
  • 8. MAP REDUCE  Map function  Input – A Key Value pair • (k1, v1) -> list(k2, v2)  Output – A list of key value pairs (one or more elements)  Reduce function  Input – A Key and a list of values • (k2, list(v2)) -> list(v2)  Sort  Merging and sorting of output produced in the map phase  Shuffle  Transfers intermediate output of map phase to reducer  Passes on intermediate output of one or more keys to a single reducer
  • 9. MAP REDUCE  Concerns  Map phase – done in parallel, typically 20% of the work  Reduce phase – executed sequentially for each key, typically 80% of the work  Tips  Increase with work done in the map phase and leave less for the reduce phase  Include the optional combine phase to reduce work done by the reducer  Combine (Optional)  A mini-reducer to summarizes mapper output record for a single key  Reduces data transfer between mapper and reducer  Decreases the amount of data to be processed by the reducer
  • 10. MAP REDUCE Image taken from : Data Flair Training Blog
  • 12. QUESTION : MAP REDUCE For each unique day in a month across all months in the year 2014 print the maximum total number of passengers across all months (across all eligible trips) alighting (i.e. picking up) a Yellow Taxi between the hours 09:00 am (inclusive) and 10:00 am (exclusive) for a trip distance of less than 3 miles where a tip was paid ? Print the day of the month as a number and the total number of passengers across all eligible trips during the month that was a maximum across all months in a month. The day of the month should be represented as a number between 1 and 31 while considering the maximum number of days occurring in each month of the year 2014. Any trip data that did not have a pickup date between 1st January 2014 and 31st December 2014 should be ignored. The day of the month need not be sorted while printing the output.
  • 13. ANSWER : MAP REDUCE Day:CountOfPassengers 1:25 10:25 11:37 12:53 13:25 14:21 15:23 16:27 17:30 18:45 Day:CountOfPassengers 28:39 29:36 3:32 30:26 31:21 4:34 5:30 6:31 7:36 8:25 9:27 Day:CountOfPassengers 19:38 2:33 20:49 21:39 22:38 23:33 24:38 25:33 26:38 27:44
  • 14. TOP HADOOP VENDORS Amazon Elastic Map Reduce (EMR) Cloudera* CDH Hadoop Distribution Hortonworks* Data Platform (HDP) MapR Hadoop Distribution IBM Open Platform Microsoft Azure HDInsight Pivotal Big Data Suite *Merged
  • 15. PIG Grew out of Yahoo A platform for analyzing large data sets Pig Latin – A procedural language  Provides a sequence of data transformations • To merge, filter, apply functions, group records • Supports User Defined Functions (UDF) for special processing Programs are compiled into map reduce jobs  Support for python, java, groovy, JavaScript, ruby
  • 16. PIG Abstraction for map reduce programming  Improves developer productivity  Suitable for use for data analysts Lower performance than map reduce  Use additional machines in cluster to increase performance Used to perform tasks for  Data Storage  Data Execution  Data Manipulation
  • 17. QUESTION : PIG For all data available for the year 2014 (consider all months), which drop- off location had the maximum total amount collected by credit card for a trip exceeding 1 mile where no toll was paid, a tip was also paid but a standard rate was applied for yellow taxi rides? Any trip data that did not have a drop-off date between 1st January 2014 and 31st December 2014, or does not have a valid month or does not have a valid day of the month should be ignored. Print the drop off location ID (IDS) and the aggregated total amount for the top location.
  • 18. ANSWER : PIG Drop-Off Latitude Drop-Off Longitude Sum Total Amount 40.78508 -73.95587 $65221.65
  • 19. HIVE Developed at Facebook A SQL engine on its own meta store on HDFS  Can be queried though HQL (Hive Query Language) Provides a traditional data ware house interface Hive compiler  Converts hive queries to map reduce programs  Executed in parallel across machines in the Hadoop cluster
  • 20. HIVE  Abstraction for map reduce programming  Improves developer productivity  Suitable for individuals with a SQL background  Lower performance than map reduce  Use additional machines in cluster to increase performance  Supports User Defined Functions (UDF’s)  Used for processing structured data  Data is loaded in tables  Unstructured data needs to be structured  Data is then loaded to tables
  • 21. QUESTION : HIVE Which three pairs of pickup location / drop off location had the largest ratio of total amount paid per passenger for trips taken by a yellow taxi for all data available for the year 2014? Only trips that utilized a payment type of credit card and utilized a standard rate code should be considered. Any trip data that did not have a drop-off date between 1st January 2014 and 31st December 2014, or does not have a valid month or does not have a valid day of the month should be ignored. Print the rank, pickup location, drop off location and the ratio of total amount paid to the passenger count for these three top pairs of pickup / drop off locations. Locations should be printed in descending order of the ratio of total amount paid to the passenger count. The pickup location and drop off location should be printed as a string of the form "latitude:longitude" based on the latitude and longitude of the pick-up location or drop off location. A dense ranking should be performed.
  • 22. ANSWER : HIVE RANK Pickup Pickup Longitude Drop-Off Latitude Drop-Off Longitude Ratio of Total Amount to Total Count 1 40.72941 -73.98386 41.30529 -72.92268 $401.5 2 40.73249 -73.98791 40.72129 -73.95615 $354.25 3 40.67019 -73.91853 40.87084 -73.90391 $354.0
  • 23. COMPARISON – MAP REDUCE, PIG, HIVE MAP REDUCE PIG HIVE Compiled Language Scripting Language Query Language Lower level of abstraction Higher level of abstraction Higher level of abstraction Higher learning curve Lower learning curve Lowest learning curve Best performance for very large data Intermediate performance for very large data(50 % lower) Least performance for very large data Programmer writes more lines of code Programmer writes intermediate lines of code Programmer writes least lines of code Highest code efficiency (more flexibility) Relatively less code efficiency (lesser flexibility) Relatively less code efficiency (lesser flexibility) Possible to handle unstructured data Not very friendly with unstructured data like images Not very friendly with unstructured data like images Possible to deal with poor schema design of xml, json Cannot deal with poor design of xml, json Not easy to deal with poor design of xml, json More potential of introducing defects due to having to write very custom code Limited possibility of introducing defects due to fixed syntactic possibilities Limited possibility of introducing defects due to fixed syntactic possibilities
  • 24. SPARK  Developed at UC Berkeley AMPLab  An open source big data framework  Utilizes DAG (Directed Acyclic Graph) programming style  Now maintained by non-profit Apache Software Foundation  An unified analytics engine for  Large scale processing  Faster, general purpose processing  Reduces read/write operations from/to disk  Intermediate data stored in memory to achieve speed  RDD’s (Resilient Distributed Dataset)  DataFrame  Used to build batch, iterative, interactive, graph and streaming applications
  • 25. SPARK Supports cross-platform development Programming in Scala, Java, Python, R, SQL – Core API’s  PySpark (Python)  SparkR  Spark SQL (fka Shark) Rest of the eco-system  MLLib (Machine Learning)  GraphX (Graph Computation)  Spark Streaming
  • 26. COMPARISON – SPARK, MAP REDUCE CRITERIA SPARK MAP REDUCE Written In Scala Java License Apache 2 Apache 2 OS support Cross-platform Cross platform Programming Languages Scala, Java, Python, R, SQL Java, C, C++, Ruby, Groovy, Python, Perl Lines of Code (LOC) Approximately 20,000 Approximately 120,000 Hardware Requirements Requires the use of mid to high level hardware Runs well on commodity hardware Data Storage Hadoop Distributed File System (HFDS), Google Cloud Storage, Amazon S3, Microsoft Azure Hadoop Distributed File System (HDFS), MapR, HBase Community Strong community, one of the most active projects at Apache MapReduce community has shifted to Spark Scalability Highly scalable, one of the largest cluster has 8K nodes Even higher scalability, one of the largest cluster has 14K nodes
  • 27. COMPARISON – SPARK, MAP REDUCE CRITERIA SPARK MAP REDUCE Speed 100x faster in memory 10x faster on disk Faster than traditional approaches Difficulty / Ease of use Easy to program with the use of high level operators (RDD’s and data frames) Difficult due to the need to program each and every operation Ease of management Easy since it is a single analytics engine that performs various tasks It is a batch engine and needs to be coupled with other engines like Storm, Giraph, Impala etc. to achieve various tasks Fault tolerance No need to start from scratch (except for programming errors) but some limitations due to in memory operations No need to start from scratch (except for programming errors) Data Processing modes Batch, Real Time, Iterative, Interactive, Graph, Streaming Batch API’s and caching Caches data in memory No support for caching SQL Support Support via Spark SQL (fka Shark) Supported via Hive
  • 28. COMPARISON – SPARK, MAP REDUCE CRITERIA SPARK MAP REDUCE Real Time analysis Possible to handle at scale No support for real-time analysis Streaming Spark Streaming handles streaming No support for streaming Interactive mode Supported Not supported Recovery Allows recovery of failed nodes by re- computation of DAG Resilient to system faults or failures. It is highly tolerant system. Latency Low High Scheduler Due to in-memory computations it acts as its own flow-scheduler Requires an external job scheduler like Oozie for its flows Security / Access Permission Less secure since the only mechanism supported is shared secret authentication More secure because of Kerberos and ACL’s (access control lists) Cost Requires plenty of RAM for in-memory computations, so increases costs as cluster size increases It is cheaper in terms of cost Category Choice Choice of data scientists since it is a complete analytics engine Choice of data engineers since it is a basic data processing engine
  • 29. QUESTION : SPARK Which day (or days) across all months in the year 2014 yielded the largest total tip amount (across all eligible trips) as a percentage of the total amount (across all eligible trips) for trips that charged the standard rate on a Yellow Taxi where the total amount for each trip exceeded 5 and no toll was paid? Print the day (or days) of the month (only the day ranging from 1 to 31) in 2014 and the total tip amount as a percentage of the total amount. Utilize the pickup date time for deciding which day of the month that the trip counts against. The drop off datetime need not be considered. Any trip data that did not have a pickup date between 1st January 2014 and 31st December 2014, or does not have a valid month or does not have a valid day of the month should be ignored.
  • 30. ANSWER : SPARK PICK UP DAY OF THE MONTH PERCENTAGE OF SUM TIP AMOUNT TO SUM TOTAL AMOUNT 10 9.991076
  • 31. ALTERNATIVES TECHNIQUE ALTERNATIVE TECHNIQUE Map Reduce Apache Spark Pig Apache Spark Hive Apache Spark, Impala, HAWK, Spark SQL, Shark Pivotal HDB (fka HAWK) PrestoDB (Facebook) BigSQL (IBM) BigQuery (Google) Spark Apache Storm, Flume Cassandra Amazon Kinesis Splunk Elasticsearch Koalas (Databricks) Vaex – python library for lazy Out-Of-Code data frames
  • 32. REFERENCES What is Big Data? Data Center storage capacity worldwide from 2016 to 2021, by segment How big is a Yottabyte? What is High Performance Computing? The 7 V’s of Big Data Distributed Computing Hadoop Combiner – Best Explanation to MapReduce Combiner Pig Documentation UC Berkeley AMPLab NYC TLC Trip Record Data Map Reduce vs Pig vs Hive Spark vs Hadoop MapReduce: Which big data framework to choose Apache Spark vs Hadoop MapReduce – Feature Wise Comparison Spark vs Hadoop MapReduce MapReduce vs Spark – 20 Useful Comparisons To Learn Spark vs Hadoop : Which is the Best Big Data Framework