SlideShare a Scribd company logo
MapMyCab
Preetika Kulshrestha!
Insight Data Engineering, Feb 2015
Motivation
• Tool for Data Scientists and Cab dispatchers to analyze (by
time of day or day of week):!
• cab occupancy!
• miles travelled!
• pickups and drop-offs!
• An app for city dwellers to view real-time cab status for
unoccupied cabs in a given area
Demo
Pipeline
Cab
Data
Message
Broker
Real-Time
Streaming
HDFS
HBase UI
MrJob
11 million rows
Data Aggregation
CabID Lat Long Occ Timestamp
Aggregate Metrics (per cab)
MrJob
year month day hour avocc pickup drop off
• Drop off event: Occupancy change from 1 to 0!
• Pickup event: Occupancy change from 0 to 1
Computing Trip Durations and Shift Times
• Used Windowing
function in Hive to
calculate idle times!
• Maximum idle time
in a day points to a
potential shift!
• 1 million trips
idle/shift time!
(hours)
tripId hour idle (s) idle (h)
Occupancy Profile
occ(%)
0
0.175
0.35
0.525
0.7
hour
0 1 2 3 4 5 6 7 8 9 1011131214151617181920212223
potential !
shift time!
Tables
• Hourly data organized by Day of Week!
• Aggregate metrics stored in the same table for fast retrieval
y_m_dow c:0 c:1 c:2 c:3 c:4 … c:23 c:Totals
Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 ..
2008_01_Mon
pickups, dropoffs,
avg_occ, avg_dist
.. .. .. .. .. ..
sum(pickups),
sum(drop offs),
avg(occ), avg(dist)
Hourly Aggregates by Day of Week
• HBase row level atomicity can be leveraged for
transactional operations!
• Keyed producer in Kafka assures in-order delivery
of messages (by key)!
• Simple operations for tool integration, followed by
incremental complexity streamlines the
development process
Takeaways
About Me
• Previous Life - Senior Energy Analyst
(EnerNOC Inc.).
• M.S. Electrical Engineering - North Carolina
State University (focus on robotics, control
systems and smart grid).
• https://github.com/PreetikaKuls
• preetika.kulshrestha@gmail.com
Batch Views
Batch Views

More Related Content

PPTX
Visualising Flux: Storytelling with Time, Space & Torque
PPTX
Mapping Air Population
PDF
Pgrouting_foss4guk_ross_mcdonald
PDF
Creating and indoor routable network with QGIS and pgRouting
PDF
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
PDF
DataEngConf - Simulations at Scale
PDF
DataEngConf SF16 - Running simulations at scale
Visualising Flux: Storytelling with Time, Space & Torque
Mapping Air Population
Pgrouting_foss4guk_ross_mcdonald
Creating and indoor routable network with QGIS and pgRouting
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
DataEngConf - Simulations at Scale
DataEngConf SF16 - Running simulations at scale

What's hot (16)

PDF
Ibm infosphere mgarren
PPT
Mapbox
PDF
Making beautiful maps with Mapbox Studio by Charley Glynn
PPTX
Trb 2017 annual_conference_visualization_lightning_talk_rst
PPTX
5200 Analysis-Airbnb data
PPTX
Relay Local State Management: Replacing Redux
PDF
Geo-Processing in the Clouds
PDF
Graph Computing with Apache TinkerPop
PDF
JanusGraph: Looking Backward, Reaching Forward
PPTX
Join semantics in kafka streams
PDF
Community-Driven Graphs with JanusGraph
PDF
ElasticSearch入門
PPTX
AWS and Terraform for Disaster Recovery
PDF
Scalable Data Analytics and Visualization with Cloud Optimized Services
PPTX
AWS for mega(geo)data
PDF
Cloud in your Cloud
Ibm infosphere mgarren
Mapbox
Making beautiful maps with Mapbox Studio by Charley Glynn
Trb 2017 annual_conference_visualization_lightning_talk_rst
5200 Analysis-Airbnb data
Relay Local State Management: Replacing Redux
Geo-Processing in the Clouds
Graph Computing with Apache TinkerPop
JanusGraph: Looking Backward, Reaching Forward
Join semantics in kafka streams
Community-Driven Graphs with JanusGraph
ElasticSearch入門
AWS and Terraform for Disaster Recovery
Scalable Data Analytics and Visualization with Cloud Optimized Services
AWS for mega(geo)data
Cloud in your Cloud
Ad

Similar to MapMyCab Presentation (20)

PDF
Demo week three_thurs
PDF
Stream Computing & Analytics at Uber
PPTX
Big Data Pipelines and Machine Learning at Uber
PDF
Koober Machine Learning
PDF
Koober Preduction IO Presentation
PDF
Analyzing NYC Transit Data
PPT
Disruptive open transport data
PDF
ML and Data Science at Uber - GITPro talk 2017
PDF
The Impact of Always-on Connectivity for Geospatial Applications and Analysis
PDF
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
PPTX
capital bikeshare
PDF
Azure Maps Mobility Services Workshop
PDF
Baseride Technologies - solutions for smart transportation & logistics
PPT
2016 gisco track: coupling gis with online time reporting to monitor and repo...
PDF
Urban Data Challenge - Christopher A. Pangilinan
PPTX
Urban Planning: Mapping & Alternative Scenarios (TLP)
PDF
Truck planning: how to certify the right route
PPTX
SIH2024_IDEA_newideaaaa_engineering_24.pptx
PDF
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
PPT
Od ifriday openraildata
Demo week three_thurs
Stream Computing & Analytics at Uber
Big Data Pipelines and Machine Learning at Uber
Koober Machine Learning
Koober Preduction IO Presentation
Analyzing NYC Transit Data
Disruptive open transport data
ML and Data Science at Uber - GITPro talk 2017
The Impact of Always-on Connectivity for Geospatial Applications and Analysis
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
capital bikeshare
Azure Maps Mobility Services Workshop
Baseride Technologies - solutions for smart transportation & logistics
2016 gisco track: coupling gis with online time reporting to monitor and repo...
Urban Data Challenge - Christopher A. Pangilinan
Urban Planning: Mapping & Alternative Scenarios (TLP)
Truck planning: how to certify the right route
SIH2024_IDEA_newideaaaa_engineering_24.pptx
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Od ifriday openraildata
Ad

MapMyCab Presentation

  • 2. Motivation • Tool for Data Scientists and Cab dispatchers to analyze (by time of day or day of week):! • cab occupancy! • miles travelled! • pickups and drop-offs! • An app for city dwellers to view real-time cab status for unoccupied cabs in a given area
  • 5. Data Aggregation CabID Lat Long Occ Timestamp Aggregate Metrics (per cab) MrJob year month day hour avocc pickup drop off • Drop off event: Occupancy change from 1 to 0! • Pickup event: Occupancy change from 0 to 1
  • 6. Computing Trip Durations and Shift Times • Used Windowing function in Hive to calculate idle times! • Maximum idle time in a day points to a potential shift! • 1 million trips idle/shift time! (hours) tripId hour idle (s) idle (h) Occupancy Profile occ(%) 0 0.175 0.35 0.525 0.7 hour 0 1 2 3 4 5 6 7 8 9 1011131214151617181920212223 potential ! shift time!
  • 7. Tables • Hourly data organized by Day of Week! • Aggregate metrics stored in the same table for fast retrieval y_m_dow c:0 c:1 c:2 c:3 c:4 … c:23 c:Totals Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 .. 2008_01_Mon pickups, dropoffs, avg_occ, avg_dist .. .. .. .. .. .. sum(pickups), sum(drop offs), avg(occ), avg(dist) Hourly Aggregates by Day of Week
  • 8. • HBase row level atomicity can be leveraged for transactional operations! • Keyed producer in Kafka assures in-order delivery of messages (by key)! • Simple operations for tool integration, followed by incremental complexity streamlines the development process Takeaways
  • 9. About Me • Previous Life - Senior Energy Analyst (EnerNOC Inc.). • M.S. Electrical Engineering - North Carolina State University (focus on robotics, control systems and smart grid). • https://github.com/PreetikaKuls • preetika.kulshrestha@gmail.com