Analyzing NYC Transit Data

Analyzing NYC Transit Data:
Taxis, Ubers, and Citi Bikes
Todd Schneider
April 8, 2016
todd@toddwschneider.com

Where to ﬁnd me
toddwschneider.com
github.com/toddwschneider
@todd_schneider
toddsnyder

Things I’ll talk about
• Taxi, Uber, and Citi Bike data
• Medium data analysis tools and tips
• Where does R ﬁt in?

Taxi and Uber Data
http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

Citi Bike Data
http://toddwschneider.com/posts/a-tale-of-twenty-two-million-citi-bikes-analyzing-the-nyc-bike-share-system/

NYC Taxi and Uber Data
• Taxi & Limousine Commission released
public, trip-level data for over 1.1 billion
taxi rides 2009–2015
• Some public Uber data available as
well, thanks to a FOIL request by
FiveThirtyEight
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

Citi Bike Data
• Citi Bike releases monthly data for every
individual ride
• Data includes timestamps and locations,
plus rider’s subscriber status, gender,
and age
https://www.citibikenyc.com/system-data

Generic Analysis Overview
1. Get raw data
2. Write code to process raw data into
something more useful
3. Analyze data
4. Write about what you found out

Analysis Tools
• PostgreSQL
• PostGIS
• R
• Command line
• JavaScript
https://github.com/toddwschneider/nyc-taxi-data
https://github.com/toddwschneider/nyc-citibike-data

Raw data processing goals
• Load flat files of varying file formats into a unified,
persistent PostgreSQL database that we can use to
answer questions about the data
• Do some one-time calculations to augment the raw
data
• We want to answer neighborhood-based questions,
so we’ll map latitude/longitude coordinates to NYC
census tracts

Processing raw data:
The reality
• Often messy, raw data can require massaging
• Not fun, takes a while, but is essential
• Speciﬁcally: we have to plan ahead a bit,
anticipate usage patterns, questions we’re going
to ask, then decide on schema

Specific issues encountered
with raw taxi data
• Some files contain empty lines and unquoted carriage
returns 😐
• Raw data files have different formats even within the
same cab type 😕
• Some files contain extra columns in every row 😠
• Some files contain extra columns in only some rows 😡

How do we load a bunch of
files into a database?
• One at a time!
• Bash script loops through each raw data file, for
each file it executes code to process data and
insert records into a database table
https://github.com/toddwschneider/nyc-taxi-data/blob/master/import_trip_data.sh

How do we map latitude and
longitude to census tracts?
• PostGIS!
• Geographic information system (GIS) for
PostgreSQL
• Can do calculations of the form, “is a point inside a
polygon?”
• Every pickup/drop off is a point, NYC’s census
tracts are polygons

NYC Census
Tracts
• 2,166 tracts
• 196 neighborhood
tabulation areas
(NTAs)

Shapefiles
• Shapefile format describes geometries like points,
lines, polygons
• Many shapefiles publicly available, e.g. NYC
provides a shapefile that contains definitions for all
census tracts and NTAs
• PostGIS includes functionality to import shapefiles

PostGIS: ST_Within()
• ST_Within(geom A, geom B) function returns
true if and only if A is entirely within B
• A = pickup or drop off point
• B = NYC census tract polygon

Spatial Indexes
• Problem: determining whether a point is inside an
arbitrary polygon is computationally intensive and
slow
• PostGIS spatial indexes to the rescue!

Spatial indexes in a nutshell
bounding box
Bounding
box
Census
tract

Spatial Indexes
• Determining whether a point is inside a rectangle is easy!
• Spatial indexes store rectangular bounding boxes for
polygons, then when determining if a point is inside a
polygon, calculate in 2 steps:
1. Is the point inside the polygon’s bounding box?
2. If so, is the point inside the polygon itself?
• Most of the time the cheap ﬁrst check will be false, then
we can skip the expensive second step

Putting it all together
• Download NYC census tracts shapefile, import
into database, create spatial index
• Download raw taxi/Uber/Citi Bike data files and
loop through them, one file at a time
• For each file: fix data issues, load into database,
calculate census tracts with ST_Within()
• Wait 3 days and voila!

Analysis, a.k.a.“the fun part”
• Ask fun and interesting
questions
• Try to answer them
• Rinse and repeat

Taxi maps
• Question: what does a map of every taxi pickup
and drop off look like?
• Each trip has a pickup and drop off location, plot a
bunch of dots at those locations
• Made entirely in R using ggplot2

Taxi maps preprocess
• Problem: R can’t ﬁt 1.1 billion rows
• Solution: preprocess data by rounding
lat/long to 4 decimal places (~10
meters), count number of trips at each
aggregated point
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L194-L215

Render maps in R
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/analysis.R

Data reliability
Every other comment on reddit:

• Map the position of every Citi Bike over
the course of a single day
• Google Maps Directions API for cycling
directions
• Leaﬂet.js for mapping
• Torque.js by CartoDB for animation
Citi Bike Animation

• Google Maps cycling directions
have strong bias for dedicated
bike lanes on 1st, 2nd, 8th, and
9th avenues
• Not necessarily true!
Citi Bike Assumptions

Modeling the relationship between
the weather and Citi Bike ridership

Modeling the relationship between
the weather and Citi Bike ridership
• Daily ridership data from Citi Bike
• Daily weather data from National Climatic
Data Center: temperature, precipitation,
snow depth
• Devise and calibrate model in R

Calibration in R
• Uses nlsLM() function from
minpack.lm package for Levenberg–
Marquardt algorithm to minimize
nonlinear squared error
https://gist.github.com/toddwschneider/bac3350f84b2ff99969d

Airport trafﬁc
• Question: how long will my taxi take to get to the
airport?
• LGA, JFK, and EWR are each their own census
tracts
• Get all trips that dropped off in one of those tracts
• Calculate travel times from neighborhoods to airports

More fun stuff in the full posts
• On the realism of Die Hard 3
• Relationship between age, gender, and cycling
speed
• Neighborhoods with most nightlife
• East Hampton privacy concerns
• What time do investment bankers arrive at work?

“Medium data” analysis tips

What is “medium data”?
No clear answer, but my rough thinking:
• Tiny: fits in spreadsheet
• Small: doesn’t fit in spreadsheet, but fits in RAM
• Medium: too big for RAM, but fits on local hard disk
• Big: too big for local disk, has to be distributed across
many nodes

Use the right tool for the job
My personal toolkit (yours may vary!):
• PostgreSQL for storing and aggregating data. Geospatial
calculations with PostGIS extension
• R for modeling and plotting
• Command line tools for looping through ﬁles, loading data, text
processing on input data with sed, awk, etc.
• Ruby for making API calls, scraping websites, running web servers,
and sometimes using local rails apps to organize relational data
• JavaScript for interactivity on the web

R + PostgresSQL
• The R ↔ Postgres link is invaluable! Use R and
Postgres for the things they’re respectively best at
• Postgres: persisting data in tables, rote number
crunching
• R: calibrating models, plotting
• RPostgreSQL package allows querying Postgres
from within R

Tip: pre-aggregate
• Think about how you’re going to access the data, and
consider creating intermediate aggregated tables
which can be used as building blocks for later
analysis
• Example: number of taxi trips grouped by pickup
census tract and date/time truncated to the hour
• Resulting table is only 30 million rows, easier to work
with than full trips table, and can still answer lots of
interesting questions

Pre-aggregating example
CREATE TABLE hourly_pickups AS
SELECT
date_trunc('hour', pickup_datetime) AS pickup_hour,
cab_type_id,
pickup_nyct2010_gid,
COUNT(*)
FROM trips
WHERE pickup_nyct2010_gid IS NOT NULL
GROUP BY pickup_hour,
cab_type_id,
pickup_nyct2010_gid;
https://github.com/toddwschneider/nyc-taxi-data/blob/master/analysis/prepare_analysis.sql#L30-L38

How to get people to read
your work
• It has to be interesting. If you’re not excited,
probably nobody else is either
• Most people are distracted, and they read things in
“fast scroll” mode. Optimize for them
• The questions you ask are more important than
the methods you use to answer them

Speciﬁc tips
• Write in short paragraphs with straightforward
language
• Use plenty of section headers
• Good ratio of pictures to text
• Avoid the dreaded “wall of text”

Above all…
• Have fun!
• Keep an inquisitive mind.
Observe stuff happening around
you, ask questions about it, try to
answer those questions

Thanks!
todd@toddwschneider.com

Analyzing NYC Transit Data

More Related Content

Viewers also liked (14)

Similar to Analyzing NYC Transit Data (20)

More from Work-Bench (8)

Recently uploaded (20)

Analyzing NYC Transit Data