Data Science in the Cloud @StitchFix

Data Science
in the Cloud
Stefan Krawczyk
@stefkrawczyk
linkedin.com/in/skrawczyk
November 2016

InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
stitchfix-cloud

Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com

But they’re in
demand and expensive

“The Sexiest Job
of the 21st Century”
- HBR
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

How many
Data Scientists do you have?

Two Data Scientist facts:
1. Has AWS console access*.
2. End to end,
they’re responsible.

How do we enable this without
?

Make doing the right
thing the easy thing.

Fellow Collaborators
Horizontal team focused on Data Scientist Enablement

1. Eng. Skills
2. Important
3. What they work on

Will Only Cover
1. Source of truth: S3 & Hive Metastore
2. Docker Enabled DS @ Stitch Fix
3. Scaling DS doing ML in the Cloud

Source of truth:
S3 & Hive Metastore

Want Everyone to Have Same View
A
B

This is Usually Nothing to Worry About
● OS handles correct access
● DB has ACID properties
A
B

This is Usually Nothing to Worry About
● OS handles correct access
● DB has ACID properties
● But it’s easy to outgrow these
options with a big data/team.
A
B

● Amazon’s Simple Storage Service
● Infinite* storage
● Can write, read, delete, BUT NOT append.
● Looks like a file system*:
○ URIs: my.bucket/path/to/files/file.txt
● Scales well
S3
* For all intents and purposes

● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore

● Hadoop service, that stores:
○ Schema
○ Partition information, e.g. date
○ Data location for a partition
Hive Metastore:
Hive Metastore
Partition Location
20161001 s3://bucket/sold_items/20161001
...
20161031 s3://bucket/sold_items/20161031
sold_items

● Replacing data in a partition
But if we’re not careful

B
A

● S3 is eventually
consistent
● These bugs are hard
to track down
A
B

● Use Hive Metastore to control partition source of truth
● Principles:
○ Never delete
○ Always write to a new place each time a partition changes
● Stitch Fix solution:
○ Use an inner directory → called Batch ID
Hive Metastore to the Rescue

Batch ID Pattern
Date Location
20161001 s3://bucket/sold_items/
20161001/20161002002334/
... ...
20161031/20161101002256/
sold_items

● Overwriting a partition is just a matter of updating the location
Batch ID Pattern
Date Location
20161001/20161002002334/
... ...
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252
sold_items

● Overwriting a partition is just a matter of updating the location
● To the user this is a hidden inner directory
Batch ID Pattern
Date Location
20161001/20161002002334/
... ...
20161031/20161101002256/
s3://bucket/sold_items/
20161031/20161102234252
sold_items

Python:
store_dataframe(df, dest_db, dest_table, partitions=[‘2016’])
df = load_dataframe(src_db, src_table, partitions=[‘2016’])
R:
sf_writer(data = result,
namespace = dest_db,
resource = dest_table,
partitions = c(as.integer(opt$ETL_DATE)))
sf_reader(namespace = src_db,
resource = src_table,
partitions = c(as.integer(opt$ETL_DATE)))
API for Data Scientists

● Full partition history
○ Can rollback
■ Data Scientists are less afraid of mistakes
○ Can create audit trails more easily
■ What data changed and when
○ Can anchor downstream consumers to a particular batch ID
Batch ID Pattern Benefits

Docker Enabled
DS @ Stitch Fix

Workstation Env. Mgmt. Scalability
Low Low
Medium Medium
High High
Ad hoc Infra: In the Beginning...

Low Low
Medium Medium
High High
Ad hoc Infra: Evolution I

Low Low
Medium Medium
High High
Ad hoc Infra: Evolution II

Low Low
Medium Medium
Low High
Ad hoc Infra: Evolution III

● Control of environment
○ Data Scientists don’t need to worry about env.
● Isolation
○ can host many docker containers on a single machine.
● Better host management
○ allowing central control of machine types.
Why Does Docker Lower Overhead?

● Has:
○ Our internal API libraries
○ Jupyter Notebook:
■ Pyspark
■ IPython
○ Python libs:
■ scikit, numpy, scipy, pandas, etc.
○ RStudio
○ R libs:
■ Dplyr, magrittr, ggplot2, lme4, BOOT, etc.
● Mounts User NFS
● User has terminal access to file system via Jupyter for git, pip, etc.
Our Docker Image

● Docker tightly integrates with the Linux Kernel.
○ Hypothesis:
■ Anything that makes uninterruptable calls to the kernel can:
● Break the ECS agent because the container doesn’t respond.
● Break isolation between containers.
■ E.g. Mounting NFS
● Docker Hub:
○ Switched to artifactory
Our Docker Problems So Far

Scaling
DS doing ML
in the Cloud

1. Data Latency
2. To Batch
or Not To Batch
3. What’s in a Model?

Data Latency
How much time do you spend waiting for data?

*This could be a laptop, a shared system, a batch process, etc.

Use Compression
*This could be a laptop, a shared system, a batch process, etc.

Use Compression - The Components
[ 1.3234543 0.23443434 … ]
[ 1 0 0 1 0 0 … 0 1 0 0
0 1 0 1 ...
… 1 0 1 1 ]
[ 1 0 0 1 0 0 … 0 1 0 0 ]
[ 1.3234543 0.23443434 … ]
{ 100: 0.56, … ,110: 0.65,
… , … , 999: 0.43 }

Use Compression - Python Comparison
Pickle: 60MB
Zlib+Pickle: 129KB
JSON: 15MB
Zlib+JSON: 55KB
Pickle: 3.1KB
Zlib+Pickle: 921B
JSON: 2.8KB
Zlib+JSON: 681B
Pickle: 2.6MB
Zlib+Pickle: 600KB
JSON: 769KB
Zlib+JSON: 139KB
[ 1.3234543 0.23443434 … ]
[ 1 0 0 1 0 0 … 0 1 0 0
0 1 0 1 ...
… 1 0 1 1 ]
[ 1 0 0 1 0 0 … 0 1 0 0 ]
[ 1.3234543 0.23443434 … ]
{ 100: 0.56, … ,110: 0.65,
… , … , 999: 0.43 }

● Naïve scheme of JSON + Zlib works well:
Observations
import json
import zlib
...
# compress
compressed = zlib.compress(json.dumps(value))
# decompress
original = json.loads(zlib.decompress(compressed))

● Double vs Float: do you really need to store that much precision?
Observations
import json
import zlib
...
# compress
# decompress

● Double vs Float: do you really need to store that much precision?
● For more inspiration look to columnar DBs and how they compress columns
Observations
import json
import zlib
...
# compress
# decompress

To Batch or Not To Batch:
When is batch inefficient?

● Online:
○ Computation occurs synchronously when needed.
● Streamed:
○ Computation is triggered by an event(s).
Online & Streamed Computation

Very likely
you start with
a batch system

● Do you need to recompute:
○ features for all users?
○ predicted results for all users?
Very likely
you start with
a batch system

● Are you heavily dependent on your
ETL running every night?
Very likely
you start with
a batch system

● Online vs Streamed depends on in
house factors:
○ Number of models
○ How often they change
○ Cadence of output required
○ In house eng. expertise
○ etc.
Very likely
you start with
a batch system

● Online vs Streamed depends on in
house factors:
○ Number of models
○ How often they change
○ Cadence of output required
○ In house eng. expertise
○ etc.
Very likely
you start with
a batch system
We use online
system for
recommendations

● Dedicated infrastructure → More room on batch infrastructure
○ Hopefully $$$ savings
○ Hopefully less stressed Data Scientists
Online/Streaming Thoughts

● Requires better software engineering practices
○ Code portability/reuse
○ Designing APIs/Tools Data Scientists will use

● Requires better software engineering practices
○ Code portability/reuse
○ Designing APIs/Tools Data Scientists will use
● Prototyping on AWS Lambda & Kinesis was surprisingly quick
○ Need to compile C libs on an amazon linux instance

What’s in a Model?
Scaling model knowledge

Ever:
● Had someone leave and then nobody understands how they trained their
models?

Ever:
models?
○ Or you didn’t remember yourself?

Ever:
models?
● Had performance dip in models and you have trouble figuring out why?

Ever:
models?
○ Or not known what’s changed between model deployments?

Ever:
models?
● Wanted to compare model performance over time?

Ever:
models?
● Wanted to compare model performance over time?
● Wanted to train a model in R/Python/Spark and then deploy it a webserver?

● Isn’t that just saving the coefficients/model values?
Produce Model Artifacts

○ NO!

○ NO!
● Why?

○ NO!
● Why?
How do you deal with
organizational drift?

○ NO!
● Why?
Makes it easy to keep an
archive and track
changes over time

○ NO!
● Why?
Helps a lot with model
debugging & diagnosis!
archive and track
changes over time

○ NO!
● Why?
Helps a lot with model
debugging & diagnosis!
archive and track
changes over time Can more easily use in
downstream processes

● Analogous to software libraries
● Packaging:
○ Zip/Jar file

But all the above
seems complex?

Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
stitchfix-cloud

Data Science in the Cloud @StitchFix

More Related Content

What's hot (20)

Similar to Data Science in the Cloud @StitchFix (20)

More from C4Media (20)

Recently uploaded (20)

Data Science in the Cloud @StitchFix