SlideShare a Scribd company logo
A compute infrastructure for
Data Scientists
Neelesh Salian
Stitch Fix
About Me
This talk..
● Data @ Stitch Fix
● Data Warehouse
● Compute Infrastructure
● Questions
Data @ Stitch Fix
Stitch Fix
● Personal styling service. We serve men & women, plus sizes.
● Founded in 2011, Led by CEO & Founder, Katrina Lake
● Employ more than 5,800 nationwide (USA)
● Algorithms + Humans
● Data Science is our culture
A compute infrastructure for data scientists
Algorithms Team
1. Data Science
○ Vertical teams based on functionality
○ Enablers of Data for the business
○ Cross-functional collaboration
2. Data Platform
○ Building a Platform for Data Science
○ Responsible for Access, Ingestion and Tooling
Read More
https://multithreaded.stitchfix.com/
http://algorithms-tour.stitchfix.com/
Challenges - Usability is Key
● Bringing SQL into Spark
● Training Data Scientists’ with Spark basics
● Assistance in Tuning recommendations to optimize
● Maintain latest releases for features and fixes
Data Warehouse
A compute infrastructure for data scientists
Data Warehouse Details
● S3 is the source of truth
● HiveMetastore used to store tabular information - schema, metadata
● Python and R based Readers and Writers
● Presto is used to read i.e. ad-hoc queries
● Spark is used for testing/ production complex applications
Compute Infrastructure
Spark
A compute infrastructure for data scientists
Genie
Genie
● Redirection Layer
● From Netflix
● Open Source - 2.x
● Used for Spark jobs - Load balance, Maintain Versions, Maintaining
Multi-Tenant environment
Genie Data Model - Applications
● Custom Builds of Spark
● Spark 2.x may refer 2.1.0 or 2.2.0
Applications
Genie Data Model - Commands
● Equivalent of spark-submit for each version
● Tags (aliases)
● Test commands with different configurations
Commands
Genie Data Model - Clusters
● Red/ Black Cluster deployment
● No disruption to Data Scientists’ workflow
Commands
Genie Data Model
Applications
Commands Clusters
Jobs
Babylon
Sheriff of Babylon
● Job Submissions
● Genie client library
● Overrides to Spark default values
● Run Commands
1. run_spark
2. run_sql
3. run_with_json
Additional Tools
● Custom SparkContext
● Metastore Library
● Table Statistics
● Python Query API
● Diagnostic Tool
Spaceman - Centralized Spark History
Server
Spaceman
● Archiving Log Data from all clusters
● Singular entity to access Spark logs
Spaceman - Features
● Runs for each cluster - easy to add new clusters
● Checks for completed applications at intervals
● Stores History of Apps from all clusters for a month
Readers and Writers
Readers and Writers
● Python and R based clients
● Uses pyarrow while using pandas dataframe for reading and writing tabular
data
● Backed by Livy - reduces bootstrapping with Genie
● Interfaces with the HiveMetastore
A compute infrastructure for data scientists
We are Hiring
https://multithreaded.stitchfix.com/careers/
Thank you

More Related Content

PDF
Tracking data lineage at Stitch Fix
PDF
Improving ad hoc and production workflows at Stitch Fix
PPTX
Optimizing Spark
PPTX
When We Spark and When We Don’t: Developing Data and ML Pipelines
PDF
Introduction to basic data analytics tools
PDF
Presto Summit 2018 - 02 - LinkedIn
PDF
Slide 2 collecting, storing and analyzing big data
PDF
Streamsets and spark in Retail
Tracking data lineage at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
Optimizing Spark
When We Spark and When We Don’t: Developing Data and ML Pipelines
Introduction to basic data analytics tools
Presto Summit 2018 - 02 - LinkedIn
Slide 2 collecting, storing and analyzing big data
Streamsets and spark in Retail

What's hot (20)

PDF
Superset druid realtime
PPTX
Telco analytics at scale
PDF
Presto @ Facebook: Past, Present and Future
PDF
Open core summit: Observability for data pipelines with OpenLineage
PPTX
Presto@Netflix Presto Meetup 03-19-15
PPTX
Improve your SQL workload with observability
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
From R Script to Production Using rsparkling with Navdeep Gill
PPTX
Presto: Distributed sql query engine
PDF
Data platform architecture principles - ieee infrastructure 2020
PDF
Streamsets and spark at SF Hadoop User Group
PDF
Data lineage and observability with Marquez - subsurface 2020
ODP
Spline 0.3 User Guide
PDF
Data pipelines observability: OpenLineage & Marquez
PDF
Presto: Fast SQL on Everything
PPTX
An Intro to Elasticsearch and Kibana
PDF
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
PDF
RealTime Recommendations @Netflix - Spark
PDF
What to Expect for Big Data and Apache Spark in 2017
PDF
Nikhil summer internship 2016
Superset druid realtime
Telco analytics at scale
Presto @ Facebook: Past, Present and Future
Open core summit: Observability for data pipelines with OpenLineage
Presto@Netflix Presto Meetup 03-19-15
Improve your SQL workload with observability
Presto Summit 2018 - 09 - Netflix Iceberg
From R Script to Production Using rsparkling with Navdeep Gill
Presto: Distributed sql query engine
Data platform architecture principles - ieee infrastructure 2020
Streamsets and spark at SF Hadoop User Group
Data lineage and observability with Marquez - subsurface 2020
Spline 0.3 User Guide
Data pipelines observability: OpenLineage & Marquez
Presto: Fast SQL on Everything
An Intro to Elasticsearch and Kibana
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
RealTime Recommendations @Netflix - Spark
What to Expect for Big Data and Apache Spark in 2017
Nikhil summer internship 2016
Ad

Similar to A compute infrastructure for data scientists (20)

PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
PPTX
Apache Spark sql
DOCX
Prashant_Agrawal_CV
PDF
Machine learning and big data @ uber a tale of two systems
PDF
Introduction to Structured Data Processing with Spark SQL
PDF
Building data "Py-pelines"
PDF
Using Databricks as an Analysis Platform
ODP
Presto
PDF
Open Data Inside - Why Internal Data Portals are Key to Successful Data Gover...
PDF
Spark Uber Development Kit
PDF
A Tool For Big Data Analysis using Apache Spark
PDF
Interactive Data Analysis in Spark Streaming
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
PDF
Journey and evolution of Presto@Grab
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Building end to end streaming application on Spark
PDF
Spark Driven Big Data Analytics
PDF
Headaches and Breakthroughs in Building Continuous Applications
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Apache Spark sql
Prashant_Agrawal_CV
Machine learning and big data @ uber a tale of two systems
Introduction to Structured Data Processing with Spark SQL
Building data "Py-pelines"
Using Databricks as an Analysis Platform
Presto
Open Data Inside - Why Internal Data Portals are Key to Successful Data Gover...
Spark Uber Development Kit
A Tool For Big Data Analysis using Apache Spark
Interactive Data Analysis in Spark Streaming
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Journey and evolution of Presto@Grab
AWS Big Data Demystified #1: Big data architecture lessons learned
Building end to end streaming application on Spark
Spark Driven Big Data Analytics
Headaches and Breakthroughs in Building Continuous Applications
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Ad

More from Stitch Fix Algorithms (7)

PPTX
Progression by Regression: How to increase your A/B Test Velocity
PPTX
Deep recommendations in PyTorch
PPTX
Moment-based estimation for hierarchical models in Apache Spark
PDF
Production model deployment
PPTX
Incrementality
PDF
Apache Spark & ML Workflows
PDF
Enabling full stack data scientists
Progression by Regression: How to increase your A/B Test Velocity
Deep recommendations in PyTorch
Moment-based estimation for hierarchical models in Apache Spark
Production model deployment
Incrementality
Apache Spark & ML Workflows
Enabling full stack data scientists

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
cuic standard and advanced reporting.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Electronic commerce courselecture one. Pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
1. Introduction to Computer Programming.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
A Presentation on Artificial Intelligence
cuic standard and advanced reporting.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Programs and apps: productivity, graphics, security and other tools
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MIND Revenue Release Quarter 2 2025 Press Release
Electronic commerce courselecture one. Pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Getting Started with Data Integration: FME Form 101
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
1. Introduction to Computer Programming.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
20250228 LYD VKU AI Blended-Learning.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia

A compute infrastructure for data scientists