SlideShare a Scribd company logo
July 10, 2013
Data center &
Backend buildout
Emil Fredriksson
David Poblador i Garcia
@davidpoblador
July 10, 2013
• Some numbers about Spotify
• Data centers, Infrastructure
and Capacity
• How Spotify works
• What are we working on now?
Some numbers
•1000M+ playlists
•Over 24M active users
•Over 20M songs (adding 20K every day)
•Over 6M paying subscribers
•Available in 28 markets
Operations
in numbers
•90+ backend systems
•23 SRE engineers
•2 locations: NYC and Stockholm
•Around 15 teams building the Spotify Platform
in Operations and Infrastructure
July 10, 2013
Data centers,
infrastructure
and capacity
Data centers:
our factories
•Input electricity, servers and software.
Get the Spotify services as output
•We have to scale it up as we grow our
business
•Where the software meets the real world and
customers
•If it does not work, the music stops playing
The capacity
challenge
•Supporting our service for a growing number
of users
•New more complex features require server
capacity
•Keeping up with very fast software
development
Delivering capacity
•We operate four data centers with more than
5 000 servers and 140Gbps of Internet
capacity
•In 2008 there were 20 servers
•Renting space in large data center facilities
•Owning and operating hardware and network
What we need in a
data center
•Reliable power supply
•Air conditioning
•Secure space
•Network POPs
•Remote hands
•Shipping and handling
Pods – standard
data center units
•Deploying a new data centers takes a long
time!
•We need to be agile and fast to keep up with
the product development
•We solve this by standardizing our data
centers and networking in to pods and pre-
provision servers
•Target is to keep 30% spare capacity at all
times
Pods – standard
data center units
•44 racks in one pod, about 1500 servers
•Racks redundantly connected with 10GE
uplink to core switches
•Pod is directly connected to the Internet via
multiple 10GE transit links
•Build it the same way every time
•Include the base infrastructure services
July 10, 2013
Data center
locations
•You can not go faster than light
•Distance == Latency
•Current locations: Stockholm, London,
Ashburn (US east coast), San Jose (US west
coast)
•Static content on CDN. Dynamic content
comes from our data centers
So what about the
public clouds?
•Commoditization of the data center is
happening now, few companies will need to
build data centers in the future
•We already use both AWS S3 and EC2, usage
will increase
•Challenges that still remain:
•Inter node network performance
•Cost (at large scale)
•Flexible hardware configurations
July 10, 2013
Automated
installation
•Information about servers go in to a database:
MAC address, hardware configuration, location,
networks, hostnames and state(available, in-use)
•Automatic generation of DNS, DHCP and PXE
records
•Cobbler used as an installation server
•Single command installs multiple servers in
multiple data centers
July 10, 2013
How Spotify works
access
point
storage
search
playlist
user
web api
browse
...
Backend services
Clients
www.spotify.com
ads
social
key
Facebook
Amazon
S3
CDN
Content ingestion,
indexing, and transcoding
Log analysis
(hadoop)
Record labels
DNS à la Spotify
•Distribution of clients
•Error reporting by clients
•Service discovery
•DHT ring configuration
DNS: Service
discovery
•_playlist: service name
•_http: protocol
•3600: ttl
•10: prio
•50: weight
•8081: port
•host1.spotify.net: host
_playlist._http.spotify.net 3600 SRV 10 50 8081 host1.spotify.net.
DNS: DHT rings
Which service instance should I ask
for a resource?
•Configuration
config._key._http.spotify.net 3600 TXT “slaves=0”
config._key._http.spotify.net 3600 TXT “slaves=2 redundancy=host”
•Mapping ring segment to service instance
tokens.8081.host1.spotify.net 3600 TXT “00112233445566778899aabbccddeeff”
Databases:
Cassandra & Postgres
•Critical and consistency important:
PostgreSQL
•Huge, growing fast, eventual consistency OK:
Cassandra
Storage:
Production Storage
•Read only
•Large files
•HTTP based
•nginx + storage proxies + Amazon S3
Other types of storage
•Hadoop
•Tokyo Cabinet
•CDB
•BDB
Communication protocols
between services: HTTP
•Originally used by every system
•Simple
•Well known
•Battle tested
•Proper Implementations in many languages
•Each service defines its own RESTful protocol
Communication protocols
between services: Hermes
Thin layer on top of ØMQ
Data in messages is serialized as protobuf
•Services define their APIs partly as protobuf
Hermes is embedded in the client-AP protocol
•AP doesn’t need to translate protocols, it is just a
message router.
In addition to request/reply, we get pub/sub.
Configuration management
•We use Puppet
•Installs Debian packages based on recipes
•Teams developing a system write Puppet
manifests
•Hiera: simple Hierarchical Database for
service parameters
•Not the most scalable solution
July 10, 2013
Working on...
Operational responsibility
delegation
•Each feature team takes responsibility for the
entire stack: from developing a system to
running and operating it.
•Mentality shift: from “it works” to “it scales”
•Full responsibility: capacity planning,
monitoring, incident management.
•Risk of reinventing square wheels. Closing the
feedback loop is key.
Service Discovery
•DNS will stay
•We can’t afford rewriting every system
•We like to be able to use standard tools (dig)
to troubleshoot
•We aim to have a handsfree zone file
management
•Automated registration and deregistration of
nodes is a goal
Unit of deployment
(containers)
•Runs on top of our OS platform
•Consistency between different environments (testing,
production, public cloud, development boxes...)
•Version N looks always the same
•Testability improves
•Deployments are fast. Gradual rollouts FTW!
•Rollbacks are easy
•Configurations could be part of the bundle
Incident management
process improvements
•Main objective: A type of incident happens only once.
•Streamline internal and external communication
•Teams developing a system lead the process for
incidents connected with it
•SRE leads the process for incidents affecting multiple
pieces that require a higher level of coordination
•Mitigation > Post-mortem > Remediation > Resolution
More stuff being done
•Explaining our challenges to the world
•Opensourcing many of our tools
•Self-service provisioning of capacity
•Improvements in our continuous integration pipeline
•Network platform
•OS platform
•Automation everywhere
•Recruitment
July 10, 2013
We are hiring
spoti.fi/ops-jobs
July 10, 2013
Gràcies! Q & A
spoti.fi/ops-jobs
Emil Fredriksson / David Poblador i Garcia

More Related Content

PDF
Spotify: behind the scenes
PDF
Spotify architecture - Pressing play
PDF
The Evolution of Hadoop at Spotify - Through Failures and Pain
PPTX
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
PDF
Data at Spotify
PDF
The Evolution of Big Data at Spotify
PDF
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
PDF
Spotify: P2P music streaming
Spotify: behind the scenes
Spotify architecture - Pressing play
The Evolution of Hadoop at Spotify - Through Failures and Pain
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Data at Spotify
The Evolution of Big Data at Spotify
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify: P2P music streaming

What's hot (20)

PDF
Spotify: Horizontal Scalability for Great Success
PDF
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
PDF
Machine Learning and Big Data for Music Discovery at Spotify
PDF
Big data and machine learning @ Spotify
PDF
Algorithmic Music Recommendations at Spotify
PPTX
SRE (service reliability engineer) on big DevOps platform running on the clou...
PDF
How Apache Drives Music Recommendations At Spotify
PDF
From Idea to Execution: Spotify's Discover Weekly
PPTX
Observability-101
PDF
Big Data At Spotify
PDF
Music Recommendation 2018
PDF
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
PPTX
PDF
Grafana introduction
PPTX
FinOps for private cloud
PPTX
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
PPTX
Spotify Discover Weekly: The machine learning behind your music recommendations
PPTX
Github
PPTX
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
ODP
Devops Devops Devops
Spotify: Horizontal Scalability for Great Success
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Machine Learning and Big Data for Music Discovery at Spotify
Big data and machine learning @ Spotify
Algorithmic Music Recommendations at Spotify
SRE (service reliability engineer) on big DevOps platform running on the clou...
How Apache Drives Music Recommendations At Spotify
From Idea to Execution: Spotify's Discover Weekly
Observability-101
Big Data At Spotify
Music Recommendation 2018
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
Grafana introduction
FinOps for private cloud
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
Spotify Discover Weekly: The machine learning behind your music recommendations
Github
Implementing SRE practices: SLI/SLO deep dive - David Blank-Edelman - DevOpsD...
Devops Devops Devops
Ad

Similar to Spotify: Data center & Backend buildout (20)

PDF
Cloudstack at Spotify, NYC
PDF
Spotify: Playing for millions, tuning for more
PDF
Cloudstack at Spotify
PDF
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
PDF
OSMC 2015 | Monitoring at Spotify - When things go ping in the night by Marti...
PDF
Spotify services (SDC 2013)
PDF
Distributed "Web Scale" Systems
PDF
Obvious and Non-Obvious Scalability Issues: Spotify Learnings
PDF
Atmosphere Conference 2015: Service Operations Evolution at Spotify
PPTX
NetflixOSS for Triangle Devops Oct 2013
PDF
Docker at Spotify
PDF
Docker at Spotify - Dockercon14
PDF
Docker at Spotify
PPTX
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
PDF
Enabling lightning fast content delivery for Spotify
PPTX
Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)
PPTX
Turkish Airlines' Journey on Cloud
PPTX
Moving to microservices – a technology and organisation transformational journey
PPTX
Netflix Cloud Architecture and Open Source
PDF
Berlin AWS meetup: here.com on AWS
Cloudstack at Spotify, NYC
Spotify: Playing for millions, tuning for more
Cloudstack at Spotify
OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm
OSMC 2015 | Monitoring at Spotify - When things go ping in the night by Marti...
Spotify services (SDC 2013)
Distributed "Web Scale" Systems
Obvious and Non-Obvious Scalability Issues: Spotify Learnings
Atmosphere Conference 2015: Service Operations Evolution at Spotify
NetflixOSS for Triangle Devops Oct 2013
Docker at Spotify
Docker at Spotify - Dockercon14
Docker at Spotify
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
Enabling lightning fast content delivery for Spotify
Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)
Turkish Airlines' Journey on Cloud
Moving to microservices – a technology and organisation transformational journey
Netflix Cloud Architecture and Open Source
Berlin AWS meetup: here.com on AWS
Ad

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Modernizing your data center with Dell and AMD
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
The AUB Centre for AI in Media Proposal.docx
Understanding_Digital_Forensics_Presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
20250228 LYD VKU AI Blended-Learning.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Review of recent advances in non-invasive hemoglobin estimation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Modernizing your data center with Dell and AMD
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
Mobile App Security Testing_ A Comprehensive Guide.pdf

Spotify: Data center & Backend buildout

  • 1. July 10, 2013 Data center & Backend buildout Emil Fredriksson David Poblador i Garcia @davidpoblador
  • 2. July 10, 2013 • Some numbers about Spotify • Data centers, Infrastructure and Capacity • How Spotify works • What are we working on now?
  • 3. Some numbers •1000M+ playlists •Over 24M active users •Over 20M songs (adding 20K every day) •Over 6M paying subscribers •Available in 28 markets
  • 4. Operations in numbers •90+ backend systems •23 SRE engineers •2 locations: NYC and Stockholm •Around 15 teams building the Spotify Platform in Operations and Infrastructure
  • 5. July 10, 2013 Data centers, infrastructure and capacity
  • 6. Data centers: our factories •Input electricity, servers and software. Get the Spotify services as output •We have to scale it up as we grow our business •Where the software meets the real world and customers •If it does not work, the music stops playing
  • 7. The capacity challenge •Supporting our service for a growing number of users •New more complex features require server capacity •Keeping up with very fast software development
  • 8. Delivering capacity •We operate four data centers with more than 5 000 servers and 140Gbps of Internet capacity •In 2008 there were 20 servers •Renting space in large data center facilities •Owning and operating hardware and network
  • 9. What we need in a data center •Reliable power supply •Air conditioning •Secure space •Network POPs •Remote hands •Shipping and handling
  • 10. Pods – standard data center units •Deploying a new data centers takes a long time! •We need to be agile and fast to keep up with the product development •We solve this by standardizing our data centers and networking in to pods and pre- provision servers •Target is to keep 30% spare capacity at all times
  • 11. Pods – standard data center units •44 racks in one pod, about 1500 servers •Racks redundantly connected with 10GE uplink to core switches •Pod is directly connected to the Internet via multiple 10GE transit links •Build it the same way every time •Include the base infrastructure services
  • 13. Data center locations •You can not go faster than light •Distance == Latency •Current locations: Stockholm, London, Ashburn (US east coast), San Jose (US west coast) •Static content on CDN. Dynamic content comes from our data centers
  • 14. So what about the public clouds? •Commoditization of the data center is happening now, few companies will need to build data centers in the future •We already use both AWS S3 and EC2, usage will increase •Challenges that still remain: •Inter node network performance •Cost (at large scale) •Flexible hardware configurations
  • 16. Automated installation •Information about servers go in to a database: MAC address, hardware configuration, location, networks, hostnames and state(available, in-use) •Automatic generation of DNS, DHCP and PXE records •Cobbler used as an installation server •Single command installs multiple servers in multiple data centers
  • 17. July 10, 2013 How Spotify works
  • 19. DNS à la Spotify •Distribution of clients •Error reporting by clients •Service discovery •DHT ring configuration
  • 20. DNS: Service discovery •_playlist: service name •_http: protocol •3600: ttl •10: prio •50: weight •8081: port •host1.spotify.net: host _playlist._http.spotify.net 3600 SRV 10 50 8081 host1.spotify.net.
  • 21. DNS: DHT rings Which service instance should I ask for a resource? •Configuration config._key._http.spotify.net 3600 TXT “slaves=0” config._key._http.spotify.net 3600 TXT “slaves=2 redundancy=host” •Mapping ring segment to service instance tokens.8081.host1.spotify.net 3600 TXT “00112233445566778899aabbccddeeff”
  • 22. Databases: Cassandra & Postgres •Critical and consistency important: PostgreSQL •Huge, growing fast, eventual consistency OK: Cassandra
  • 23. Storage: Production Storage •Read only •Large files •HTTP based •nginx + storage proxies + Amazon S3
  • 24. Other types of storage •Hadoop •Tokyo Cabinet •CDB •BDB
  • 25. Communication protocols between services: HTTP •Originally used by every system •Simple •Well known •Battle tested •Proper Implementations in many languages •Each service defines its own RESTful protocol
  • 26. Communication protocols between services: Hermes Thin layer on top of ØMQ Data in messages is serialized as protobuf •Services define their APIs partly as protobuf Hermes is embedded in the client-AP protocol •AP doesn’t need to translate protocols, it is just a message router. In addition to request/reply, we get pub/sub.
  • 27. Configuration management •We use Puppet •Installs Debian packages based on recipes •Teams developing a system write Puppet manifests •Hiera: simple Hierarchical Database for service parameters •Not the most scalable solution
  • 29. Operational responsibility delegation •Each feature team takes responsibility for the entire stack: from developing a system to running and operating it. •Mentality shift: from “it works” to “it scales” •Full responsibility: capacity planning, monitoring, incident management. •Risk of reinventing square wheels. Closing the feedback loop is key.
  • 30. Service Discovery •DNS will stay •We can’t afford rewriting every system •We like to be able to use standard tools (dig) to troubleshoot •We aim to have a handsfree zone file management •Automated registration and deregistration of nodes is a goal
  • 31. Unit of deployment (containers) •Runs on top of our OS platform •Consistency between different environments (testing, production, public cloud, development boxes...) •Version N looks always the same •Testability improves •Deployments are fast. Gradual rollouts FTW! •Rollbacks are easy •Configurations could be part of the bundle
  • 32. Incident management process improvements •Main objective: A type of incident happens only once. •Streamline internal and external communication •Teams developing a system lead the process for incidents connected with it •SRE leads the process for incidents affecting multiple pieces that require a higher level of coordination •Mitigation > Post-mortem > Remediation > Resolution
  • 33. More stuff being done •Explaining our challenges to the world •Opensourcing many of our tools •Self-service provisioning of capacity •Improvements in our continuous integration pipeline •Network platform •OS platform •Automation everywhere •Recruitment
  • 34. July 10, 2013 We are hiring spoti.fi/ops-jobs
  • 35. July 10, 2013 Gràcies! Q & A spoti.fi/ops-jobs Emil Fredriksson / David Poblador i Garcia