Spotify: Data center & Backend buildout

July 10, 2013
Data center &
Backend buildout
Emil Fredriksson
David Poblador i Garcia
@davidpoblador

July 10, 2013
• Some numbers about Spotify
• Data centers, Infrastructure
and Capacity
• How Spotify works
• What are we working on now?

Some numbers
•1000M+ playlists
•Over 24M active users
•Over 20M songs (adding 20K every day)
•Over 6M paying subscribers
•Available in 28 markets

Operations
in numbers
•90+ backend systems
•23 SRE engineers
•2 locations: NYC and Stockholm
•Around 15 teams building the Spotify Platform
in Operations and Infrastructure

July 10, 2013
Data centers,
infrastructure
and capacity

Data centers:
our factories
•Input electricity, servers and software.
Get the Spotify services as output
•We have to scale it up as we grow our
business
•Where the software meets the real world and
customers
•If it does not work, the music stops playing

The capacity
challenge
•Supporting our service for a growing number
of users
•New more complex features require server
capacity
•Keeping up with very fast software
development

Delivering capacity
•We operate four data centers with more than
5 000 servers and 140Gbps of Internet
capacity
•In 2008 there were 20 servers
•Renting space in large data center facilities
•Owning and operating hardware and network

What we need in a
data center
•Reliable power supply
•Air conditioning
•Secure space
•Network POPs
•Remote hands
•Shipping and handling

Pods – standard
data center units
•Deploying a new data centers takes a long
time!
•We need to be agile and fast to keep up with
the product development
•We solve this by standardizing our data
centers and networking in to pods and pre-
provision servers
•Target is to keep 30% spare capacity at all
times

Pods – standard
data center units
•44 racks in one pod, about 1500 servers
•Racks redundantly connected with 10GE
uplink to core switches
•Pod is directly connected to the Internet via
multiple 10GE transit links
•Build it the same way every time
•Include the base infrastructure services

Data center
locations
•You can not go faster than light
•Distance == Latency
•Current locations: Stockholm, London,
Ashburn (US east coast), San Jose (US west
coast)
•Static content on CDN. Dynamic content
comes from our data centers

So what about the
public clouds?
•Commoditization of the data center is
happening now, few companies will need to
build data centers in the future
•We already use both AWS S3 and EC2, usage
will increase
•Challenges that still remain:
•Inter node network performance
•Cost (at large scale)
•Flexible hardware configurations

Automated
installation
•Information about servers go in to a database:
MAC address, hardware configuration, location,
networks, hostnames and state(available, in-use)
•Automatic generation of DNS, DHCP and PXE
records
•Cobbler used as an installation server
•Single command installs multiple servers in
multiple data centers

July 10, 2013
How Spotify works

access
point
storage
search
playlist
user
web api
browse
...
Backend services
Clients
www.spotify.com
ads
social
key
Facebook
Amazon
S3
CDN
Content ingestion,
indexing, and transcoding
Log analysis
(hadoop)
Record labels

DNS à la Spotify
•Distribution of clients
•Error reporting by clients
•Service discovery
•DHT ring configuration

DNS: Service
discovery
•_playlist: service name
•_http: protocol
•3600: ttl
•10: prio
•50: weight
•8081: port
•host1.spotify.net: host
_playlist._http.spotify.net 3600 SRV 10 50 8081 host1.spotify.net.

DNS: DHT rings
Which service instance should I ask
for a resource?
•Configuration
config._key._http.spotify.net 3600 TXT “slaves=0”
config._key._http.spotify.net 3600 TXT “slaves=2 redundancy=host”
•Mapping ring segment to service instance
tokens.8081.host1.spotify.net 3600 TXT “00112233445566778899aabbccddeeff”

Databases:
Cassandra & Postgres
•Critical and consistency important:
PostgreSQL
•Huge, growing fast, eventual consistency OK:
Cassandra

Storage:
Production Storage
•Read only
•Large files
•HTTP based
•nginx + storage proxies + Amazon S3

Other types of storage
•Hadoop
•Tokyo Cabinet
•CDB
•BDB

Communication protocols
between services: HTTP
•Originally used by every system
•Simple
•Well known
•Battle tested
•Proper Implementations in many languages
•Each service defines its own RESTful protocol

Communication protocols
between services: Hermes
Thin layer on top of ØMQ
Data in messages is serialized as protobuf
•Services deﬁne their APIs partly as protobuf
Hermes is embedded in the client-AP protocol
•AP doesn’t need to translate protocols, it is just a
message router.
In addition to request/reply, we get pub/sub.

Configuration management
•We use Puppet
•Installs Debian packages based on recipes
•Teams developing a system write Puppet
manifests
•Hiera: simple Hierarchical Database for
service parameters
•Not the most scalable solution

Operational responsibility
delegation
•Each feature team takes responsibility for the
entire stack: from developing a system to
running and operating it.
•Mentality shift: from “it works” to “it scales”
•Full responsibility: capacity planning,
monitoring, incident management.
•Risk of reinventing square wheels. Closing the
feedback loop is key.

Service Discovery
•DNS will stay
•We can’t afford rewriting every system
•We like to be able to use standard tools (dig)
to troubleshoot
•We aim to have a handsfree zone file
management
•Automated registration and deregistration of
nodes is a goal

Unit of deployment
(containers)
•Runs on top of our OS platform
•Consistency between different environments (testing,
production, public cloud, development boxes...)
•Version N looks always the same
•Testability improves
•Deployments are fast. Gradual rollouts FTW!
•Rollbacks are easy
•Configurations could be part of the bundle

Incident management
process improvements
•Main objective: A type of incident happens only once.
•Streamline internal and external communication
•Teams developing a system lead the process for
incidents connected with it
•SRE leads the process for incidents affecting multiple
pieces that require a higher level of coordination
•Mitigation > Post-mortem > Remediation > Resolution

More stuff being done
•Explaining our challenges to the world
•Opensourcing many of our tools
•Self-service provisioning of capacity
•Improvements in our continuous integration pipeline
•Network platform
•OS platform
•Automation everywhere
•Recruitment

July 10, 2013
We are hiring
spoti.fi/ops-jobs

July 10, 2013
Gràcies! Q & A
spoti.fi/ops-jobs
Emil Fredriksson / David Poblador i Garcia

Spotify: Data center & Backend buildout

More Related Content

What's hot (20)

Similar to Spotify: Data center & Backend buildout (20)

Recently uploaded (20)

Spotify: Data center & Backend buildout