The challenges of live events scalability

THE CHALLENGES OF
SUPPORTING ONLINE LIVE
EVENTS WITH TV
PARTICIPATION NUMBERS
A STARTUP PERSPECTIVE

Presentation for B.Sc students from IDC
By Guy Tomer, November 2011

Hello
• I’m Guy Tomer
• Founding and working in start-ups for the last
13 years
• Founder & CTO of attracTV for the last 4 years

• This Presentation is about
• Building a scalable system for “a lot” of users
• More specifically for handling usage peaks of live TV
events on the internet
• Even more specifically – how we tackle it as a small start-up

attracTV

Web based self-service solution and tools for
managing viewers’ engagement and
interaction on the online screen
Social Information Advertisement eCommerce

Our Use Case – MTV European Music
Awards
• One of the biggest online live streams ever
• Can’t expose precise numbers but
• 7 digits ( > 1,000,000) – number of streams
• 6 digits (> 100,000) –
number of concurrent users
• 5 digits (> 10,000) –
number of users joining
every minute at peak
• In addition
• International event, 20
sites, viewers from
>150 countries
• 9 languages

What Are The Challenges
1. Scaling for these numbers
2. Handling very steep ramp-up
3. Big data
4. High availability
5. Testing & preparing for such numbers
6. The cost of the above – how to do it and still make
money

We’ll Discuss mainly 1,5 & 6

Some Big “Internet Scale” Examples
• Google Uses About 900,000 Servers
• (Map-Reduce) Google completed sorting a ten petabyte
input set took 6 hours and 27 minutes to complete on
8000 computers.
• Facebook serves 1 trillion pages per months
• (2010) 30 billion – Pieces of content (links, notes, photos,
etc.) shared on Facebook per month.
• (2010) 2 billion – The number of videos watched per day
on YouTube.
• Akamai, the “CDN to the starts” has 95811 (Q2 2011)
servers, 1000 networks, 70 countries

Challenge 1 – Handling The Scale
• We are prepared for 400,000 concurrent viewers
• HTTP polling every 10<=N<=30 seconds
• This means ~20,000 HTTP R/S (requests per
second)
• For comparison
• Stack overflow recently reported 800 R/S
• Sify.com (leading portal in India)
reported 3900 R/S
• Jobs' death resulted in a record
breaking 10,000 tweets/s
(they do have a lot more requests,
that’s just to feel the scale)

What Is Scalability
• From Wikipedia“Scalability is the ability of a
system, network, or process, to handle growing
amounts of work in a graceful manner or its
ability to be enlarged to accommodate that
growth.”
Performance ≠ Scalability
The fact that your code runs very fast for
X users doesn’t mean your architecture
supports 100*X users.

Vertical Scalability (scale up)
• “Get a bigger server”
• “Use faster CPUs”
• Cons
• Can only help so much (with bad scale/$ value).
• A server twice as fast is more than
twice as expensive
• Pros
• Easier to manager less computers
• Can use virtualization

Horizontal Scalability (scale out)
• “Just add another box” (or another thousand or
...)
• Plan the architecture right first, do micro
optimizations later
• Pros
• Unlimited theoretically
• Works well with the cloud services elasticity
• Cons
• More complex to manage
• More complex programming models

Challenge #2 – Steep Ramp-up
• Live Event - Everyone comes at the same time
Steep ramp-up Standard website example (wikimedia)

• A car can drive 250k/h doesn’t mean it can do 0-
100km/h in 4 seconds

≠

Challenge #3 – Big Data
• From Wikipedia:

“Big data are datasets that grow so large
that they become awkward to work with
using on-hand database management tools”
• One of the biggest hypes in the industry today
• During this even we had ~10,000,000 records written to
our analytics system per hour
• We’re not “Big Data” yet but
it’s coming

Challenge #4 – High Availability
“High availability refers to a system or
component that is continuously operational
for a desirably long length of time.”

• We need to meet a Service High availability in the cloud
Level of 99.9%
• Backup, failover systems
are expensive
• The cloud is at our help

Challenge #5 – Testing
• Simulating 100s of thousands of concurrent
users… not trivial
• Requires 10s of strong servers
• Very difficult to collect the data
• The cloud is at our help

Challenge #6 – Handling The Costs
Of Such Event (Hint- Elasticity)
• For production we used ~50 servers that have 4 cores
with 2GH and 15GB RAM (m1.xl)
• Some options (rough estimation) for this are:
• Buy - ~$3500 per box = $175,000. Not for us…

• Dedicated server for a month - ~$1000 per instance = $50,000
• VPS (Virtual private server) monthly - ~300$ per box = $15,000

• Solution: Cloud on-demand (Amazon AWS) - ~$500 per
instance = $25,000 for a month…. BUT
… no need to take it for a month,
we activate it on demand for 12 hours
and it costs $416!

Our #1 Lesson - Think Horizontal!
• Why not vertical?
• We don’t want it to be our business’s bottleneck at any
point in time
• We don’t want to buy giant servers
• We wanted a cheap start
• We want elasticity
• We don’t want to buy anything at this point
• How? (deserves a separate lecture)
• Everything in the architecture
• No state shared between the web/app
servers
• No relation between the # of users
and the load on the Database

Lesson #2 KISS
• Keep It Simple Stupid
• Your system architecture
• Your code
• Your features Hug out all the complexity in your system
• Your business model

• If you don’t
you won’t scale,
from personal
experience

Lesson #3 – Load Test Everything, Focus
On Real World Usage Patterns
• We did massive stress testing
• We launched tens of servers just for stress testing
• Automated with Jmeter and monitored the same way as
production
Why?
• The only way to test your scaling capabilities
• Looking at the code and manual tests are irrelevant
• Measure the capacity of a single app server
• Test the specific ramp-up scenario because
• Example 1 app server = 5000 users, we need
to support 200,000 users so we need to
prepare at least 40 servers

Lesson #4 – S*t Happens, Don’t Save On
Real-Time Monitoring and Support
• We had a series of successful big events before this one
• We launched tens of servers just for the stress testing
• And yet we had two problems during the event 
Why?
• Murphy is always (eventually) right…
• Because of a feature no one uses (see lesson #2 - KISS)
that wasn’t active in the stress tests
• The specific usage of 9 languages caused unexpected load
(see lesson #3 – stress real world scenarios)
Luckily the whole team was in
monitoring mode and the issues
were quickly handled on the fly.

Lesson #5 – Use The Cloud (startups)
• It’s Elastic, pay on demand
• Flexible when you don’t know your parameters
• Solution for affordable High Availability & Testing
• Focus on development
• I am not getting paid by Amazon – check
others as well!

Summary - What To Remember?
• Scalability is the ability of a system to handle growing
amount of work with additional resources
• Think horizontal
• Keep It Simple (Stupid) – everything
• Stress test everything, focus on real world scenarios
• Monitor and Real-Time support
• Cloud is great for start-ups

The End

• Questions? Comments? Consulting Preguntas?
问题 ?
• Just Shy? Think you should be working in attracTV?
Contact me:
guy.tomer@gmail.com
www.guytomer.com

Special Thanks (presentations, websites I “borrowed” from)
• Ask Bjørn Hansen
(http://groups.google.com/group/scalable)
• High Scalability blog http://highscalability.com/
• http://royal.pingdom.com
• Google images
• Entourage (http://www.hbo.com/entourage/index.html)

The challenges of live events scalability

More Related Content

What's hot (13)

Viewers also liked (20)

Similar to The challenges of live events scalability (20)

Recently uploaded (20)

The challenges of live events scalability

Editor's Notes