SlideShare a Scribd company logo
Project Skyfall
                        
                        
Matt Abrams (@abramsm)
Agenda


 A bit about AddThis!
 !
 Why did we need Skyfall?!
 !
 Architecture!
 !
 Operations/Performance!
Introduction!
Big datadc skyfall_preso_v2
Fun with Numbers
AddThis JavaScript loads > 3 Billion times per day

Edge Network (Skyfall) receives around 4B hits per
day

Either datacenter can handle 100% load (we test this
often) 

Currently using around 1K servers (will double next
year)
Data Center Porn
Why did we need Skyfall?
We couldn’t find anyone else to do it for us
    •  Pervious vendors log aggregation was delayed by a
       minimum of 3 hours and could take up to 5 days

Minimize impact on our publishers
    •    Combining log collection with remote services means we only
         need 1 event instead of n

Support near real time applications
Why did we call it Skyfall?
Why did we call it Skyfall?
Skyfall Goals and Architecture!
Skyfall Goals (Technical)
High Availability
                      Handle Server and DC failure

                                       gracefully
Low latency
                            

                                       Zero downtime deployment and
                                        configuration
Use for internal and external Logging
needs
                                  

                                       In session RPC
O(1) reads and writes
                  

                                       Support data filtering at the
                                        edge
Smart Clients
Why speed and robustness matters
Architecture
                              Web Event
                                          Web Event
                                         Web Event



                                       Global Traffic
                                       Management



               DC1                                                     DC2

 Skyfall      Skyfall        Skyfall                     Skyfall      Skyfall        Skyfall


                                           Repeater



   Consumer                Service                         Consumer
 Consumer
  Consumer               Service                                                   Service
Consumer                Service                          Consumer
                                                          Consumer               Service
                                                        Consumer                Service
Big datadc skyfall_preso_v2
1.    Messages are placed on concurrent non-blocking queue
      (CNBQ) to minimize latency impact on producer

2.    Messages are then popped from CNBQ and placed on a
      Disk-Backed queue (DBQ)

3.    DBQ is used to provide temporary storage in case Kafka is
      down or backed up

4.    Messages from DBQ are popped and sent to Kafka where
      they are persisted to file system
Kafka
Kafka is treats persistence as a first class citizen

Focus is on high throughput vs lots of bells and whistles

State about what has been consumed is maintained in the
client rather than the server

Kafka is explicitly distributed

Supports O(1) reads and writes

Pull rather than push


           http://incubator.apache.org/kafka/design.html
Circuit Breaker for remote Services
Pattern is used to detect failures and encapsulates logic of
preventing a failure to reoccur constantly[1]


If a service instance throws an error, times out, or responds
with a failure message an error event is marked

If the error rate threshold is exceeded that service instance is
removed from the pool of available services

Before re-adding a service to the pool a test request is made
and validated

Internal service failures should not be reflected in response to
message originator

          [1] - http://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
What does a call to our endpoint look like?



Topic


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


              Version
Topic


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


              Version Resource
Topic


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


              Version Resource URL Params
Topic


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


              Version Resource URL Params   Status Code
Topic


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


              Version Resource URL Params   Status Code
Topic
                                                    Bytes Transferred


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!
What does a call to our endpoint look like?


               Version Resource URL Params   Status Code
Topic
                                                     Bytes Transferred


 •    "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
      s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
      (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
      5.0)"!

      CDN Resource              User Agent
What does a call to our endpoint look like?


             Version Resource URL Parameters Status Code
Topic
                                                     Bytes Transferred


 "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://
 s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0
 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/
 5.0)"

  CDN Resource                      User Agent


        The endpoint also receives header and cookie information not
        Shown here.
Zero Downtime Deployment and
Configuration

Group 1
                         4             8             16
 S1       S2   S2   S3       S3   S4       S4   S5        S5




Group 2
                         4             8             16
 S1       S2   S2   S3       S3   S4       S4   S5        S5
Endpoint Configuration




Each endpoint maps to a ‘topic’

Header elements may be extracted from the HTTP request

Parameters may be mapped to new key names

Variables may be extracted from the URL path
Data Center Repeater

DC Repeater nodes
automatically negotiate         N1
peering relationships with
nodes in the other data              N1
center
                                N2
If a peer node becomes
unreachable the local node           N2
will select a new peer
                                N3
These are special consumers
of the Kafka log data created
by the local node
Skyfall Operations!
Big datadc skyfall_preso_v2
Requests per/second (VA Data Center)
TCP - When do you say goodbye?




      http://upload.wikimedia.org/wikipedia/commons/a/a2/Tcp_state_diagram_fixed.svg
Connection Tracking – what you need to
know
Connection information is maintained in memory

The message: “ip_conntrack: table full, dropping packet” is
BAD

Chrome – doesn’t close connection on FIN

This means that the connection info remains open until it
times out, drastically increasing the number of connection
your server needs to track

You need some mechanism for timing out the connection in a
reasonable time period
HA Proxy
We use a simple round-robin load balancing algorithm with a
liveness check

Default connection timeouts are way to high. Reasonable
values are used to prevent excessive connection tracking

“http-close” and “http-server-close” are enabled to ensure low
latency for clients and fast session reuse for the server

HA Proxy is our solution of choice our LB needs. We prefer
software solutions on commodity hardware vs expensive
custom LB appliances

They could use a new logo

More Related Content

PPTX
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
PPTX
Advanced Tools and Techniques for Troubleshooting NetScaler Appliances
PPTX
Troubleshooting Common Network Related Issues with NetScaler
PDF
NetScaler TCP Performance Tuning
PDF
Network Service Mesh
PDF
Building data pipelines at Shopee with DEC
PPTX
The Juniper SDN Landscape
PDF
Opening Up Your Network with SDN
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021
Advanced Tools and Techniques for Troubleshooting NetScaler Appliances
Troubleshooting Common Network Related Issues with NetScaler
NetScaler TCP Performance Tuning
Network Service Mesh
Building data pipelines at Shopee with DEC
The Juniper SDN Landscape
Opening Up Your Network with SDN

What's hot (20)

PDF
Open stack with_openflowsdn-torii
PPTX
OpenContrail Silicon Valley Meetup Aug 25 2015
PDF
Ari Zilka Cluster Architecture Patterns
PDF
Disaster Recovery and High Availability with Kafka, SRM and MM2
PPTX
In-depth Troubleshooting on NetScaler using Command Line Tools
PPT
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
PDF
Mellanox High Performance Networks for Ceph
PDF
Understanding network and service virtualization
PPTX
Software-Defined Networking SDN - A Brief Introduction
PPTX
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
PPTX
Understanding and deploying Network Virtualization
PDF
Cloudian dynamic consistency
PDF
NSX Reference Design version 3.0
PPT
Oracle 10g Performance: chapter 11 SQL*Net
PPT
F5 link controller
PDF
OpenStack and OpenContrail for FreeBSD platform by Michał Dubiel
PPTX
Advanced network services insertions framework
PPTX
Midokura OpenStack Day Korea Talk: MidoNet Open Source Network Virtualization...
PDF
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
PPTX
Pivotal Cloud Foundry + NSX
Open stack with_openflowsdn-torii
OpenContrail Silicon Valley Meetup Aug 25 2015
Ari Zilka Cluster Architecture Patterns
Disaster Recovery and High Availability with Kafka, SRM and MM2
In-depth Troubleshooting on NetScaler using Command Line Tools
PLNOG15: Practical deployments of Kea, a high performance scalable DHCP - Tom...
Mellanox High Performance Networks for Ceph
Understanding network and service virtualization
Software-Defined Networking SDN - A Brief Introduction
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
Understanding and deploying Network Virtualization
Cloudian dynamic consistency
NSX Reference Design version 3.0
Oracle 10g Performance: chapter 11 SQL*Net
F5 link controller
OpenStack and OpenContrail for FreeBSD platform by Michał Dubiel
Advanced network services insertions framework
Midokura OpenStack Day Korea Talk: MidoNet Open Source Network Virtualization...
Install FD.IO VPP On Intel(r) Architecture & Test with Trex*
Pivotal Cloud Foundry + NSX
Ad

Similar to Big datadc skyfall_preso_v2 (20)

PDF
Mini-Track: Lessons from Public Cloud
PPTX
Meetup Microservices Commandments
PPS
Active network
PPTX
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
PDF
Presentation deploying cloud based services
PPTX
From nothing to production in 1 hour
PDF
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
PDF
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
PPTX
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
PDF
Comparing Sidecar-less Service Mesh from Cilium and Istio
PPTX
Tokyo azure meetup #12 service fabric internals
PDF
Spring and Pivotal Application Service - SpringOne Tour - Boston
PPTX
Fraud Detection for Israel BigThings Meetup
PPTX
Netsft2017 day in_life_of_nfv
PDF
Move fast and make things with microservices
PDF
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
PPTX
Inside Microsoft's FPGA-Based Configurable Cloud
PDF
20151207 - iot strategy
PPTX
20120416 tf mms_feedback_slideshare
PPTX
G rpc talk with intel (3)
Mini-Track: Lessons from Public Cloud
Meetup Microservices Commandments
Active network
Ceph Day Seoul - AFCeph: SKT Scale Out Storage Ceph
Presentation deploying cloud based services
From nothing to production in 1 hour
The Good, the Bad and the Ugly of Migrating Hundreds of Legacy Applications ...
Migrating Hundreds of Legacy Applications to Kubernetes - The Good, the Bad, ...
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
Comparing Sidecar-less Service Mesh from Cilium and Istio
Tokyo azure meetup #12 service fabric internals
Spring and Pivotal Application Service - SpringOne Tour - Boston
Fraud Detection for Israel BigThings Meetup
Netsft2017 day in_life_of_nfv
Move fast and make things with microservices
Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS
Inside Microsoft's FPGA-Based Configurable Cloud
20151207 - iot strategy
20120416 tf mms_feedback_slideshare
G rpc talk with intel (3)
Ad

Big datadc skyfall_preso_v2

  • 1. Project Skyfall Matt Abrams (@abramsm)
  • 2. Agenda A bit about AddThis! ! Why did we need Skyfall?! ! Architecture! ! Operations/Performance!
  • 5. Fun with Numbers AddThis JavaScript loads > 3 Billion times per day Edge Network (Skyfall) receives around 4B hits per day Either datacenter can handle 100% load (we test this often) Currently using around 1K servers (will double next year)
  • 7. Why did we need Skyfall? We couldn’t find anyone else to do it for us •  Pervious vendors log aggregation was delayed by a minimum of 3 hours and could take up to 5 days Minimize impact on our publishers •  Combining log collection with remote services means we only need 1 event instead of n Support near real time applications
  • 8. Why did we call it Skyfall?
  • 9. Why did we call it Skyfall?
  • 10. Skyfall Goals and Architecture!
  • 11. Skyfall Goals (Technical) High Availability Handle Server and DC failure gracefully Low latency Zero downtime deployment and configuration Use for internal and external Logging needs In session RPC O(1) reads and writes Support data filtering at the edge Smart Clients
  • 12. Why speed and robustness matters
  • 13. Architecture Web Event Web Event Web Event Global Traffic Management DC1 DC2 Skyfall Skyfall Skyfall Skyfall Skyfall Skyfall Repeater Consumer Service Consumer Consumer Consumer Service Service Consumer Service Consumer Consumer Service Consumer Service
  • 15. 1.  Messages are placed on concurrent non-blocking queue (CNBQ) to minimize latency impact on producer 2.  Messages are then popped from CNBQ and placed on a Disk-Backed queue (DBQ) 3.  DBQ is used to provide temporary storage in case Kafka is down or backed up 4.  Messages from DBQ are popped and sent to Kafka where they are persisted to file system
  • 16. Kafka Kafka is treats persistence as a first class citizen Focus is on high throughput vs lots of bells and whistles State about what has been consumed is maintained in the client rather than the server Kafka is explicitly distributed Supports O(1) reads and writes Pull rather than push http://incubator.apache.org/kafka/design.html
  • 17. Circuit Breaker for remote Services Pattern is used to detect failures and encapsulates logic of preventing a failure to reoccur constantly[1] If a service instance throws an error, times out, or responds with a failure message an error event is marked If the error rate threshold is exceeded that service instance is removed from the pool of available services Before re-adding a service to the pool a test request is made and validated Internal service failures should not be reflected in response to message originator [1] - http://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
  • 18. What does a call to our endpoint look like? Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 19. What does a call to our endpoint look like? Version Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 20. What does a call to our endpoint look like? Version Resource Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 21. What does a call to our endpoint look like? Version Resource URL Params Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 22. What does a call to our endpoint look like? Version Resource URL Params Status Code Topic •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 23. What does a call to our endpoint look like? Version Resource URL Params Status Code Topic Bytes Transferred •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"!
  • 24. What does a call to our endpoint look like? Version Resource URL Params Status Code Topic Bytes Transferred •  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)"! CDN Resource User Agent
  • 25. What does a call to our endpoint look like? Version Resource URL Parameters Status Code Topic Bytes Transferred "GET /live/t00/250lo.gif&foo=bar" 200 37 "http:// s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/ 5.0)" CDN Resource User Agent The endpoint also receives header and cookie information not Shown here.
  • 26. Zero Downtime Deployment and Configuration Group 1 4 8 16 S1 S2 S2 S3 S3 S4 S4 S5 S5 Group 2 4 8 16 S1 S2 S2 S3 S3 S4 S4 S5 S5
  • 27. Endpoint Configuration Each endpoint maps to a ‘topic’ Header elements may be extracted from the HTTP request Parameters may be mapped to new key names Variables may be extracted from the URL path
  • 28. Data Center Repeater DC Repeater nodes automatically negotiate N1 peering relationships with nodes in the other data N1 center N2 If a peer node becomes unreachable the local node N2 will select a new peer N3 These are special consumers of the Kafka log data created by the local node
  • 31. Requests per/second (VA Data Center)
  • 32. TCP - When do you say goodbye? http://upload.wikimedia.org/wikipedia/commons/a/a2/Tcp_state_diagram_fixed.svg
  • 33. Connection Tracking – what you need to know Connection information is maintained in memory The message: “ip_conntrack: table full, dropping packet” is BAD Chrome – doesn’t close connection on FIN This means that the connection info remains open until it times out, drastically increasing the number of connection your server needs to track You need some mechanism for timing out the connection in a reasonable time period
  • 34. HA Proxy We use a simple round-robin load balancing algorithm with a liveness check Default connection timeouts are way to high. Reasonable values are used to prevent excessive connection tracking “http-close” and “http-server-close” are enabled to ensure low latency for clients and fast session reuse for the server HA Proxy is our solution of choice our LB needs. We prefer software solutions on commodity hardware vs expensive custom LB appliances They could use a new logo