SlideShare a Scribd company logo
1
Transitioning from Ticketing to LBaaS
November 12, 2019 - Amsterdam
William Dauchy
@wdauchy
Network-LB Team Lead
Pierre Cheynier
@pierrecdn
Network-LB Senior SRE
2
Criteo infrastructure in a nutshell
• Focus on cost-efficiency, agility and regaining control
on infra. software
• Examples:
• Commoditize hardware, challenge vendors by establishing
direct ODM relationship
• BIOS, BMCs and Switches OS: switch to OSS
3
Baremetal, microservices, containers
register
4
Transparent communication
5
Network is not to be outdone
• CLOS Matrix (RFC7938)
• Network Services still
in specialized racks
6
User, 2016: My container is running, and so what?
• Step 1: application is deployed at git push
• Step 2: Come on, I’m missing some things here…
• Public or private VIP?
• DNS entries?
• TLS certificates?
• IPv6?
• Traffic engineering?
• Security policies enforcement?
• Metrics?
• Step 3: Don’t you want to .. fill a ticket?
7
8
API extensions for LB
• Our team write DSL or API
extensions
• Same primitives everywhere
• Linked to Consul registration
• Features, not technologies
(vendor agnostic)
• Use Consul storage (KV/Metas)
► Ownership of app network
config has been transferred!
9
Health-checks
• Business HC can be non-trivial
• Multiple HCs on a service
• Will all technologies have the
support for running them …
remotely?
• Multiply sources of checks = lack
of consistency
10
Health-checks
• Business HC can be non-trivial
• Multiple HCs on a service
• Will all technologies have the
support for running them …
remotely?
• Multiply sources of checks = lack
of consistency
► Consul as a state reference
11
Health-checks
• Business HC can be non-trivial
• Multiple HCs on a service
• Will all technologies have the
support for running them …
remotely?
• Multiply sources of checks = lack
of consistency
► Consul as a state reference (>=1.4)
12
Health-checks
• Business HC can be non-trivial
• Multiple HCs on a service
• Will all technologies have the
support for running them …
remotely?
• Multiply sources of checks = lack
of consistency
► Consul as a state reference (>=1.4)
13
Health-checks
• Business HC can be non-trivial
• Multiple HCs on a service
• Will all technologies have the
support for running them …
remotely?
• Multiply sources of checks = lack
of consistency
► Consul as a state reference (>=1.4)
14
Here comes a Control-plane
• Abstract internal dependencies
• Resource reservation logic
• External system provisioning
• Give some controls to admins
• Consume events, produce events
• Device Provisioners:
• One contract
• N implementations
(vendor/technologies)
15
Here comes a Control-plane
• Abstract internal dependencies
• Resource reservation logic
• External system provisioning
• Give some controls to admins
• Consume events, produce events
• Device Provisioners:
• One contract
• N implementations
(vendor/technologies)
16
17
User experience: metrology
• Get network things done, within seconds.
• Also, with great powers (…)
• Diagnostic API endpoints provide errors cause
• Subscribe to alerts on bad service health
• Retrieve KPIs and logs autonomously
18
User experience: Zero-config LB
• Add a Consul http tag
• Get a free LB service configured with sane
defaults in seconds
• Controlled namespace
• Private visibility
• TLS enforced with redirects
• Self-diagnostics on errors
19
User experience: Zero-config LB
• Users are using it extensively
• Remember East-West without LB?
• Rate limit!
• timeout tarpit 2s
20
21
Agility at scale
• Ability to transparently switch LB vendor
• Safe and progressive rollout
• 50K servers
• Between 50 and 100 LBs per datacenter
• Easy and frequent platform upgrade
• HAProxy deployments, several times a week
• Transparent Linux kernel upgrades
• > 530 deployments in less than two years(!)
22
Incidents at scale
• Doubling maxconn, what could go wrong?
• Silent failing changes are not welcome
• strict-limits, introduced in v2.1
23
Probing the general service
• End to end probing is providing fast feedbacks
• Triggered many regressions and bugs
• .. despite the simplicity of checks
24
Log everything
• http-response set-log-level silent if
{rand(100) ge 1}
• log 127.0.0.1 format rfc5424 local0 info
• log-format-sd
25
Observability example
• TLS: %sslv - %sslc
26
27
Load balancing disaggregation
• GeoDNS > L3 > L7
• “Hyper-converged” LB
28
Load balancing disaggregation
• GeoDNS > L3 > L4 > L7
• Support layer=L4
29
Load balancing disaggregation
• GeoDNS > L3 > L4 > L7
• Let L7 LBs become client of this
• Translates into “L7 LB, on behalf of VIP X,
asks for a L4 LB service”
• L4 > L7 specifics
• DSR: invest for ingress only ($)
• Consistent hashing
30
Load balancing disaggregation
• GeoDNS > L3 > L4 > L7
• Let L7 LBs become client of this
• Translates into “L7 LB, on behalf of VIP X,
asks for a L4 LB service”
• L4 > L7 specifics
• DSR: invest for ingress only ($)
• Consistent hashing
31
Load balancing disaggregation
• GeoDNS > L3 > L4 > L7
• Let’s redo this with layer=L3
• Translates into “L4 LB, on behalf of VIP X,
asks for a BGP ECMP configuration”
• Identify switch to peer with, configure it
• L3 > L4 specifics
• BGP ECMP => placement constraints
• DDoS
32
Load balancing disaggregation
• GeoDNS > L3 > L4 > L7
33
Load balancing disaggregation
• Moved from Hyper-converged devices
to commodity hardware
• Consequences:
• Cost efficient
• LBs = plain servers in the CLOS matrix!
34
Edge PoPs
• GeoDNS > Edge PoP > L3 > L4 > L7
• Terminate TLS early = huge latency win
• pool-purge-delay (HAProxy >= v2.0)
35
Edge PoPs
• GeoDNS > Edge PoP > L3 > L4 > L7
• Reach large population base, keep costs
under control
• Control plane role
• Configure Geo DNS zones to closest PoP
• Call third-party API
• Configure HAProxy accordingly
36
Edge PoPs
• GeoDNS > Edge PoP > L3 > L4 > L7
• Reach large population base, keep costs
under control
• Control plane role
• Configure Geo DNS zones to closest PoP
• Call third-party API
• Configure HAProxy accordingly
37
GeoDNS on steroids
• Adapt routing based on performance metrics,
aka. traffic engineering
• “Someone told me these subnets perform better
using another AS path”
• Build our own GeoDNS dataset?
• Increase latencies by directing user bases to closest
location with more accuracy
• A decision-making tool?
• Data-center location and topology assessment
38
39
Feedback
• Loading more objects at runtime?
• Several changes per second!
• Never flush stats (at least some counters)
40
Feedback
• Overall stability
• very high traffic pressure
• Latency sensitive
• Loosing requests during reloads is
not acceptable!
• Small memory footprint
• TLS certificates at scale like a
public hosting service
• Share certificates between bind
(in v2.1)
41
Feedback
• Overall stability
• Small memory footprint
• Community + enterprise support
• "I'm investigating an issue we have with some fetch we're
adding on a custom tcp_info structure, and I'm wondering if
I might have discovered a broader issue[...]“
• -
• Deployed worldwide two hours after
Thank you!

More Related Content

PDF
REVOLUTION - Transforming the network with Open SDN
PPTX
Pivotal Cloud Foundry + NSX
PPTX
Multi-tenant Framework for SDN Virtualization
PDF
Sdn primer pdf
PDF
Introduction to SDN
PPTX
SDN: an introduction
PDF
Embracing SDN in the Next Gen Network
PDF
Technical Deep Dive into MidoNet - Taku Fukushima, Developer at Midokura
REVOLUTION - Transforming the network with Open SDN
Pivotal Cloud Foundry + NSX
Multi-tenant Framework for SDN Virtualization
Sdn primer pdf
Introduction to SDN
SDN: an introduction
Embracing SDN in the Next Gen Network
Technical Deep Dive into MidoNet - Taku Fukushima, Developer at Midokura

What's hot (20)

PPTX
High Availability in Neutron
PPTX
SDN Cloud Computing Project Help
PDF
Technical introduction to MidoNet
PPTX
Software Defined Networking: Primer
PPTX
Sdn presentation
PPTX
Software defined network
PPTX
Software defined networking(sdn) vahid sadri
PDF
MidoNet 101: Face to Face with the Distributed SDN
PPTX
NSX for vSphere Logical Routing Deep Dive
PDF
Software Define Networking (SDN)
PPT
Software defined network and Virtualization
PPTX
Software Defined Network - SDN
PPT
OpenFlow tutorial
PDF
Understanding network and service virtualization
PDF
Next-gen Network Telemetry is Within Your Packets: In-band OAM
PDF
L4-L7 services for SDN and NVF by Youcef Laribi
PPTX
Network and Service Virtualization tutorial at ONUG Spring 2015
PDF
Hyperledger Fabric Technical Deep Dive 20190618
PPT
Manageengine Netflow analyzer - An Insight
PDF
VMworld 2013: Deploying VMware NSX Network Virtualization
High Availability in Neutron
SDN Cloud Computing Project Help
Technical introduction to MidoNet
Software Defined Networking: Primer
Sdn presentation
Software defined network
Software defined networking(sdn) vahid sadri
MidoNet 101: Face to Face with the Distributed SDN
NSX for vSphere Logical Routing Deep Dive
Software Define Networking (SDN)
Software defined network and Virtualization
Software Defined Network - SDN
OpenFlow tutorial
Understanding network and service virtualization
Next-gen Network Telemetry is Within Your Packets: In-band OAM
L4-L7 services for SDN and NVF by Youcef Laribi
Network and Service Virtualization tutorial at ONUG Spring 2015
Hyperledger Fabric Technical Deep Dive 20190618
Manageengine Netflow analyzer - An Insight
VMworld 2013: Deploying VMware NSX Network Virtualization
Ad

Similar to HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS (20)

PPTX
Hyperledger Fabric Update - June 2018
PDF
IBM Blockchain Platform - Architectural Good Practices v1.0
PDF
OVNC 2015-성공적인 Customer Optimized Datacenter 구축 방안
PPTX
Service Mesh CTO Forum (Draft 3)
PDF
Introductionto SDN
PDF
Introduction to Software Defined Networking (SDN)
PDF
Distributed Virtual Transaction Directory Server
PDF
SFSCON23 - Andrea Alfonsi - Kubernetes for IoT
PPTX
Micro Services Architecture
PDF
Scaling Hadoop at LinkedIn
PDF
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
PPTX
Manging Container Deployments at Scale
PPTX
Istio Mesh – Managing Container Deployments at Scale
PPTX
Sdn not just a buzzword
PPTX
10. Lec X- SDN.pptx
PPTX
bruce-sdn.pptx
PDF
Monitoring microservices platform
PDF
Microservice - Up to 500k CCU
PPTX
Do You Need A Service Mesh?
PDF
FreeSWITCH as a Microservice
Hyperledger Fabric Update - June 2018
IBM Blockchain Platform - Architectural Good Practices v1.0
OVNC 2015-성공적인 Customer Optimized Datacenter 구축 방안
Service Mesh CTO Forum (Draft 3)
Introductionto SDN
Introduction to Software Defined Networking (SDN)
Distributed Virtual Transaction Directory Server
SFSCON23 - Andrea Alfonsi - Kubernetes for IoT
Micro Services Architecture
Scaling Hadoop at LinkedIn
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Manging Container Deployments at Scale
Istio Mesh – Managing Container Deployments at Scale
Sdn not just a buzzword
10. Lec X- SDN.pptx
bruce-sdn.pptx
Monitoring microservices platform
Microservice - Up to 500k CCU
Do You Need A Service Mesh?
FreeSWITCH as a Microservice
Ad

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
KodekX | Application Modernization Development
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
cuic standard and advanced reporting.pdf
PPTX
sap open course for s4hana steps from ECC to s4
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Advanced methodologies resolving dimensionality complications for autism neur...
Spectroscopy.pptx food analysis technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KodekX | Application Modernization Development
Dropbox Q2 2025 Financial Results & Investor Presentation
cuic standard and advanced reporting.pdf
sap open course for s4hana steps from ECC to s4
The AUB Centre for AI in Media Proposal.docx
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Network Security Unit 5.pdf for BCA BBA.
Unlocking AI with Model Context Protocol (MCP)
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS

  • 1. 1 Transitioning from Ticketing to LBaaS November 12, 2019 - Amsterdam William Dauchy @wdauchy Network-LB Team Lead Pierre Cheynier @pierrecdn Network-LB Senior SRE
  • 2. 2 Criteo infrastructure in a nutshell • Focus on cost-efficiency, agility and regaining control on infra. software • Examples: • Commoditize hardware, challenge vendors by establishing direct ODM relationship • BIOS, BMCs and Switches OS: switch to OSS
  • 5. 5 Network is not to be outdone • CLOS Matrix (RFC7938) • Network Services still in specialized racks
  • 6. 6 User, 2016: My container is running, and so what? • Step 1: application is deployed at git push • Step 2: Come on, I’m missing some things here… • Public or private VIP? • DNS entries? • TLS certificates? • IPv6? • Traffic engineering? • Security policies enforcement? • Metrics? • Step 3: Don’t you want to .. fill a ticket?
  • 7. 7
  • 8. 8 API extensions for LB • Our team write DSL or API extensions • Same primitives everywhere • Linked to Consul registration • Features, not technologies (vendor agnostic) • Use Consul storage (KV/Metas) ► Ownership of app network config has been transferred!
  • 9. 9 Health-checks • Business HC can be non-trivial • Multiple HCs on a service • Will all technologies have the support for running them … remotely? • Multiply sources of checks = lack of consistency
  • 10. 10 Health-checks • Business HC can be non-trivial • Multiple HCs on a service • Will all technologies have the support for running them … remotely? • Multiply sources of checks = lack of consistency ► Consul as a state reference
  • 11. 11 Health-checks • Business HC can be non-trivial • Multiple HCs on a service • Will all technologies have the support for running them … remotely? • Multiply sources of checks = lack of consistency ► Consul as a state reference (>=1.4)
  • 12. 12 Health-checks • Business HC can be non-trivial • Multiple HCs on a service • Will all technologies have the support for running them … remotely? • Multiply sources of checks = lack of consistency ► Consul as a state reference (>=1.4)
  • 13. 13 Health-checks • Business HC can be non-trivial • Multiple HCs on a service • Will all technologies have the support for running them … remotely? • Multiply sources of checks = lack of consistency ► Consul as a state reference (>=1.4)
  • 14. 14 Here comes a Control-plane • Abstract internal dependencies • Resource reservation logic • External system provisioning • Give some controls to admins • Consume events, produce events • Device Provisioners: • One contract • N implementations (vendor/technologies)
  • 15. 15 Here comes a Control-plane • Abstract internal dependencies • Resource reservation logic • External system provisioning • Give some controls to admins • Consume events, produce events • Device Provisioners: • One contract • N implementations (vendor/technologies)
  • 16. 16
  • 17. 17 User experience: metrology • Get network things done, within seconds. • Also, with great powers (…) • Diagnostic API endpoints provide errors cause • Subscribe to alerts on bad service health • Retrieve KPIs and logs autonomously
  • 18. 18 User experience: Zero-config LB • Add a Consul http tag • Get a free LB service configured with sane defaults in seconds • Controlled namespace • Private visibility • TLS enforced with redirects • Self-diagnostics on errors
  • 19. 19 User experience: Zero-config LB • Users are using it extensively • Remember East-West without LB? • Rate limit! • timeout tarpit 2s
  • 20. 20
  • 21. 21 Agility at scale • Ability to transparently switch LB vendor • Safe and progressive rollout • 50K servers • Between 50 and 100 LBs per datacenter • Easy and frequent platform upgrade • HAProxy deployments, several times a week • Transparent Linux kernel upgrades • > 530 deployments in less than two years(!)
  • 22. 22 Incidents at scale • Doubling maxconn, what could go wrong? • Silent failing changes are not welcome • strict-limits, introduced in v2.1
  • 23. 23 Probing the general service • End to end probing is providing fast feedbacks • Triggered many regressions and bugs • .. despite the simplicity of checks
  • 24. 24 Log everything • http-response set-log-level silent if {rand(100) ge 1} • log 127.0.0.1 format rfc5424 local0 info • log-format-sd
  • 26. 26
  • 27. 27 Load balancing disaggregation • GeoDNS > L3 > L7 • “Hyper-converged” LB
  • 28. 28 Load balancing disaggregation • GeoDNS > L3 > L4 > L7 • Support layer=L4
  • 29. 29 Load balancing disaggregation • GeoDNS > L3 > L4 > L7 • Let L7 LBs become client of this • Translates into “L7 LB, on behalf of VIP X, asks for a L4 LB service” • L4 > L7 specifics • DSR: invest for ingress only ($) • Consistent hashing
  • 30. 30 Load balancing disaggregation • GeoDNS > L3 > L4 > L7 • Let L7 LBs become client of this • Translates into “L7 LB, on behalf of VIP X, asks for a L4 LB service” • L4 > L7 specifics • DSR: invest for ingress only ($) • Consistent hashing
  • 31. 31 Load balancing disaggregation • GeoDNS > L3 > L4 > L7 • Let’s redo this with layer=L3 • Translates into “L4 LB, on behalf of VIP X, asks for a BGP ECMP configuration” • Identify switch to peer with, configure it • L3 > L4 specifics • BGP ECMP => placement constraints • DDoS
  • 32. 32 Load balancing disaggregation • GeoDNS > L3 > L4 > L7
  • 33. 33 Load balancing disaggregation • Moved from Hyper-converged devices to commodity hardware • Consequences: • Cost efficient • LBs = plain servers in the CLOS matrix!
  • 34. 34 Edge PoPs • GeoDNS > Edge PoP > L3 > L4 > L7 • Terminate TLS early = huge latency win • pool-purge-delay (HAProxy >= v2.0)
  • 35. 35 Edge PoPs • GeoDNS > Edge PoP > L3 > L4 > L7 • Reach large population base, keep costs under control • Control plane role • Configure Geo DNS zones to closest PoP • Call third-party API • Configure HAProxy accordingly
  • 36. 36 Edge PoPs • GeoDNS > Edge PoP > L3 > L4 > L7 • Reach large population base, keep costs under control • Control plane role • Configure Geo DNS zones to closest PoP • Call third-party API • Configure HAProxy accordingly
  • 37. 37 GeoDNS on steroids • Adapt routing based on performance metrics, aka. traffic engineering • “Someone told me these subnets perform better using another AS path” • Build our own GeoDNS dataset? • Increase latencies by directing user bases to closest location with more accuracy • A decision-making tool? • Data-center location and topology assessment
  • 38. 38
  • 39. 39 Feedback • Loading more objects at runtime? • Several changes per second! • Never flush stats (at least some counters)
  • 40. 40 Feedback • Overall stability • very high traffic pressure • Latency sensitive • Loosing requests during reloads is not acceptable! • Small memory footprint • TLS certificates at scale like a public hosting service • Share certificates between bind (in v2.1)
  • 41. 41 Feedback • Overall stability • Small memory footprint • Community + enterprise support • "I'm investigating an issue we have with some fetch we're adding on a custom tcp_info structure, and I'm wondering if I might have discovered a broader issue[...]“ • - • Deployed worldwide two hours after