SlideShare a Scribd company logo
IPVS for Docker Containers
Production-level load balancing and request routing without spending a single penny.
What is IPVS
• Stands for IP Virtual Server. Built on top of Netfilter and works in kernel space. No userland copying of network
packets involved.

• It’s in the mainline Linux Kernel since 2.4.x. Surprisingly, it’s one of the technologies which successfully pumps
your stuff through world’s networks for more than 15 years and almost nobody ever heard of it.

• Used in many world-scale companies such as Google, Facebook, LinkedIn, Dropbox, GitHub, Alibaba, Yandex
and so on. During my time in Yandex, we used to route millions of requests per second through a few IPVS
boxes without a sweat.

• Tested with fire, water and brass trombones (untranslatable Russian proverb).

• Provides flexible, configurable load-balancing and request routing done in kernel space – it plugs in even before
PCAP layer – making it bloody fucking fast.

• As all kernel tech, it looks like a incomprehensible magical artifact from outer space, but bear with me – it’s
actually very simple to use.
And why didn’t I hear about it before?
What is IPVS
• Supports TCP, SCTP and UDP (both v4 and v6), works on L4 in kernel space.

• Load balancing using weighted RR, LC, replicated locality-based LC (LBLCR), destination and source-
based hashing, shortest expected delay (SED) and more via plugins:

‐ LBLCR: «sticky» least-connections, picks a backend and sends all jobs to it until it’s full, then picks
the next one. If all backends are full, replicates one with the least jobs and adds it to the service
backend set.

• Supports persistent connections for SSL, FTP and One Packet Mode for connectionless protocols such
as DNS – will trigger scheduling for each packet.

• Supports request forwarding with standard Masquerading (DNAT), Tunneling (IPIP) or Direct Routing (DR).
Supports Netfilter Marks (FWMarks) for virtual service aggregation, e.g. http and https.

• Built-in cluster synchronization daemon – only need to configure one router box, other boxes will keep
their configurations up-to-date automatically.
And what can it do if you say it’s so awesome?
What is IPVS
• DNAT is your ordinary NAT. It will replace each packet’s destination IP with the chosen backend IP and put the packet
back into the networking stack.

‐ Pros: you don’t need to do anything to configure it.

‐ Cons: responses will be a martian packets in your network. But you can use a local reverse proxy to workaround
this or point the default gateway on backends to load balancer’s IP address for IPVS to take care of it.

• DR (or DSR: direct server response) is the fastest forwarding method in the world.

‐ Pros: speed. It will be as fast as possible given your network conditions. In fact, it will be so fast that you will be
able to pump more traffic through your balancers than the interface capacity they have.

‐ Cons: all backends and routers should be in the same L2 segment, need to configure NoARP devices.

• IPIP is a tun/tap alternative to DR. Like IPSEC, but without encryption.

‐ Pros: it can be routed anywhere and won’t trigger martian packets.
‐ Cons: lower MSS. Also, you need to configure NoARP tun devices on all backend boxes, like in DR.
And a little bit more about all these weird acronyms.
What is IPVS
And a little bit more about all these weird acronyms.
IPIP
Encapsulates IP
Routable anywhere
DNAT
Rewrites DST IP
Same L4
DSR
Rewrites DST MAC
Same L2
What is IPVS
• Works by replacing the job’s destination MAC with the chosen backend MAC and then putting
the packet back into the INPUT queue. Once again: the requirement for this mode is that all
backends should be in the same L2 network segment.

• The response will travel directly to the requesting client, bypassing the router box. This
essentially allows to cut router box traffic by more than 50% since responses are usually heavier
than requests.

• This also allows you to load balance more traffic than the load balancer interface capacity
since with a proper setup it will only load balance TCP SYNs.

• It’s a little bit tricky to set up, but if done well is indistinguishable from magic (as well as any
sufficiently advanced technology).

• Fun fact: this is how video streaming services survive and avoid bankruptcy!
And a little bit more about DR since it’s awesome!
I don’t need this
• If you have more than one instance of your application, you need load balancing, otherwise you
won’t be able to distribute requests among those instances.

• If you have more than one application or more than one version deployed at the same time in
production, you need request routing. Typically, only one version is deployed to production at the
same time, and one testing version is waiting in line.

• Imagine what you can do if you could deploy 142 different versions of your application to
production:

‐ A/B testing: route 50% to version A, route 50% to version B.

‐ Intelligent routing: if requester’s cookie is set to «NSA», route to version «HoneyPot».

‐ Experiments: randomly route 1% of users to experimental version «EvenMorePointlessAds»
and stick them to that version.
And why would I load balance and route anything at all?
I don’t need this
• In modern world, instances come and go as they may, sometimes multiple times per second. This rate
will only go up in the future.

• Instances migrate across physical nodes. Instances might be stopped when no load or low load is in
effect. New instances might appear to accommodate the growing request rate and be torn down later.

• Neither nginx (for free) nor haproxy allows for online configuration. Many hacks exist which
essentially regenerate the configuration file on the routing box and then cycle the reverse proxy, but
it’s neither sustainable nor mainstream.

• Neither hipache (written in node.js) nor vulcand (under development) allows for non-RR balancing
methods.

• The fastest userland proxy to date – HAProxy – is ~40% slower than IPVS in DR mode according to
many publicly available benchmark results.
Also, my nginx (haproxy, third option) setup works fine, get off the stage please!
I don’t need this
• Lucky you!

• But running stuff in the cloud is not an option in many circumstances:

‐ Performance and/or reliability critical applications.

‐ Exotic hardware or software requirements. CG or CUDA farms, BSD environments and so on.

‐ Clouds are a commodity in US and Europe, but in some countries with a thriving IT culture AWS
is more of a fairy tale (e.g. Vietnam).

‐ Security considerations and vendor lock-in.

• Cloud-based load balancers are surprisingly dumb, for example AWS ELB doesn’t really allow you to
configure anything except real servers and some basic stuff like SSL and Health Checks.

• After a certain point, AWS/GCE becomes quite expensive, unless you’re Netflix.
And I run my stuff in the cloud, it takes care of everything – my work is perpetual siesta.
What is IPVS, again
• Make sure you have a properly configured kernel (usually you do):

‐ cat /boot/config-$(uname -r) | grep IP_VS
• Install IPVS CLI to verify that it’s healthy and up:

‐ sudo apt-get install ipvsadm
‐ sudo ipvsadm -l
• Command-line tool allows you to do everything out there with IPVS, but it’s the 21st century and CLIs are only
good to show off your keyboard-fu ;)

• That’s why I’ve coded a dumb but loyal REST API daemon that talks to kernel and configures IPVS for you. In
Go, because, you know.

• You can use it to add and remove virtual services and backends in runtime and get metrics about existing virtual
services. It’s totally open source and free but might not work (as everything open source and free).
And how do I use it now since it sounds amazing!
GORB
• It’s here: https://github.com/kobolog/gorb

• Based on native Go netlink library – https://github.com/tehnerd/gnl2go. The library itself talks directly to the
Kernel and already supports the majority of IPVS operations. This library is a port of Facebook’s gnl2py – https://
github.com/facebook/gnlpy

• It exposes a very straightforward JSON-based REST API:

‐ PUT or DELETE /service/<vsID> – create or remove a virtual service.

‐ PUT or DELETE /service/<vsID>/<rsID> – add or remove a backend for a virtual service

‐ GET /service/<vsID> – get virtual service metrics (incl. health checks).

‐ GET /service/<vsID>/<rsID> – get backend configuration.

‐ PATCH /service/<vsID>/<rsID> – update backend weight and other parameters without restarting
anything.
Go Routing and Balancing.
GORB
• Every time you spin up a new container, it can be magically automatically registered with
GORB, so that your application’s clients could reach it via the balancer-provided endpoint.

• GORB will automatically do TCP checks (if it’s a TCP service) on your container and inhibit all
traffic coming to it if, for some reason, it disappeared without gracefully de-registering first:
network outage, oom-killer or whatnot.

• HTTP checks are also available, if required – e.g. your app can have a /health endpoint
which will respond 200 only if all downstream dependencies are okay as well.

• Essentially, the only thing you need to do is to start a tiny little daemon on Docker Engine
boxes that will listen for events on Events API and send commands to GORB servers.

• More checks can be added in the future, e.g. Consul, ZooKeeper or etcd!
And why is it cool for Docker Containers.
GORB
kobolog@bulldozer:~$ docker-machine create -d virtualbox gorb
kobolog@bulldozer:~$ docker-machine ssh gorb sudo modprobe ip_vs
kobolog@bulldozer:~$ eval $(docker-machine env gorb)
kobolog@bulldozer:~$ docker build -t gorb src/gorb
kobolog@bulldozer:~$ docker run -d —net=host --privileged gorb -f -i eth1
kobolog@bulldozer:~$ docker build -t gorb-link src/gorb/gorb-docker-link
kobolog@bulldozer:~$ docker run -d --net=host -v /var/run/docker.sock:/var/run/
docker.sock gorb-link -r $(docker-machine ip default):4672 -i eth1
kobolog@bulldozer:~$ docker run -d -p 80 nginx (repeat 4 times)
kobolog@bulldozer:~$ curl -i http://$(docker-machine ip gorb):80
And how do I use it? Live demo or GTFO!
A few words about BGP
• Once you have a few router boxes to avoid having a SPOF, you might wonder how clients would find out which one to
use. One option could be DNS RR, where you put them all behind one domain name and let DNS resolvers to do their job.

• But there’s a better way – BGP host routes, also known as anycast routing:

‐ It’s a protocol used by network routers to agree on network topologies.

‐ The idea behind anycast is that each router box announces the same IP address on the network via BGP host
route advertisements.

‐ When a client hits this IP address, it’s automatically routed to one of the GORB boxes based on configured routing
metrics.

‐ You don’t need any special hardware – it’s already built into your routers.

‐ You can do this using one of the BGP packages, such as Bird or Quagga. Notable ecosystem projects in this area:
Project Calico.

‐ Also can be done with IPv6 anycast, but I’ve never seen it implemented.
Black belt in networking is not complete without a few words about BGP
GORB
• No, you cannot blame me, but I’m here to help and answer questions!

• IPVS is production ready since I was in college.

• GORB is not production ready and is not tested in production, but since it’s just a configuration daemon, it won’t actually
affect your traffic, I hope.

• Here are some nice numbers to think about:

‐ 1 GBPS line rate uses 1% CPU in DR mode.

‐ Can utilize 40G/100G interface.

‐ Costs you €0.

‐ Typical enterprise hardware load-balancer – €25,000.

• Requires generic hardware, software, tools – also easy to learn. Ask your Ops whether they want to read a 1000-page
vendor manual or a 1000-word manpage.

• Also, no SNMP involved. Amazing and unbelievable!
Is it stable? Is it production-ready? Can I blame you if it doesn’t work?
This guy on the stage
• You shouldn’t. In fact, don’t trust anybody on this stage – get your hands dirty and verify
everything yourself. Although last month I turned 30, so I’m trustworthy!

• I’ve been building distributed systems and networking for more than 7 years in companies with
worldwide networks and millions of users, multiple datacenters and other impressive things.

• These distributed systems and networks are still operational and serve billions of user requests
each day.

• I’m the author of IPVS-based routing and load balancing framework for Yandex’s very own open-
source platform called Cocaine: https://github.com/cocaine.

• As of now, I’m a Senior Infrastructure Software Engineer in Uber in NYC, continuing my self-
proclaimed crusade to make magic infrastructure look less magical and more useful for humans.
Who the hell are you and why should I believe a Russian?
Gracias, Cataluña
• This guy:

‐ Twitter (it’s boring and mostly in Russian): @kobolog

‐ Questions when I’m not around: me@kobology.ru

‐ My name is Andrey Sibiryov.

• IPVS: http://www.linuxvirtualserver.org/software/ipvs.html
• GORB sources: https://github.com/kobolog/gorb
• Bird BGP: http://bird.network.cz
• Stage version of this slide deck: http://bit.ly/1S1A3cT
• Also, it’s right about time to ask your questions!
Some links, more links, some HR and questions!

More Related Content

PDF
PDF
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
PDF
What Is Helm
PDF
Hands-On Introduction to Kubernetes at LISA17
PPTX
Kubernetes 101 for Beginners
PDF
Kubernetes
PPTX
Elastic stack Presentation
PPTX
Jenkins CI
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
What Is Helm
Hands-On Introduction to Kubernetes at LISA17
Kubernetes 101 for Beginners
Kubernetes
Elastic stack Presentation
Jenkins CI

What's hot (20)

PDF
GitOps with ArgoCD
PDF
Evolution of containers to kubernetes
PPTX
Everything You Need To Know About Persistent Storage in Kubernetes
PDF
Kubernetes - introduction
PDF
Kubernetes Networking - Sreenivas Makam - Google - CC18
PPTX
Introduction to Docker - 2017
PDF
Istio : Service Mesh
PDF
Fundamentals of Apache Kafka
PPTX
Kubernetes fundamentals
PDF
Kubernetes - A Comprehensive Overview
PPTX
Docker 101 : Introduction to Docker and Containers
PPTX
Kubernetes
PDF
Devops Porto - CI/CD at Gitlab
PPTX
Understanding kube proxy in ipvs mode
PDF
Kubernetes Networking | Kubernetes Services, Pods & Ingress Networks | Kubern...
PDF
Linux Networking Explained
PPTX
01. Kubernetes-PPT.pptx
PDF
Disaster Recovery and High Availability with Kafka, SRM and MM2
PDF
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
PPTX
Elastic - ELK, Logstash & Kibana
GitOps with ArgoCD
Evolution of containers to kubernetes
Everything You Need To Know About Persistent Storage in Kubernetes
Kubernetes - introduction
Kubernetes Networking - Sreenivas Makam - Google - CC18
Introduction to Docker - 2017
Istio : Service Mesh
Fundamentals of Apache Kafka
Kubernetes fundamentals
Kubernetes - A Comprehensive Overview
Docker 101 : Introduction to Docker and Containers
Kubernetes
Devops Porto - CI/CD at Gitlab
Understanding kube proxy in ipvs mode
Kubernetes Networking | Kubernetes Services, Pods & Ingress Networks | Kubern...
Linux Networking Explained
01. Kubernetes-PPT.pptx
Disaster Recovery and High Availability with Kafka, SRM and MM2
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
Elastic - ELK, Logstash & Kibana
Ad

Viewers also liked (12)

PPTX
Data focused docker clustering
PPTX
PDF
Stateful Containers: Flocker on CoreOS
PDF
[En] IPVS for Docker Containers
PDF
What's New in Docker 1.12 (June 20, 2016) by Mike Goelzer & Andrea Luzzardi
PDF
Kernel load-balancing for Docker containers using IPVS
PDF
Containerd: Building a Container Supervisor by Michael Crosby
PPTX
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
PDF
Docker Security Deep Dive by Ying Li and David Lawrence
PDF
Managing Persistent Storage with Docker Containers by John Griffith and Garre...
PDF
The Golden Ticket: Docker and High Security Microservices by Aaron Grattafiori
PPTX
Containerd - core container runtime component
Data focused docker clustering
Stateful Containers: Flocker on CoreOS
[En] IPVS for Docker Containers
What's New in Docker 1.12 (June 20, 2016) by Mike Goelzer & Andrea Luzzardi
Kernel load-balancing for Docker containers using IPVS
Containerd: Building a Container Supervisor by Michael Crosby
Thinking Inside the Container: A Continuous Delivery Story by Maxfield Stewart
Docker Security Deep Dive by Ying Li and David Lawrence
Managing Persistent Storage with Docker Containers by John Griffith and Garre...
The Golden Ticket: Docker and High Security Microservices by Aaron Grattafiori
Containerd - core container runtime component
Ad

Similar to IPVS for Docker Containers (20)

PDF
The advantages of Arista/OVH configurations, and the technologies behind buil...
PDF
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
PDF
Rapid IPv6 Deployment for ISP Networks
PPTX
Blue host using openstack in a traditional hosting environment
PPTX
Blue host openstacksummit_2013
PDF
Directions for CloudStack Networking
PDF
[OpenStack Day in Korea 2015] Track 2-3 - 오픈스택 클라우드에 최적화된 네트워크 가상화 '누아지(Nuage)'
PPTX
Using OpenStack In a Traditional Hosting Environment
PDF
Switch as a Server - PuppetConf 2014 - Leslie Carr
PPTX
Spy hard, challenges of 100G deep packet inspection on x86 platform
PDF
NFV SDN Summit March 2014 D1 07 kireeti_kompella Native MPLS Fabric
PPTX
High Performance Networking Leveraging the DPDK and Growing Community
PPTX
Kubernetes
PDF
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...
PDF
Pluggable Infrastructure with CI/CD and Docker
PDF
VMworld 2013: Real-world Deployment Scenarios for VMware NSX
PDF
FreeSWITCH as a Microservice
PPTX
Learning series fundamentals of Networking and Medical Imaging
PDF
DPDK Summit 2015 - HP - Al Sanders
The advantages of Arista/OVH configurations, and the technologies behind buil...
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
Rapid IPv6 Deployment for ISP Networks
Blue host using openstack in a traditional hosting environment
Blue host openstacksummit_2013
Directions for CloudStack Networking
[OpenStack Day in Korea 2015] Track 2-3 - 오픈스택 클라우드에 최적화된 네트워크 가상화 '누아지(Nuage)'
Using OpenStack In a Traditional Hosting Environment
Switch as a Server - PuppetConf 2014 - Leslie Carr
Spy hard, challenges of 100G deep packet inspection on x86 platform
NFV SDN Summit March 2014 D1 07 kireeti_kompella Native MPLS Fabric
High Performance Networking Leveraging the DPDK and Growing Community
Kubernetes
Snabb Switch: Riding the HPC wave to simpler, better network appliances (FOSD...
Pluggable Infrastructure with CI/CD and Docker
VMworld 2013: Real-world Deployment Scenarios for VMware NSX
FreeSWITCH as a Microservice
Learning series fundamentals of Networking and Medical Imaging
DPDK Summit 2015 - HP - Al Sanders

More from Bob Sokol (13)

PPTX
AppOrbit DevOps NYC
PPTX
RackN DevOps meetup NYC
PDF
How (and why!) we built Packet
PDF
Accelerating the Software Delivery Pipelinewith Mirantis OpenStack
PPTX
More than Technology - The Culture of DevOps
PPTX
Cloud Native Applications - DevOps, EMC and Cloud Foundry
PDF
Enabling Enterprise DevOps at Scale
PDF
XebiaLabs Enterprise DevOps
PDF
EMC {code} Open Source
PPTX
ECS/Cloud Object Storage - DevOps Day
PDF
DevOps Toolkit
PPTX
Puppet Labs EMC DevOps Day NYC Aug-2015
PDF
EMC DevOps Day Aug-2015 - Stormy Peters, Cloud Foundry Foundation
AppOrbit DevOps NYC
RackN DevOps meetup NYC
How (and why!) we built Packet
Accelerating the Software Delivery Pipelinewith Mirantis OpenStack
More than Technology - The Culture of DevOps
Cloud Native Applications - DevOps, EMC and Cloud Foundry
Enabling Enterprise DevOps at Scale
XebiaLabs Enterprise DevOps
EMC {code} Open Source
ECS/Cloud Object Storage - DevOps Day
DevOps Toolkit
Puppet Labs EMC DevOps Day NYC Aug-2015
EMC DevOps Day Aug-2015 - Stormy Peters, Cloud Foundry Foundation

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Machine learning based COVID-19 study performance prediction
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
A Presentation on Artificial Intelligence
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Weekly Chronicles - August'25 Week I
Machine learning based COVID-19 study performance prediction
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
“AI and Expert System Decision Support & Business Intelligence Systems”
Per capita expenditure prediction using model stacking based on satellite ima...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Understanding_Digital_Forensics_Presentation.pptx
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Digital-Transformation-Roadmap-for-Companies.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
A Presentation on Artificial Intelligence

IPVS for Docker Containers

  • 1. IPVS for Docker Containers Production-level load balancing and request routing without spending a single penny.
  • 2. What is IPVS • Stands for IP Virtual Server. Built on top of Netfilter and works in kernel space. No userland copying of network packets involved. • It’s in the mainline Linux Kernel since 2.4.x. Surprisingly, it’s one of the technologies which successfully pumps your stuff through world’s networks for more than 15 years and almost nobody ever heard of it. • Used in many world-scale companies such as Google, Facebook, LinkedIn, Dropbox, GitHub, Alibaba, Yandex and so on. During my time in Yandex, we used to route millions of requests per second through a few IPVS boxes without a sweat. • Tested with fire, water and brass trombones (untranslatable Russian proverb). • Provides flexible, configurable load-balancing and request routing done in kernel space – it plugs in even before PCAP layer – making it bloody fucking fast. • As all kernel tech, it looks like a incomprehensible magical artifact from outer space, but bear with me – it’s actually very simple to use. And why didn’t I hear about it before?
  • 3. What is IPVS • Supports TCP, SCTP and UDP (both v4 and v6), works on L4 in kernel space. • Load balancing using weighted RR, LC, replicated locality-based LC (LBLCR), destination and source- based hashing, shortest expected delay (SED) and more via plugins: ‐ LBLCR: «sticky» least-connections, picks a backend and sends all jobs to it until it’s full, then picks the next one. If all backends are full, replicates one with the least jobs and adds it to the service backend set. • Supports persistent connections for SSL, FTP and One Packet Mode for connectionless protocols such as DNS – will trigger scheduling for each packet. • Supports request forwarding with standard Masquerading (DNAT), Tunneling (IPIP) or Direct Routing (DR). Supports Netfilter Marks (FWMarks) for virtual service aggregation, e.g. http and https. • Built-in cluster synchronization daemon – only need to configure one router box, other boxes will keep their configurations up-to-date automatically. And what can it do if you say it’s so awesome?
  • 4. What is IPVS • DNAT is your ordinary NAT. It will replace each packet’s destination IP with the chosen backend IP and put the packet back into the networking stack. ‐ Pros: you don’t need to do anything to configure it. ‐ Cons: responses will be a martian packets in your network. But you can use a local reverse proxy to workaround this or point the default gateway on backends to load balancer’s IP address for IPVS to take care of it. • DR (or DSR: direct server response) is the fastest forwarding method in the world. ‐ Pros: speed. It will be as fast as possible given your network conditions. In fact, it will be so fast that you will be able to pump more traffic through your balancers than the interface capacity they have. ‐ Cons: all backends and routers should be in the same L2 segment, need to configure NoARP devices. • IPIP is a tun/tap alternative to DR. Like IPSEC, but without encryption. ‐ Pros: it can be routed anywhere and won’t trigger martian packets. ‐ Cons: lower MSS. Also, you need to configure NoARP tun devices on all backend boxes, like in DR. And a little bit more about all these weird acronyms.
  • 5. What is IPVS And a little bit more about all these weird acronyms. IPIP Encapsulates IP Routable anywhere DNAT Rewrites DST IP Same L4 DSR Rewrites DST MAC Same L2
  • 6. What is IPVS • Works by replacing the job’s destination MAC with the chosen backend MAC and then putting the packet back into the INPUT queue. Once again: the requirement for this mode is that all backends should be in the same L2 network segment. • The response will travel directly to the requesting client, bypassing the router box. This essentially allows to cut router box traffic by more than 50% since responses are usually heavier than requests. • This also allows you to load balance more traffic than the load balancer interface capacity since with a proper setup it will only load balance TCP SYNs. • It’s a little bit tricky to set up, but if done well is indistinguishable from magic (as well as any sufficiently advanced technology). • Fun fact: this is how video streaming services survive and avoid bankruptcy! And a little bit more about DR since it’s awesome!
  • 7. I don’t need this • If you have more than one instance of your application, you need load balancing, otherwise you won’t be able to distribute requests among those instances. • If you have more than one application or more than one version deployed at the same time in production, you need request routing. Typically, only one version is deployed to production at the same time, and one testing version is waiting in line. • Imagine what you can do if you could deploy 142 different versions of your application to production: ‐ A/B testing: route 50% to version A, route 50% to version B. ‐ Intelligent routing: if requester’s cookie is set to «NSA», route to version «HoneyPot». ‐ Experiments: randomly route 1% of users to experimental version «EvenMorePointlessAds» and stick them to that version. And why would I load balance and route anything at all?
  • 8. I don’t need this • In modern world, instances come and go as they may, sometimes multiple times per second. This rate will only go up in the future. • Instances migrate across physical nodes. Instances might be stopped when no load or low load is in effect. New instances might appear to accommodate the growing request rate and be torn down later. • Neither nginx (for free) nor haproxy allows for online configuration. Many hacks exist which essentially regenerate the configuration file on the routing box and then cycle the reverse proxy, but it’s neither sustainable nor mainstream. • Neither hipache (written in node.js) nor vulcand (under development) allows for non-RR balancing methods. • The fastest userland proxy to date – HAProxy – is ~40% slower than IPVS in DR mode according to many publicly available benchmark results. Also, my nginx (haproxy, third option) setup works fine, get off the stage please!
  • 9. I don’t need this • Lucky you! • But running stuff in the cloud is not an option in many circumstances: ‐ Performance and/or reliability critical applications. ‐ Exotic hardware or software requirements. CG or CUDA farms, BSD environments and so on. ‐ Clouds are a commodity in US and Europe, but in some countries with a thriving IT culture AWS is more of a fairy tale (e.g. Vietnam). ‐ Security considerations and vendor lock-in. • Cloud-based load balancers are surprisingly dumb, for example AWS ELB doesn’t really allow you to configure anything except real servers and some basic stuff like SSL and Health Checks. • After a certain point, AWS/GCE becomes quite expensive, unless you’re Netflix. And I run my stuff in the cloud, it takes care of everything – my work is perpetual siesta.
  • 10. What is IPVS, again • Make sure you have a properly configured kernel (usually you do): ‐ cat /boot/config-$(uname -r) | grep IP_VS • Install IPVS CLI to verify that it’s healthy and up: ‐ sudo apt-get install ipvsadm ‐ sudo ipvsadm -l • Command-line tool allows you to do everything out there with IPVS, but it’s the 21st century and CLIs are only good to show off your keyboard-fu ;) • That’s why I’ve coded a dumb but loyal REST API daemon that talks to kernel and configures IPVS for you. In Go, because, you know. • You can use it to add and remove virtual services and backends in runtime and get metrics about existing virtual services. It’s totally open source and free but might not work (as everything open source and free). And how do I use it now since it sounds amazing!
  • 11. GORB • It’s here: https://github.com/kobolog/gorb • Based on native Go netlink library – https://github.com/tehnerd/gnl2go. The library itself talks directly to the Kernel and already supports the majority of IPVS operations. This library is a port of Facebook’s gnl2py – https:// github.com/facebook/gnlpy • It exposes a very straightforward JSON-based REST API: ‐ PUT or DELETE /service/<vsID> – create or remove a virtual service. ‐ PUT or DELETE /service/<vsID>/<rsID> – add or remove a backend for a virtual service ‐ GET /service/<vsID> – get virtual service metrics (incl. health checks). ‐ GET /service/<vsID>/<rsID> – get backend configuration. ‐ PATCH /service/<vsID>/<rsID> – update backend weight and other parameters without restarting anything. Go Routing and Balancing.
  • 12. GORB • Every time you spin up a new container, it can be magically automatically registered with GORB, so that your application’s clients could reach it via the balancer-provided endpoint. • GORB will automatically do TCP checks (if it’s a TCP service) on your container and inhibit all traffic coming to it if, for some reason, it disappeared without gracefully de-registering first: network outage, oom-killer or whatnot. • HTTP checks are also available, if required – e.g. your app can have a /health endpoint which will respond 200 only if all downstream dependencies are okay as well. • Essentially, the only thing you need to do is to start a tiny little daemon on Docker Engine boxes that will listen for events on Events API and send commands to GORB servers. • More checks can be added in the future, e.g. Consul, ZooKeeper or etcd! And why is it cool for Docker Containers.
  • 13. GORB kobolog@bulldozer:~$ docker-machine create -d virtualbox gorb kobolog@bulldozer:~$ docker-machine ssh gorb sudo modprobe ip_vs kobolog@bulldozer:~$ eval $(docker-machine env gorb) kobolog@bulldozer:~$ docker build -t gorb src/gorb kobolog@bulldozer:~$ docker run -d —net=host --privileged gorb -f -i eth1 kobolog@bulldozer:~$ docker build -t gorb-link src/gorb/gorb-docker-link kobolog@bulldozer:~$ docker run -d --net=host -v /var/run/docker.sock:/var/run/ docker.sock gorb-link -r $(docker-machine ip default):4672 -i eth1 kobolog@bulldozer:~$ docker run -d -p 80 nginx (repeat 4 times) kobolog@bulldozer:~$ curl -i http://$(docker-machine ip gorb):80 And how do I use it? Live demo or GTFO!
  • 14. A few words about BGP • Once you have a few router boxes to avoid having a SPOF, you might wonder how clients would find out which one to use. One option could be DNS RR, where you put them all behind one domain name and let DNS resolvers to do their job. • But there’s a better way – BGP host routes, also known as anycast routing: ‐ It’s a protocol used by network routers to agree on network topologies. ‐ The idea behind anycast is that each router box announces the same IP address on the network via BGP host route advertisements. ‐ When a client hits this IP address, it’s automatically routed to one of the GORB boxes based on configured routing metrics. ‐ You don’t need any special hardware – it’s already built into your routers. ‐ You can do this using one of the BGP packages, such as Bird or Quagga. Notable ecosystem projects in this area: Project Calico. ‐ Also can be done with IPv6 anycast, but I’ve never seen it implemented. Black belt in networking is not complete without a few words about BGP
  • 15. GORB • No, you cannot blame me, but I’m here to help and answer questions! • IPVS is production ready since I was in college. • GORB is not production ready and is not tested in production, but since it’s just a configuration daemon, it won’t actually affect your traffic, I hope. • Here are some nice numbers to think about: ‐ 1 GBPS line rate uses 1% CPU in DR mode. ‐ Can utilize 40G/100G interface. ‐ Costs you €0. ‐ Typical enterprise hardware load-balancer – €25,000. • Requires generic hardware, software, tools – also easy to learn. Ask your Ops whether they want to read a 1000-page vendor manual or a 1000-word manpage. • Also, no SNMP involved. Amazing and unbelievable! Is it stable? Is it production-ready? Can I blame you if it doesn’t work?
  • 16. This guy on the stage • You shouldn’t. In fact, don’t trust anybody on this stage – get your hands dirty and verify everything yourself. Although last month I turned 30, so I’m trustworthy! • I’ve been building distributed systems and networking for more than 7 years in companies with worldwide networks and millions of users, multiple datacenters and other impressive things. • These distributed systems and networks are still operational and serve billions of user requests each day. • I’m the author of IPVS-based routing and load balancing framework for Yandex’s very own open- source platform called Cocaine: https://github.com/cocaine. • As of now, I’m a Senior Infrastructure Software Engineer in Uber in NYC, continuing my self- proclaimed crusade to make magic infrastructure look less magical and more useful for humans. Who the hell are you and why should I believe a Russian?
  • 17. Gracias, Cataluña • This guy: ‐ Twitter (it’s boring and mostly in Russian): @kobolog ‐ Questions when I’m not around: me@kobology.ru ‐ My name is Andrey Sibiryov. • IPVS: http://www.linuxvirtualserver.org/software/ipvs.html • GORB sources: https://github.com/kobolog/gorb • Bird BGP: http://bird.network.cz • Stage version of this slide deck: http://bit.ly/1S1A3cT • Also, it’s right about time to ask your questions! Some links, more links, some HR and questions!