SlideShare a Scribd company logo
Laurent Bernaille, @lbernail
Staff Engineer, Datadog
Deep Dive in Container
Service Discovery
v
Subtitle here
Agenda
Time Title will go here when it’s ready Location
Service Discovery
Load-balancing
L7 Load-balancing
v
Service Discovery
“Service discovery is the automatic detection of devices and
services offered by these devices on a computer network”
https://en.wikipedia.org/wiki/Service_discovery
Why has this topic become so important?
Service Discovery
Service discovery in Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: echodeploy
labels:
app: echo
spec:
replicas: 3
selector:
matchLabels:
app: echo
template:
metadata:
labels:
app: echo
spec:
containers:
- name: echopod
image: lbernail/echo:0.5
apiVersion: v1
kind: Service
metadata:
name: echo
labels:
app: echo
spec:
type: ClusterIP
selector:
app: echo
ports:
- name: http
protocol: TCP
port: 80
targetPort: 5000
Creating a deployment and a service
Created Kubernetes objects
Deployment ReplicaSet Pod 1
label: app=echo
Pod 2
label: app=echo
Pod 3
label: app=echo
Service
Selector: app=echo
kubectl get all
NAME AGE
deploy/echodeploy 16s
NAME AGE
rs/echodeploy-75dddcf5f6 16s
NAME READY
po/echodeploy-75dddcf5f6-jtjts 1/1
po/echodeploy-75dddcf5f6-r7nmk 1/1
po/echodeploy-75dddcf5f6-zvqhv 1/1
NAME TYPE CLUSTER-IP
svc/echo ClusterIP 10.200.246.139
The endpoint object
Deployment ReplicaSet Pod 1
label: app=echo
Pod 2
label: app=echo
Pod 3
label: app=echo
kubectl describe endpoints echo
Name: echo
Namespace: datadog
Labels: app=echo
Annotations: <none>
Subsets:
Addresses: 10.150.4.10,10.150.6.16,10.150.7.10
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
http 5000 TCP
Endpoints
Addresses:
10.150.4.10
10.150.6.16
10.150.7.10
Service
Selector: app=echo
Pod readiness
readinessProbe:
httpGet:
path: /ready
port: 5000
periodSeconds: 2
successThreshold: 2
failureThreshold: 2
● A pod can be started but no ready to serve requests
○ Initialization
○ Connection to backends
● Kubernetes provides an abstraction for this: Readiness Probes
Demo
kubectl run -it test --image appropriate/curl ash
# while true ; do curl 10.200.246.139 ; sleep 1 ; done
Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2
Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2
Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2
Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2
Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2
Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2
Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2
Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2
Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2
Demo
kubectl exec -it <curl pod> sh
# curl <podip>:5000/ready
Ready : True
# curl <podip>:5000/toggleReady
# curl <podip>:5000/ready
Ready : False
kubectl get pods
NAME READY
echodeploy-75dddcf5f6-jtjts 1/1
echodeploy-75dddcf5f6-r7nmk 1/1
echodeploy-75dddcf5f6-zvqhv 0/1
kubectl describe endpoints echo
Addresses: 10.150.4.10,10.150.6.16
kubectl describe pod echodeploy-75dddcf5f6-zvqhv
Warning Unhealthy (Readiness probe failed)
How does this all work?
API Server
Node
kubelet pod
HC
Status updates
Node
kubelet pod
HC
ETCD
pods
How does this all work?
API Server
Node
kubelet pod
HC
Status updates
Controller Manager
Watch
- pods
- services
endpoint
controller
Node
kubelet pod
HC
Sync endpoints:
- list pods matching selector
- add IP to endpoints
ETCD
pods
services
endpoints
v
Load-Balancing
DNS Round Robin
● Service has a DNS record with one entry per endpoint
● Many clients will only use the first IP
● Many clients will perform resolution only at startup
Virtual IP + IP based load-balancing
● Service has a single VIP
● Traffic sent to this VIP is load-balanced to endpoints IPs
=> Requires a “process” to perform and configure this load-balancing
Load-balancing solutions
Load-balancing in Kubernetes
API Server
Node
kube-proxy proxier
Controller Manager
Watch
- pods
- services
endpoint
controller
Sync endpoints:
- list pods matching selector
- add IP to endpoints
ETCD
pods
services
endpoints
Watch
- services
- endpoints
Load-balancing in Kubernetes
API Server
Node
kube-proxy proxier
Controller Manager
endpoint
controller
ETCD
pods
services
endpoints
client Node Bpod 1
Node Cpod 2
● userspace
Original implementation
Userland TCP/UDP proxy
● iptables
Default since Kubernetes 1.2
Use iptables to load-balance traffic
Faster than userspace
● ipvs
Use Kernel load-balancing
Still relies on iptables for some NAT rule
Faster than iptables, scales better with large number of services/endpoints
Kube-proxy modes
v
IPTABLES
Load-Balancing
API Server
Node A
kube-proxy iptables
iptables overview
client
Node B
Node C
pod 1
pod 2
Outgoing traffic
1. Client to Service IP
2. DNAT: Client to Pod1 IP
Reverse path
1. Pod1 IP to Client
2. Reverse NAT: Service IP to client
proxy-mode = iptables
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
All traffic is processed by kube chains
proxy-mode = iptables
KUBE-SERVICES
any / VIP:PORT => KUBE-SVC-XXX
Global Service chain
Identify service and jump to appropriate service chain
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
proxy-mode = iptables
KUBE-SERVICES
any / VIP:PORT => KUBE-SVC-XXX
KUBE-SVC-XXX
any / any proba 33% => KUBE-SEP-AAA
any / any proba 50% => KUBE-SEP-BBB
any / any => KUBE-SEP-CCC
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
Service chain (one per service)
Use statistic iptables module (probability of rule being applied)
Rules are evaluated sequentially (hence the 33%, 50%, 100%)
proxy-mode = iptables
KUBE-SERVICES
any / VIP:PORT => KUBE-SVC-XXX
KUBE-SVC-XXX
any / any proba 33% => KUBE-SEP-AAA
any / any proba 50% => KUBE-SEP-BBB
any / any => KUBE-SEP-CCC
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
KUBE-SEP-AAA
endpoint IP / any => KUBE-MARK-MASQ
any / any => DNAT endpoint IP:Port
Endpoint Chain
Mark hairpin traffic (client = target) for SNAT
DNAT to the endpoint
Edge case: Hairpin traffic
API Server
Node A
kube-proxy iptables
pod 1
Node B
Node C
pod 2
pod 3
Client can also be a destination
After DNAT:
Src IP= Pod1, Dst IP= Pod1
No reverse NAT possible
=> SNAT on host for this traffic
1. Pod1 IP => SVC IP
2. SNAT: HostIP => SVC IP
3. DNAT: HostIP => Pod1 IP
Reverse path
1. Pod1 IP => Host IP
2. Reverse NAT: SVC IP => Pod1IP
Persistency
spec:
type: ClusterIP
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 600
KUBE-SEP-AAA
endpoint IP / any => KUBE-MARK-MASQ
any / any => DNAT endpoint IP:Port
recent : set rsource KUBE-SEP-AAA
Use “recent” module
Add Source IP to set named KUBE-SEP-AAA
Persistency
KUBE-SEP-AAA
endpoint IP / any => KUBE-MARK-MASQ
any / any => DNAT endpoint IP:Port
recent : set rsource KUBE-SEP-AAA
Use recent module
Add Source IP to set named KUBE-SEP-AAA
KUBE-SVC-XXX
any / any recent: rcheck set KUBE-SEP-AAA => KUBE-SEP-AAA
any / any recent: rcheck set KUBE-SEP-BBB => KUBE-SEP-BBB
any / any recent: rcheck set KUBE-SEP-CCC => KUBE-SEP-CCC
Load-balancing rules
Use recent module
If Source IP is in set named KUBE-SEP-AAA,
jump to KUBE-SEP-AAA
Demos
kubectl exec echodeploy-xxxx -it sh
# hostname -i
10.1.161.2
# while true ; do wget -q -O - 10.200.20.164 ; sleep 1 ; done
Container: 10.1.162.5 | Source: 10.1.161.2 | Version: Unknown
Container: 10.1.161.2 | Source: 10.1.161.1 | Version: Unknown
Container: 10.1.163.2 | Source: 10.1.161.2 | Version: Unknown
Chains
Hairpin traffic
Persistency
iptables proxy gotchas
Rules synchronization
Every sync flushes and reload all Kubernetes chains
Performance
Design
v
IPVS
Load-Balancing
proxy-mode = ipvs
● L4 load-balancer build in the Linux Kernel
● Many load-balancing algorithms
● Very fast
● Still relies on iptables for some use cases (SNAT in particular)
IPVS Demo
$ sudo ipvsadm --list --numeric --tcp-service 10.200.200.68:80
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.200.200.68:http rr
-> 10.1.242.2:5000 Masq 1 0 0
-> 10.1.243.2:5000 Masq 1 0 0
Virtual Server
Dummy interface
sudo ip -d addr show kube-ipvs0
3: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default
link/ether da:c8:87:73:ac:d4 brd ff:ff:ff:ff:ff:ff promiscuity 0
dummy numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
inet 10.200.200.68/32 brd 10.200.200.68 scope global kube-ipvs0
valid_lft forever preferred_lft forever
IPVS Hairpin traffic
$ sudo iptables -t nat -L KUBE-POSTROUTING
Chain KUBE-POSTROUTING (1 references)
target prot opt source destination
MASQUERADE all -- anywhere anywhere mark match 0x4000/0x4000
MASQUERADE all -- anywhere anywhere match-set KUBE-LOOP-BACK dst,dst,src
$ sudo ipset -L KUBE-LOOP-BACK
Name: KUBE-LOOP-BACK
Type: hash:ip,port,ip
Members:
10.1.243.2,tcp:5000,10.1.243.2
10.1.242.2,tcp:5000,10.1.242.2
Same as iptables but uses IPSET
When src & dst == endpoint IP => SNAT
ip sets are much faster than iptables simple rules with long lists
Persistency
$ sudo ipvsadm --list --numeric --tcp-service 10.200.200.68:80
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.200.200.68:80 rr persistent 600
-> 10.1.242.2:5000 Masq 1 0 0
-> 10.1.243.2:5000 Masq 1 0 0
Native option of virtual services
Not considered stable yet
Much better performances
● No chain traversal: faster DNAT
● No full reload to add an endpoint / service: much faster updates
● See “Scale Kubernetes to support 50000 services”, Haibin Michael Xie
(Linuxcon China)
Definitely the future of kube-proxy
IPVS status
Alternatives to kube-proxy
Kube-router
● https://github.com/cloudnativelabs/kube-router
● Pod Networking with BGP
● Network Policies
● IPVS based service-proxy
Cilium
● Relies on eBPF to implement service proxying
● Implement security policies with eBPF
● Really promising
Other
● Very dynamic area, expect to see other solutions
API Server
Node A
kube-proxy iptables
What about DNS
DNS client
Node B
Node C
DNS pod 1
DNS pod 2
Just another Kube Service
DNS pods get DNS info from API server
Access services from outside kube
Run kube-proxy on an external VM
Requires routable pod IPs
DNS
Access services from outside kube
VM
API Server
kube-proxy
iptables
Node
Service pod
Node
Service pod
Service pod
Node
client
Access services from outside kube
VM
API Server
kube-proxy
iptables
Node
Service pod
DNS pod
Node
Service pod
Service pod
Node
DNS poddnsmasqclient
v
L7 Load-balancing
L7 load balancing options
Ingress controllers
Service mesh (Istio)
Key takeaways
Complicated under the hood
● Helps to know where to look at when debugging complex setups
Service discovery
● Challenge: integrate with hosts outside of Kubernetes
Load-Balancing
● L4 is still very dynamic (IPVS, eBPF)
● L7 is only starting, expect to see a lot
Thank you
We’re hiring!
Questions/ comments: @lbernail
https://github.com/lbernail/dockercon2018

More Related Content

PPTX
Docker, LinuX Container
PPTX
nftables: the Next Generation Firewall in Linux
PDF
Spring Boot + Netflix Eureka
PPTX
High Availability Content Caching with NGINX
PDF
Writing the Container Network Interface(CNI) plugin in golang
PDF
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
PDF
Java Clientで入門する Apache Kafka #jjug_ccc #ccc_e2
PDF
Kubernetes Networking - Sreenivas Makam - Google - CC18
Docker, LinuX Container
nftables: the Next Generation Firewall in Linux
Spring Boot + Netflix Eureka
High Availability Content Caching with NGINX
Writing the Container Network Interface(CNI) plugin in golang
わかる!metadata.managedFields / Kubernetes Meetup Tokyo 48
Java Clientで入門する Apache Kafka #jjug_ccc #ccc_e2
Kubernetes Networking - Sreenivas Makam - Google - CC18

What's hot (20)

PPTX
NGINX: High Performance Load Balancing
PPTX
はじめてのElasticsearchクラスタ
PPTX
Amazon EKS Deep Dive
PPTX
コンテナネットワーキング(CNI)最前線
PDF
HA Deployment Architecture with HAProxy and Keepalived
PPTX
How Kubernetes scheduler works
PDF
Cilium - BPF & XDP for containers
PDF
コンテナイメージの脆弱性スキャンについて
PPTX
Nginx A High Performance Load Balancer, Web Server & Reverse Proxy
PDF
[GuideDoc] Deploy EKS thru eksctl - v1.22_v0.105.0.pdf
PDF
NGINX Back to Basics: Ingress Controller (Japanese Webinar)
PDF
MySQL 5.7にやられないためにおぼえておいてほしいこと
PPTX
Docker and kubernetes
PDF
Apache Kafka 0.11 の Exactly Once Semantics
PDF
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps Online
PDF
GitOps with Amazon EKS Anywhere by Dan Budris
PDF
A Deep Dive into Kafka Controller
PDF
[OpenInfra Days Korea 2018] (Track 3) Zuul v3 - OpenStack 인프라 코드로 CI/CD 살펴보기
PDF
HAProxy TCP 모드에서 내부 서버로 Source IP 전달 방법
PDF
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
NGINX: High Performance Load Balancing
はじめてのElasticsearchクラスタ
Amazon EKS Deep Dive
コンテナネットワーキング(CNI)最前線
HA Deployment Architecture with HAProxy and Keepalived
How Kubernetes scheduler works
Cilium - BPF & XDP for containers
コンテナイメージの脆弱性スキャンについて
Nginx A High Performance Load Balancer, Web Server & Reverse Proxy
[GuideDoc] Deploy EKS thru eksctl - v1.22_v0.105.0.pdf
NGINX Back to Basics: Ingress Controller (Japanese Webinar)
MySQL 5.7にやられないためにおぼえておいてほしいこと
Docker and kubernetes
Apache Kafka 0.11 の Exactly Once Semantics
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps Online
GitOps with Amazon EKS Anywhere by Dan Budris
A Deep Dive into Kafka Controller
[OpenInfra Days Korea 2018] (Track 3) Zuul v3 - OpenStack 인프라 코드로 CI/CD 살펴보기
HAProxy TCP 모드에서 내부 서버로 Source IP 전달 방법
왜 쿠버네티스는 systemd로 cgroup을 관리하려고 할까요
Ad

Similar to Deep dive in container service discovery (20)

PDF
Evolution of kube-proxy (Brussels, Fosdem 2020)
PPTX
Scaling Kubernetes to Support 50000 Services.pptx
PDF
Scale Kubernetes to support 50000 services
PPTX
Understanding kube proxy in ipvs mode
PDF
Kubernetes Networking
PPTX
Nynog-K8s-networking-101.pptx
PDF
Networking in Kubernetes
PDF
iptables and Kubernetes
PDF
Kubernetes Networking 101 kubecon EU 2022
PDF
Kubernetes at Datadog Scale
PDF
Kubernetes networking
PPTX
Open stackaustinmeetupsept21
PDF
KubeCon EU 2016: Creating an Advanced Load Balancing Solution for Kubernetes ...
PPTX
Kubernetes Services are sooo Yesterday!
PDF
The Journey to the Kubernetes networking.pdf
PDF
Kubernetes Walk Through from Technical View
PDF
Deep dive into Kubernetes Networking
PDF
Kubernetes at Datadog Scale - Ara Pulido
PDF
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
PPTX
Kubernetes Networking 101
Evolution of kube-proxy (Brussels, Fosdem 2020)
Scaling Kubernetes to Support 50000 Services.pptx
Scale Kubernetes to support 50000 services
Understanding kube proxy in ipvs mode
Kubernetes Networking
Nynog-K8s-networking-101.pptx
Networking in Kubernetes
iptables and Kubernetes
Kubernetes Networking 101 kubecon EU 2022
Kubernetes at Datadog Scale
Kubernetes networking
Open stackaustinmeetupsept21
KubeCon EU 2016: Creating an Advanced Load Balancing Solution for Kubernetes ...
Kubernetes Services are sooo Yesterday!
The Journey to the Kubernetes networking.pdf
Kubernetes Walk Through from Technical View
Deep dive into Kubernetes Networking
Kubernetes at Datadog Scale - Ara Pulido
The Simply Complex Task of Implementing Kubernetes Ingress - Velocity NYC
Kubernetes Networking 101
Ad

More from Docker, Inc. (20)

PDF
Containerize Your Game Server for the Best Multiplayer Experience
PDF
How to Improve Your Image Builds Using Advance Docker Build
PDF
Build & Deploy Multi-Container Applications to AWS
PDF
Securing Your Containerized Applications with NGINX
PDF
How To Build and Run Node Apps with Docker and Compose
PDF
Hands-on Helm
PDF
Distributed Deep Learning with Docker at Salesforce
PDF
The First 10M Pulls: Building The Official Curl Image for Docker Hub
PDF
Monitoring in a Microservices World
PDF
COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...
PDF
Predicting Space Weather with Docker
PDF
Become a Docker Power User With Microsoft Visual Studio Code
PDF
How to Use Mirroring and Caching to Optimize your Container Registry
PDF
Monolithic to Microservices + Docker = SDLC on Steroids!
PDF
Labels, Labels, Labels
PDF
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
PDF
Build & Deploy Multi-Container Applications to AWS
PDF
From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...
PDF
Developing with Docker for the Arm Architecture
PDF
Sharing is Caring: How to Begin Speaking at Conferences
Containerize Your Game Server for the Best Multiplayer Experience
How to Improve Your Image Builds Using Advance Docker Build
Build & Deploy Multi-Container Applications to AWS
Securing Your Containerized Applications with NGINX
How To Build and Run Node Apps with Docker and Compose
Hands-on Helm
Distributed Deep Learning with Docker at Salesforce
The First 10M Pulls: Building The Official Curl Image for Docker Hub
Monitoring in a Microservices World
COVID-19 in Italy: How Docker is Helping the Biggest Italian IT Company Conti...
Predicting Space Weather with Docker
Become a Docker Power User With Microsoft Visual Studio Code
How to Use Mirroring and Caching to Optimize your Container Registry
Monolithic to Microservices + Docker = SDLC on Steroids!
Labels, Labels, Labels
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Build & Deploy Multi-Container Applications to AWS
From Fortran on the Desktop to Kubernetes in the Cloud: A Windows Migration S...
Developing with Docker for the Arm Architecture
Sharing is Caring: How to Begin Speaking at Conferences

Recently uploaded (20)

PPTX
Intro to ISO 9001 2015.pptx wareness raising
PPTX
Learning-Plan-5-Policies-and-Practices.pptx
PPTX
Biography Text about someone important in life
PPTX
Self management and self evaluation presentation
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
PPTX
Caption Text about Social Media Post in Internet
PPTX
Human Mind & its character Characteristics
PPTX
Project and change Managment: short video sequences for IBA
PPTX
The spiral of silence is a theory in communication and political science that...
PPTX
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
PPTX
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
PDF
Instagram's Product Secrets Unveiled with this PPT
PPTX
Effective_Handling_Information_Presentation.pptx
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PPTX
Primary and secondary sources, and history
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PPTX
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
Understanding-Communication-Berlos-S-M-C-R-Model.pptx
PPTX
Introduction to Effective Communication.pptx
Intro to ISO 9001 2015.pptx wareness raising
Learning-Plan-5-Policies-and-Practices.pptx
Biography Text about someone important in life
Self management and self evaluation presentation
_ISO_Presentation_ISO 9001 and 45001.pptx
Caption Text about Social Media Post in Internet
Human Mind & its character Characteristics
Project and change Managment: short video sequences for IBA
The spiral of silence is a theory in communication and political science that...
INTERNATIONAL LABOUR ORAGNISATION PPT ON SOCIAL SCIENCE
Role and Responsibilities of Bangladesh Coast Guard Base, Mongla Challenges
Instagram's Product Secrets Unveiled with this PPT
Effective_Handling_Information_Presentation.pptx
Tablets And Capsule Preformulation Of Paracetamol
Primary and secondary sources, and history
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
An Unlikely Response 08 10 2025.pptx
Understanding-Communication-Berlos-S-M-C-R-Model.pptx
Introduction to Effective Communication.pptx

Deep dive in container service discovery

  • 1. Laurent Bernaille, @lbernail Staff Engineer, Datadog Deep Dive in Container Service Discovery
  • 2. v Subtitle here Agenda Time Title will go here when it’s ready Location Service Discovery Load-balancing L7 Load-balancing
  • 4. “Service discovery is the automatic detection of devices and services offered by these devices on a computer network” https://en.wikipedia.org/wiki/Service_discovery Why has this topic become so important? Service Discovery
  • 5. Service discovery in Kubernetes apiVersion: apps/v1 kind: Deployment metadata: name: echodeploy labels: app: echo spec: replicas: 3 selector: matchLabels: app: echo template: metadata: labels: app: echo spec: containers: - name: echopod image: lbernail/echo:0.5 apiVersion: v1 kind: Service metadata: name: echo labels: app: echo spec: type: ClusterIP selector: app: echo ports: - name: http protocol: TCP port: 80 targetPort: 5000 Creating a deployment and a service
  • 6. Created Kubernetes objects Deployment ReplicaSet Pod 1 label: app=echo Pod 2 label: app=echo Pod 3 label: app=echo Service Selector: app=echo kubectl get all NAME AGE deploy/echodeploy 16s NAME AGE rs/echodeploy-75dddcf5f6 16s NAME READY po/echodeploy-75dddcf5f6-jtjts 1/1 po/echodeploy-75dddcf5f6-r7nmk 1/1 po/echodeploy-75dddcf5f6-zvqhv 1/1 NAME TYPE CLUSTER-IP svc/echo ClusterIP 10.200.246.139
  • 7. The endpoint object Deployment ReplicaSet Pod 1 label: app=echo Pod 2 label: app=echo Pod 3 label: app=echo kubectl describe endpoints echo Name: echo Namespace: datadog Labels: app=echo Annotations: <none> Subsets: Addresses: 10.150.4.10,10.150.6.16,10.150.7.10 NotReadyAddresses: <none> Ports: Name Port Protocol ---- ---- -------- http 5000 TCP Endpoints Addresses: 10.150.4.10 10.150.6.16 10.150.7.10 Service Selector: app=echo
  • 8. Pod readiness readinessProbe: httpGet: path: /ready port: 5000 periodSeconds: 2 successThreshold: 2 failureThreshold: 2 ● A pod can be started but no ready to serve requests ○ Initialization ○ Connection to backends ● Kubernetes provides an abstraction for this: Readiness Probes
  • 9. Demo kubectl run -it test --image appropriate/curl ash # while true ; do curl 10.200.246.139 ; sleep 1 ; done Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2 Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2 Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2 Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2 Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2 Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2 Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2 Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2 Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2
  • 10. Demo kubectl exec -it <curl pod> sh # curl <podip>:5000/ready Ready : True # curl <podip>:5000/toggleReady # curl <podip>:5000/ready Ready : False kubectl get pods NAME READY echodeploy-75dddcf5f6-jtjts 1/1 echodeploy-75dddcf5f6-r7nmk 1/1 echodeploy-75dddcf5f6-zvqhv 0/1 kubectl describe endpoints echo Addresses: 10.150.4.10,10.150.6.16 kubectl describe pod echodeploy-75dddcf5f6-zvqhv Warning Unhealthy (Readiness probe failed)
  • 11. How does this all work? API Server Node kubelet pod HC Status updates Node kubelet pod HC ETCD pods
  • 12. How does this all work? API Server Node kubelet pod HC Status updates Controller Manager Watch - pods - services endpoint controller Node kubelet pod HC Sync endpoints: - list pods matching selector - add IP to endpoints ETCD pods services endpoints
  • 14. DNS Round Robin ● Service has a DNS record with one entry per endpoint ● Many clients will only use the first IP ● Many clients will perform resolution only at startup Virtual IP + IP based load-balancing ● Service has a single VIP ● Traffic sent to this VIP is load-balanced to endpoints IPs => Requires a “process” to perform and configure this load-balancing Load-balancing solutions
  • 15. Load-balancing in Kubernetes API Server Node kube-proxy proxier Controller Manager Watch - pods - services endpoint controller Sync endpoints: - list pods matching selector - add IP to endpoints ETCD pods services endpoints Watch - services - endpoints
  • 16. Load-balancing in Kubernetes API Server Node kube-proxy proxier Controller Manager endpoint controller ETCD pods services endpoints client Node Bpod 1 Node Cpod 2
  • 17. ● userspace Original implementation Userland TCP/UDP proxy ● iptables Default since Kubernetes 1.2 Use iptables to load-balance traffic Faster than userspace ● ipvs Use Kernel load-balancing Still relies on iptables for some NAT rule Faster than iptables, scales better with large number of services/endpoints Kube-proxy modes
  • 19. API Server Node A kube-proxy iptables iptables overview client Node B Node C pod 1 pod 2 Outgoing traffic 1. Client to Service IP 2. DNAT: Client to Pod1 IP Reverse path 1. Pod1 IP to Client 2. Reverse NAT: Service IP to client
  • 20. proxy-mode = iptables PREROUTING / OUTPUT any / any => KUBE-SERVICES All traffic is processed by kube chains
  • 21. proxy-mode = iptables KUBE-SERVICES any / VIP:PORT => KUBE-SVC-XXX Global Service chain Identify service and jump to appropriate service chain PREROUTING / OUTPUT any / any => KUBE-SERVICES
  • 22. proxy-mode = iptables KUBE-SERVICES any / VIP:PORT => KUBE-SVC-XXX KUBE-SVC-XXX any / any proba 33% => KUBE-SEP-AAA any / any proba 50% => KUBE-SEP-BBB any / any => KUBE-SEP-CCC PREROUTING / OUTPUT any / any => KUBE-SERVICES Service chain (one per service) Use statistic iptables module (probability of rule being applied) Rules are evaluated sequentially (hence the 33%, 50%, 100%)
  • 23. proxy-mode = iptables KUBE-SERVICES any / VIP:PORT => KUBE-SVC-XXX KUBE-SVC-XXX any / any proba 33% => KUBE-SEP-AAA any / any proba 50% => KUBE-SEP-BBB any / any => KUBE-SEP-CCC PREROUTING / OUTPUT any / any => KUBE-SERVICES KUBE-SEP-AAA endpoint IP / any => KUBE-MARK-MASQ any / any => DNAT endpoint IP:Port Endpoint Chain Mark hairpin traffic (client = target) for SNAT DNAT to the endpoint
  • 24. Edge case: Hairpin traffic API Server Node A kube-proxy iptables pod 1 Node B Node C pod 2 pod 3 Client can also be a destination After DNAT: Src IP= Pod1, Dst IP= Pod1 No reverse NAT possible => SNAT on host for this traffic 1. Pod1 IP => SVC IP 2. SNAT: HostIP => SVC IP 3. DNAT: HostIP => Pod1 IP Reverse path 1. Pod1 IP => Host IP 2. Reverse NAT: SVC IP => Pod1IP
  • 25. Persistency spec: type: ClusterIP sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 600 KUBE-SEP-AAA endpoint IP / any => KUBE-MARK-MASQ any / any => DNAT endpoint IP:Port recent : set rsource KUBE-SEP-AAA Use “recent” module Add Source IP to set named KUBE-SEP-AAA
  • 26. Persistency KUBE-SEP-AAA endpoint IP / any => KUBE-MARK-MASQ any / any => DNAT endpoint IP:Port recent : set rsource KUBE-SEP-AAA Use recent module Add Source IP to set named KUBE-SEP-AAA KUBE-SVC-XXX any / any recent: rcheck set KUBE-SEP-AAA => KUBE-SEP-AAA any / any recent: rcheck set KUBE-SEP-BBB => KUBE-SEP-BBB any / any recent: rcheck set KUBE-SEP-CCC => KUBE-SEP-CCC Load-balancing rules Use recent module If Source IP is in set named KUBE-SEP-AAA, jump to KUBE-SEP-AAA
  • 27. Demos kubectl exec echodeploy-xxxx -it sh # hostname -i 10.1.161.2 # while true ; do wget -q -O - 10.200.20.164 ; sleep 1 ; done Container: 10.1.162.5 | Source: 10.1.161.2 | Version: Unknown Container: 10.1.161.2 | Source: 10.1.161.1 | Version: Unknown Container: 10.1.163.2 | Source: 10.1.161.2 | Version: Unknown Chains Hairpin traffic Persistency
  • 28. iptables proxy gotchas Rules synchronization Every sync flushes and reload all Kubernetes chains Performance Design
  • 30. proxy-mode = ipvs ● L4 load-balancer build in the Linux Kernel ● Many load-balancing algorithms ● Very fast ● Still relies on iptables for some use cases (SNAT in particular)
  • 31. IPVS Demo $ sudo ipvsadm --list --numeric --tcp-service 10.200.200.68:80 Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.200.200.68:http rr -> 10.1.242.2:5000 Masq 1 0 0 -> 10.1.243.2:5000 Masq 1 0 0 Virtual Server Dummy interface sudo ip -d addr show kube-ipvs0 3: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default link/ether da:c8:87:73:ac:d4 brd ff:ff:ff:ff:ff:ff promiscuity 0 dummy numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 inet 10.200.200.68/32 brd 10.200.200.68 scope global kube-ipvs0 valid_lft forever preferred_lft forever
  • 32. IPVS Hairpin traffic $ sudo iptables -t nat -L KUBE-POSTROUTING Chain KUBE-POSTROUTING (1 references) target prot opt source destination MASQUERADE all -- anywhere anywhere mark match 0x4000/0x4000 MASQUERADE all -- anywhere anywhere match-set KUBE-LOOP-BACK dst,dst,src $ sudo ipset -L KUBE-LOOP-BACK Name: KUBE-LOOP-BACK Type: hash:ip,port,ip Members: 10.1.243.2,tcp:5000,10.1.243.2 10.1.242.2,tcp:5000,10.1.242.2 Same as iptables but uses IPSET When src & dst == endpoint IP => SNAT ip sets are much faster than iptables simple rules with long lists
  • 33. Persistency $ sudo ipvsadm --list --numeric --tcp-service 10.200.200.68:80 Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.200.200.68:80 rr persistent 600 -> 10.1.242.2:5000 Masq 1 0 0 -> 10.1.243.2:5000 Masq 1 0 0 Native option of virtual services
  • 34. Not considered stable yet Much better performances ● No chain traversal: faster DNAT ● No full reload to add an endpoint / service: much faster updates ● See “Scale Kubernetes to support 50000 services”, Haibin Michael Xie (Linuxcon China) Definitely the future of kube-proxy IPVS status
  • 35. Alternatives to kube-proxy Kube-router ● https://github.com/cloudnativelabs/kube-router ● Pod Networking with BGP ● Network Policies ● IPVS based service-proxy Cilium ● Relies on eBPF to implement service proxying ● Implement security policies with eBPF ● Really promising Other ● Very dynamic area, expect to see other solutions
  • 36. API Server Node A kube-proxy iptables What about DNS DNS client Node B Node C DNS pod 1 DNS pod 2 Just another Kube Service DNS pods get DNS info from API server
  • 37. Access services from outside kube Run kube-proxy on an external VM Requires routable pod IPs DNS
  • 38. Access services from outside kube VM API Server kube-proxy iptables Node Service pod Node Service pod Service pod Node client
  • 39. Access services from outside kube VM API Server kube-proxy iptables Node Service pod DNS pod Node Service pod Service pod Node DNS poddnsmasqclient
  • 41. L7 load balancing options Ingress controllers Service mesh (Istio)
  • 42. Key takeaways Complicated under the hood ● Helps to know where to look at when debugging complex setups Service discovery ● Challenge: integrate with hosts outside of Kubernetes Load-Balancing ● L4 is still very dynamic (IPVS, eBPF) ● L7 is only starting, expect to see a lot
  • 43. Thank you We’re hiring! Questions/ comments: @lbernail https://github.com/lbernail/dockercon2018