Kubernetes Walk Through from Technical View

Kubernetes Walk-through
by Harry Zhang @resouer

Kubernetes
• Created by Google Borg/Omega team
• Founded and operated by CNCF (Linux Foundation)
• Container orchestration, scheduling and management
• One of the most popular open source project in the world

Growing Contributors
• 1728+ authors

Architecture
kubelet
SyncLoop
controller-manager
ControlLoop
kubelet
SyncLoop
proxy
proxy
network
pod
replica
namespace
service
job
deployment
volume
petset
…
scheduler
Node
Node
Desired World
Real World
etcdapi-server

Example
kubelet
SyncLoop
kubelet
SyncLoop
proxy
proxy
1 Container created
etcd
scheduler
api-server

Example
kubelet
SyncLoop
kubelet
SyncLoop
proxy
proxy
2 Object added
etcd
scheduler
api-server

Example
kubelet
SyncLoop
kubelet
SyncLoop
proxy
proxy
3.1 New container detected
3.2 Bind container to a node
etcd
scheduler
api-server

Example
kubelet
SyncLoop
kubelet
SyncLoop
proxy
proxy
4.1 Detected bind operation
4.2 Start container on this machine
etcd
scheduler
api-server

Take Aways
• Independent control loops
• loosely coupled
• high performance
• easy to customize and extend
• “Watch” object change
• Decide next step based on state change
• not edge driven (event), level driven (state)

Co-scheduling
• Tow containers:
• App: generate log ﬁles
• LogCollector: read and redirect logs to storage
• Request MEM:
• App: 1G
• LogCollector: 0.5G
• Available MEM:
• Node_A: 1.25G
• Node_B: 2G
• What happens if App is scheduled to Node_A ﬁrst?

Pod
• Deeply coupled containers
• Atomic scheduling/placement unit
• Shared namespace
• network, IPC etc
• Shared volume
• Process group in container cloud

Why co-scheduling?
• It’s about using container in right way:
• Lesson learnt from Borg: “workloads tend to have tight relationship”

Ensure Container Order
• Decouple web server
and application
• war ﬁle container
• tomcat container

• Wrong!
Multiple Apps in One Container?
Master Pod
kube-apiserver
kube-scheduler
controller-manager
⽇日志看不不到

是否running没法判断

运维操作困难

出错定位麻烦，不不知道是哪个挂了了，频繁登陆容器器

Copy Files from One to Another?
• Wrong!
Master Pod
kube-apiserver
kube-scheduler
controller-manager
/etc/kubernetes/ssl

Connect to Peer Container thru IP?
• Wrong!
Master Pod
kube-apiserver
kube-scheduler
controller-manager
network namespace

So this is Pod
• Design pattern in container world
• decoupling
• reuse & refactoring
• Describe more real-world workloads by container
• e.g. ML
• Parameter server and trainer in same Pod

1. How Kubernetes schedule
workloads?

Resource Model
• Compressible resources
• Hold no state
• Can be taken away very quickly
• “Merely” cause slowness when revoked
• e.g. CPU
• Non-compressible resources
• Hold state
• Are slower to be taken away
• Can fail to be revoked
• e.g. Memory, disk space
Kubernetes (and Docker) can only handle CPU & Memory
Don’t handle things like memory bandwidth, disk time,
cache, network bandwidth, ... (yet)

Resource Model
• Request: amount of a resource allowed
to be used, with a strong guarantee of
availability
• CPU (seconds/second), RAM (bytes)
• Scheduler will not over-commit
requests
• Limit: max amount of a resource that
can be used, regardless of guarantees
• scheduler ignores limits
• Mapping to Docker
• —cpu-shares=requests.cpu
• —cpu-quota=limits.cpu
• —cpu-period=100ms
• —memory=limits.memory

QoS Tiers and Eviction
• Guaranteed
• limits is set for all resources, all containers
• limits == requests (if set)
• Be killed until they exceed their limits
• or if the system is under memory pressure and there are no lower priority containers that can be killed.
• Burstable
• requests is set for one or more resources, one or more containers
• limits (if set) != requests
• killed once they exceed their requests and no Best-Effort pods exist when system under memory pressure
• Best-Effort
• requests and limits are not set for all of the resources, all containers
• First to get killed if the system runs out of memory

Scheduler
• Predicates
• NoDiskConflict
• NoVolumeZoneConflict
• PodFitsResources
• PodFitsHostPorts
• MatchNodeSelector
• MaxEBSVolumeCount
• MaxGCEPDVolumeCount
• CheckNodeMemoryPressure
• eviction, QoS tiers
• CheckNodeDiskPressure
• Priorities
• LeastRequestedPriority
• BalancedResourceAllocation
• SelectorSpreadPriority
• CalculateAntiAffinityPriority
• ImageLocalityPriority
• NodeAffinityPriority
• Design tips:
• watch and sync podQueue
• schedule based on cached info
• optimistically bind
• predicates is paralleled between
nodes
• priorities are paralleled between
functions in Map-Reduce way

Multi-Scheduler
The 2nd scheduler
• Tips: annotation: system usage labels
• Do NOT abuse labels

Deployment
• Replicas with control
• Bring up a Replica Set and Pods.
• Check the status of a Deployment.
• Update that Deployment (e.g. new image, labels).
• Rollback to an earlier Deployment revision.
• Pause and resume a Deployment.

Create
• ReplicaSet
• Next generation of ReplicaController
• —record: record command in the annotation of ‘nginx-deployment’

Check
• DESIRED: .spec.replicas
• CURRENT: .status.replicas
• UP-TO-DATE: contains the latest pod template
• AVAILABLE: pod status is ready (running)

Update
• kubectl set image
• will change container image
• kubectl edit
• open an editor and modify
your deployment yaml
• RollingUpdateStrategy
• 1 max unavailable
• 1 max surge
• can also be percentage
• Does not kill old Pods until a sufﬁcient
number of new Pods have come up
• Does not create new Pods until a
sufﬁcient number of old Pods have
been killed.
trigger

Update Process
• The update process is coordinated by Deployment
Controller
• Create: Replica Set (nginx-deployment-2035384211) and scaled it up to 3 replicas directly.
• Update:
• created a new Replica Set (nginx-deployment-1564180365) and scaled it up to 1
• scaled down the old Replica Set to 2
• continued scaling up and down the new and the old Replica Set, with the same rolling update
strategy.
• Finally, 3 available replicas in the new Replica Set, and the old Replica Set is scaled down to 0.

Rolling Back
• Check reversions
• Roll back to reversion

Pausing & Resuming
(Canary)
• Tips
• blue-green deployment: duplicated infrastructure
• canary release: share same infrastructure
• rollback resumed deployment is WIP
• old way: kubectl rolling-update rc-1 rc-2

3. Deploy Daemon workload to
every Node?

DaemonSet
• Spread daemon pod to every node
• DaemonSet Controller
• bypass default scheduler
• even on unschedulable nodes
• e.g. bootstrap

Horizontal Pod Autoscaling
• Tips
• Scale out/in
• TriggeredScaleUp (GCE, AWS, will add more)
• Support for custom metrics

Custom Metrics
• Endpoint (Location to collect metrics from)
• Name of metric
• Type (Counter, Gauge, ...)
• Data Type (int, ﬂoat)
• Units (kbps, seconds, count)
• Polling Frequency
• Regexps (Regular expressions to specify
which metrics to collect and how to parse
them)
• The metric will be added to pod as
ConﬁgMap volume
Prometheus
Nginx

5. Pass information to workloads?

ConfigMap
• Decouple configuration from image
• configuration is a runtime attribute
• Can be consumed by pods thru:
• env
• volumes

ConﬁgMap Volume
• No need to use Persistent Volume
• Think about Etcd

Secret
• Tip: credentials for
accessing the k8s API is
automatically added to
your pods as secret

6. Read information from system
itself?

Downward Api
• Get these inside your pod as
ENV or volume
• The pod’s name
• The pod’s namespace
• The pod’s IP
• A container’s cpu limit
• A container’s cpu request
• A container’s memory limit
• A container’s memory request

Service
• The uniﬁed portal of replica Pods
• Portal IP:Port
• External load balancer
• GCE
• AWS
• HAproxy
• Nginx
• OpenStack LB

Service Implementation
Tip: ipvs solution works in nat mode which is the same with this iptables way
$ iptables-save | grep my-service
-A KUBE-SERVICES -d 10.0.0.116/32 -p tcp -m comment --comment "default/my-service: cluster IP" -m tcp --dport 8001 -j KUBE-SVC-KEAUNL7HVWWSEZA6
-A KUBE-SVC-KEAUNL7HVWWSEZA6 -m comment --comment "default/my-service:" --mode random -j KUBE-SEP-6XXFWO3KTRMPKCHZ
-A KUBE-SVC-KEAUNL7HVWWSEZA6 -m comment --comment "default/my-service:" --mode random -j KUBE-SEP-57KPRZ3JQVENLNBRZ
-A KUBE-SEP-6XXFWO3KTRMPKCHZ -p tcp -m comment --comment "default/my-service:" -m tcp -j DNAT --to-destination 172.17.0.2:80
-A KUBE-SEP-57KPRZ3JQVENLNBRZ -p tcp -m comment --comment "default/my-service:" -m tcp -j DNAT --to-destination 172.17.0.3:80

Publishing Services
• Use Service.Type=NodePort
• <node_ip>:<node_port>
• External IP
• IPs route to one or more cluster nodes (e.g. ﬂoating IP)
• Use external LoadBalancer
• Require support from IaaS (GCE, AWS, OpenStack)
• Deploy a service-loadbalancer (e.g. HAproxy)
• Ofﬁcial guide: https://github.com/kubernetes/contrib/tree/master/service-loadbalancer

Ingress
• The next generation external Service load
balancer
• Deployed as a Pod on dedicated Node
(with external network)
• Implementation
• Nginx, HAproxy, GCE L7
• External access for service
• SSL support for service
• …
s1http://foo.bar.com <IP_of_Ingress_node>
http://foo.bar.com/foo

Headless Service
*.nginx.default.svc.cluster.local
app=nginx app=nginx app=nginx
also: subdomain

StatefulSet: “clustered applications”
• Ordinal index
• startup/teardown ordering
• Stable hostname
• Stable storage
• linked to the ordinal & hostname
• Databases like MySQL or PostgreSQL
• single instance attached to a persistent volume at any time
• Clustered software like Zookeeper, Etcd, or Elasticsearch, Cassandra
• stable membership.
Update StatefulSet:
Scale: create/delete one by one
Scale in: will not delete old persistent volume

StatefulSet
StatefulSet Example
cassandra-0 cassandra-1
volume 0 volume 1
cassandra-0.cassandra.default.svc.cluster.local
cassandra-1.cassandra.default.svc.cluster.local
$ kubectl patch petset cassandra -p '{"spec":{"replicas":10}}'

One Pod One IP
• Network sharing is important for afﬁliate
containers
• Not all containers need independent
network
• Network implementation for pod is
totally the same as for single container
Pod
Infra
container
Container A Container B
--net=container:pause
/proc/{pid}/ns/net -> net:[4026532483]

Kubernetes uses CNI
• CNI plugin
• e.g. Calico, Flannel etc
• The kubelet cni ﬂags:
• --network-plugin=cni
• --network-plugin-dir=/etc/cni/net.d
• CNI is very simple
1.Kubelet creates a network namespace for Pod
2.Kubelet invokes CNI plugin to conﬁgure the NS (interface
name, IP, MAC, gateway, bridge name …)
3.Infra container in Pod join this network namespace

Tips
• host < calico(bgp) < calico(ipip) = ﬂannel(vxlan) = docker(vxlan) < ﬂannel(udp) < weave(udp)
• Test graph comes from: http://cmgs.me/life/docker-network-cloud
Calico Flannel Weave Docker Overlay Network
Network Model Pure Layer-3 Solution VxLAN or UDP Channel VxLAN or UDP Channel VxLAN

Calico
• Step 1: Run calico-node image as DaemonSet

Calico
• Step 2: Download and enable calico cni plugin

Calico
• Step 3: Add calico network controller
• Done!

Persistent Volumes
• -v host_path:container_path
1.Attach networked storage to host path
1. mounted to host_path
2.Mount host path as container volume
1. bind mount container_path with host_path
3. Independent volume control loop

Ofﬁcially Supported PVs
• GCEPersistentDisk
• AWSElasticBlockStore
• AzureFile
• FC (Fibre Channel)
• NFS
• iSCSI
• RBD (Ceph Block Device)
• CephFS
• Cinder (OpenStack block storage)
• Glusterfs
• VsphereVolume
• HostPath (single node testing only)
• more than 20+
• Write your own volume plugin: FlexVolume
1. Implement 10 methods
2. Put binary/shell in plugin directory
• example: LVM as k8s volume

Production ENV Volume Model
Persistent Volumes
PersistentVolumeClaims Pod
Host
path
networked
storage
Pod Pod
mountPath mountPath
Key point: 职责分离

PV & PVC
• System Admin:
• $ kubectl create -f nfs-pv.yaml
• create a volume with access mode, capacity, recycling mode
• Dev:
• $ kubectl create -f pv-claim.yaml
• request a volume with access mode, resource, selector
• $ kubectl create -f pod.yaml

More …
• GC
• Health check
• Container lifecycle hook
• Jobs (batch)
• Pod afﬁnity and binding
• Dynamic provisioning
• Rescheduling
• CronJob
• Logging and monitoring
• Network policy
• Federation
• Container capabilities
• Resource quotas
• Security context
• Security polices
• GPU scheduling

Summary
• Q: Where are all these control panel ideas come from?
• A: Kubernetes = “Borg” + “Container”
• Kubernetes is a set of methodology for using containers based on past
10+ yr’s exp in Google Inc.
• “不不要摸着⽯石头过河”
• Kubernetes is a container centric DevOps/Workload orchestration system
• Not a “CI/CD”, “Micro-service” focused container cloud

Growing Adopters
• Public Cloud
• AWS
• Microsoft Azure (acquired Deis)
• Google Cloud
• 腾讯云
• 百度AI
• 阿⾥里里云
Enterprise Users

THE END
@resouer
harryzhang@zju.edu.cn

Kubernetes Walk Through from Technical View

More Related Content

What's hot (20)

Similar to Kubernetes Walk Through from Technical View (20)

Recently uploaded (20)

Kubernetes Walk Through from Technical View