SlideShare a Scribd company logo
Large Scale Overlay Networks with OVN:
Problems and Solutions
Han Zhou (hzhou8@ebay.com)
Open Infrastructure Summit - Denver, 2019
Agenda
● Background
● Control-plane components scaling
○ OVN-Controller
○ South-bound DB
○ OVN-Northd
● Scaling ACL
● Scaling nested workloads (containers on VMs)
Background of OVN
● SDN solution developed by OVS (Open vSwitch) community
● OpenStack support - neutron ML2 plugin: networking-ovn
● Kubernetes support - CNI plugin: ovn-kubernetes
● Main Features
● Full L2/L3 virtualization with overlay
networks (Geneve, STT, VxLAN)
● L2 gateway, L3 gateway
(centralized/distributed) & NAT with HA
● L4 ACLs (stateful FW) with address-set,
port-group and packet logging
● Distributed Load-Balancer
● L2/L3 Port-security
● ARP responder, static/dynamic ARP
● Flat/Vlan physical networks
● Native DHCP, Metadata
● Parent-child ports for nested workloads
● QoS
● IPSec
● Policy-based routing
● ...
● Logical/physical separation
● Distributed local controllers
● Database Approach (ovsdb) Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
Distributed Control Plane
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
OVN-Controller Scaling Challenges
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
● Factors
○ Size of data
○ Rate of changes
● Challenges
○ Big size of data to be processed
■ E.g. 10k logical ports generates >40k
logical flows and 10k port-bindings
○ Logical flow parsing is CPU intensive
○ Cloud workload changes frequently
○ Lots of inputs for flow computation
OVS
qos Address Sets
(converted)
MFF OVN
Geneve
OVS
open_vswitch
OVS
bridge
SB
logical_flow
SB
chassis
SB
encap
SB
mc_group
SB
dp_binding
SB
port_binding
SB
mac_binding
SB
dhcp
SB
dhcpv6
SB
dns
SB
gw_chassis
OVS
port
SB
addr_set
SB
port_group
Runtime Data
------------------------------
Local_datapath
Local_lports
Local_lport_ids
Active_tunnels
Ct_zone_bitmap
Pending_ct_zones
Ct_zones
Flow Output
---------------------------
Desired_flow_table
Group_table
Meter_table
Conj_id_ofs
SB OVSDB input
Local OVSDB input
Dependency Graph of OVN-Controller
Port Groups
(converted)
Original Approach - Recomputing
● Compute OVS flows by reprocessing all inputs when
○ Any input changes
○ Or even when there is no change at all (but just unrelated events)
● Benefit
○ Relatively easy to implement and maintain
● Problems
○ 100% CPU of ovn-controller process on all compute nodes
○ High control plane latency
Solution - Incremental Processing Engine
● DAG representing dependencies
● Each node contains
○ Data
○ Links to input nodes
○ Change-handler for each input
○ Full recompute handler
● Engine
○ DFS post-order traverse the DAG from the
final output node
○ Invoke change-handlers for inputs that
changed
○ Fall back to recompute if for ANY of its inputs:
■ Change-handler is not implemented for that
input, or
■ Change-handler cannot handle the particular
change (returns false)
input
intermediate
input
intermediate
output
input
OVS
qos Address Sets
(converted)
MFF OVN
Geneve
OVS
open_vswitch
OVS
bridge
SB
logical_flow
SB
chassis
SB
encap
SB
mc_group
SB
dp_binding
SB
port_binding
SB
mac_binding
SB
dhcp
SB
dhcpv6
SB
dns
SB
gw_chassis
OVS
port
SB
addr_set
SB
port_group
Runtime Data
------------------------------
Local_datapath
Local_lports
Local_lport_ids
Active_tunnels
Ct_zone_bitmap
Pending_ct_zones
Ct_zones
Flow Output
---------------------------
Desired_flow_table
Group_table
Meter_table
Conj_id_ofs
SB OVSDB input
Local OVSDB input
Input with change
handler implemented
Change Handler Implemented
Port Groups
(converted)
● Create and bind 10k ports on 1k HVs
○ Simulated 1k HVs on 20 BMs x 40 cores (2.50GHz)
○ 10k ports all under the same logical router
○ Batch size 100 lports
○ Bind port one by one for each batch
○ Wait all ports up before next batch
CPU Efficiency Improvement
● End to end latency on top of 10k existed logical ports
○ Create one more logical port and bind the port on HV
○ Wait until northd generate lflows and create port-binding in SB
○ Wait until ovn-controller claim the port on HV
○ Wait until northd generate all lflows
○ Wait until OVS flows programmed on all HVs
Latency Improvement
Tests at Larger Scale
● Next bottle-necks:
○ OVS flow installation
○ Port-binding handling when the binding happens locally
What’s next for Incremental-Processing (WIP)
● Incremental flow installation
○ Low hanging fruit - with the help of incremental flow computing
● Implement more change handlers as needed
○ E.g. support incremental processing when port-binding happens locally - further improve
end-to-end latency
● New implementation: Differential Datalog (DDlog)
○ Data-flow approach
○ Reuse the effort taken for Northd improvement (will be discussed in Northd scaling)
● Upstream?
○ Not in upstream, because DDlog is the preferred long term solution
○ For those who need this:
■ Rebased on Master: https://github.com/hzhou8/ovs/tree/ovn-controller-inc-proc
■ Rebased on 2.11: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.11
■ Rebased on 2.10: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.10
OVN-Controller Other Improvements (WIP)
● Reduce data size per-HV
○ Problem: External Provider Network connects everything
○ Solution: Don’t cross external network boundary when calculating connected datapaths
● On-demand tunnel port creation
○ Problem: Too many OVS ports when there are a lot of HVs
○ Solution: Create tunnel to a remote host only if there are ports on these hosts logically connected.
● Factors
○ Number of clients (HVs & GWs)
○ Size of data
○ Rate of changes
● Problems
○ Probe handling
○ Data resync during restart/failover
○ Clustered-mode problems
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
SB DB Scaling Challenges
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
SB DB Probe
● Default 5 sec probe interval causing connection flapping
○ Ovsdb-server response can occasionally exceed 5 sec
■ DB log compression
■ Large transaction handling
○ Clients reconnecting adds more load to the server - cascade failure
■ Clients resync data from server (solved - see next slide)
● Solution
○ Increase probe interval
■ Client side (on HVs)
● ovs-vsctl set open . external_ids:ovn-remote-probe-interval=60000
■ Server side (DON’T FORGET!!)
● ovn-sbctl -- --id=@conn_uuid create Connection 
target="ptcp:6642:0.0.0.0" 
inactivity_probe=0 -- set SB_Global . connections=@conn_uuid
○ Rely on external monitorings for HVs connectivity
Data re-sync during DB reconnect
● Problem
○ OVSDB client caching => NOT a problem
○ Server restart/failover: re-sync data for all
clients. => This is the problem!
● Solution - OVSDB fast re-sync (in master -> v2.12)
○ Track and maintain recent history transactions
in disk and memory.
○ New method monitor_cond_since in OVSDB
protocol, to request changes since last point
before connection lost.
○ Note: now it works for clustered mode only.
● Test Result - 1k HVs, 10k ports
○ Before: SB DB 100% CPU, >30 min to recover.
○ After: No CPU spike, all connections restored in
<1 min (probe interval).
OVSDB Clustered Mode
● Raft based clustering (experimental support since v2.9)
● Problems at scale
○ High CPU load (solved in master)
○ Follower update latency (solved in master)
○ Leader flapping (WIP, workaround ready)
○ Client reconnect (solved in master)
OVSDB Clustered Mode - High CPU
● OVSDB Raft Implementation
○ Preprocessing on followers before sending to leader - share
some load for leader
○ Send preprocessed transaction to leader together with a
prerequisite version ID
● Problem
○ Lots of prerequisite check failure and retry at large scale
■ Different HVs update chassis/port_binding at the same time
through different follower nodes
○ Continuous retry causes 100% CPU
● Solution (in master -> v2.12)
○ Retry only when the follower have applied the largest local
Raft log index
■ Otherwise, the prerequisite is already out-of-date, so don’t
waste CPU
OVSDB Clustered Mode - Follower Latency
● Original behavior: leader sends Raft log update to follower nodes when:
○ A new change is proposed, or
○ A heartbeat is sent
● Problem
○ Update from follower node suffers big latency
● Solution (in master -> v2.12)
○ Send log to followers as soon as a new entry is committed
● Test result: 100 updates through same follower from same client
○ Before: >30 sec
○ After: 500 ms
OVSDB Clustered Mode - Leader Flapping
● Problem: heartbeat timeout, triggering re-election
○ Large transaction execution
○ Raft log compression (snapshot)
● Solution
○ Quick and dirty: Increase election timeout (hardcode)
○ Short term: Make election timeout configurable at cluster level (WIP)
○ Longer term: Separate thread for Raft RPC (WIP)
■ Still need to configure timeout for snapshot scenarios
OVSDB Clustered Mode - Client Reconnect
● Problem: during leader failover, all clients of new leader will reconnect
○ DB state changes to “disconnected” when there is no leader (temporarily)
○ Client tries to reconnect to a new node
● Solution (in master -> v2.12)
○ Don’t change state to “disconnected” if
■ Current node is candidate, and
■ Election didn’t timeout yet
Scale Test for Clustered Mode
● Setup
○ 3-node cluster, 1k HVs
○ Election timeout: 10s (hardcoded in the test)
● Test
○ Keep creating and binding ports up to 10k
○ Periodically kill->wait(10s)->start each ovsdb-server randomly
● Test passed at scale!
○ All port creation and binding completed correctly.
○ Fast-resync helped!
Further Improvement: SB-DB Scale-out Replicas (TODO)
● How to support more HVs - 2k? 5k? 10k?
○ More nodes in cluster? Doesn’t scale.
○ Multi-threading OVSDB? Would help, but...
● Precondition: no write to SB from HV
○ Chassis/Encap/Port-binding update by
CMS/northd only
○ Does not use dynamic ARP (mac-binding)
● How
○ Use replication mode of OVSDB to create N
read-only replicas
○ HV connections sharding on read-only
replicas
○ HV can failover to other replicas
NorthdNorthd
SB ovsdb
SB
Replica 1
SB
Replica 2
SB
Replica n
…
HV HV HV
…
HV HV HV
…
HV HV HV
…
CMS
NB ovsdb
Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
OVN-Northd Scaling Challenges
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
● Factors
○ Size of data
○ Rate of changes
● Problems
○ Recompute
OVN-Northd Incremental Processing (WIP from community)
● OVN-Northd is a perfect target user of Differential Datalog (DDlog)
○ Inputs - NB DB tables (logical routers, switch, port, etc.)
○ Outputs - SB DB tables (logical flows, port-bindings, etc.)
○ Rules to convert inputs to outputs
● Differential Datalog
○ An open-source datalog language for incremental data-flow processing
○ Defining inputs and outputs as relations
○ Defining rules to generate outputs from inputs
● Efforts can be reused by OVN-Controller
○ OVSDB - DDlog wrappers
○ Process framework changes
● OVN-Northd
● OVN-SB DB
● OVN-Controller Northd
North-bound
ovsdb
South-bound
ovsdb
Central
North-bound
ovsdb
Northd
South-bound
ovsdb
Recap Scaling Bottlenecks
OVN-Controller
OVS
HV
HV …
OVSDB protocol (RFC7047)
HV GW
CMS
(OpenStack/K8S)
Virtual Network
Abstractions
Logical Flows
OpenFlows
Some More Scaling Problems
● Security Group / Network policy using ACLs
● Nested workloads (K8S containers)
ACLs
● Used by Security Group (OpenStack) / Network Policy (K8S)
● Typical use case: members of same group are allowed to access each other
● Naked => O(N^2)
● Using Address Set => O(N)
● #Flows in OVS is always O(M*N) (M = number of ports on the HV)
outport == <port1_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
outport == <port2_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
...
outport == <portN_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN}
outport == <port1_uuid> && ip4 && ip4.src == $as_ip4_sg1
outport == <port2_uuid> && ip4 && ip4.src == $as_ip4_sg1
...
outport == <portN_uuid> && ip4 && ip4.src == $as_ip4_sg1
Solution - Port Group (Released in v2.10)
● All-in-one
● Greatly simplified CMS Implementation
○ networking-ovn
○ ovn-kubernetes
● Enables more efficient OVS flow generation with conjunction, when multiple ports on same HV
belongs to same port-group
○ E.g.
■ N members in a port-group, all M ports on HV1 belong to this group
■ Number of OVS flows on HV1 will be M + N, instead of M * N
outport == @port_group1 && ip4 && ip4.src == $port_group1_ip4
CMS creates
port-group instead
of address-set
OVN-Northd
generates
address-set for you
Further Improvement - Group-ID in Packet (TODO)
● Problem - still too many OVS flows
○ Best case: M + N, if all M ports on HV belongs to same group.
○ Worst case: M * N, if ports are distributed randomly.
■ M ports on HV, each belongs to a different group, each group has N members
● Solution (just an idea)
○ Encoding port-group in tunnel metadata
■ Only M flows in all cases
■ Best part: no local flow change needed for remote member changes
○ Challenge: what if a port belongs to multiple groups
■ Limit the number of groups for a single port
■ Fall back to old way if exceeds
○ Limitation: works for ingress (to-lport) rules only
outport == @port_group1 && src_group_id == <group1 id>
From tunnel
metadata
Scaling Nested Workloads
● Use Case
○ VM overlay networking with OVN (e.g. using OpenStack networking-ovn)
○ Run Kubernetes on top of the VMs
● Problem
○ How to connect the pods at scale?
ARP Proxy
● OVN doesn’t support MAC-learning (MAC-Port binding
learning), but IP-MAC binding can be learned through
ARP
● How
○ LR send ARP request for Pod IPs
○ ARP proxy in the VM replies with VM’s MAC for
all Pod IPs on the VM
● Works, but
○ Requires VM and Pods on same subnet
○ Unreliable when SB DB connection fails
○ Scale: O(N), N = number of pods, usually much
bigger than number of VMs
■ Note: IP-MAC Binding incremental processing
change handler is implemented - no re-compute.
HV
VM
OVS
Pod
Pod Pod
Pod
ARP
Proxy
OVN
Controller
SB
IP-MAC
Binding Table
LR ARP Cache (dynamic):
10.0.0.102 => aa:bb:cc:dd:ee:ff
10.0.0.103 => aa:bb:cc:dd:ee:ff
10.0.0.104 => aa:bb:cc:dd:ee:ff
...
10.0.0.102
10.0.0.103 10.0.0.104
10.0.0.105
10.0.0.2 (aa:bb:cc:dd:ee:ff)
LR Static Route
● Assign Pod subnet(s) per VM (minion)
● How
○ Configure static routes in OVN LR for pod
subnets: next hop = VM IP
● Considerations
○ De-couples VM and Pod subnets
○ Declarative, more reliable than ARP
○ May waste more IPs, but size of subnet is
flexible
○ Scale: O(S), S = number of pod subnets
■ Worst case O(N), N = number of pods, if subnet
size is /32.
HV
VM
OVS
Pod
Pod Pod
Pod
10.0.0.2/25
10.0.0.3/25 10.0.0.4/25
10.0.0.5/25
172.0.0.2/24
LR Routing Table (static):
10.0.0.0/25 => 172.0.0.2
10.0.0.128/25 => 172.0.1.100
10.0.0.1/25 => 172.0.1.3
...
● OVS/OVN
○ http://www.openvswitch.org/
● Networking-OVN
○ https://docs.openstack.org/networking-ovn/latest/
● OVN-Kubernetes
○ https://github.com/openvswitch/ovn-kubernetes/
● OVN-Scale-Test
○ https://github.com/openvswitch/ovn-scale-test
● GO-OVN library
○ https://github.com/eBay/go-ovn
References

More Related Content

PDF
OpenStack networking (Neutron)
PPTX
Meetup 23 - 02 - OVN - The future of networking in OpenStack
PPTX
OpenvSwitch Deep Dive
PDF
OpenStack Networking
PPTX
Overview of Distributed Virtual Router (DVR) in Openstack/Neutron
PPTX
OVN - Basics and deep dive
PPTX
The Basic Introduction of Open vSwitch
PPTX
OpenStack Architecture and Use Cases
OpenStack networking (Neutron)
Meetup 23 - 02 - OVN - The future of networking in OpenStack
OpenvSwitch Deep Dive
OpenStack Networking
Overview of Distributed Virtual Router (DVR) in Openstack/Neutron
OVN - Basics and deep dive
The Basic Introduction of Open vSwitch
OpenStack Architecture and Use Cases

What's hot (20)

PPTX
OVN DBs HA with scale test
PPTX
OVN operationalization at scale at eBay
PDF
Open vSwitch - Stateful Connection Tracking & Stateful NAT
PDF
Deploying IPv6 on OpenStack
PPTX
Packet flow on openstack
PDF
macvlan and ipvlan
PPTX
Vxlan deep dive session rev0.5 final
PDF
Deploying IPv6 in OpenStack Environments
PDF
Routed Provider Networks on OpenStack
PDF
Service Function Chaining in Openstack Neutron
PDF
netfilter and iptables
PDF
Open vSwitch Introduction
PDF
VLANs in the Linux Kernel
PDF
VPNaaS in Neutron
PDF
Openstack Neutron, interconnections with BGP/MPLS VPNs
PDF
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
PDF
Virtualized network with openvswitch
PDF
NFV & Openstack
PDF
BGP Dynamic Routing and Neutron
PDF
Neutron packet logging framework
OVN DBs HA with scale test
OVN operationalization at scale at eBay
Open vSwitch - Stateful Connection Tracking & Stateful NAT
Deploying IPv6 on OpenStack
Packet flow on openstack
macvlan and ipvlan
Vxlan deep dive session rev0.5 final
Deploying IPv6 in OpenStack Environments
Routed Provider Networks on OpenStack
Service Function Chaining in Openstack Neutron
netfilter and iptables
Open vSwitch Introduction
VLANs in the Linux Kernel
VPNaaS in Neutron
Openstack Neutron, interconnections with BGP/MPLS VPNs
Deploying CloudStack and Ceph with flexible VXLAN and BGP networking
Virtualized network with openvswitch
NFV & Openstack
BGP Dynamic Routing and Neutron
Neutron packet logging framework
Ad

Similar to Large scale overlay networks with ovn: problems and solutions (20)

PPTX
OVN Controller Incremental Processing
PPTX
Managing Open vSwitch Across a Large Heterogenous Fleet
PDF
The Open vSwitch and OVN Projects
PDF
Network Virtualization & Software-defined Networking
PDF
SDN & NFV Introduction - Open Source Data Center Networking
PPTX
Monitoring federation open stack infrastructure
PPTX
Can you trust Neutron?
PDF
LF_OVS_17_State of the OVN
DOCX
[OSS Upstream Training] 5 open stack liberty_recap
DOCX
open stackliberty_recap_by_VietOpenStack
PDF
SDN in the Management Plane: OpenConfig and Streaming Telemetry
PDF
Model-driven Network Automation
PDF
Consistent Updates in Software-De!ned Networks
PDF
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
PPTX
PLNOG19 - Krzysztof Szarkowicz - RIFT i nowe pomysły na routing
PPTX
PDF
ONOS-Based VIM Implementation
PPTX
OpenStack@NBU
PPTX
OpenStack and OpenDaylight Workshop: ONUG Spring 2014
PDF
Coordination in distributed systems
OVN Controller Incremental Processing
Managing Open vSwitch Across a Large Heterogenous Fleet
The Open vSwitch and OVN Projects
Network Virtualization & Software-defined Networking
SDN & NFV Introduction - Open Source Data Center Networking
Monitoring federation open stack infrastructure
Can you trust Neutron?
LF_OVS_17_State of the OVN
[OSS Upstream Training] 5 open stack liberty_recap
open stackliberty_recap_by_VietOpenStack
SDN in the Management Plane: OpenConfig and Streaming Telemetry
Model-driven Network Automation
Consistent Updates in Software-De!ned Networks
QNIBTerminal: Understand your datacenter by overlaying multiple information l...
PLNOG19 - Krzysztof Szarkowicz - RIFT i nowe pomysły na routing
ONOS-Based VIM Implementation
OpenStack@NBU
OpenStack and OpenDaylight Workshop: ONUG Spring 2014
Coordination in distributed systems
Ad

Recently uploaded (20)

PPTX
Welding lecture in detail for understanding
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
web development for engineering and engineering
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
DOCX
573137875-Attendance-Management-System-original
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
PPT on Performance Review to get promotions
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
Welding lecture in detail for understanding
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Structs to JSON How Go Powers REST APIs.pdf
UNIT 4 Total Quality Management .pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
web development for engineering and engineering
Lesson 3_Tessellation.pptx finite Mathematics
573137875-Attendance-Management-System-original
OOP with Java - Java Introduction (Basics)
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPT on Performance Review to get promotions
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Operating System & Kernel Study Guide-1 - converted.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Arduino robotics embedded978-1-4302-3184-4.pdf

Large scale overlay networks with ovn: problems and solutions

  • 1. Large Scale Overlay Networks with OVN: Problems and Solutions Han Zhou (hzhou8@ebay.com) Open Infrastructure Summit - Denver, 2019
  • 2. Agenda ● Background ● Control-plane components scaling ○ OVN-Controller ○ South-bound DB ○ OVN-Northd ● Scaling ACL ● Scaling nested workloads (containers on VMs)
  • 3. Background of OVN ● SDN solution developed by OVS (Open vSwitch) community ● OpenStack support - neutron ML2 plugin: networking-ovn ● Kubernetes support - CNI plugin: ovn-kubernetes ● Main Features ● Full L2/L3 virtualization with overlay networks (Geneve, STT, VxLAN) ● L2 gateway, L3 gateway (centralized/distributed) & NAT with HA ● L4 ACLs (stateful FW) with address-set, port-group and packet logging ● Distributed Load-Balancer ● L2/L3 Port-security ● ARP responder, static/dynamic ARP ● Flat/Vlan physical networks ● Native DHCP, Metadata ● Parent-child ports for nested workloads ● QoS ● IPSec ● Policy-based routing ● ...
  • 4. ● Logical/physical separation ● Distributed local controllers ● Database Approach (ovsdb) Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb Distributed Control Plane OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 5. Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb OVN-Controller Scaling Challenges OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows ● Factors ○ Size of data ○ Rate of changes ● Challenges ○ Big size of data to be processed ■ E.g. 10k logical ports generates >40k logical flows and 10k port-bindings ○ Logical flow parsing is CPU intensive ○ Cloud workload changes frequently ○ Lots of inputs for flow computation
  • 6. OVS qos Address Sets (converted) MFF OVN Geneve OVS open_vswitch OVS bridge SB logical_flow SB chassis SB encap SB mc_group SB dp_binding SB port_binding SB mac_binding SB dhcp SB dhcpv6 SB dns SB gw_chassis OVS port SB addr_set SB port_group Runtime Data ------------------------------ Local_datapath Local_lports Local_lport_ids Active_tunnels Ct_zone_bitmap Pending_ct_zones Ct_zones Flow Output --------------------------- Desired_flow_table Group_table Meter_table Conj_id_ofs SB OVSDB input Local OVSDB input Dependency Graph of OVN-Controller Port Groups (converted)
  • 7. Original Approach - Recomputing ● Compute OVS flows by reprocessing all inputs when ○ Any input changes ○ Or even when there is no change at all (but just unrelated events) ● Benefit ○ Relatively easy to implement and maintain ● Problems ○ 100% CPU of ovn-controller process on all compute nodes ○ High control plane latency
  • 8. Solution - Incremental Processing Engine ● DAG representing dependencies ● Each node contains ○ Data ○ Links to input nodes ○ Change-handler for each input ○ Full recompute handler ● Engine ○ DFS post-order traverse the DAG from the final output node ○ Invoke change-handlers for inputs that changed ○ Fall back to recompute if for ANY of its inputs: ■ Change-handler is not implemented for that input, or ■ Change-handler cannot handle the particular change (returns false) input intermediate input intermediate output input
  • 9. OVS qos Address Sets (converted) MFF OVN Geneve OVS open_vswitch OVS bridge SB logical_flow SB chassis SB encap SB mc_group SB dp_binding SB port_binding SB mac_binding SB dhcp SB dhcpv6 SB dns SB gw_chassis OVS port SB addr_set SB port_group Runtime Data ------------------------------ Local_datapath Local_lports Local_lport_ids Active_tunnels Ct_zone_bitmap Pending_ct_zones Ct_zones Flow Output --------------------------- Desired_flow_table Group_table Meter_table Conj_id_ofs SB OVSDB input Local OVSDB input Input with change handler implemented Change Handler Implemented Port Groups (converted)
  • 10. ● Create and bind 10k ports on 1k HVs ○ Simulated 1k HVs on 20 BMs x 40 cores (2.50GHz) ○ 10k ports all under the same logical router ○ Batch size 100 lports ○ Bind port one by one for each batch ○ Wait all ports up before next batch CPU Efficiency Improvement
  • 11. ● End to end latency on top of 10k existed logical ports ○ Create one more logical port and bind the port on HV ○ Wait until northd generate lflows and create port-binding in SB ○ Wait until ovn-controller claim the port on HV ○ Wait until northd generate all lflows ○ Wait until OVS flows programmed on all HVs Latency Improvement
  • 12. Tests at Larger Scale ● Next bottle-necks: ○ OVS flow installation ○ Port-binding handling when the binding happens locally
  • 13. What’s next for Incremental-Processing (WIP) ● Incremental flow installation ○ Low hanging fruit - with the help of incremental flow computing ● Implement more change handlers as needed ○ E.g. support incremental processing when port-binding happens locally - further improve end-to-end latency ● New implementation: Differential Datalog (DDlog) ○ Data-flow approach ○ Reuse the effort taken for Northd improvement (will be discussed in Northd scaling) ● Upstream? ○ Not in upstream, because DDlog is the preferred long term solution ○ For those who need this: ■ Rebased on Master: https://github.com/hzhou8/ovs/tree/ovn-controller-inc-proc ■ Rebased on 2.11: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.11 ■ Rebased on 2.10: https://github.com/hzhou8/ovs/tree/ip12_rebase_on_2.10
  • 14. OVN-Controller Other Improvements (WIP) ● Reduce data size per-HV ○ Problem: External Provider Network connects everything ○ Solution: Don’t cross external network boundary when calculating connected datapaths ● On-demand tunnel port creation ○ Problem: Too many OVS ports when there are a lot of HVs ○ Solution: Create tunnel to a remote host only if there are ports on these hosts logically connected.
  • 15. ● Factors ○ Number of clients (HVs & GWs) ○ Size of data ○ Rate of changes ● Problems ○ Probe handling ○ Data resync during restart/failover ○ Clustered-mode problems Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb SB DB Scaling Challenges OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 16. SB DB Probe ● Default 5 sec probe interval causing connection flapping ○ Ovsdb-server response can occasionally exceed 5 sec ■ DB log compression ■ Large transaction handling ○ Clients reconnecting adds more load to the server - cascade failure ■ Clients resync data from server (solved - see next slide) ● Solution ○ Increase probe interval ■ Client side (on HVs) ● ovs-vsctl set open . external_ids:ovn-remote-probe-interval=60000 ■ Server side (DON’T FORGET!!) ● ovn-sbctl -- --id=@conn_uuid create Connection target="ptcp:6642:0.0.0.0" inactivity_probe=0 -- set SB_Global . connections=@conn_uuid ○ Rely on external monitorings for HVs connectivity
  • 17. Data re-sync during DB reconnect ● Problem ○ OVSDB client caching => NOT a problem ○ Server restart/failover: re-sync data for all clients. => This is the problem! ● Solution - OVSDB fast re-sync (in master -> v2.12) ○ Track and maintain recent history transactions in disk and memory. ○ New method monitor_cond_since in OVSDB protocol, to request changes since last point before connection lost. ○ Note: now it works for clustered mode only. ● Test Result - 1k HVs, 10k ports ○ Before: SB DB 100% CPU, >30 min to recover. ○ After: No CPU spike, all connections restored in <1 min (probe interval).
  • 18. OVSDB Clustered Mode ● Raft based clustering (experimental support since v2.9) ● Problems at scale ○ High CPU load (solved in master) ○ Follower update latency (solved in master) ○ Leader flapping (WIP, workaround ready) ○ Client reconnect (solved in master)
  • 19. OVSDB Clustered Mode - High CPU ● OVSDB Raft Implementation ○ Preprocessing on followers before sending to leader - share some load for leader ○ Send preprocessed transaction to leader together with a prerequisite version ID ● Problem ○ Lots of prerequisite check failure and retry at large scale ■ Different HVs update chassis/port_binding at the same time through different follower nodes ○ Continuous retry causes 100% CPU ● Solution (in master -> v2.12) ○ Retry only when the follower have applied the largest local Raft log index ■ Otherwise, the prerequisite is already out-of-date, so don’t waste CPU
  • 20. OVSDB Clustered Mode - Follower Latency ● Original behavior: leader sends Raft log update to follower nodes when: ○ A new change is proposed, or ○ A heartbeat is sent ● Problem ○ Update from follower node suffers big latency ● Solution (in master -> v2.12) ○ Send log to followers as soon as a new entry is committed ● Test result: 100 updates through same follower from same client ○ Before: >30 sec ○ After: 500 ms
  • 21. OVSDB Clustered Mode - Leader Flapping ● Problem: heartbeat timeout, triggering re-election ○ Large transaction execution ○ Raft log compression (snapshot) ● Solution ○ Quick and dirty: Increase election timeout (hardcode) ○ Short term: Make election timeout configurable at cluster level (WIP) ○ Longer term: Separate thread for Raft RPC (WIP) ■ Still need to configure timeout for snapshot scenarios
  • 22. OVSDB Clustered Mode - Client Reconnect ● Problem: during leader failover, all clients of new leader will reconnect ○ DB state changes to “disconnected” when there is no leader (temporarily) ○ Client tries to reconnect to a new node ● Solution (in master -> v2.12) ○ Don’t change state to “disconnected” if ■ Current node is candidate, and ■ Election didn’t timeout yet
  • 23. Scale Test for Clustered Mode ● Setup ○ 3-node cluster, 1k HVs ○ Election timeout: 10s (hardcoded in the test) ● Test ○ Keep creating and binding ports up to 10k ○ Periodically kill->wait(10s)->start each ovsdb-server randomly ● Test passed at scale! ○ All port creation and binding completed correctly. ○ Fast-resync helped!
  • 24. Further Improvement: SB-DB Scale-out Replicas (TODO) ● How to support more HVs - 2k? 5k? 10k? ○ More nodes in cluster? Doesn’t scale. ○ Multi-threading OVSDB? Would help, but... ● Precondition: no write to SB from HV ○ Chassis/Encap/Port-binding update by CMS/northd only ○ Does not use dynamic ARP (mac-binding) ● How ○ Use replication mode of OVSDB to create N read-only replicas ○ HV connections sharding on read-only replicas ○ HV can failover to other replicas NorthdNorthd SB ovsdb SB Replica 1 SB Replica 2 SB Replica n … HV HV HV … HV HV HV … HV HV HV … CMS NB ovsdb
  • 25. Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb OVN-Northd Scaling Challenges HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows ● Factors ○ Size of data ○ Rate of changes ● Problems ○ Recompute
  • 26. OVN-Northd Incremental Processing (WIP from community) ● OVN-Northd is a perfect target user of Differential Datalog (DDlog) ○ Inputs - NB DB tables (logical routers, switch, port, etc.) ○ Outputs - SB DB tables (logical flows, port-bindings, etc.) ○ Rules to convert inputs to outputs ● Differential Datalog ○ An open-source datalog language for incremental data-flow processing ○ Defining inputs and outputs as relations ○ Defining rules to generate outputs from inputs ● Efforts can be reused by OVN-Controller ○ OVSDB - DDlog wrappers ○ Process framework changes
  • 27. ● OVN-Northd ● OVN-SB DB ● OVN-Controller Northd North-bound ovsdb South-bound ovsdb Central North-bound ovsdb Northd South-bound ovsdb Recap Scaling Bottlenecks OVN-Controller OVS HV HV … OVSDB protocol (RFC7047) HV GW CMS (OpenStack/K8S) Virtual Network Abstractions Logical Flows OpenFlows
  • 28. Some More Scaling Problems ● Security Group / Network policy using ACLs ● Nested workloads (K8S containers)
  • 29. ACLs ● Used by Security Group (OpenStack) / Network Policy (K8S) ● Typical use case: members of same group are allowed to access each other ● Naked => O(N^2) ● Using Address Set => O(N) ● #Flows in OVS is always O(M*N) (M = number of ports on the HV) outport == <port1_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} outport == <port2_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} ... outport == <portN_uuid> && ip4 && ip4.src == {ip1, ip2, …, ipN} outport == <port1_uuid> && ip4 && ip4.src == $as_ip4_sg1 outport == <port2_uuid> && ip4 && ip4.src == $as_ip4_sg1 ... outport == <portN_uuid> && ip4 && ip4.src == $as_ip4_sg1
  • 30. Solution - Port Group (Released in v2.10) ● All-in-one ● Greatly simplified CMS Implementation ○ networking-ovn ○ ovn-kubernetes ● Enables more efficient OVS flow generation with conjunction, when multiple ports on same HV belongs to same port-group ○ E.g. ■ N members in a port-group, all M ports on HV1 belong to this group ■ Number of OVS flows on HV1 will be M + N, instead of M * N outport == @port_group1 && ip4 && ip4.src == $port_group1_ip4 CMS creates port-group instead of address-set OVN-Northd generates address-set for you
  • 31. Further Improvement - Group-ID in Packet (TODO) ● Problem - still too many OVS flows ○ Best case: M + N, if all M ports on HV belongs to same group. ○ Worst case: M * N, if ports are distributed randomly. ■ M ports on HV, each belongs to a different group, each group has N members ● Solution (just an idea) ○ Encoding port-group in tunnel metadata ■ Only M flows in all cases ■ Best part: no local flow change needed for remote member changes ○ Challenge: what if a port belongs to multiple groups ■ Limit the number of groups for a single port ■ Fall back to old way if exceeds ○ Limitation: works for ingress (to-lport) rules only outport == @port_group1 && src_group_id == <group1 id> From tunnel metadata
  • 32. Scaling Nested Workloads ● Use Case ○ VM overlay networking with OVN (e.g. using OpenStack networking-ovn) ○ Run Kubernetes on top of the VMs ● Problem ○ How to connect the pods at scale?
  • 33. ARP Proxy ● OVN doesn’t support MAC-learning (MAC-Port binding learning), but IP-MAC binding can be learned through ARP ● How ○ LR send ARP request for Pod IPs ○ ARP proxy in the VM replies with VM’s MAC for all Pod IPs on the VM ● Works, but ○ Requires VM and Pods on same subnet ○ Unreliable when SB DB connection fails ○ Scale: O(N), N = number of pods, usually much bigger than number of VMs ■ Note: IP-MAC Binding incremental processing change handler is implemented - no re-compute. HV VM OVS Pod Pod Pod Pod ARP Proxy OVN Controller SB IP-MAC Binding Table LR ARP Cache (dynamic): 10.0.0.102 => aa:bb:cc:dd:ee:ff 10.0.0.103 => aa:bb:cc:dd:ee:ff 10.0.0.104 => aa:bb:cc:dd:ee:ff ... 10.0.0.102 10.0.0.103 10.0.0.104 10.0.0.105 10.0.0.2 (aa:bb:cc:dd:ee:ff)
  • 34. LR Static Route ● Assign Pod subnet(s) per VM (minion) ● How ○ Configure static routes in OVN LR for pod subnets: next hop = VM IP ● Considerations ○ De-couples VM and Pod subnets ○ Declarative, more reliable than ARP ○ May waste more IPs, but size of subnet is flexible ○ Scale: O(S), S = number of pod subnets ■ Worst case O(N), N = number of pods, if subnet size is /32. HV VM OVS Pod Pod Pod Pod 10.0.0.2/25 10.0.0.3/25 10.0.0.4/25 10.0.0.5/25 172.0.0.2/24 LR Routing Table (static): 10.0.0.0/25 => 172.0.0.2 10.0.0.128/25 => 172.0.1.100 10.0.0.1/25 => 172.0.1.3 ...
  • 35. ● OVS/OVN ○ http://www.openvswitch.org/ ● Networking-OVN ○ https://docs.openstack.org/networking-ovn/latest/ ● OVN-Kubernetes ○ https://github.com/openvswitch/ovn-kubernetes/ ● OVN-Scale-Test ○ https://github.com/openvswitch/ovn-scale-test ● GO-OVN library ○ https://github.com/eBay/go-ovn References