Graphs are at the Heart of the Cloud

1. Graphs are at the heart of the cloud Alejandro Erickson This work is supported by the EPSRC grant “INPUT: Interconnection Networks, Practice Unites with Theory”, with Iain Stewart, Javier Navaridas, Abbas Kiasari April 28, 2015 ACiD Seminar School of Engineering and Computer Science, Durham University 1/38

2. Data Centre Networks What are they? A collection of servers and networking hardware (switches, load balancers etc.) connected together at one physical location. What are they used for? Video streaming Data analysis and scientiﬁc applications Indexing the web Cloud services. Global cloud computing market soon to exceed $100 billion. 2/38

3. Huge and growing Google, Amazon, and Microsoft combined house esimated over 3 million servers, and these numbers have been growing exponentially. Microsoft Live, in Chicago, covers over 700, 000ft2 . In 2010, data centres used between 1.1% and 1.5% of total global electricity, and the US EPA reported data centre power usage doubled from 2000 to 2006. 3/38

4. Huge and growing Google, Amazon, and Microsoft combined house esimated over 3 million servers, and these numbers have been growing exponentially. Microsoft Live, in Chicago, covers over 700, 000ft2 . In 2010, data centres used between 1.1% and 1.5% of total global electricity, and the US EPA reported data centre power usage doubled from 2000 to 2006. 3/38

5. “Layered” (Single Tier) Data Centre Architecture 4/38

6. “Layered” (Single Tier) Data Centre Architecture Aggregation Layer: “Server-to-server multi-tier traffic flows through the aggregation layer and can use services, such as firewall and server load balancing, to optimize and secure applications ... services, such as content switching, firewall, SSL offload, intrusion detection, network analysis, and more.” Access Layer: “Where the servers physically attach to the network...” 1 1 http://www.cisco.com/c/en/us/td/docs/solutions/ Enterprise/Data_Center/DC_Infra2_5/DCInfra_1.html 4/38

7. “Layered” (Single Tier) Data Centre Architecture Faults. Poor: Scalability Fault tolerance Energy eﬃciency Hardware cost Agility This is widely considered by the research community to be a legacy architecture. 4/38

8. What is an Architecture? Architecture for “us” A list of constraints (formal and informal) on graphs. This can be complex and difficult to determine! For example nodes can be of different types; e.g., switch-nodes, server-nodes, with appropriate degree constraints. edges can be of different types, and certain edges may be forbidden. constraints on routing algorithms, diameter, embedding in space etc... Traditional data centre topologically “uninteresting” Most of the functionality is in the “enterprise-level” aggregation switches, and the network topology is limited. 5/38

12. FrameWork vs NowWork 6/38

13. Overview of Architectures 3-layer architecture; single-tier, multi-tier, a.k.a. hierarchical. Switch-centric, commodity off-the-shelf (indirect networks): fat-tree (Clos based multi-rooted tree), Al-Fares, Loukissas, Vahdat, SIGCOMM 2008. Jellyfish (random-regular graphs), Singla, USENIX 2012. Hybrid-optical switch; send certain traffic to an optical switch. Helios, Farrington et. al, SIGCOMM 2010. Server-centric networks. DCell, Chuanxiong Guo, SIGCOMM 2008. BCube, Chuanxiong Guo, SIGCOMM 2009, and many others. Use wavelength division multiplexers to combine links into a single optical cable. Quartz, Yupeng James Liu, SIGCOMM 2014. Free-space optics on steerable controllers for each rack of servers. Firefly, Hamedazimi, SIGCOMM 2014. 7/38

19. Switch-centric commodity data centre Network of commodity (ethernet) switches, typically homogeneous. Server (or server-racks) attach at certain points. Linked by cables. (many details ommitted!) Graph theoretic constraints Connected graph on servers-nodes of degree 1, switch-nodes of degree at most 100 (degree 48 is typical). Figures: (left) Fat-tree Clos (from Al-Fares et. al 2008) (right) Jellyﬁsh random regular from Godfrey’s blog. 8/38

20. Hybrid-optical switch Replace some switches in a switch-centric network with optical switches. The provides very low latency connections, with a small setup time (to position mirrors or something). Graph theoretic constraints Routing algorithms and 3D embedding (wiring) must account for heterogeneous links and switch-nodes (as regards capability and cost). 1 1 Figure: Helios topology (from Farrington et. al 2010) 9/38

21. Quartz Ring Completely interconnect small sets of servers and/or switches in an existing network using an optical cable (cycle) and wavelength division multiplexers to combine signals. Graph theoretic constraints Edit a graph G a to a new network by adding small cliques. Routing algorithms, fault tolerance, wiring, and cost to be reconsidered. 2 2 Figure: Quartz (from Lui et. al 2014) 10/38

22. Free space optics (FSO) Each rack of servers has a Top-of-Rack switch, connected to controllable FSOs that communicate with other racks by bouncing an optical signal off a ceiling mirror (radical...). Graph theoretic constraints Switch-node degree at most 100, links have a maximum embedded length, topology is reconfigurable ... 3 3 Figures: Firefly (from Hamedazimi et. al 2014) 11/38

23. Server-centric architecture A direct network of commodity servers with unintelligent, crossbar, commodity switches. Routing intelligent is programmed into the servers. Graph theoretic constraints Switch-node degree at most 100, server-node degree small, perhaps 1–4. No switch-node-to-switch-node links. Switch-nodes with their (server-node) neighbours may sometimes be abstracted as cliques. The programmability of servers as networking devices makes this architecture very ﬂexible; many topologies are possible. Figure: Commodore 64 image from Wikipedia 12/38

24. Server-centric architecture A direct network of commodity servers with unintelligent, crossbar, commodity switches. Routing intelligent is programmed into the servers. Graph theoretic constraints Switch-node degree at most 100, server-node degree small, perhaps 1–4. No switch-node-to-switch-node links. Switch-nodes with their (server-node) neighbours may sometimes be abstracted as cliques. The programmability of servers as networking devices makes this architecture very ﬂexible; many topologies are possible. Figure: Commodore 64 image from Wikipedia 12/38

25. Only scratched the surface... More sophisticated constraints Table lookups are limited by the number of server-nodes. Data centre components need to be embedded in 3-space in a practical way (i.e. packaging/wiring). “Uplinks” to outside world. etc. We also want good networking properties... These do not always translate easily into graph theoretic properties. Furthermore, they depend on how the data centre will be used, and networks researchers do not always agree on what they are. 13/38

29. Desirable properties: Real / in Graphs High throughput: e.g. High bisection width, efficient routing, load balancing Low latency: Low diameter, efficient routing Low equipment/power cost: Few switches and ports, short wires Easy wiring: ...!? Scalability: e.g. useful families Efficiently computable sub-structures for special communications, data centre management, virtualisation: e.g. spanning trees, symmetry (?) Fault tolerant, graceful degradation: e.g. internally disjoint paths, efficient routing Difficult to find graph theoretic properties that are necessary and sufficient conditions for real world properties. 14/38

36. How data centre design happens Engineers: Propose a holistic package based on engineering experience, including basic routing, fault-tolerance, load balancing, operating system integration, etc. Validation by simulation (or Microsoft’s data centre test bed!). Theoreticians: Come in later to improve routing, fault tolerance, throughput, etc, for sometimes simpliﬁed (blue sky) abstractions of the data centre. Validation by proving theorems. A new approach (for theoreticians): Stay close to current technology, embrace simulation, collaborate with engineers! 15/38

39. Generalized DCell (2008) 16/38

43. Generalized DCell (2008) 31 copies of level 1 DCell 16/38

44. Generalized DCell (2008) 31 copies of level 1 DCell 16/38

45. DCell’s k-level links 17/38

46. DCell’s k-level links Let tk be the number of server-nodes in Dk,n. Let [a, b] be the bth server in ath copy of D0,n. Below is the β-DCellk,n connection rule: Build D1,n: connect [a, b] ↔ [a + b + 1 ( mod t0 + 1), t0 − 1 − b] Label servers in D1,n by at0 + b. Reuse the connection rule to make D2,n: [a, b] ↔ [a + b + 1 ( mod tk−1 + 1), tk−1 − 1 − b] 17/38

49. DCell properties n-port, k-level DCellk,n Number of Servers: N > (n + 1/2)2k − 1/2 Number of Switches: N/n Diameter: at most 2k+1 − 1 (and unknown) Number of internally disjoint paths: unknown (and how many i.d.p.s are there between two server-nodes that are connected to the same switch-node?) What about bisection width?... (unknown) 18/38

54. DCell (2008) First one of its kind, tends to be a reference point. Poor load balancing, diﬃcult to analyse. BCube (2009) Based on generalized hypercubes, Kk n ; replace each n-clique with a switch-node connected to n server-nodes. Easier to analyse, low diameter, fault tolerant. Expensive, small scale, diﬃcult wiring. FiConn (2009) Similar to DCell, but server-nodes are degree 1 or 2 Idea: Level k edges use only half the server-nodes of degree 1 from FiConnk−1,n, to build FiConnk,n. 19/38

61. DCell 20/38

62. “Star”-replaced networks Let G be a graph, and let G be the graph that is obtained by subdividing each edge of G twice. Vertices of G become switch-nodes W and the new vertices are server-nodes S. 21/38

65. Relevance of G Every dual port server-centric data centre network in which each server-node connects to one switch-node and one other server-node can be obtained by a star-replaced construction. Commodity servers tend to have 2 NIC (network interface controller) ports. More server-nodes per switch-node than 1 subdivision. Each server-node “belongs” to exactly 1 switch-node. Retains some aspects of node/edge/arc symmetry of G. Many “networking” properties of G can be derived from those of G. 22/38

70. Properties of G Certain properties of G are a function of properties of G Internally disjoint paths (between switch-nodes). Number of server-nodes is twice E(G) Diameter of G , measured in server-server hops, is about twice that of G. Routing algorithms in G translate directly to routing algorithms in G When G is regular the bisection width of G can be computed exactly from the solution to the edge-isoperimetric problem on G. 23/38

75. Data Centre Network from Kk n Let GQk,n = Kk n = k times Kn × Kn × . . . × Kn, which is a generalized hypercube. e.g. GQ2,3= K3 × K3. Each vertex has a label a0a1 . . . ak−1 with 0 ≤ ai ≤ n − 1, and vertices are connected by an edge if their labels diﬀer in exactly one coordinate. 24/38

76. Data Centre Network from Kk n Let GQk,n = Kk n = k times Kn × Kn × . . . × Kn, which is a generalized hypercube. e.g. GQ 2,3= (K3 × K3) . Each vertex has a label a0a1 . . . ak−1 with 0 ≤ ai ≤ n − 1, and vertices are connected by an edge if their labels diﬀer in exactly one coordinate. 24/38

77. Bisection width vs S-bisection width Bisection width bw(G) The minimum number of edges whose removal partitions the vertices into two halves. Let G be an interconnection network on N nodes with bw(G) = β under the random traffic pattern. Throughput of a link = 1/#(flows using the link). Throughput of G = min{ throughput of a link } On average, half the flows use a cut-link so, at least 1 link involved in at least (N/2)/β flows so throughput is at most 2β/N. So, the higher the bisection width, the higher the (estimated) throughput. 25/38

78. Bisection width vs S-bisection width Bisection width bw(G) The minimum number of edges whose removal partitions the vertices into two halves. S-Bisection width bwS(G ) The minimum number of edges whose removal partitions the nodes into two parts, each containing half of the server-nodes. The standard text (Dally and Towled) insists on partitioning both switches and processers, but in practice, S-bisection width is used. This seems to be the ﬁrst time this has been formalised! 26/38

79. Bisection width vs S-bisection width Bisection width bw(G) The minimum number of edges whose removal partitions the vertices into two halves. S-Bisection width bwS(G ) The minimum number of edges whose removal partitions the nodes into two parts, each containing half of the server-nodes. The standard text (Dally and Towled) insists on partitioning both switches and processers, but in practice, S-bisection width is used. This seems to be the ﬁrst time this has been formalised! 26/38

80. Bisection width vs S-bisection width Bisection width bw(G) The minimum number of edges whose removal partitions the vertices into two halves. S-Bisection width bwS(G ) The minimum number of edges whose removal partitions the nodes into two parts, each containing half of the server-nodes. Can we characterise optimal partitions (below) of the switch-nodes? Can we compute these partitions ef- ﬁciently? 26/38

81. A slight misfortune... Lemma Let G = (V , E) be a regular graph. It is always the case that bwS (G ) ≤ bw(G). Lemma For all n ≥ 2, we have that bwS (K2n) < bw(K2n). n n − 1 n x y [R, T] Legend server switch [R , T ]Partitions:Proof. Let [R, T] be the (minimum) edge-cut that bipartitions switches and servers. The modiﬁed edge-cut [R , T ] is smaller than [R, T] and it bipartitions servers. 27/38

82. Edge-isoperimetric problems End Let G = (V , E), and let R ⊆ V . IG (R): the number of edges with both ends in R. ΘG (R): the number of edges with exactly one end in R. Edge-isoperimetric subsets R 1. Find IG (r) = maxR:|R|=r IG (R) 2. Find ΘG (r) = minR:|R|=r ΘG (R) The study of isoperimetric subsets has a long history (Bezrukov 1999), e.g., n-partite graphs; hypercubes; Cartesian products of cliques, bipartite graphs, the Petersen graph; grids,... Muradyan, Piliposjan ’80, Harper ’64, Lindsey ’64, Ahlswede, Cai ’97, Bezrukov, Elssser ’98, Bollobas, Leader ’91, Golovach ’94, Mohar ... 28/38

83. Edge-isoperimetric problems End Let G = (V , E), and let R ⊆ V . IG (R): the number of edges with both ends in R. ΘG (R): the number of edges with exactly one end in R. Edge-isoperimetric subsets R 1. Find IG (r) = maxR:|R|=r IG (R) 2. Find ΘG (r) = minR:|R|=r ΘG (R) Theorem (E., Kiasari, Navaridas, Stewart (2015)) The G = (V , E) be a d-regular graph. The S-bisection width of G is given by min{wr : 0 ≤ r ≤ |V |}, where wr =    rd − 2 |E| 2 if |E| ≤ 2IG (r) θG (r) if 2IG (r) < |E| ≤ 2IG (r) + 2θG (r) 2 |E| 2 − rd if 2IG (r) + 2θG (r) < |E|. 28/38

84. Proof idea Let G = (V , E) and let G = (S ∪ W , E ), with server-nodes S and switch-nodes W . Let (RW , TW ) be a partition of W , and let B = [RW ∪ RS , TW ∪ TS ] be a minimal S-bisection of G (i.e., with |RS | = |TS | = |E|). Types of 3-paths: Type-R: both ends in RW : contribute 0 or 2 edges to B. Type-RT: one end in RW one end in TW : 1 edge to B. Type-T: both ends in TW : 0 or 2 edges to B. 29/38

85. Proof idea Let G = (V , E) and let G = (S ∪ W , E ), with server-nodes S and switch-nodes W . Let (RW , TW ) be a partition of W , and let B = [RW ∪ RS , TW ∪ TS ] be a minimal S-bisection of G (i.e., with |RS | = |TS | = |E|). Types of 3-paths: Type-R: both ends in RW : contribute 0 or 2 edges to B. Type-RT: one end in RW one end in TW : 1 edge to B. Type-T: both ends in TW : 0 or 2 edges to B. 29/38

86. Proof idea Claim [RW ∪ RS , TW ∪ TS ] is a minimal S-bisection (w.r.t. r) if RW maximises IG (RW ) (with |RW | = r), of size: wr =    rd − 2 |E| 2 if |E| ≤ 2IG (r) θG (r) if 2IG (r) < |E| ≤ 2IG (r) + 2θG (r) 2 |E| 2 − rd if 2IG (r) + 2θG (r) < |E|. 30/38

87. S-bisection width of GQ k,n End Theorem (Lindsey (1964), Nakano (1994)) For 1 ≤ t ≤ kn , IGQk,n (t) = t−1 i=0 wn(i), where wn(i) is the sum of the k (base n) ‘digits’ of i. Nakano gives a formula. k n servers bw(GQn,k) bwS (GQn,k) diﬀerence 2 32 63, 488 8, 192 7, 788 404 2 33 69, 696 9, 248 8, 580 668 3 20 456, 000 40, 000 39, 600 400 3 21 555, 660 50, 930 47, 628 3, 302 3 22 670, 824 58, 564 57, 856 708 4 9 209, 952 16, 400 14, 580 1, 820 4 10 360, 000 25, 000 25, 000 0 4 11 585, 640 43, 920 39, 930 3, 990 4 12 912, 384 62, 208 62, 208 0 4 13 1, 370, 928 99, 960 92, 274 7, 686 4 14 1, 997, 632 134, 456 134, 400 56 4 15 2, 835, 000 202, 496 189, 000 13, 496 31/38

88. S-bisection width of GQ k,n End Theorem (Lindsey (1964), Nakano (1994)) For 1 ≤ t ≤ kn , IGQk,n (t) = t−1 i=0 wn(i), where wn(i) is the sum of the k (base n) ‘digits’ of i. Nakano gives a formula. k n servers bw(GQn,k) bwS (GQn,k) diﬀerence 2 32 63, 488 8, 192 7, 788 404 2 33 69, 696 9, 248 8, 580 668 3 20 456, 000 40, 000 39, 600 400 3 21 555, 660 50, 930 47, 628 3, 302 3 22 670, 824 58, 564 57, 856 708 4 9 209, 952 16, 400 14, 580 1, 820 4 10 360, 000 25, 000 25, 000 0 4 11 585, 640 43, 920 39, 930 3, 990 4 12 912, 384 62, 208 62, 208 0 4 13 1, 370, 928 99, 960 92, 274 7, 686 4 14 1, 997, 632 134, 456 134, 400 56 4 15 2, 835, 000 202, 496 189, 000 13, 496 31/38

89. S-bisection width of GQ k,2; the hypercube End Theorem (E., Kiasari, Navaridas, Stewart (2015)) bwS (Qn ) = 2n−1 = bw(Qn) Proof sketch. Let [RW ∪ RS , TW ∪ TS ] be an S-bisection of Qn . Case |RW | = |TW | is easy so assume r := |RW | < 2n−1 and suppose bwS (Qn ) < 2n−1 . We have 2n−1 > bwS (Qn ) ≥ ΘQn (RW ) = rn − 2I(RW )≥ rn − 2I(r), and we use properties of I(r) (e.g., I(2n−2 + a) = I(2n−2 + a + I(a)), for a < 2n−2 ) to show that r < 2n−2 . Use the previously mentioned Type-R, RT, T paths to show that |RS | = n2n−1 ≤ 2(I(RW ) + Θ(RW ) + b), where b are servers from Type-T paths, and plug in I(r) for another contradiction. 32/38

92. Practical application: Compare with FiConn End 102 103 104 105 101 102 103 104 Number of servers Bisectionwidth GQ , D = 5 GQ , D = 7 GQ , D = 9 FiConn2,n FiConn3,n GQ3,10 GQ4,6 FiConn2,24 server-nodes 27, 000 25, 920 24, 648 switch-nodes 1, 000 1, 296 1, 027 switch-radix 27 20 24 links 81, 000 77, 760 67, 782 diameter 7 9 7 bwS 2, 496 1, 944 1, 560 33/38

93. Results on DCell End Theorem (Wang, E., Fan, Jia, 2015) DCell is (almost always) Hamiltonian connected Re: DCell+ (E., Kiasari, Navaridas, Stewart (2015)) Simultaneously compute shorter (than best known) paths and balance load by doing an eﬃcient “intelligent” search. Percent mean hop-length savings over best known routing. 34/38

94. DPillarn,k dual-port SCDCN End Server-nodes are labelled (c, vk−1vk−1 · · · v0) with column c ∈ {0, 1, . . . , k − 1} and vi ∈ {0, 1, . . . , n/2 − 1}. Sever (c, vk−1vk−1 · · · v0) and (c + 1, vk−1vk−1 · · · vc+1 ∗ vc−1 · · · v0) connect to the same switch. 35/38

95. Shortest path routing on DPillar End In spite of considerable fanfare, no shortest path routing algorithm was known for DPillar Theorem (E., Kiasari, Navaridas, Stewart (2015)) There is a routing algorithm that computes shortest paths in O(k) time. Sketch. Let src and dst be server-nodes of DPillarn,k, which diﬀer at coordinate c. We need to route through (or near) column c to change coordinate c. 36/38

96. Shortest path routing on DPillar End In spite of considerable fanfare, no shortest path routing algorithm was known for DPillar Theorem (E., Kiasari, Navaridas, Stewart (2015)) There is a routing algorithm that computes shortest paths in O(k) time. Sketch. Let src and dst be server-nodes of DPillarn,k, which diﬀer at coordinate c. We need to route through (or near) column c to change coordinate c. Map DPillar routing to visiting marked vertices on a k-cycle. Given x and y vertices in the cycle, we need to “cover” the marked vertex c in order to change the symbol vc. If such a walk is of minimum length, it changes direction at most twice. We need to perform up to k steps to ﬁnd these changes of direction. 36/38

99. There is a lot to do! Improve on recently proposed designs by ﬁnding new routing algorithms or discovering useful properties. Propose your own designs, soundly based in theory. Apply graph embedding algorithms to improve data centre virtualisation. Determine the set of relevant graph theoretic properties for each data centre usage situation. Relate topology to energy usage. Consider the wiring problem. A lot of data centre research is eﬀectively being gifted to theorists. Take hold of it! 37/38

106. Thank you. More information at http://alejandroerickson.com 38/38

Graphs are at the Heart of the Cloud

More Related Content

What's hot (19)

Viewers also liked (10)

Similar to Graphs are at the Heart of the Cloud (20)

Recently uploaded (20)

Graphs are at the Heart of the Cloud