On the feasibility of 40 Gbps network data capture and retention with general purpose hardware

On the feasibility of 40 Gbps network data capture and
retention with general purpose hardware
SAC 2018 | Pau, France
Guillermo Julián-Moreno
Rafael Leira
Jorge E. López de Vergara
Francisco Gómez-Arribas
Iván González
April 10, 2018
Naudit HPCN &
Escuela Politécnica Superior, Universidad
Autónoma de Madrid

Outline
1. Introduction
2. Design and implementation
3. Results
4. Conclusions
1

Motivation
Why would we want to capture and store traffic?
• Online analysis and monitoring (e.g., flow records, traffic volume dashboards, IDS).
• Data retention for specialized analysis or policy requirements (e.g. GPDR).
10 GbE standard is widespread, now we are seeing higher speeds (40 GbE, but also 100
GbE)
• 10 GbE is the “last” standard that we can process with a single core.
• 40 GbE and higher speeds require parallelism: how?
2

Purpose of the system
• Receive the traffic at 40 Gbps as efficiently as possible.
• Timestamp the incoming traffic.
• Store the network frames in the disk at 40 Gbps.
• Use commercial off-the-shelf hardware to reduce costs.
3

Previous architecture
DMA
tail
head
Descriptor ring Data bufferNIC
• Single thread copying frames to the intermediate buffer.
• Write ﬁles by blocks and use padding at the end of each ﬁle.
• Return the descriptor’s ownership to the card after the copy (no allocations
needed).
4

Reading from NIC and copying to buffer
1
23
4
Rx
Intermediate buffer
...
• Usual approach for parallelism is RSS queues. Problem: ensuring uniform
distribution between queues.
• Given our limited scope, switch to single queue and fixed descriptor assignments:
uniform distribution and no synchronization required for reading.
• Use a single atomic counter for the buffer write offset: as fast as possible and no
deadlocks possible.
• Add padding to the beginning and end of the files to avoid having frames
overrunning file boundaries.
5

Client reading
Kernel Userspace
Rx 0
Rx 1
Rx 2
Rx 3
1. Allocation
2. Copy Client
…
3. New data!
4. Reading
• Userspace clients get last written byte via syscalls and set their last read byte.
• RX thread 0 updates s, the space available in the buffer. No thread writes more
than ⌊s/n⌋ bytes in a batch.
6

Writing to disk
Two options for the write process:
• Regular files written in 4MB blocks. Needs a fast filesystem.
• Distributed writes between several NVMe disks with SPDK.
Features to reduce hardware requirements:
• Simple filtering system that searches “strings” of bytes at fixed positions.
• Selective storage: only store the first N bytes of each frame.
7

Hardware used
Trafﬁc generator RX Server 1 RX Server 2
CPU Intel Xeon E5-1620 v2 Intel Xeon E5-1620 v2 2 × Intel Xeon E5-2630 v4
Clock 3.70GHz 3.70GHz 2.20 GHz
Cores 4 4 2 x 10
Memory 32 GB 32 GB 2 x 64 GB
NIC Intel XL710 Intel XL710 Intel XL710
Storage SATA RAID SATA RAID 6 x NVMe
Est. cost 7,000e 7,000e 10,000e
Table 1: Speciﬁcations of the servers used for testing. HyperThreading was disabled.
8

Storage speed
0
10
20
30
40
50
2 3 4 5 6
Rate(Gbps)
Number of discs
Disk write rate
Software RAID
SPDK
Theoretical max speed
Figure 1: Performance of the NVMe disk array.
9

Traffic generation
0
5
10
15
20
25
30
35
40
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
Rate(Gbps)
Frame size (bytes)
Generation rate
Send rate
Theoretical max rate
Figure 2: Synthetic traffic rates achieved with our custom, DPDK-based traffic generator. We also
made a version capable of sending large PCAP files at line rate.
10

Timestamping accuracy
Frames Mean Std
All 1738 ns 3296 ns
One out of every eight 55 ns 287 ns
Table 2: Timestamping accuracy. The Intel NIC posts descriptors in batches of eight, so we have to
take that into account for the accuracy.
11

Trafﬁc capture
0
25
50
75
100
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
0
5
10
15
20
25
30
35
40
Loss%
Rate(Gbps)
Frame size (bytes)
Capture rate
Lost %
Port Drop %
Send rate
Capture rate
Figure 3: Results of the ﬁrst test: retrieval of the
frames from the NIC. The bottleneck is the card
for small frame sizes.
0
25
50
75
100
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
0
5
10
15
20
25
30
35
40
Loss%
Rate(Gbps)
Frame size (bytes)
Capture rate
Lost %
Port Drop %
Send rate
Capture rate
Figure 4: Results of the second test: writing of
the frames to a null device.
12

Trafﬁc storage
0
25
50
75
100
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
0
5
10
15
20
25
30
35
40
Loss%
Rate(Gbps)
Frame size (bytes)
Capture rate
Lost %
Send rate
Capture rate
Figure 5: Results of the third test: trafﬁc storage using SPDK.
13

Traffic storage
Name Size Avg. frame size Send rate Loss %
CAIDA 222 GB 787.91 B 39.78 Gbps <0.01
University 4.3 GB 910.08 B 39.82 Gbps 0
Table 3: Performance on reception of traffic capture files.
14

Results
• We have created and open-sourced a system capable of capturing, timestamping
and storing network trafﬁc at 40 Gbps.
• Not using RSS parallelism is feasible and useful in our limited-scope system.
• The one-copy mechanism and synchronization algorithms allow our system to store
line-rate trafﬁc at frame sizes of 300 bytes and above (enough for the majority of
environments).
• We have created a testbed capable of saturating 40 GbE links for frames of size 96
bytes or greater.
15

Future work
• Improve the selective-storage approach: more effective ﬁlters (ASCII/BPF) or limits
based on RX rate.
• Detailed comparison of the disorder and timestamp inaccuracies between our
approach and RSS queues.
• Port this system to virtual machines with SR-IOV NVFs.
16

On the feasibility of 40 Gbps network data capture and retention with general purpose hardware

More Related Content

What's hot (20)

Similar to On the feasibility of 40 Gbps network data capture and retention with general purpose hardware (20)

More from Jorge E. López de Vergara Méndez (9)

Recently uploaded (20)

On the feasibility of 40 Gbps network data capture and retention with general purpose hardware