WebRTC Real time media P2P, Server, Infrastructure, and Platform

Real-Time Media P2P, Server,
Infrastructure, and Platform
Dr. Alex Gouaillard

P2P: In the beginning
Originally, the webrtc use case was 1-to-1, p2p, a-la facetime, or skype.
The media flows directly between the two peers, which have a direct connection
and thus can know almost everything about the connection of their respective
remote peer.
Bandwidth estimation, Encryption, Bandwidth adaptation, are all relatively trivial to
support in that setting, and the only thing left to do is the discovery and
handshake. The discovery is the process through which peers find each other.
The handshake is the negotiation process which precede the media establishment
and allow to assess capacity of each peer and exchange needed info like the
encryption keys.

P2P: In the beginning
A lot of first generation servers are thus signalling servers, servers which enable
the discovery and facilitate the handshake for peers. Most of the &yet portfolio for
example is based on p2p connections (simplepeer, talky, …) as well as the first
version of tokbox, appear.in and almost any webrtc platform.
This design was very appealing to PaaS designers as the cost of the media
bandwidth, up to 85% of the operational cost, is fully externalized to the end user.
However the main limitation of this design is the scalability. Extending the logic to
multiparty chats is trivial, but the design forces e.g. multiple encoding of the same
stream, once per remote peer, and uses a lot of CPU. The complexity is then
quadratic wrt the # of users.
Beyond scalability, features like recording, or inter-connection to other protocols
for VoIP or streaming, all require a media server.

Note on Turn / Stun / ICE
ICE is a solution to firewall and NAT traversal. It’s an algorithm based on the
usage of either a STUN or a TURN server (purist will tell you a TURN server is
actually a STUN server but not the other way around). Because of that TURN and
STUN server are usually called ICE servers in implementations, but in
commercial offers almost everybody speak about TURN services (twilio, xirsys,
callstats.io, …).
What is interesting is that TURN servers are always useful, they are independent
of your back-end choice (p2p, sfu, infra, platform), they only act on the client side,
and it’s the only part of a webrtc solution that can be completely separated from
the rest.

Media Servers
Roughly two kinds of media servers exists: SFU and MCU. Roughly the distinction
is that MCU processes the media while SFU only relay/route media.
NOTE: medooze is a media server toolkit. You have “bricks” at your disposal to
put together to end up with a media server with the desired capabilities. It can be
an SFU, an MCU, and any variations around. There is no such thing as a
Medooze server, per say.
Less granular: Janus as a core engine, and some plugins that can be used to add
desired functionalities. So you speak about a janus server, with a given plugin.
“Video Room” is the plugin which implements SFU functionalities, and the one we
have SDKs and native clients for.
Not granular: MediaSoup. Absolutely monolithic: Jitsi (red5, ant media).

Media Servers: MCUs
MCUs need de decrypt the stream, reconstruct the encoded frame, decode it to get the raw media frame,
apply some transformation to it, and do everything in reverse.
● MCU always remove the encryption are not secure by design,
● This process represents an extra 80% CPU consumption,
● Transcoding at least doubles the media processing latency on the path.
● If the incoming and outgoing codecs are different, it’s called transcoding,
● If the incoming and outgoing protocols are different, it’s called transmuxing,
● It is very versatile, and can reduce the bandwidth usage to a minimum (linear to # of users).
MCU are mandatory to use when transcoding RTMP, RTSP, or HLS/MPEG-Dash to/from Webrtc. We are
using an MCU configuration of meedoze for the corresponding media Servers.
For a very long time, bitrate adaptation for streaming was done by reducing the resolution on the server.
That requires transcoding, scaling down the frame in the middle of the process. They call it ABR. ABR is
thus intrinsically unsecure, and costly both in terms of CPU footprint, and overall latency.
In VoIP, where a single video stream is accepted by default, MCU are used to stitch several stream
together and send it to a SIP client.

Media Servers: SFUs
SFUs (or selective forward units), do just that: chose a stream and relay it.
● It is very lightweight since it does not process media, i.e. no encoding or decoding.
● Adding support for a new codec is trivial, since no encoding/decoding takes place.
● It is very well suited for multiparty conferences as it just duplicate the incoming packets, rewrite the
destination address, and push it out.
● It is also insecure by default: to be able to send it to multiple peers, it needs to establish a separate
encrypted connection with each, each with different keys. End-to-end encryption solves this problem
and is implementable with an SFU, while impossible with MCU. Unless you want to lose the “end-to-
end encrypted” webrtc claim, you need to add an additional end-to-end encryption layer when using
an SFU. The original claim is void and null if using an MCU.
● Recording would require decoding the media stream. It is either implemented as a packet dump (rtp
packets themselves are recorded, which allow post mortem network debugging, as well as on-
demand media file generation), or in a multi-server environment, delegated to a separate MCU.
● Most SFU have extra reliability features like own per-outgoing stream packet cache, and jitter
buffer and can help improve real-time reliability on bad networks.
● They are especially powerful when coupled with simulcast (any codec) and SVC (special codecs)

Media Server: Recording solution: SFUs + MCUs
Keep the main path SFU only
(transcoding free) for min
latency !
Push the storage and MCU
(mixer) out to the outgoing leg
that requires it and only that
one
Keep pcap as the main storage
format: allow debugging,
contain media AND network
info, and can be stored E2E
encoded.
RTMP ingest, if video only, and
if single resolution can be done
with an SFU. For other codecs
support, and/or audio support,
and/or simulcast support, you
need an MCU. Here again,
keep it off main path.

● ABR makes separate parallel encoding of the same media source, and delivers the best one a
viewer can consume. By chunking the encoded media, one can adapt the resolution delivered
dynamically, with a reaction time equivalent to the chunk size (90s by default with HLS). It a server-
side feature for viewer-side optimization. Transcoding is needed, doubling the media latency. It
usually requires storages of the chunk prior to delivery, adding to the latency.
Media Servers: Bandwidth Adaptation: ABR

● Simulcast is exactly the same thing but done on the sender-side. It removes the need to manipulate
the media on the server (so one can now use an SFU). It’s a sender-side feature for viewer-side
optimization. Its reaction time depends on the time to the next full frame, usually 10s by default, but
smart implementations generate a full frame on-demand. It also removes the need for an extra
decoding/encoding cycle in the server, reducing the media latency by 2. It does not require storage.
It still requires a separate encoder per resolution, which can be problematic on mobile device where
only one hardware encoder is present. It works with any codec, even old ones.
Media Servers: Bandwidth Adaptation: Simulcast

● SVC is achieving the same thing as simulcast: getting multiple resolutions from a single source, but
using only one encoder, with all the resolution layered into a single bitstream. The layering also
allows for easier and more robust network resilience. The reaction time in the server to switch
between one resolution to another one is in the order of one RTP packet read, around 10ms.
● It requires special codec. H.264 (AVC), H.265 (HEVC), VP9 all have an SVC mode. AV1 supports
SVC by default.
Media Servers: Bandwidth Adaptation: SVC

Streaming and Media Infrastructure
Old protocols, like RTMP, and RTSP, are not encrypted by default, have no capacity to traverse NAT or
firewall, and are not adaptive either. Practically it means that:
● anybody with access can watch,
● You need to open a port in your firewall and NAT ahead of time,
● You need to allocate and reserve enough bandwidth. If you go below, the call fails.
Streaming protocols solve this by offsetting all the traffic over to HTTP, which is never blocked, leaving the
security to the user, and use ABR for adaptability. The storage allows for scalability through CDNs. Using
CDNs and HTTP (cache) allowed to achieve then a much lower cost of operation. Each of those solutions
address the original problems by adding latency.
WebRTC solved those problems in real-time, using respectively ICE, encryption by default,
simulcast/SVC, but with a p2p model which does not scale so well.
Nowaday, the cost of CDNs, and the cost of bandwidth are in par.

Webrtc is much closer to RTMP and RTSP in nature than to HLS and MPEG-DASH. They are media
transport protocols that do not use HTTP nor use file format/storage, everything is done in-memory and
on-the-wire.
Given the limitations of the protocols themselves, the only thing you could do with RTMP and RTSP was
to cluster and (auto)-scale. Since there was no firewall / NAT traversal capacity, you would ignore them.
Since there was no adaptability, you would not try to probe the bandwidth and to be a good network
citizen by applying congestion control to yourself. You would assume x Mbps available and reserved for
you at any time, and perfect network conditions. Then scaling becomes a simple matter of connecting
servers to one another in chains from source to consumer/viewer of the media. It’s so simple, that free
HTTP proxies like NGINX also implement RTMP routing, relaying and recording by default.
With nowadays cloud services, providing the capacity to start and stop servers, to distribute the sources
and the viewer across servers, and to route media from source to viewer is enough. You can automate
all that through what is called “load balancing” and “auto-scaling”.

Streaming and Media Infrastructure: Stadia

A - Single, stand-alone server (red5, ant media, medooze)
Stand-Alone
SFU
source viewer Max: 1K viewers
viewer
viewer
viewer
viewer
Load
Balancing
Internal
routing
Scaling
● If webrtc, you still need a signalling server for discovery
and handshake, but usually this is integrated in the media server itself (all the server above). Each source,
respectively viewer, needs to discover and do a handshake with the media server separately.
● If rtmp or rtsp, you just need to provide the IP and PORT of the target computer, and you’re done. Of course,
port needs to be open, network needs to be almost perfect, bandwidth needs to be plenty, and you cannot
upgrade codecs (RTMP and RTSP at best support H264).
● Load balancing is the process of distributing the stream across several servers. But in the case of a single
stand alone server, there is no need for distribution as both source and viewer will always connect to the same
server. It makes things much simpler, but limit your audience to roughly 1,000 viewers.
● Routing, which is the process of connecting the source and the viewers, is trivial and can be done by the
media server.
● Scaling, which is adding or removing media servers to a cluster or pool of servers is not needed, since
there is only one media server.
● You need end-to-end encryption if you want to be secure. The distribution of the keys if of topic here.

B - 2 servers in path, no e2e media layer (red5, ant, phenixRTS)
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Max: 1M viewers
Load
Balancing
Internal
routing
N
N
N
Scaling
● Same problems with
handshake and discovery:
○ RTMP just connects by IP:PORT
○ Webrtc needs a signaling server for the source and the viewers to connect to the infrastructure
● Everything else being equal, this design can handle 1M viewers.
● The problems starts to arise here as the WebRTC Bandwidth Estimation, and Congestion Control
algorithms fail when used with multi-hops media paths. Plus Noisy neighbors effects start to appear.
● The easy solution to that, is to switch back to ABR. It solves all those problems, at the cost of breaking
RT bandwidth adaptability, adding latency, disabling end-to-end encryption and basically making webrtc
into nothing more than RTMP. Easy to implement, does not “work” worse than RTMP, while providing
better codec than RTMP, and Browser native support.

B - 2 servers in path, no e2e media layer (red5, ant, phenixRTS)
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Max: 1M viewers
Load
Balancing
Internal
routing
N
N
N
Scaling
● Here you suppose that
instead of a single ingest SFU,
you have a pool of N of them.
● In simple designs (LUXON v1), one simply add or remove instances
to the pool manually. In more advanced design (LUXON v2, red5, ant, phenixRTS), the pools auto-scale, i.e.
servers are being added or removed on-demand depending e.g. on the load.
● Load balancing manages how source, resp viewers, are being directed to a specific server. In simple designs
(LUXON v1), one use a round-robin algorithm which send each new connection to the next server in a list of
server (in order), and when reaching the end of the list, resumes at the beginning of the list. It works well when
the number of servers is fixed, and when all servers are idempotent.
● Routing becomes more complicated here. In simple designs (LUXON v1), we would just relay the ingest
stream to all egress SFUs. We would then be sure viewer can connect to any egress SFU and have any
stream available. Obviously, it does not scale well and wastes a lot of bandwidth. In LUXON v2, we have a
“director API” which does the routing and the load balancing on-demand. (slides exist)

C - 2+ servers in path, no e2e media layer
Load
Balancing
Internal
routing
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Max: 1G viewers
Relay
SFU
Relay
SFU
N N
N
Scaling
N
N
● Having a extra media server
in the path allows for extra scalability
(up to 1 Trillion theoretically).
● It also allow for smarter routing, and better connections managements, including multi-clouds.
● We have the 3 configurations with Medooze. We have a version of LUXON like this.

D - 2 servers in path, with e2e media layer (LUXON)
Load
Balancing
Internal
routing
Scaling
● The problems starts to arise here as the WebRTC Bandwidth Estimation, and Congestion Control algorithms
fail when used with multi-hops media paths. Plus Noisy neighbors effects start to appear (slide).
● The solution, is to take into account the network status and capacity at each server in the media path.
● One such algorithm is called BBR,
(Bottleneck Bandwidth and Round-Trip
Time). It’s conceptually easy to understand,
You should not send more than the weak-
-est network in the path can handle.
● This is e.g. what Google uses in Stadia,
What is used in Facebook’s Quic impl.
● It requires a full monitoring, interactive
E2e Media layer, on top of the media
Servers, and in addition to scaling/routing.
It cannot be in a single server by design.
It’s a media infrastructure level feature.

Streaming and Media Infrastructure: Stadia
full monitoring,
Interactive,
E2E Media layer,
Separated from
Media Servers,
Separated from
scaling/routing.

D - 2+ servers in path, with e2e media layer (LUXON)
Load
Balancing
Internal
routing
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Server (Medooze™)
& Clients/SDKs layer
(Include E2E encryption)
(include watermarking)
(very differentiating)
Relay
SFU
Relay
SFU
N N
N
Scaling
LB LB
Smart Routing
N
N
Infrastructure scalability layer:
“Director API”™ (not differentiating)
Bandwidth Bottleneck and RTT + other black
magic
Infrastructure BWE and CC layer:
“RT Media Enabler”™ (UNIQUE)
LUXON
Media Only

Media Streaming PaaS - Millicast
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Server (Medooze™)
& Clients/SDKs layer
(Include E2E encryption)
(include watermarking)
(very differentiating as
traditionally most
vendors are server
vendors with no native
SDK capacity)
Relay
SFU
Relay
SFU
N N
N
LB LB
Smart Routing
N
N
Infrastructure scalability/logic layer:
“Director API”™ (not differentiating)
Bandwidth Bottleneck and RTT +
other black magic
Infrastructure BWE and CC layer:
“RT Media path manager”™ (UNIQUE)
EXT: OAuth, Tokens, Analytics,
Billing, Payment, All APIs.
Platform layer:
“Influxis Synapse”™ (little differentiation)
Hosting / Cloud Management
INT: monitoring, notifications,
Recording/storage.

RTMP Ingest support - in production
Core webrtc media path for minimum latency, RTMP first mile
WebRTC
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Relay
SFU
Relay
SFU
N N
N
N
N
● NOW: No video transcoding (=> no simulcast, no SVC support)(=> only support H264)
● NOW: Audio transcoding to opus stereo at best (can lose quality if input is 5.1 AAC)
● NOW: added latency from media processing: 1 frame at most (1/30s = 33ms)
● PLANNED: Video Transcoding with simulcast / SVC, and capacity to choose which codec.
● PLANNED: Audio transcoding to multiOpus if input if 5.1 or 6.1. No loss of spatial quality.
RTMP / RTSP
source
Ingest
MCU
N

RTMP/HLS/MPEG-DASH egress support
Core webrtc media path for minimum latency,
WebRTC
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Relay
SFU
Relay
SFU
N N
N
N
N
RTMP
MCU
N
HLS
MCU
N
Social
Site
CDN
Facebook, Twitch
● The media part (medooze
transcoding MCU) is done.
● The difficult part now is the
logic and the routing needed
to integrate in LUXON.
● Then, cabling it in millicast
with the business logic
(accounting, billing,
authentification, …)
● Connecting to FB or a CDN
could and should be
automated as well.

Server-Side Ad-Insertion
Core webrtc media path
for minimum latency,
● The media part (medooze transcoding MCU) is
done.
● The difficult part now is the logic and the
routing needed to integrate in LUXON.
● Then, cabling it in millicast with the business
logic (accounting, billing, authentification, …)
● The ad targeting is missing, pending viewer
analytics availability. Collaboration project with
DataZoom.
Source
Ad Decision
SFU
(Medooze)
View er
request
response
ad content
Transcoder
(Medooze)
Ad Storage
Live Stream
W ebRTC
Ad
Request
Ad content
HLS/MPEG-DASH
Ad content
W ebRTC
Live Stream + Ad
W ebRTC
VAST
1
2 3
4
5
6
7

ANNEX 4: Other Features
Core webrtc media path for minimum latency,
● E2EE with SFrame. Ready to integrate. Native Apps only for now.
● multiOpus. Ready to integrate. Web and Native. Hidden in chrome, risk that they remove it.
● Watermarking. Trivial to integrate (client side only).
Native Apps only for now. Needs more research on the Watermarking
algorithm, currently not robust enough. Needs Ludo time. Slow (R&D).
● AV1: not ready to integrate, wait for SVC from Google.
● webXR with RTP Timestamp based sync. No POC yet.

WebRTC Real time media P2P, Server, Infrastructure, and Platform

More Related Content

What's hot (20)

Similar to WebRTC Real time media P2P, Server, Infrastructure, and Platform (20)

Recently uploaded (20)

WebRTC Real time media P2P, Server, Infrastructure, and Platform