SlideShare a Scribd company logo
Real-Time Media P2P, Server,
Infrastructure, and Platform
Dr. Alex Gouaillard
P2P: In the beginning
Originally, the webrtc use case was 1-to-1, p2p, a-la facetime, or skype.
The media flows directly between the two peers, which have a direct connection
and thus can know almost everything about the connection of their respective
remote peer.
Bandwidth estimation, Encryption, Bandwidth adaptation, are all relatively trivial to
support in that setting, and the only thing left to do is the discovery and
handshake. The discovery is the process through which peers find each other.
The handshake is the negotiation process which precede the media establishment
and allow to assess capacity of each peer and exchange needed info like the
encryption keys.
P2P: In the beginning
A lot of first generation servers are thus signalling servers, servers which enable
the discovery and facilitate the handshake for peers. Most of the &yet portfolio for
example is based on p2p connections (simplepeer, talky, …) as well as the first
version of tokbox, appear.in and almost any webrtc platform.
This design was very appealing to PaaS designers as the cost of the media
bandwidth, up to 85% of the operational cost, is fully externalized to the end user.
However the main limitation of this design is the scalability. Extending the logic to
multiparty chats is trivial, but the design forces e.g. multiple encoding of the same
stream, once per remote peer, and uses a lot of CPU. The complexity is then
quadratic wrt the # of users.
Beyond scalability, features like recording, or inter-connection to other protocols
for VoIP or streaming, all require a media server.
Note on Turn / Stun / ICE
ICE is a solution to firewall and NAT traversal. It’s an algorithm based on the
usage of either a STUN or a TURN server (purist will tell you a TURN server is
actually a STUN server but not the other way around). Because of that TURN and
STUN server are usually called ICE servers in implementations, but in
commercial offers almost everybody speak about TURN services (twilio, xirsys,
callstats.io, …).
What is interesting is that TURN servers are always useful, they are independent
of your back-end choice (p2p, sfu, infra, platform), they only act on the client side,
and it’s the only part of a webrtc solution that can be completely separated from
the rest.
Media Servers
Roughly two kinds of media servers exists: SFU and MCU. Roughly the distinction
is that MCU processes the media while SFU only relay/route media.
NOTE: medooze is a media server toolkit. You have “bricks” at your disposal to
put together to end up with a media server with the desired capabilities. It can be
an SFU, an MCU, and any variations around. There is no such thing as a
Medooze server, per say.
Less granular: Janus as a core engine, and some plugins that can be used to add
desired functionalities. So you speak about a janus server, with a given plugin.
“Video Room” is the plugin which implements SFU functionalities, and the one we
have SDKs and native clients for.
Not granular: MediaSoup. Absolutely monolithic: Jitsi (red5, ant media).
Media Servers: MCUs
MCUs need de decrypt the stream, reconstruct the encoded frame, decode it to get the raw media frame,
apply some transformation to it, and do everything in reverse.
● MCU always remove the encryption are not secure by design,
● This process represents an extra 80% CPU consumption,
● Transcoding at least doubles the media processing latency on the path.
● If the incoming and outgoing codecs are different, it’s called transcoding,
● If the incoming and outgoing protocols are different, it’s called transmuxing,
● It is very versatile, and can reduce the bandwidth usage to a minimum (linear to # of users).
MCU are mandatory to use when transcoding RTMP, RTSP, or HLS/MPEG-Dash to/from Webrtc. We are
using an MCU configuration of meedoze for the corresponding media Servers.
For a very long time, bitrate adaptation for streaming was done by reducing the resolution on the server.
That requires transcoding, scaling down the frame in the middle of the process. They call it ABR. ABR is
thus intrinsically unsecure, and costly both in terms of CPU footprint, and overall latency.
In VoIP, where a single video stream is accepted by default, MCU are used to stitch several stream
together and send it to a SIP client.
Media Servers: SFUs
SFUs (or selective forward units), do just that: chose a stream and relay it.
● It is very lightweight since it does not process media, i.e. no encoding or decoding.
● Adding support for a new codec is trivial, since no encoding/decoding takes place.
● It is very well suited for multiparty conferences as it just duplicate the incoming packets, rewrite the
destination address, and push it out.
● It is also insecure by default: to be able to send it to multiple peers, it needs to establish a separate
encrypted connection with each, each with different keys. End-to-end encryption solves this problem
and is implementable with an SFU, while impossible with MCU. Unless you want to lose the “end-to-
end encrypted” webrtc claim, you need to add an additional end-to-end encryption layer when using
an SFU. The original claim is void and null if using an MCU.
● Recording would require decoding the media stream. It is either implemented as a packet dump (rtp
packets themselves are recorded, which allow post mortem network debugging, as well as on-
demand media file generation), or in a multi-server environment, delegated to a separate MCU.
● Most SFU have extra reliability features like own per-outgoing stream packet cache, and jitter
buffer and can help improve real-time reliability on bad networks.
● They are especially powerful when coupled with simulcast (any codec) and SVC (special codecs)
Media Server: Recording solution: SFUs + MCUs
Keep the main path SFU only
(transcoding free) for min
latency !
Push the storage and MCU
(mixer) out to the outgoing leg
that requires it and only that
one
Keep pcap as the main storage
format: allow debugging,
contain media AND network
info, and can be stored E2E
encoded.
RTMP ingest, if video only, and
if single resolution can be done
with an SFU. For other codecs
support, and/or audio support,
and/or simulcast support, you
need an MCU. Here again,
keep it off main path.
● ABR makes separate parallel encoding of the same media source, and delivers the best one a
viewer can consume. By chunking the encoded media, one can adapt the resolution delivered
dynamically, with a reaction time equivalent to the chunk size (90s by default with HLS). It a server-
side feature for viewer-side optimization. Transcoding is needed, doubling the media latency. It
usually requires storages of the chunk prior to delivery, adding to the latency.
Media Servers: Bandwidth Adaptation: ABR
● Simulcast is exactly the same thing but done on the sender-side. It removes the need to manipulate
the media on the server (so one can now use an SFU). It’s a sender-side feature for viewer-side
optimization. Its reaction time depends on the time to the next full frame, usually 10s by default, but
smart implementations generate a full frame on-demand. It also removes the need for an extra
decoding/encoding cycle in the server, reducing the media latency by 2. It does not require storage.
It still requires a separate encoder per resolution, which can be problematic on mobile device where
only one hardware encoder is present. It works with any codec, even old ones.
Media Servers: Bandwidth Adaptation: Simulcast
● SVC is achieving the same thing as simulcast: getting multiple resolutions from a single source, but
using only one encoder, with all the resolution layered into a single bitstream. The layering also
allows for easier and more robust network resilience. The reaction time in the server to switch
between one resolution to another one is in the order of one RTP packet read, around 10ms.
● It requires special codec. H.264 (AVC), H.265 (HEVC), VP9 all have an SVC mode. AV1 supports
SVC by default.
Media Servers: Bandwidth Adaptation: SVC
Streaming and Media Infrastructure
Old protocols, like RTMP, and RTSP, are not encrypted by default, have no capacity to traverse NAT or
firewall, and are not adaptive either. Practically it means that:
● anybody with access can watch,
● You need to open a port in your firewall and NAT ahead of time,
● You need to allocate and reserve enough bandwidth. If you go below, the call fails.
Streaming protocols solve this by offsetting all the traffic over to HTTP, which is never blocked, leaving the
security to the user, and use ABR for adaptability. The storage allows for scalability through CDNs. Using
CDNs and HTTP (cache) allowed to achieve then a much lower cost of operation. Each of those solutions
address the original problems by adding latency.
WebRTC solved those problems in real-time, using respectively ICE, encryption by default,
simulcast/SVC, but with a p2p model which does not scale so well.
Nowaday, the cost of CDNs, and the cost of bandwidth are in par.
Streaming and Media Infrastructure
Webrtc is much closer to RTMP and RTSP in nature than to HLS and MPEG-DASH. They are media
transport protocols that do not use HTTP nor use file format/storage, everything is done in-memory and
on-the-wire.
Given the limitations of the protocols themselves, the only thing you could do with RTMP and RTSP was
to cluster and (auto)-scale. Since there was no firewall / NAT traversal capacity, you would ignore them.
Since there was no adaptability, you would not try to probe the bandwidth and to be a good network
citizen by applying congestion control to yourself. You would assume x Mbps available and reserved for
you at any time, and perfect network conditions. Then scaling becomes a simple matter of connecting
servers to one another in chains from source to consumer/viewer of the media. It’s so simple, that free
HTTP proxies like NGINX also implement RTMP routing, relaying and recording by default.
With nowadays cloud services, providing the capacity to start and stop servers, to distribute the sources
and the viewer across servers, and to route media from source to viewer is enough. You can automate
all that through what is called “load balancing” and “auto-scaling”.
Streaming and Media Infrastructure: Stadia
Streaming and Media Infrastructure: Stadia
Streaming and Media Infrastructure
A - Single, stand-alone server (red5, ant media, medooze)
Stand-Alone
SFU
source viewer Max: 1K viewers
viewer
viewer
viewer
viewer
Load
Balancing
Internal
routing
Scaling
● If webrtc, you still need a signalling server for discovery
and handshake, but usually this is integrated in the media server itself (all the server above). Each source,
respectively viewer, needs to discover and do a handshake with the media server separately.
● If rtmp or rtsp, you just need to provide the IP and PORT of the target computer, and you’re done. Of course,
port needs to be open, network needs to be almost perfect, bandwidth needs to be plenty, and you cannot
upgrade codecs (RTMP and RTSP at best support H264).
● Load balancing is the process of distributing the stream across several servers. But in the case of a single
stand alone server, there is no need for distribution as both source and viewer will always connect to the same
server. It makes things much simpler, but limit your audience to roughly 1,000 viewers.
● Routing, which is the process of connecting the source and the viewers, is trivial and can be done by the
media server.
● Scaling, which is adding or removing media servers to a cluster or pool of servers is not needed, since
there is only one media server.
● You need end-to-end encryption if you want to be secure. The distribution of the keys if of topic here.
Streaming and Media Infrastructure
B - 2 servers in path, no e2e media layer (red5, ant, phenixRTS)
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Max: 1M viewers
Load
Balancing
Internal
routing
N
N
N
Scaling
● Same problems with
handshake and discovery:
○ RTMP just connects by IP:PORT
○ Webrtc needs a signaling server for the source and the viewers to connect to the infrastructure
● Everything else being equal, this design can handle 1M viewers.
● The problems starts to arise here as the WebRTC Bandwidth Estimation, and Congestion Control
algorithms fail when used with multi-hops media paths. Plus Noisy neighbors effects start to appear.
● The easy solution to that, is to switch back to ABR. It solves all those problems, at the cost of breaking
RT bandwidth adaptability, adding latency, disabling end-to-end encryption and basically making webrtc
into nothing more than RTMP. Easy to implement, does not “work” worse than RTMP, while providing
better codec than RTMP, and Browser native support.
Streaming and Media Infrastructure
B - 2 servers in path, no e2e media layer (red5, ant, phenixRTS)
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Max: 1M viewers
Load
Balancing
Internal
routing
N
N
N
Scaling
● Here you suppose that
instead of a single ingest SFU,
you have a pool of N of them.
● In simple designs (LUXON v1), one simply add or remove instances
to the pool manually. In more advanced design (LUXON v2, red5, ant, phenixRTS), the pools auto-scale, i.e.
servers are being added or removed on-demand depending e.g. on the load.
● Load balancing manages how source, resp viewers, are being directed to a specific server. In simple designs
(LUXON v1), one use a round-robin algorithm which send each new connection to the next server in a list of
server (in order), and when reaching the end of the list, resumes at the beginning of the list. It works well when
the number of servers is fixed, and when all servers are idempotent.
● Routing becomes more complicated here. In simple designs (LUXON v1), we would just relay the ingest
stream to all egress SFUs. We would then be sure viewer can connect to any egress SFU and have any
stream available. Obviously, it does not scale well and wastes a lot of bandwidth. In LUXON v2, we have a
“director API” which does the routing and the load balancing on-demand. (slides exist)
Streaming and Media Infrastructure
C - 2+ servers in path, no e2e media layer
Load
Balancing
Internal
routing
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Max: 1G viewers
Relay
SFU
Relay
SFU
N N
N
Scaling
N
N
● Having a extra media server
in the path allows for extra scalability
(up to 1 Trillion theoretically).
● It also allow for smarter routing, and better connections managements, including multi-clouds.
● We have the 3 configurations with Medooze. We have a version of LUXON like this.
Streaming and Media Infrastructure
D - 2 servers in path, with e2e media layer (LUXON)
Load
Balancing
Internal
routing
Scaling
● The problems starts to arise here as the WebRTC Bandwidth Estimation, and Congestion Control algorithms
fail when used with multi-hops media paths. Plus Noisy neighbors effects start to appear (slide).
● The solution, is to take into account the network status and capacity at each server in the media path.
● One such algorithm is called BBR,
(Bottleneck Bandwidth and Round-Trip
Time). It’s conceptually easy to understand,
You should not send more than the weak-
-est network in the path can handle.
● This is e.g. what Google uses in Stadia,
What is used in Facebook’s Quic impl.
● It requires a full monitoring, interactive
E2e Media layer, on top of the media
Servers, and in addition to scaling/routing.
It cannot be in a single server by design.
It’s a media infrastructure level feature.
Streaming and Media Infrastructure: Stadia
full monitoring,
Interactive,
E2E Media layer,
Separated from
Media Servers,
Separated from
scaling/routing.
Streaming and Media Infrastructure
D - 2+ servers in path, with e2e media layer (LUXON)
Load
Balancing
Internal
routing
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Server (Medooze™)
& Clients/SDKs layer
(Include E2E encryption)
(include watermarking)
(very differentiating)
Relay
SFU
Relay
SFU
N N
N
Scaling
LB LB
Smart Routing
N
N
Infrastructure scalability layer:
“Director API”™ (not differentiating)
Bandwidth Bottleneck and RTT + other black
magic
Infrastructure BWE and CC layer:
“RT Media Enabler”™ (UNIQUE)
LUXON
Media Only
Media Streaming PaaS - Millicast
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Server (Medooze™)
& Clients/SDKs layer
(Include E2E encryption)
(include watermarking)
(very differentiating as
traditionally most
vendors are server
vendors with no native
SDK capacity)
Relay
SFU
Relay
SFU
N N
N
LB LB
Smart Routing
N
N
Infrastructure scalability/logic layer:
“Director API”™ (not differentiating)
Bandwidth Bottleneck and RTT +
other black magic
Infrastructure BWE and CC layer:
“RT Media path manager”™ (UNIQUE)
EXT: OAuth, Tokens, Analytics,
Billing, Payment, All APIs.
Platform layer:
“Influxis Synapse”™ (little differentiation)
Hosting / Cloud Management
INT: monitoring, notifications,
Recording/storage.
RTMP Ingest support - in production
Core webrtc media path for minimum latency, RTMP first mile
WebRTC
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Relay
SFU
Relay
SFU
N N
N
N
N
● NOW: No video transcoding (=> no simulcast, no SVC support)(=> only support H264)
● NOW: Audio transcoding to opus stereo at best (can lose quality if input is 5.1 AAC)
● NOW: added latency from media processing: 1 frame at most (1/30s = 33ms)
● PLANNED: Video Transcoding with simulcast / SVC, and capacity to choose which codec.
● PLANNED: Audio transcoding to multiOpus if input if 5.1 or 6.1. No loss of spatial quality.
RTMP / RTSP
source
Ingest
MCU
N
RTMP/HLS/MPEG-DASH egress support
Core webrtc media path for minimum latency,
WebRTC
source
Ingest
SFU
egress
SFU
viewer
viewer
viewer
viewer
viewer
egress
SFU
viewer
viewer
viewer
viewer
viewer
Relay
SFU
Relay
SFU
N N
N
N
N
RTMP
MCU
N
HLS
MCU
N
Social
Site
CDN
Facebook, Twitch
● The media part (medooze
transcoding MCU) is done.
● The difficult part now is the
logic and the routing needed
to integrate in LUXON.
● Then, cabling it in millicast
with the business logic
(accounting, billing,
authentification, …)
● Connecting to FB or a CDN
could and should be
automated as well.
Server-Side Ad-Insertion
Core webrtc media path
for minimum latency,
● The media part (medooze transcoding MCU) is
done.
● The difficult part now is the logic and the
routing needed to integrate in LUXON.
● Then, cabling it in millicast with the business
logic (accounting, billing, authentification, …)
● The ad targeting is missing, pending viewer
analytics availability. Collaboration project with
DataZoom.
Source
Ad Decision
SFU
(Medooze)
View er
request
response
ad content
Transcoder
(Medooze)
Ad Storage
Live Stream
W ebRTC
Ad
Request
Ad content
HLS/MPEG-DASH
Ad content
W ebRTC
Live Stream + Ad
W ebRTC
VAST
1
2 3
4
5
6
7
ANNEX 4: Other Features
Core webrtc media path for minimum latency,
● E2EE with SFrame. Ready to integrate. Native Apps only for now.
● multiOpus. Ready to integrate. Web and Native. Hidden in chrome, risk that they remove it.
● Watermarking. Trivial to integrate (client side only).
Native Apps only for now. Needs more research on the Watermarking
algorithm, currently not robust enough. Needs Ludo time. Slow (R&D).
● AV1: not ready to integrate, wait for SVC from Google.
● webXR with RTP Timestamp based sync. No POC yet.

More Related Content

PPTX
AUTOSAR Memory Stcak (MemStack).
PPTX
React vs Angular
PPT
Silverlight
PPTX
Frequently Asked Questions on AUTOSAR Services
PPTX
Autosar fundamental
PPTX
Autosar software component
PDF
Automotive embedded systems part2 v1
PDF
Automotive embedded systems part7 v1
AUTOSAR Memory Stcak (MemStack).
React vs Angular
Silverlight
Frequently Asked Questions on AUTOSAR Services
Autosar fundamental
Autosar software component
Automotive embedded systems part2 v1
Automotive embedded systems part7 v1

What's hot (20)

PPTX
Autosar MCAL (Microcontroller Abstraction Layer)
PDF
Autosar Basics hand book_v1
DOCX
Microsoft Hololens Seminar Report
PDF
ML Kit , Cloud FF GDSC MESCOE.pdf
PDF
UDS Protocol Stack | Manual Guide | Fact Sheet
PDF
Automative basics v3
PDF
The Basics of Automotive Ethernet Webinar Slidedeck
PPTX
Introduction to Flutter
PDF
Compiled vs interpreted Linguages
PPT
Containers 101
PDF
Mixed-critical adaptive AUTOSAR stack based on VxWorks, Linux, and virtualiza...
PDF
Automotive embedded systems part5 v1
PDF
Automotive embedded systems part8 v1
PDF
Augmented Reality: Use Cases In the Automotive Industry
PDF
Secrets in Kubernetes
PDF
Flutter vs React Native | Edureka
PPTX
Introduction to Docker - 2017
PPTX
AUToSAR introduction
PPT
Introduction To Webrtc
PDF
INTRODUCTION TO FLUTTER.pdf
Autosar MCAL (Microcontroller Abstraction Layer)
Autosar Basics hand book_v1
Microsoft Hololens Seminar Report
ML Kit , Cloud FF GDSC MESCOE.pdf
UDS Protocol Stack | Manual Guide | Fact Sheet
Automative basics v3
The Basics of Automotive Ethernet Webinar Slidedeck
Introduction to Flutter
Compiled vs interpreted Linguages
Containers 101
Mixed-critical adaptive AUTOSAR stack based on VxWorks, Linux, and virtualiza...
Automotive embedded systems part5 v1
Automotive embedded systems part8 v1
Augmented Reality: Use Cases In the Automotive Industry
Secrets in Kubernetes
Flutter vs React Native | Edureka
Introduction to Docker - 2017
AUToSAR introduction
Introduction To Webrtc
INTRODUCTION TO FLUTTER.pdf
Ad

Similar to WebRTC Real time media P2P, Server, Infrastructure, and Platform (20)

PDF
The difference between a hub, switch and router webopedia
PPT
Collaboration and Grid Technologies
PDF
H n q & a
PDF
Audio/Video Conferencing over Publish/Subscribe Messaging Systems
PDF
Video streaming
PDF
WebRTC Standards from Tim Panton
DOC
Multicast Basics
PPTX
Tcp and udp
PPTX
PPTX
Protocols for internet of things
PPTX
Protocols for internet of things
PPTX
Protocols for internet of things
PPTX
Protocols for internet of things
PPTX
Protocols for internet of things
PPTX
Internet of Things: Protocols for M2M
PPTX
Tcp and udp.transmission control protocol.user datagram protocol
PDF
International Journal of Computational Engineering Research(IJCER)
DOC
Hubs vs switches vs routers
PDF
Uit Presentation of IN/NGIN for Cosmote 2010
PDF
Architecting Low Latency Applications Alberto Gonzalez
The difference between a hub, switch and router webopedia
Collaboration and Grid Technologies
H n q & a
Audio/Video Conferencing over Publish/Subscribe Messaging Systems
Video streaming
WebRTC Standards from Tim Panton
Multicast Basics
Tcp and udp
Protocols for internet of things
Protocols for internet of things
Protocols for internet of things
Protocols for internet of things
Protocols for internet of things
Internet of Things: Protocols for M2M
Tcp and udp.transmission control protocol.user datagram protocol
International Journal of Computational Engineering Research(IJCER)
Hubs vs switches vs routers
Uit Presentation of IN/NGIN for Cosmote 2010
Architecting Low Latency Applications Alberto Gonzalez
Ad

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
Big Data Technologies - Introduction.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Electronic commerce courselecture one. Pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Cloud computing and distributed systems.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation theory and applications.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
Approach and Philosophy of On baking technology
Big Data Technologies - Introduction.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
The AUB Centre for AI in Media Proposal.docx
Cloud computing and distributed systems.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
Network Security Unit 5.pdf for BCA BBA.
Encapsulation theory and applications.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

WebRTC Real time media P2P, Server, Infrastructure, and Platform

  • 1. Real-Time Media P2P, Server, Infrastructure, and Platform Dr. Alex Gouaillard
  • 2. P2P: In the beginning Originally, the webrtc use case was 1-to-1, p2p, a-la facetime, or skype. The media flows directly between the two peers, which have a direct connection and thus can know almost everything about the connection of their respective remote peer. Bandwidth estimation, Encryption, Bandwidth adaptation, are all relatively trivial to support in that setting, and the only thing left to do is the discovery and handshake. The discovery is the process through which peers find each other. The handshake is the negotiation process which precede the media establishment and allow to assess capacity of each peer and exchange needed info like the encryption keys.
  • 3. P2P: In the beginning A lot of first generation servers are thus signalling servers, servers which enable the discovery and facilitate the handshake for peers. Most of the &yet portfolio for example is based on p2p connections (simplepeer, talky, …) as well as the first version of tokbox, appear.in and almost any webrtc platform. This design was very appealing to PaaS designers as the cost of the media bandwidth, up to 85% of the operational cost, is fully externalized to the end user. However the main limitation of this design is the scalability. Extending the logic to multiparty chats is trivial, but the design forces e.g. multiple encoding of the same stream, once per remote peer, and uses a lot of CPU. The complexity is then quadratic wrt the # of users. Beyond scalability, features like recording, or inter-connection to other protocols for VoIP or streaming, all require a media server.
  • 4. Note on Turn / Stun / ICE ICE is a solution to firewall and NAT traversal. It’s an algorithm based on the usage of either a STUN or a TURN server (purist will tell you a TURN server is actually a STUN server but not the other way around). Because of that TURN and STUN server are usually called ICE servers in implementations, but in commercial offers almost everybody speak about TURN services (twilio, xirsys, callstats.io, …). What is interesting is that TURN servers are always useful, they are independent of your back-end choice (p2p, sfu, infra, platform), they only act on the client side, and it’s the only part of a webrtc solution that can be completely separated from the rest.
  • 5. Media Servers Roughly two kinds of media servers exists: SFU and MCU. Roughly the distinction is that MCU processes the media while SFU only relay/route media. NOTE: medooze is a media server toolkit. You have “bricks” at your disposal to put together to end up with a media server with the desired capabilities. It can be an SFU, an MCU, and any variations around. There is no such thing as a Medooze server, per say. Less granular: Janus as a core engine, and some plugins that can be used to add desired functionalities. So you speak about a janus server, with a given plugin. “Video Room” is the plugin which implements SFU functionalities, and the one we have SDKs and native clients for. Not granular: MediaSoup. Absolutely monolithic: Jitsi (red5, ant media).
  • 6. Media Servers: MCUs MCUs need de decrypt the stream, reconstruct the encoded frame, decode it to get the raw media frame, apply some transformation to it, and do everything in reverse. ● MCU always remove the encryption are not secure by design, ● This process represents an extra 80% CPU consumption, ● Transcoding at least doubles the media processing latency on the path. ● If the incoming and outgoing codecs are different, it’s called transcoding, ● If the incoming and outgoing protocols are different, it’s called transmuxing, ● It is very versatile, and can reduce the bandwidth usage to a minimum (linear to # of users). MCU are mandatory to use when transcoding RTMP, RTSP, or HLS/MPEG-Dash to/from Webrtc. We are using an MCU configuration of meedoze for the corresponding media Servers. For a very long time, bitrate adaptation for streaming was done by reducing the resolution on the server. That requires transcoding, scaling down the frame in the middle of the process. They call it ABR. ABR is thus intrinsically unsecure, and costly both in terms of CPU footprint, and overall latency. In VoIP, where a single video stream is accepted by default, MCU are used to stitch several stream together and send it to a SIP client.
  • 7. Media Servers: SFUs SFUs (or selective forward units), do just that: chose a stream and relay it. ● It is very lightweight since it does not process media, i.e. no encoding or decoding. ● Adding support for a new codec is trivial, since no encoding/decoding takes place. ● It is very well suited for multiparty conferences as it just duplicate the incoming packets, rewrite the destination address, and push it out. ● It is also insecure by default: to be able to send it to multiple peers, it needs to establish a separate encrypted connection with each, each with different keys. End-to-end encryption solves this problem and is implementable with an SFU, while impossible with MCU. Unless you want to lose the “end-to- end encrypted” webrtc claim, you need to add an additional end-to-end encryption layer when using an SFU. The original claim is void and null if using an MCU. ● Recording would require decoding the media stream. It is either implemented as a packet dump (rtp packets themselves are recorded, which allow post mortem network debugging, as well as on- demand media file generation), or in a multi-server environment, delegated to a separate MCU. ● Most SFU have extra reliability features like own per-outgoing stream packet cache, and jitter buffer and can help improve real-time reliability on bad networks. ● They are especially powerful when coupled with simulcast (any codec) and SVC (special codecs)
  • 8. Media Server: Recording solution: SFUs + MCUs Keep the main path SFU only (transcoding free) for min latency ! Push the storage and MCU (mixer) out to the outgoing leg that requires it and only that one Keep pcap as the main storage format: allow debugging, contain media AND network info, and can be stored E2E encoded. RTMP ingest, if video only, and if single resolution can be done with an SFU. For other codecs support, and/or audio support, and/or simulcast support, you need an MCU. Here again, keep it off main path.
  • 9. ● ABR makes separate parallel encoding of the same media source, and delivers the best one a viewer can consume. By chunking the encoded media, one can adapt the resolution delivered dynamically, with a reaction time equivalent to the chunk size (90s by default with HLS). It a server- side feature for viewer-side optimization. Transcoding is needed, doubling the media latency. It usually requires storages of the chunk prior to delivery, adding to the latency. Media Servers: Bandwidth Adaptation: ABR
  • 10. ● Simulcast is exactly the same thing but done on the sender-side. It removes the need to manipulate the media on the server (so one can now use an SFU). It’s a sender-side feature for viewer-side optimization. Its reaction time depends on the time to the next full frame, usually 10s by default, but smart implementations generate a full frame on-demand. It also removes the need for an extra decoding/encoding cycle in the server, reducing the media latency by 2. It does not require storage. It still requires a separate encoder per resolution, which can be problematic on mobile device where only one hardware encoder is present. It works with any codec, even old ones. Media Servers: Bandwidth Adaptation: Simulcast
  • 11. ● SVC is achieving the same thing as simulcast: getting multiple resolutions from a single source, but using only one encoder, with all the resolution layered into a single bitstream. The layering also allows for easier and more robust network resilience. The reaction time in the server to switch between one resolution to another one is in the order of one RTP packet read, around 10ms. ● It requires special codec. H.264 (AVC), H.265 (HEVC), VP9 all have an SVC mode. AV1 supports SVC by default. Media Servers: Bandwidth Adaptation: SVC
  • 12. Streaming and Media Infrastructure Old protocols, like RTMP, and RTSP, are not encrypted by default, have no capacity to traverse NAT or firewall, and are not adaptive either. Practically it means that: ● anybody with access can watch, ● You need to open a port in your firewall and NAT ahead of time, ● You need to allocate and reserve enough bandwidth. If you go below, the call fails. Streaming protocols solve this by offsetting all the traffic over to HTTP, which is never blocked, leaving the security to the user, and use ABR for adaptability. The storage allows for scalability through CDNs. Using CDNs and HTTP (cache) allowed to achieve then a much lower cost of operation. Each of those solutions address the original problems by adding latency. WebRTC solved those problems in real-time, using respectively ICE, encryption by default, simulcast/SVC, but with a p2p model which does not scale so well. Nowaday, the cost of CDNs, and the cost of bandwidth are in par.
  • 13. Streaming and Media Infrastructure Webrtc is much closer to RTMP and RTSP in nature than to HLS and MPEG-DASH. They are media transport protocols that do not use HTTP nor use file format/storage, everything is done in-memory and on-the-wire. Given the limitations of the protocols themselves, the only thing you could do with RTMP and RTSP was to cluster and (auto)-scale. Since there was no firewall / NAT traversal capacity, you would ignore them. Since there was no adaptability, you would not try to probe the bandwidth and to be a good network citizen by applying congestion control to yourself. You would assume x Mbps available and reserved for you at any time, and perfect network conditions. Then scaling becomes a simple matter of connecting servers to one another in chains from source to consumer/viewer of the media. It’s so simple, that free HTTP proxies like NGINX also implement RTMP routing, relaying and recording by default. With nowadays cloud services, providing the capacity to start and stop servers, to distribute the sources and the viewer across servers, and to route media from source to viewer is enough. You can automate all that through what is called “load balancing” and “auto-scaling”.
  • 14. Streaming and Media Infrastructure: Stadia
  • 15. Streaming and Media Infrastructure: Stadia
  • 16. Streaming and Media Infrastructure A - Single, stand-alone server (red5, ant media, medooze) Stand-Alone SFU source viewer Max: 1K viewers viewer viewer viewer viewer Load Balancing Internal routing Scaling ● If webrtc, you still need a signalling server for discovery and handshake, but usually this is integrated in the media server itself (all the server above). Each source, respectively viewer, needs to discover and do a handshake with the media server separately. ● If rtmp or rtsp, you just need to provide the IP and PORT of the target computer, and you’re done. Of course, port needs to be open, network needs to be almost perfect, bandwidth needs to be plenty, and you cannot upgrade codecs (RTMP and RTSP at best support H264). ● Load balancing is the process of distributing the stream across several servers. But in the case of a single stand alone server, there is no need for distribution as both source and viewer will always connect to the same server. It makes things much simpler, but limit your audience to roughly 1,000 viewers. ● Routing, which is the process of connecting the source and the viewers, is trivial and can be done by the media server. ● Scaling, which is adding or removing media servers to a cluster or pool of servers is not needed, since there is only one media server. ● You need end-to-end encryption if you want to be secure. The distribution of the keys if of topic here.
  • 17. Streaming and Media Infrastructure B - 2 servers in path, no e2e media layer (red5, ant, phenixRTS) source Ingest SFU egress SFU viewer viewer viewer viewer viewer egress SFU viewer viewer viewer viewer viewer Max: 1M viewers Load Balancing Internal routing N N N Scaling ● Same problems with handshake and discovery: ○ RTMP just connects by IP:PORT ○ Webrtc needs a signaling server for the source and the viewers to connect to the infrastructure ● Everything else being equal, this design can handle 1M viewers. ● The problems starts to arise here as the WebRTC Bandwidth Estimation, and Congestion Control algorithms fail when used with multi-hops media paths. Plus Noisy neighbors effects start to appear. ● The easy solution to that, is to switch back to ABR. It solves all those problems, at the cost of breaking RT bandwidth adaptability, adding latency, disabling end-to-end encryption and basically making webrtc into nothing more than RTMP. Easy to implement, does not “work” worse than RTMP, while providing better codec than RTMP, and Browser native support.
  • 18. Streaming and Media Infrastructure B - 2 servers in path, no e2e media layer (red5, ant, phenixRTS) source Ingest SFU egress SFU viewer viewer viewer viewer viewer egress SFU viewer viewer viewer viewer viewer Max: 1M viewers Load Balancing Internal routing N N N Scaling ● Here you suppose that instead of a single ingest SFU, you have a pool of N of them. ● In simple designs (LUXON v1), one simply add or remove instances to the pool manually. In more advanced design (LUXON v2, red5, ant, phenixRTS), the pools auto-scale, i.e. servers are being added or removed on-demand depending e.g. on the load. ● Load balancing manages how source, resp viewers, are being directed to a specific server. In simple designs (LUXON v1), one use a round-robin algorithm which send each new connection to the next server in a list of server (in order), and when reaching the end of the list, resumes at the beginning of the list. It works well when the number of servers is fixed, and when all servers are idempotent. ● Routing becomes more complicated here. In simple designs (LUXON v1), we would just relay the ingest stream to all egress SFUs. We would then be sure viewer can connect to any egress SFU and have any stream available. Obviously, it does not scale well and wastes a lot of bandwidth. In LUXON v2, we have a “director API” which does the routing and the load balancing on-demand. (slides exist)
  • 19. Streaming and Media Infrastructure C - 2+ servers in path, no e2e media layer Load Balancing Internal routing source Ingest SFU egress SFU viewer viewer viewer viewer viewer egress SFU viewer viewer viewer viewer viewer Max: 1G viewers Relay SFU Relay SFU N N N Scaling N N ● Having a extra media server in the path allows for extra scalability (up to 1 Trillion theoretically). ● It also allow for smarter routing, and better connections managements, including multi-clouds. ● We have the 3 configurations with Medooze. We have a version of LUXON like this.
  • 20. Streaming and Media Infrastructure D - 2 servers in path, with e2e media layer (LUXON) Load Balancing Internal routing Scaling ● The problems starts to arise here as the WebRTC Bandwidth Estimation, and Congestion Control algorithms fail when used with multi-hops media paths. Plus Noisy neighbors effects start to appear (slide). ● The solution, is to take into account the network status and capacity at each server in the media path. ● One such algorithm is called BBR, (Bottleneck Bandwidth and Round-Trip Time). It’s conceptually easy to understand, You should not send more than the weak- -est network in the path can handle. ● This is e.g. what Google uses in Stadia, What is used in Facebook’s Quic impl. ● It requires a full monitoring, interactive E2e Media layer, on top of the media Servers, and in addition to scaling/routing. It cannot be in a single server by design. It’s a media infrastructure level feature.
  • 21. Streaming and Media Infrastructure: Stadia full monitoring, Interactive, E2E Media layer, Separated from Media Servers, Separated from scaling/routing.
  • 22. Streaming and Media Infrastructure D - 2+ servers in path, with e2e media layer (LUXON) Load Balancing Internal routing source Ingest SFU egress SFU viewer viewer viewer viewer viewer egress SFU viewer viewer viewer viewer viewer Server (Medooze™) & Clients/SDKs layer (Include E2E encryption) (include watermarking) (very differentiating) Relay SFU Relay SFU N N N Scaling LB LB Smart Routing N N Infrastructure scalability layer: “Director API”™ (not differentiating) Bandwidth Bottleneck and RTT + other black magic Infrastructure BWE and CC layer: “RT Media Enabler”™ (UNIQUE) LUXON Media Only
  • 23. Media Streaming PaaS - Millicast source Ingest SFU egress SFU viewer viewer viewer viewer viewer egress SFU viewer viewer viewer viewer viewer Server (Medooze™) & Clients/SDKs layer (Include E2E encryption) (include watermarking) (very differentiating as traditionally most vendors are server vendors with no native SDK capacity) Relay SFU Relay SFU N N N LB LB Smart Routing N N Infrastructure scalability/logic layer: “Director API”™ (not differentiating) Bandwidth Bottleneck and RTT + other black magic Infrastructure BWE and CC layer: “RT Media path manager”™ (UNIQUE) EXT: OAuth, Tokens, Analytics, Billing, Payment, All APIs. Platform layer: “Influxis Synapse”™ (little differentiation) Hosting / Cloud Management INT: monitoring, notifications, Recording/storage.
  • 24. RTMP Ingest support - in production Core webrtc media path for minimum latency, RTMP first mile WebRTC source Ingest SFU egress SFU viewer viewer viewer viewer viewer egress SFU viewer viewer viewer viewer viewer Relay SFU Relay SFU N N N N N ● NOW: No video transcoding (=> no simulcast, no SVC support)(=> only support H264) ● NOW: Audio transcoding to opus stereo at best (can lose quality if input is 5.1 AAC) ● NOW: added latency from media processing: 1 frame at most (1/30s = 33ms) ● PLANNED: Video Transcoding with simulcast / SVC, and capacity to choose which codec. ● PLANNED: Audio transcoding to multiOpus if input if 5.1 or 6.1. No loss of spatial quality. RTMP / RTSP source Ingest MCU N
  • 25. RTMP/HLS/MPEG-DASH egress support Core webrtc media path for minimum latency, WebRTC source Ingest SFU egress SFU viewer viewer viewer viewer viewer egress SFU viewer viewer viewer viewer viewer Relay SFU Relay SFU N N N N N RTMP MCU N HLS MCU N Social Site CDN Facebook, Twitch ● The media part (medooze transcoding MCU) is done. ● The difficult part now is the logic and the routing needed to integrate in LUXON. ● Then, cabling it in millicast with the business logic (accounting, billing, authentification, …) ● Connecting to FB or a CDN could and should be automated as well.
  • 26. Server-Side Ad-Insertion Core webrtc media path for minimum latency, ● The media part (medooze transcoding MCU) is done. ● The difficult part now is the logic and the routing needed to integrate in LUXON. ● Then, cabling it in millicast with the business logic (accounting, billing, authentification, …) ● The ad targeting is missing, pending viewer analytics availability. Collaboration project with DataZoom. Source Ad Decision SFU (Medooze) View er request response ad content Transcoder (Medooze) Ad Storage Live Stream W ebRTC Ad Request Ad content HLS/MPEG-DASH Ad content W ebRTC Live Stream + Ad W ebRTC VAST 1 2 3 4 5 6 7
  • 27. ANNEX 4: Other Features Core webrtc media path for minimum latency, ● E2EE with SFrame. Ready to integrate. Native Apps only for now. ● multiOpus. Ready to integrate. Web and Native. Hidden in chrome, risk that they remove it. ● Watermarking. Trivial to integrate (client side only). Native Apps only for now. Needs more research on the Watermarking algorithm, currently not robust enough. Needs Ludo time. Slow (R&D). ● AV1: not ready to integrate, wait for SVC from Google. ● webXR with RTP Timestamp based sync. No POC yet.