WebRTC for Audio Spaces: Key Architecture Considerations

git commit -m "always keep learning"
Well, I’m no expert. I have been playing with webRTC and related technologies, such as Asterisk, ever since I joined tech professionally.
However, I have seen experiments, different strategies, scaling web conferences, and “fancy” tools. Throughout my professional career, I’ve come to truly value the importance of understanding the fundamentals, especially the workings of the underlying infrastructure. In this blog, we’ll dive into the specifics of scaling audio streaming applications or features, such as X (formerly Twitter) Spaces.

WebRTC Architecture
Without wasting much of your time: SFU vs MCU, is a war that can be debated till Christ returns to the world for the believers.
A Selective forward Unit (SFU) that differs from a p2p is a media server that receives media from each party in a conference call, decides which streams should be forwarded to other parties, and then forwards them. So if you have three clients on a call each client will have three remote streams and one local stream.
A Multipoint Control Unit (MCU) works by mixing multiple media streams into a single stream, with higher server processing due to transcoding and mixing hence you need a big server for this which results in lower bandwidth usage for clients but higher latency. SFU will save you on server bills while compromising on latency.
Industry Recommendations

Participants, for calls with an average of fewer than 5 participants, use SFU for its efficiency and low latency. If the call involves more than 5 participants, opt for MCU to prioritize bandwidth efficiency.
Wondering? why I’m not talking about audio vs video vs audio + video, focus guys were are building the next generational X audio spaces.
Our Million dollar Audio Rooms app
Problem: sometimes our rooms have less than 5 people, it’s midnight of the night, and night owls want to catch up there is no shame in that. Automatically you would recommend SFU, but there is a problem during peak hours on average rooms have over 50 people in them automatically you rule our SFU for MCU.
My Implementation Strategy: Use both SFU and MCU and upgrade or downgrade based on the number of participants. This dynamic mode selection isn’t straightforward as two things need to be catered for:
Peer connections: separate SFU and MCU connections are required.
Signaling Server: another important piece of the puzzle, use your signaling server to send control messages to clients indicating when to switch modes. Then trigger WebRTC renegotiation to switch between SFU and MCU.
For our use case below is what I could recommend:
Primary Model: SFU (Selective Forwarding Unit):
Speakers (e.g., 10–15 active participants): Use SFU to forward streams directly to other participants. This ensures low latency and reduces server-side processing.
Listeners (e.g., 85+ passive participants): Instead of routing individual streams via SFU, mix speaker streams into a single audio stream and broadcast it (similar to an MCU for listeners only). This approach minimizes bandwidth usage for passive participants.
Dynamic Role Switching Allow participants to move between "listener" and "speaker" roles dynamically:
Listener → Speaker: Promote them to the SFU layer for real-time audio.
Speaker → Listener: Drop their individual SFU stream and include them in the mixed broadcast.
Backup Model: MCU for fallback (if you anticipate >100 participants regularly).
WebRTC Media Server
Use a media server that supports both SFU and optional mixing (MCU-like) capabilities:
Mediasoup: A powerful SFU framework with simulcast support. You can simulate MCU by mixing streams for listeners using additional components like FFmpeg.
Janus: Another flexible SFU with plugins for audio mixing.
Kurento: Supports both SFU and MCU features natively.
Honorable mentions, XDN architecture seems to be doing what I’m proposing above, I’m yet to test this personally but I will in the coming days: To read more about it click here

That’s a wrap, see you on the next one… Adios ✌️


