The Ultimate Guide to WebRTC
LiveSwitch Ultimate WebRTC Guide
What is WebRTC?
Web Real-Time Communication (WebRTC) is both an open-source project and specification that enables real-time media communications like voice, video and data transfer natively between browsers and devices. This provides users with the ability to communicate from within their primary web browser without the need for complicated plug-ins or additional hardware.
The WebRTC project was first announced by Google in May 2011 as a means of developing a common set of protocols for enabling high-quality RTC applications within browsers, mobile platforms and IoT devices. At the time, Flash and plug-ins were the only methods of offering real-time communication. Two years and considerable work later, the first cross-browser video call was established between Chrome and Firefox. Support for WebRTC in the developer community has since skyrocketed as more and more organizations add support for the specification. Today, WebRTC is available natively (to varying degrees) in Chrome, Firefox, Safari, Edge, Android and iOS and is a widely popular tool for video calling.
There are three primary components of the WebRTC API and each plays a unique role in WebRTC specification:
The Peer Connection is the core of the WebRTC standard. It provides a way for participants to create direct connections with their peers without the need for an intermediary server (beyond signalling). Each participant takes the media acquired from the media stream API and plugs it into the peer connection to create an audio or video feed. The PeerConnection API has a lot going on behind the scenes—it handles SDP negotiation, codec implementations, NAT Traversal, packet loss, bandwidth management and media transfer.
The RTCDataChannel API was set up to allow bi-directional data transfer of any type of data—media or otherwise—directly between peers. It was designed to mimic the WebSocket API; however, rather than relying on a TCP connection (which is reliable but also high in latency and prone to bottlenecks), data channels use UDP-based streams with the configurability of the Stream Control Transmission Protocol (SCTP) protocol. This design allows the best of both worlds: reliable delivery like in TCP but with reduced congestion on the network like in UDP.
Before a peer-to-peer video call can begin, a connection between the two clients needs to be established. This is accomplished through signalling. Signalling falls outside of the realm of the WebRTC specification but is the vital first step in establishing an audio/video connection.
Signalling allows two endpoints (senders, receivers or both) to exchange metadata to coordinate communication and set up a call. For example, before two endpoints can start a video call, one side has to call the other and the called side has to respond. This call-and-response message flow (also known as offer-answer message flow) contains critical details about the streaming that will take place (e.g. the number and types of streams, how the media will be encoded, etc.) and is often formatted using the Session Description Protocol (SDP). SDP is a standard format used by many real-world systems, including VoIP and WebRTC.
This is needed for two reasons:
- Generally, the peers do not know each other’s capabilities.
- Generally, the peers do not know each other’s network addresses.
NAT Traversal – ICE, TURN and STUN
Once the initial signalling for a streaming connection has taken place, the two endpoints need to begin the process of NAT (Network Address Translation) traversal. When NAT assigns a public address to a computer inside a private network it can cause difficulties for setting up a real-time video connection. NAT Traversal is a method for getting around the issues associated with IP address translation.
In a WebRTC-enabled video call, unless the two endpoints are on the same local network, there will be one or more intermediary network devices (routers/gateways) between the two. There are three key specifications that are used in WebRTC to overcome these hurdles:
- Interactive Connectivity Establishment (ICE) – ICE is used to find all the ways for two computers to talk to each other. Its two main roles are gathering candidates and checking connectivity. ICE guarantees if there is a path for two clients to communicate, it will find it and ensure it is the most efficient using two protocols: STUN and TURN.
- Session Traversal Utilities for NAT (STUN) – STUN stands for Session Traversal Utilities for NAT, and is a lightweight and simple method for NAT Traversal. STUN allows WebRTC clients to find their own public IP address by making a request to a STUN server.
- Traversal Using Relays around NAT (TURN) – The TURN server assists in the NAT traversal by helping the endpoints learn about the routers on their local networks, as well as blindly relaying data for one of the endpoints where a direct connection is not possible due to firewall restrictions.
Before sending the media over a peer connection, it has to be compressed. Raw audio and video is simply too large to send efficiently in our current Internet infrastructure. Likewise, after receiving media over a peer connection, it has to be decompressed. A media codec (coder-decoder) does exactly this.
WebRTC has mandated three audio codecs and two video codecs:
- Audio – PCMU (G.711μ) running at 8,000Hz with a single channel (mono).
- Audio – PCMA (G.711a) running at 8,000Hz with a single channel (mono).
- Audio – Opus running at 48,000Hz with two channels (stereo).
- Video – VP8.
- Video – H.264/AVC using Constrained Baseline Profile Level 1.2.
Future media codecs like VP9 and H.265 could be added to the WebRTC standard in the future, but for now they are not mandatory. RTC experts such as LiveSwitch’s Professional Services team are often able to add additional custom and future codec support to meet any customer’s requirements.
The peer-to-peer (mesh) topology is the only connection type that is covered in the WebRTC specification. However, there are many use cases where a mesh topology is insufficient. Server-based topologies can help address these drawbacks and are often used within the world of WebRTC for transferring media. The best topology for any given application depends largely on the expected use cases, as each one has its own unique set of benefits and drawbacks.
Peer-to-Peer (P2P) Architecture
- Lowest operating cost and excellent for simple use cases.
- Peers open connections directly to each other.
- Video is sent to each peer individually.
- Servers only involved for signalling and TURN/TURNS.
- CPU-intensive as conference size grows.
- Recording is difficult without a central server.
- Each participant uses more network bandwidth.
In a peer-to-peer or mesh topology, each participant in a session directly connects to all other participants without the use of a server. This type of connection is perfect for small video conferences as it is the lowest cost and easiest to set up. However, when conferences grow larger, maintaining direct connections between all participants becomes unsustainable as it can become too CPU-intensive. Since the connections are direct between peers, a mesh topology also doesn’t work well for recording.
For these reasons, a mesh topology is best for simple applications that connect two to three participants where low latency is important and where recording isn’t required.
Examples of potential use cases include:
Selective Forwarding (SFU) Architecture
- Every participant sends their stream to the server (upstream).
- Every peer can choose to open a downstream connection to get your video.
- Most popular connection type: allows older devices and rural participants with poor internet connectivity to actively participate.
- Requires additional server CPU power for mixing audio/video into single streams.
In a selective forwarding topology, each participant in a session connects to a server that acts as a selective forwarding unit (SFU). Each participant uploads their encrypted video stream to the server once, and the server then forwards those streams to each of the other participants. This reduces latency and also permits transcoding, recording, and other server-side integrations such as SIP that would be much more difficult in a peer-to-peer connection.
The SFU topology does have limits. While having a single upstream connection makes it more upload-efficient than a mesh topology, having multiple downstream connections means each client will eventually run out of resources once a certain number of participants is active in the session.
For these reasons, SFU topology is best for applications that connect 4 to 10 participants, where low latency is important or where recording is required and integrity is critical. This topology is generally considered the most balanced.
Examples of potential use cases include:
Multi-Point Control (MCU) Architecture
- Reduces required participant upload bandwidth.
- Permits transcoding, recording and wider device support.
- Every participant uploads their stream to the server.
- The server processes all the streams and sends one back to each participant.
- Shifts some CPU load from the patient to the provider.
In a Multipoint Control Topology, each participant in a session connects to a server that acts as a multipoint control unit (MCU). The MCU receives media from each participant and decodes it, mixing the audio and video from the participants together into a single stream which is in turn encoded and sent to each participant. This requires less bandwidth usage and device CPU, but it does require additional server CPU for mixing audio/video into single streams. MCU’s are also a great option for dealing with poor network conditions as it provides the lowest possible bandwidth usage for each individual participant.
For these reasons, a multipoint control topology is best for large-scale applications that connect large numbers of participants, need to accommodate poor network conditions, or where recording is required and integrity is critical.
Examples of potential use cases include:
- Large scale broadcasting of multiple input stream
- Virtual classrooms
- Video conferencing with clients in remote areas or with underpowered devices
Hybrid architectures allow you to maintain a mix of Selective Forwarding and Multipoint Control (Mixing) architectures. In a hybrid environment, topologies can change as participant counts increase and decrease. If recording is critical, for example, starting with a selective forwarding topology and then switching to a multipoint control topology around the 10-participant count could make the most sense. If cost is more important than the integrity of recording, your application could start with a mesh topology and graduate as needed. Our LiveSwitch Server stack is a great example of a hybrid topology and is one of the few hybrid Media Servers on the market today.
For more information on how you can scale your WebRTC application, check out our post How to Successfully Scale Your WebRTC Application.
WebRTC is inherently secure and employs a number of protective measures to ensure your data remains secure. These include:
WebRTC is enacted directly between browsers without the need for plugins. This makes WebRTC inherently safer, because it provides an extra level of protection against malware or other undesirable software installations that may be disguised as a plug-in. And because WebRTC is offered as a part of a browser, any potential security threats or vulnerabilities tend to be addressed quickly via auto-updates from the browser vendors.
The WebRTC specification has addressed potential concerns to allowing access to media resources by requiring explicit permission for the camera or microphone to be used. It is not possible for a WebRTC application to gain access to a device without consent. Furthermore, whenever a device is in use it will be indicated in the clients UI and on their hardware.
Encryption is a mandatory part of WebRTC and is enforced on all parts of establishing and maintaining a connection. The preferred method for this is to use perfect forward secrecy (PFS) ciphers in a DTLS (Datagram Transport Layer Security) handshake to securely exchange key data. For audio and video, key data can then be used to generate AES (Advanced Encryption Standard) keys that are in turn used by SRTP (Secure Real-time Transport Protocol) to encrypt and decrypt the media. This acronym-rich stack of technologies translates to extremely secure connections that are impossible to break with current technology. Both WebRTC and ORTC mandate this particular stack, which is backwards-compatible and interoperable with VoIP systems.
While the basis of WebRTC has historically been peer-to-peer video conferencing, there are many promising add-ons that can help make WebRTC even more powerful as a real-time communications tool.
WebRTC and Broadcasting
When combined with efficient server scaling, WebRTC can be used to deliver sub-second latency broadcasts to large audiences. With plugin-free support now from every major browser vendor on desktop and mobile combined with intelligently designed Media Server clusters, it’s possible to scale to thousands and even millions of concurrent users while maintaining just milliseconds of latency.
WebRTC and Telephony - SIP and PSTN
VoIP-based systems and the public switched telephone network (PSTN) are still fundamental components in many enterprises and often include large historical investments into PBXs, gateways and SBCs. While the traditional landline may be slipping away, mobile phones are still ubiquitous and VoIP deployments in businesses are still standard. Because of this, it’s imperative in some applications for users to be able to dial into an active WebRTC-based session from a phone or have their phone ring when they are invited to join.
To do this, you need a gateway or switch that can speak the protocol used by VoIP phones everywhere—the Session Initiation Protocol, or SIP. Open source products like Asterisk and FreeSWITCH, which support WebRTC, can be helpful for small-scale deployments. Leveraging a flexible WebRTC stack such as LiveSwitch Server or Cloud is crucial for the creation of a seamless user experience when integrating such systems.
Launching a WebRTC Project
Before you jump into a WebRTC project it is important to have a good understanding of your organization's needs, current infrastructure and possible limitations. Having a good grasp of your present state and future needs will allow you to determine what options you have for developing your own video conferencing platform.
Do you need a media server?
Before you can fully explore your options, you need to have a good understanding of your session requirements. Particularly, what is the maximum number of users that need to be connected in a session at any given time, and what are the capabilities of the network and devices you expect them to connect from? If you only need to connect two or three people in a video conference and the users are all expected to have powerful devices on high-speed and uncongested networks, then a media server may not be required. If you need to connect 4 or more participants or if you will be connecting participants in remote areas with older laptops or smartphones, then a media server is necessary. Note that session needs often change over time, so consider carefully whether you might need to host larger sessions in the future.
Do you want to host your application in the cloud or on-premises?
If you determine a media server is required, your next step is to decide whether you want to host your application in the cloud or on-premise. Each option is valid depending on your specific requirements.
Many commercial WebRTC-based video conferencing products on the market today are cloud-based offerings. In general, clouds are great options for organizations that are looking for a solution that can be easily deployed and can be scaled up quickly with minimal oversight from your team. It is important that organizations considering cloud-based video conferencing options carefully check the data security protocols of the provider to determine the risk, and whether the offering complies with data protection laws like GDPR. For a fully secure and fully managed cloud-based product, check out LiveSwitch Cloud.
Media servers can also be hosted on-premise. This is a great option for organizations seeking maximum control over their data. On-premise solutions often appeal to those in risk-averse industries like governments, financial institutions or healthcare.
It is important to note “on-premise” can be a bit of a misnomer. While some media servers are physically hosted on your own premises, others could be hosted in a private cloud controlled by your operations team. The difference is your organization has the ability to choose your own cloud infrastructure provider based on your own unique specifications and risk tolerance. LiveSwitch’s Server SDK is completely compatible with all cloud infrastructure providers including AWS, Azure, Oracle, Digital Ocean and more. In some instances, this could be a more cost-effective option as you are dealing with the cloud provider directly, rather than through a third party. However, owning, operating and maintaining your own media server on-site also has costs that may or may not work with your business model.
What type of WebRTC-enabled video conferencing will best match your needs and the expertise you have in-house?
There is a wide spectrum of options out there for creating a video conferencing system: from creating it from scratch using open-source, to ready-made platforms and everything in between. Below we will go through the five most common types of services.
- Skilled WebRTC developers required
- Time available for development
- Comfortable sourcing media servers if required
The first option available to developers is to build your own platform from scratch using open-source code. The source code for WebRTC is freely available to the developer community to use or modify as they see fit. Creating your own application with the open-source code available can be a good option if you have enough skilled WebRTC developers available to take on the project (there are less than 12,000 worldwide), have the time to devote to the applications development, are comfortable sourcing media and signalling servers, and are in a position to navigate the uncertainties that come with any development project.
- Skilled developers required, but they do not need to be the experts in WebRTC
- Need more control over their solution
- Comfortable sourcing media servers (if required)
A software development kit (SDK) is a set of tools provided by hardware and software providers that are used for developing applications. SDKs operate on the client-side of an application and are usually comprised of application programming interfaces (APIs), sample code, and documentation. Using an SDK to build your video conferencing application is a good choice for developers who are looking for more control over their solution. It is also ideal for organizations that wish to manage their own data centers or cloud deployment. LiveSwitch Server and LiveSwitch Cloud are great examples of commercial SDKs that provide varying levels of WebRTC capabilities that can be built into applications.
Software as a Service (SaaS)
- Need a specific cloud-based service
- Needs to be combined with other offerings to build video conferencing software
Some organizations may wish to outsource pieces of their WebRTC development to others. These pieces are offered as a service, also known as a Software as a Service (SaaS). SaaS is a cloud-based software solution that hosts applications and makes them available to customers over the internet. It is possible to combine an SDK or open source project with some SaaS offerings. Some examples of pieces that may be purchased as a SaaS include TURN servers, NAT Traversal and signalling servers. LiveSwitch Cloud includes these pieces that can be customized easily within its web-based platform.
Communications Platform as a Service (CPaaS)
- Skilled developer required (but they do not need to be experts in WebRTC)
- Need more control over their solution
- Want a cloud-based solution
Communications Platform as a Service (CPaaS) refers to a cloud-based platform that enables developers to add real-time communications features like voice, video and messaging capabilities in their own applications without building back end infrastructure and interfaces. Using a CPaaS is a great option for those who are looking to reduce their time to market and their upfront development costs while maintaining a large degree of control over the design and development of their solutions. As it is a cloud-based offering, customers need to do their due diligence to ensure the cloud meets all security requirements. LiveSwitch Cloud is a great option for anyone who is looking for a highly flexible CPaaS platform with advanced functionality and analytics.
- No skilled WebRTC developers required
- OK with the cost
- Do not need flexibility in the application
The last option is purchasing a turn-key video conferencing solution. These solutions are designed to be fully complete and are ready for use by the average consumer on purchase. This can be a good option for organizations that don't have access to software developers and who are willing to pay more for a product with limited flexibility.
If you are unsure which WebRTC option is right for you, consider investing in an architecture assessment. This is a great first step for those who are looking for a path forward for their WebRTC project.
How can LiveSwitch help me with WebRTC?
With more than 10 years of WebRTC experience, we provide our partners with world-class real-time communications technology strategically engineered to their exact specifications. Our flexible platform allows our team to engineer fully customizable live-streaming applications tailored to your specifications. To learn more, click the Get Started button.
Build apps. Build the future.
Browse the latest documentation including API reference and LiveSwitch SDK.