Help & Support

SIP and Audio

SIPSorcery SIP Audio A guide to understanding how SIP controls audio streams and common things that go wrong.

SIP calls typically employ a protocol called Real-time Transport Protocol (RTP) to transmit the packets that contain the audio (or other types of media such as video) between the call end points. To find out more about how SIP, RTP and various other protocols are inter-related I’d recommend reading the SIP Survival Guide.

This guide goes into more detail about how RTP works with SIP and some of the common problems that can occur. It also details a diagnostics tool that I have made available at diags.sipsorcery.com that can be used to perform a cursory check of the ability to transmit an RTP packet to and from a single SIP device.

The Problem: Network Address Translation aka NAT

The problem boils down to the fact that SIP and RTP were written with a deliberate ignorance of NAT coupled with the fact that just about every home and small business user on the internet will be behind a NAT.

NAT stands for Network Address Translation and it became popular in the mid-90’s as a way to share a single IPv4 address amongst multiple devices on a single private network. It allowed ISP’s to only need to allocate one IPv4 address per customer (or sometimes none if they were particularly stingy and double NATted their customers). It was a short term fix for the problem of scarce IPv4 addresses. Unfortunately the consequences of using NAT are that lots of protocols designed to work on the internet break and require fixes or workarounds thus resulting in a workaround for the protocol to the NAT workaround.

SIP is one of the protocols that is severely affected by NAT. SIP’s main job is to set up a media stream between two or more end points and to do that it transmit the internet addresses of the media stream endpoints. That works fine in a non-NAT World but as soon as NAT is introduced into the equation the internet addresses will often end up being private network addresses that are unusable by destination SIP devices.

The Solution: STUN, ICE & TURN (aka alphabet soup)

Lots of work has been done to help SIP work with NAT. The original STUN standard - Simple Traversal of User Datagram Protocol (UDP) Through Network Address Translators (NATs) – provided a way for devices to determine their public IP address with the intention that they would substitute it for their private IP address when sending a SIP packet to the internet. That helped somewhat but couldn’t help in cases where NATs did both network and port translation and also could cause problems if two devices on the same private network attempted a SIP call.

To provide a more comprehensive solution to SIP’s NAT affliction additional protocols were developed; a new version of STUN was written this time standing for Session Traversal Utilities for NAT and it was joined by ICE, TURN and others. The problem with this more comprehensive solution was that the only way it could reliably solve the problem was by introducing the requirement for a new server that would always be available to proxy the media stream between the two SIP end points. Servers that proxy the media on SIP calls are common place and are the default way that most internet based SIP providers function. While using a proxy for the media does solve NAT issues it introduces a whole set of other issues such as latency, security and codec restrictions amongst others.

The Consequences: one way audio

The most common consequence of a SIP call being affected by NAT is one way audio where the caller cannot hear the callee or vice-versa. The cause of this is one of the RTP streams not being able to reach its required destination. A SIP call between two end points involves creating two RTP streams; one sending the caller’s audio to the callee and the second is simply the reverse, the callee’s audio to the caller. Because the end points for each of the two streams need to be transmitted within SIP packets, and by devices that could be on a private network, there is always potential for the end point addresses to be wrong.