Transmitting VoIP depends on three pillars

Transmitting
VoIP depends on three pillars:
■ Signaling is used for call setup and teardown. Common protocols
include H.323, SIP, and MGCP.
■ Packetization sends voice samples inside IP packets.
■ QoS prioritizes VoIP traffic.
There are three reasons users will throw new VoIP phones at you and beg for
old analog headsets: packet loss, delay, and echo. The biggest reason for
packet loss is tail-drop in queues, which is solved through QoS. The biggest
CCNP ONT
issue with delay is variation in delay (called jitter), which causes large dejitter
buffers to be used and causes more delay. The solution to jitter is QoS.
Echo is solved through a technique called echo-cancellation (G.168), which
is on by default and compensates for delay.
Voice samples are encapsulated in Real Time Protocol (RTP) packets. Voice
does not need the reliability provided by TCP; by the time a retransmission
happened, the moment to play the sound would have passed. Voice does
need a way to order samples and recognize the time between samples, which
UDP by itself doesn’t allow. RTP is a protocol within UDP that adds the
necessary features.
A complete VoIP packet needs to include a data link header (Ethernet has a
14 byte header and 4 bytes CRC), an IP header (20 Bytes), an 8 byte UDP
header, and 12 bytes for RTP. Each 20 ms sample therefore includes 58
bytes of overhead. G.711 sends 8000 bytes per second (20 ms would therefore
need 160 bytes), so about a quarter of the transmission is headers!
Figure 2-3 shows the header overhead graphically and Table 2-1 shows the
bandwidth consumed by the various CODECs, including headers. If the
phone uses 20 ms samples (50 samples per second), then there will be 50
headers. G.711, instead of being 64 kbps, turns out to be:
(Headers + Sample) * 50/s=
(14B + 20B + 8B + 12B + 160B + 4B) * 50/s =
218 B * 50/s= 10900 B/s = 87,200 b/s = 87.2 kbps
Figure 2-3 Protocol Layout for Voice Transmission over IP
CCNP ONT
[ 258 ] CCNP ONT Quick Reference
IP Header
(20B)
UDP Header (8B)
Frame Header
Ethernet (14B)
RTP
Header
(12B)
Voice Sample
G.711 20ms = 160B
Frame CRC
Ethernet (4B)
Voice Sample
G.729 20ms = 20B
IP Header
(20B)
UDP Header (8B)
Frame Header
Ethernet (14B)
RTP
Header
(12B)
Frame
Note that G.729 uses 20-byte samples, so it needs only 31.2 kbps.
At this point, you may have sticker shock. If G.729 is billed as 8 kbps per
conversation, 31.2 kbps seems extreme. There are ways to mitigate the difference,
although the techniques do not completely erase the need for headers.
One way is to use RTP header compression. Header compression is configured
per link and remembers previous IP, UDP, and RTP headers, substituting
2B- or 4B-labels subsequently. By taking the header set from 40B to 4B,
cRTP delivers G.729 using 22-B headers and a consumption of 16.8 kbps!
Voice Activity Detection (VAD) is a technology that recognizes when you
are not talking and ceases transmission, thus saving bandwidth. In normal
speech, one person or the other is talking less than 65 percent of the time
(there are those long, uncomfortable silences right after you say, “You did
what?”). VAD can therefore dramatically reduce demands for bandwidth.
The bad news with VAD is that it doesn’t help with music (such as hold
music) and that it creates “dead air,” which can be mistaken for disconnection.
Some phones, in fact, will play soft static to reinforce that the line is
still live (this is called comfort noise).