Get started Bring yourself up to speed with our introductory content.

Learning Guide: How does VoIP work?

Section 3 of Roger Freeman's VoIP series shows the basic operational elements of a VoIP network.

(Continued from How does VoIP Work? A technical guide to functional VoIP) 

Figure 2 shows a simplified block diagram of VoIP operation from an analog signal deriving from a standard telephone, which is digitized and transmitted over the Internet via a conversion device. Then, at the distant end, it is converted back to analog telephony using a similar device suitable for input to a standard telephone. A "gateway" is placed between the voice codec and the digital data transport circuit. An identical device will also be found at the far end of the link. This equipment carries out the signaling role on a telephone call, among other functions.

Moving from left to right in Figure 2, we have the spurty analog voice signal developed by the standard telephone. The signal is then converted to a digital counterpart using one of the seven or so codecs from which the VoIP system designer has to select. Some of the more popular codecs for this application are listed in Table 1. The binary output of the codec is then applied to a conversion device (i.e., a packetizer) that loads these binary 1's and 0's into an IP payload between 20 and 40 octets in length.


Figure 2. Elements of basic operation of VoIP, where the input signal derives from a conventional analog telephone

The output of this converter consists of IP packets* that are transmitted on the Web or other data circuit for delivery to the distant end.

At the far end, the IP packets are input to a converter that strips off the IP header, stores the payload, and then releases it in a constant bit stream to a codec. Of course, this codec must be compatible with its near-end counterpart. The codec converts the digital bit stream back to an analog signal that is input to a standard telephone.

The insightful reader will comment that many steps of translation and interface have been left out. Most of these considerations will be covered in Section 3.1, in our discussion of the "gateway." *The output may be ATM cells (See Chapter 16, Telecommunication System Engineering, 4th ed, Wiley, N.Y., 2004) if the intervening network is an ATM network.  

 VoIP gateway
Gateways are defined in different ways by different people. A gateway is a server; it may also be called a "media gateway." Figure 3 illustrates a typical gateway. It sits on the edge of the network and carries out a switching function of a local, tandem or toll-connecting PSTN switch.


Figure 3. A media gateway, from one perspective. API = applications program interface.  From IEC online (Jan. 2003)

Media gateways are part of the physical transport layer. They are regulated by a call control function housed in a media gateway controller. A media gateway, with its associated gateway controller, is necessary for the network transformation to packetized voice. Several of the media gateway functions are listed below:

  • Carries out A/D conversion of the analog voice channel (called compression in many texts);
  • Converts a DS0 or E0 to a binary signal compatible with IP or ATM;
  • Supports several types of access networks, including media such as copper (including various DSL regimes), fiber, radio (wireless), and CATV cable. It is also able to support various formats found in PDH and SDH hierarchies;
  • Competitive availability (99.999%);
  • Capable of handling several voice and data interface protocols;
  • Multi-vendor interoperability;
  • It must provide interface between the media gateway control device and the media gateway. This involves one of four protocols: SIP (Ref. 2), H.323 (Ref. 3), MGCP and Megaco (H.248);
  • It can handle switching and media processing based on standard network PCM, ATM and traditional IP; and
  • Transport of voice. There are four transmission categories that may be involved:
    1. Standard PCM (E0/E1 or DS0/DS1)
    2. ATM over AAL1/AAL2
    3. IP-based RTP/RTCP
    4. Frame relay


 Table 1 – Characteristics of speech codecs used on packet networks
   Coding algorithm   Voice bit rate (Kbps)    Voice frame size (bytes)    Header (bytes)    Packets per second    Packet bit-rate (Kbps)
  G.711 8-bit PCM (Ref. 1)   64   80   40   100   96
  G.723.1 MPMLQ(1)(Ref. 4)   6.3   30   40   26   14.6
  G.723.1 ACELP(2)(Ref. 5)   5.3   30   40   22   12.3
  G.726 ADPCM (3) (Ref. 6)   32   40   40   100   64
  G.728 LD-CELP(4)(Ref. 7)   16   20   40   100   48
  G.729a CS-ACELP (5)(Ref. 8)   8   10   40   100   40

(1) MPMLQ – Multi-pulse Maximum Likelihood Quantization
(2) ACELP -Algebraic Code-Excited Linear Prediction
(3) ADPCM – Adaptive Differential PCM
(4) LD-CELP – Low Delay Code-Excited Linear Prediction
(5) CS-ACELP – Conjugate Structure Algebraic Code-Excited Linear Prediction

The most powerful gateway supports the PSTN, requiring a high-reliability device to meet the PSTN availability requirements. It will be required to process many thousands of digital circuits. As shown in Figure 3, it has a network management capability most often based on simple network management protocol (SNMP -- see Chapter 21 of Telecommunication System Engineering, 4th ed., Wiley, N.Y.).

A somewhat less formidable gateway is employed to provide VoIP for small and medium-sized businesses. Some texts call this type of gateway an "integrated access device" (IAD) if it can handle data and video products as well. An IAD will probably be remotely configurable.

The least powerful and most economic gateways are residential. They can be deployed in at least five settings:

  • POTS (telephony);
  • Set-top box (CATV), which provides telephony as well;
  • PC/modem;
  • XDSL termination; and
  • Broadband last-mile connectivity (to the digital network).

Figure 4 shows gateway interface functions via a block diagram. On the left are time slots of a PCM bit stream (T1, in this case). The various signal functions are shown to develop a stream of data packets carrying voice or data. The output on the right consists of IP packets.

Figure 4. A simplified functional block diagram of a gateway providing an interface between a PCM bit stream deriving from the PSTN on the left and an IP network. The first functional block of the gateway analyzes the content on a time-slot basis. The time slot may contain an 8-bit data sequence where we must be hands-off regarding the content. A gateway senses the presence of data by the presence of a 2100 Hz tone in the time slot. The next signal type in the time slot it looks for is DTMF signaling tones (see Chapter 4). If there is no modem tone or DTMF tones in the time slot, the gateway assumes the time slot contains human speech. Three actions now have to be accomplished. "Silence" is removed, the standard PCM compression algorithm is applied, and an echo canceller is switched in. There are three digital formats used for voice over packet:

  1. IP (Internet protocol, Chapter 11, Section 7);
  2. Frame relay (Chapter 15); and
  3. ATM (asynchronous transfer mode, Chapter 16).

Reference: Telecommunication System Engineering, Wiley, N.Y., 2004.

  An IP packet, as used for VoIP
Assume for argument's sake that we use either a G.711 or G.726 IP packet. The packet consists of a header and a payload. Figure 5 shows a typical IP packet. Of interest, as one may imagine, is its payload.

Figure 5. A typical IP packet. Based on RFC 791, Ref. 22. (also see Figure 11.28).

In the case of G.711, standard PSTN PCM, there may be a transmission rate of 100 packets per second with 80 bytes in the payload of each packet. Of course, our arithmetic comes out just right and we get 8,000 samples per second, the Nyquist sampling rate for a 4 KHz analog voice channel. Another transmission rate for G.711 is 50 packets per second, where each packet will have 160 bytes, again achieving 8,000 samples per second per voice channel.

The total raw bytes per channel come out as follows: Layers 3 and 4 overhead (IP): 40 bytes plus 8 bytes for Layer 2 (link layer) overhead. So we add 48 to 80 or 160 bytes and get 128 or 208 bytes for a raw packet. The efficiency is nothing to write home about. Keep in mind that the primary concern of the VoIP designer is delay.

  The delay tradeoff
Human beings are intolerant of delay on a full-duplex circuit, typical of standard PSTN telephony. ITU-T Rec. G.114 (Ref. 10) recommends the total delay -- one-way -- in voice connectivity as follows:

  • 0-150 msec acceptable;
  • 150-400 msec acceptable but not desirable. Connectivity through a geostationary satellite falls into this category; and, above 400 msec, is unacceptable.

The delay objective -- one-way -- for a VoIP voice connectivity is less than 100 msec. With bridging for conference calls, that value doubles, owing to the very nature of bridging.
One-way components of delay are as follows:

  • Packetization or encapsulation delay based on G.711 or other compression algorithm. In the case of G.711, we must build from 80 PCM samples at 125 µsec per sample, so we have consumed 80 x 125 µsec, or 10,000 msec, or 10 msec plus time for the header, or 48 x 125 µsec, or 6 msec, for a total of 16 msec. If we use 160 PCM samples in the payload, then allow 20 msec plus 6 msec for the header, or 26 msec. This is a fixed delay.
  • Buffer delay is variable. As a minimum, there must be buffering of one frame or packet period. By definition, routers have buffers. Buffer delay varies with the number of routers in tandem. For G.711, the packet buffer size is 16 or 26 msec.
  • Look-ahead delay: This is used by the coder to help in compression. Look-ahead is a period of time when the coder looks at packet n+1 for patterns on which it can compress while coding packet n. With G.711, the look-ahead is 0.
  • De-jitterizer: This is a buffer installed at the destination. It injects at least 1 frame duration (1-20 msec) in the total delay to smooth out the apparent arrival times of packets.
  • Queuing delay: This is time spent in the queue because it is a shared network. One method to reduce this delay is to prioritize voice packets over data, with an objective of less than 50 msec.
  • Propagation delay: Variable. Major contributor to total delay. Geostationary satellite relay of circuits is a special problem. The trip to the satellite and back is budgeted at 250 msec.

One way to speed things up is to increase the bit rate per voice data stream. To do this, the aggregate bit rate may have to be increased or the number of voice streams may be reduced on the aggregate bit rate so that each stream can be transmitted at a faster rate.

  Lost packet rate
A second concern of the VoIP designer is lost packet rate. There are several ways a packet can be lost.

For example, Section 3.3 described a de-jitterizing buffer. It has a finite size. Once the time is exceeded by a late packet, the packet in question is lost. In the case of G.711, this would be the time equivalent to 16 or 26 msec -- duration of a packet including its header. Another cause of packet loss may be excessive error rate on a packet, whereby it is deleted. When the lost or discarded packet rate begins to exceed 10%, quality of voice starts to deteriorate. If high-compression algorithms -- such as G.723 or G.729 -- are employed, it is desirable to maintain the packet loss rate below 1%. Router buffer overflow is another source of packet loss.

IP through TCP has excellent retransmission capabilities for erred frames or packets, but they are not practical for VoIP because of the additional delay involved. When there is a packet in error, the receiving end of the link transmits a request (RQ) to the transmitting end for a packet retransmission and its incumbent propagation delay. This must be added to the transmission delay with some processing delay to send the offending packet back to the receiver again.

Concealment of lost packets
A lost packet causes a gap in the reception stream. For a single packet, we are looking at a 20 to 40 msec gap. The simplest measure to take for lost packets and the resulting gaps is to disregard them. The absolute silence of a gap may disturb a listener. In this case, artificial noise is often inserted.

There are packet loss concealment (PLC) procedures that can camouflage gaps in the output voice signal. The simplest techniques require a little extra processing power, and the most sophisticated techniques can restore speech to a level approximating the quality of the original signal. Concealment techniques are most effective for about 40 to 60 msec of missing speech. Gaps longer than 80 msec usually have to be muted.

One of the most elementary PLCs simply smooths the edges of gaps to eliminate audible clicks. A more advanced algorithm replays the previous packet in place of the lost one, but this can cause harmonic artifacts such as tones or beeps. Good concealment methods use variation in the synthesized replacement speech to make the output more like natural speech. There are better PLCs that preserve the spectral characteristics of the talker's voice and maintain a smooth transition between estimated signal and surrounding original. The most sophisticated PLCs use CELP (codebook-excited linear predictive) or a similar technique to determine the content of the missing packet by examining the previous one (Ref. 11). Lost packets can be detected by packet sequence numbering.

  Echo and echo control
Echo is commonly removed by the use of echo cancellers that are incorporated on the same DSP chips that perform the voice coding. A good source for information and design of echo cancellers is ITU-T Rec. G.168 (Ref. 12). However, most vendors of VoIP equipment have their own proprietary designs. A common design approach is to have the echo canceller store the outgoing speech in a buffer. It then monitors the stored speech after a delay to see whether it contains a component the matches up against the stored speech after a delay. If it does, that component of the incoming speech is cancelled out instead of being passed back to the user, since it is an echo of what the user originally said. Echo cancellers can be tuned or can tune themselves to the echo delay on any particular connection. Each echo canceller design has a limit to the maximum delay of echo it can identify. Echo cancellers are bypassed if a fax signal or modem data is on the line.

Return to the How does VoIP work? A technical guide to functional VoIP.

Go to Section 4: What are media gateways and how do H.323, SIP, MGCP and other support protocols work?

Dig Deeper on Unified Communications Resources

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.