User satisfaction with the dialing experience and perception of voice quality are the foremost factors when making a phone call. Users assign descriptive adjectives (good, OK, poor, terrible) to
The elements of voice quality
Good clarity is the primary description of an acceptable voice call. Clarity is the speech clearness, fidelity, intelligibility and lack of distortion. The following five components define the elements of sound quality for one direction of a call:
The speech volume level cannot be too low (whispering) or too high (shouting).
All speech is distorted, even in the PSTN conversion of analog to digital speech. Is the distortion perceivable to the listener? The greater the distortion, the poorer the comprehension of the conversation will be. You may not even be able to recognize the speaker.
Background noise exists in the form of static and hum in all calls. This is known as noise level. The noise may, however, be at a level low enough for the listener not to notice it at all.
The signal level (loudness) may change, increasing or decreasing during the call.
Crosstalk occurs when another conversation on a separate call can be heard on the user's call.
The next four elements finish the list of factors to be considered for voice quality. The first five sound-quality elements in combination with the following four can be termed voice quality or conversation quality:
Echo is the sound of the speaker's voice returning to -- and being heard by -- the speaker. Think of echo as a problem of long round-trip delay. The listener may not perceive short delay echoes. The longer the round-trip delay, the more difficult it is for the speaker to ignore. The speaker will probably pause so that the echo does not interfere with the speech.
Latency (end-to-end delay)
Latency is the time it takes for speech to travel from the speaker's mouthpiece to the listener's earpiece. The PSTN, within the U.S., usually has a delay of 30 ms or less. The latency goal is to have a one-way delay of 100 ms or less in VoIP calls, with an upper limit of 150 ms. Very long latency will cause the speakers to pause because they are not sure when the other speaker has finished, or they may barge in on each other's conversation.
Silence suppression/Voice Activity Detection (VAD) performance
Silence suppression is used in VoIP to reduce bandwidth consumption. When these technologies are used, the beginnings and ends of words tend to be clipped off, especially the "T" and "S" sounds at the end of a word.
- Echo canceller performance
The longer the latency, the more the echo needs to be eliminated. Echoes may occur in only one direction or in both directions. The echo cancellers may not work, or they may not be able to effectively compensate when there is significant jitter during the VoIP connection.
The combination of these nine elements will contribute to the clarity of a voice call. An excellent tutorial on these factors can be found in Voice Quality (VQ) in Converging Telephony and Internet Protocol (IP) Networks.
Mean Opinion Score (MOS)
Mean Opinion Score (MOS) is a standard numeric value used to measure and report on voice quality. The MOS has a range from a maximum score of 5, which is considered to be the same as speaking directly into the person's ear, to a value of 1, which is an unacceptable voice quality to all users. MOS does not include what has been defined as the call experience, only the sound or voice quality.
An MOS of 4.4 to 4.5 is considered equivalent to a toll-quality call as experienced on the PSTN. Users who experience an MOS of 4.5 will be very satisfied. An MOS of 4.0 is still considered acceptable to the vast majority of users. When the MOS decreases to 3.5, some users may find the voice quality unacceptable. Most cellular calls have an MOS rating of 3.8 to 4.0, where speaker and word recognition may be impaired.
When the MOS falls below 3.5, users will be dissatisfied and hang up. An MOS below 2.6 is considered to be an awful call. The user with an MOS of 2.6 will need to find an alternative network for this call -- for example, when a wireless call is terminated and the speaker moves to the PSTN.
The P.800 standard from the International Telecommunication Union (ITU) for the MOS measuring technique was last updated in the mid-1990s and continues to be a subjective exercise. About 30 or more people are asked to listen to 8 to 10 seconds of speech under controlled conditions. The listeners are asked to rate their opinions of the calls from very satisfied to awful, scoring the calls from 5 to 1.
The industry started to move to objective machine measurement of voice quality several years ago, with the advent of cellular phone networks. There are algorithms for calculating and predicting the MOS for VoIP communications; these will be covered in later tips.
About the author
Gary Audin has more than 40 years of computer, communications and security experience. He has planned, designed, specified, implemented and operated data, LAN and telephone networks. These have included local area, national and international networks as well as VoIP and IP convergent networks in the U.S., Canada, Europe, Australia and Asia.
This was first published in November 2007