Problem solve Get help with specific problems with your technologies, process and projects.

VoIP natural fit with text-to-speech information delivery

Text-to-speech translation software enables VoIP users to listen to e-mails, text messages, or other textual information. Understanding what's involved in text-to-speech also helps to highlight some of the potential technical hurdles involved.

Though the individual pieces and parts can (and do) vary to an astonishing degree according to the computing platforms that participate in making Voice over IP work, there's no inherent reason why automated text-to-speech translation software can't plug into this equation to enable VoIP users to listen to e-mails, text messages, or other forms of textual information (including documents of many kinds) instead of reading such information in textual form on a visual display device of some kind. As you can imagine, this can be particularly helpful for those using hand-held devices with only limited screen real estate, and equally limited text display capabilities.

Understanding what's involved in text-to-speech also helps to highlight some of the potential technical hurdles involved. This technology also goes by the name of speech synthesis and involves numerous interesting components. The process of converting text into sound tends to work something like this:

  • Some applications generate a text file that serves as the input to the speech synthesis process.
  • A special program called a speech synthesis engine converts words into phonetic (sound elements) and prosodic (speech elements for emphasis and inflection) symbols.
  • The phonetic and prosodic elements are rendered into a digital audio stream.
  • A sound card converts the digital audio stream into an acoustical audio stream.
  • The audio stream is played back through an amplifier and delivered through one or more loudspeakers (which may be in a handset or headset, especially for VoIP related uses).

The interesting parts of the process that come into play with VoIP require some understanding of the type of access and services involved. Given the right kind of hardware and services on the call handling end (where the VoIP user goes to pick up data), text-to-speech may be performed on demand and on the fly (when requested), or may be performed in advanced and stored in audio format (when pre-defined services exist, and may be configured to perform such work in advance) .

Because the space required to store audio files for text data is high, and it's not possible to predict that a user will always want certain files delivered in audible rather than visual form, the overwhelming technology emphasis is on performing text-to-speech conversion in real time, initiated by a user's specific request for such service. It's only for eminently predictable and manageable text-to-speech conversions—such as time reporting, for example—that it proves to be worthwhile to generate the library of possible utterances in speech form in advance and then stitch them together upon user requests for the time.

Possible uses for this technology are legion, but the most commonly available implementations stick to widely used, obviously important things such as accessing text e-mail messages in voice form via telephone (and VoIP) connections. Thus, Outlook Exchange savvy messaging environments such as Asterisk, Cisco Unity (through the somewhat paradoxically named ViewMail facility), and Avaya Unified Messenger, all provide varying degrees of capability to enterprises. For those with needs at a more modest scale, numerous voicemail plug-ins for Outlook such as CallAudit Voice, PhoneMax, RVS-COM, and Simply BitWare, all offer varying degrees of e-mail and voice integration for phone access to Outlook inbox contents.

But a key ingredient in the underlying technology that makes such solutions work comes from rendering text into speech, and that's where speech synthesis engines come into play. To some degree, listening to text data through a VoIP link simply represents remote access to such capability, but at the lowest level of detail there's much more to it than that.

About the author
Ed Tittel is a regular contributor to numerous TechTarget Web sites, and the author of over 100 books on a wide range of computing subjects from markup languages to information security. He's also a contributing editor for Certification Magazine, and edits Que Publishing's Exam Cram 2 and Training Guide series of cert prep books.

Dig Deeper on VoIP Migration and Implementation

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.