When SIP trunks fail: Disaster recovery for SIP trunking

When your SIP trunks fail, do you have a disaster recovery plan for SIP trunking to support troubleshooting and resolution of SIP trunk failures, or are you solely dependent on what the carriers and/or ITSPs tell you?

When SIP trunks fail, knowing your network will help you devise a disaster recovery SIP trunking strategy to gain...

alternative and diverse routing, and high availability, and to retain the cost savings of SIP trunking.

The many solutions and network designs range from simple to complex. Ultimately, even with complex mesh networks, hammering out simple designs for SIP trunk failover will require effort, resources and a lot of testing and re-testing to ensure that all the details are exposed to best plan for expedient disaster recovery of SIP trunk failures. In part 1, When SIP trunks fail: ITSP disaster recovery plans, I outlined the key criteria for selecting an Internet telephony service provider (ITSP) for SIP trunking services and what you need to know about an ITSP's disaster recovery plan for SIP trunk failures.

Since networks are configured differently, part 2 of this series will cover the generic concerns and considerations for crafting a solid disaster recovery plan for SIP trunking in hopes of encouraging thoughtful consideration of your individual situation. (See the illustration of basic SIP trunks below.)

Before venturing into a disaster recovery plan for SIP trunk failures, you first need to assess on which end of the SIP trunks you want relief. On your end, you need to determine what level of redundancy and failover is sufficient for your business. This could be a mobile gateway card, a POTS line, a T1 used for overflow and redundancy, or a standby data center with diverse routes. Each solution requires different adaptations of overflow, alternative or least-cost routing, or load balancing and must also be cost effective for your specific business. Diverse routing for your WAN connection may be optional, but when provided, it must be integrated into the SIP trunk design upfront.

When the WAN link fails, some customers have firewalls, routers or integrated access devices (IADs) configured with auto failover and load-balancing mechanisms that use probes and timers. Once the probe fails to reach a preloaded external IP address within a specified time, the failover route kicks in to another WAN link. This overflow and failover mechanism in firewalls should not interfere with existing calls unless the WAN link fails. Again, when your SIP voice traffic originates from a different public IP address, your ITSP must know about the secondary IP address before building the SIP trunks for your company. Whether or not your SIP voice traffic remains live during primary WAN failures depends on your telephony solution capabilities, such as:

  • Providing multiple SIP servers for each SIP trunk. This means separate uniform resource identifiers (URIs) for primary and secondary SIP server addresses so that if one address does not respond, another option to secondary address should reply.
  • Routing between SIP trunks during a SIP trunk failure will occur if the telephony solution fails to receive an initial response to the SIP INVITE. (Here Iptel.org explains SIP messages, including SIP INVITE, used to initiate a SIP session.) If an alert, pre-connect, or connect is received, the call will stay with that trunk and will not be re-routed. When there is no response, the telephony solution should be able to route to another trunk.
  • The telephony solution should support SIP keep alives in the form of INFO or OPTION to tear down an active call if there is no response. This prevents a call from becoming stuck if the SIP server quits responding mid-call but does not alter call routing.
  • SIP trunk routing is resilient if fully qualified domain names (FQDNs) are used and DNS-lookups fail.

In a fully meshed network, SIP trunking failover and diverse routing can become much more complicated. Last year, Verizon's BEST (Burstable Enterprise Shared Trunks) capability was introduced. BEST allows idle capacity to be used at another location. In addition, the customer will get unlimited site-to-site calling.

SIP REFER -- already supported in Verizon's VoIP products -- provides overflow routing when one site is too busy or when there's not enough available bandwidth. SIP REFER will route the call to another call center or site. The features and services of Verizon's BEST provide the ability to pack more calls in the pipes, get unlimited on-net calling, and re-route calls from busy or full pipes to other call centers or sites. Also, customers can transform their business operations through a centralized design by utilization of failover and load-sharing capabilities that aren't available in a TDM environment.

In the voice world, we are constantly aware of traffic-blocking situations. Blocking is defined by the inability to complete a telephone call -- getting dial tone, dialing digits, hearing ringback or busy tone -- but not a network fast-busy or dead air.

In a traditional telephony or TDM environment, there are several reasons for blocking: processor load, too few or failed trunks, software registers, system capacities/limitations, dial-plan misconfigurations, PSTN and carrier issues, or traffic loads. Additional causes of blocking are network congestion, hardware or link failure, WAN or ITSP congestion, DNS misconfigurations, and system capacities/limitations in an IP network.

Illustration: Basic SIP Trunks. Click to view the full-size chart.

Illustration: Basic SIP Trunks. Click the image to view the full-size chart.

Call flow mapping for SIP trunking

An intensive process to reveal potential pitfalls in your network and even basic call routing designs is call flow mapping of every type of telephone call to and from your network. My suggestion is to use whatever visual tool or graphic aid you are comfortable with, even if it means writing it down using just paper and pencil. Map the calls and how they flow in and out of your network, and then do the same in a failover situation.

You may have two data centers -– each backing up the other. You may have just one IP PBX or telephony solution in place. Your network could be MPLS-powered or you could have a hybrid-hosted environment. Whatever your solution, mapping the call flows should reveal potential problems with dial plans, including preservation of caller ID (inbound or outbound) and resolution of DNS/URI addresses, conflicts in routing, potential bandwidth issues, supporting equipment weaknesses, and call blocking situations. It is an exercise that should pay off.

In the illustration below, Call mapping: What is missing? note what is missing and then break down each connection point in the call path as it applies to your network. There is a cellular connection to the telephony solution but no gateway shown or supporting equipment. When calls are routed through the dial plan to devices or destinations, are there other conditions such as overflow routing and call forwarding?

Illustration: Call mapping: What is missing? Click to view the full-size chart.

Illustration: Call mapping: What is missing? Click to view the full-size chart.

Also note that the firewall or even telephony solution, such as Mitel's or Zulty's IP PBX, may also include a session border controller (SBC). Large enterprises deploy SBCs to centralize call routing in and out of their networks. (See Acme Packet illustration below; click to view the full-size chart.)

Illustration: SBC graphic. Click to view the full-size chart.

Some companies are having issues with fax traffic or fax over IP (FoIP) over SIP trunks. By pre-pending fax calls with a few extra digits for the ITSP to recognize that the call is fax, you allow the ITSP to strip those extra digits and route the call accordingly using only a G.711 codec, for example, instead of G.729. In your call flow mapping efforts, be sure to document codec choices and priorities for each call type.

The network inventory needs to address whether or not the elements that the voice traffic touches are sufficient to handle the voice traffic load and real time protocol (RTP) packets. SIP-aware firewalls are a must. Next, determine whether or not the telephony solution can overflow calls to other routes such as the PSTN, T1/PRI, cellular and/or POTS.

When SIP trunks fail or are all occupied, the device, such as the IP PBX, will ideally auto lock out those failed SIP trunks or overflow and route calls over other facilities. In another mindset, if the SIP trunks are all busy (concurrent call sessions), then the telephony solution must overflow the traffic to other facilities, preferably based on least-cost routing first and highest-cost routes last. So old features like least-cost routing and time-of-day routing capabilities are still key in the new telephony model. Remember to examine processor loads, memory and throughput for telephony applications residing on servers.

But this doesn't really address a higher order of redundancy. Having multiple facilities addresses alternative and overflow routing acting as the basic failover. When the primary WAN link fails, the SIP server at the ITSP must expect a different originating public IP address from your hardware. Your IP PBX or other device IP address remains the same and this also reported to the SIP server. When you reroute traffic through a different IP PBX or device on your network back to the SIP server (ITSP), then you may be routing a different account and authentication data to different IP addresses. This really gets you into thinking about your network and how to develop a strategy to gain alternative and diverse routing, high availability and best cost.

For the two data centers serving the enterprise, each data center's IP address becomes the redundant link for the other -– meaning that if the primary data center WAN link fails, the secondary data center takes over. This failover scenario using a backup data center also signifies that synchronization must be taking place between the two sites, and this indicates that potential timing issues during failover could still occur.

Still, the ITSP must be aware of public IP addresses for primary and all other alternative routes from which traffic originates on your network. Otherwise, they may see the traffic as unauthorized and block the traffic. For single-site locations typical in the SMB space, there is usually no redundant system –- if the IP PBX, firewall, router or IAD is down, then the ITSP simply reroutes traffic to a landline, cell or POTS number.

SIP trunking failover and overflow can be a simple or complex configuration

There's one really basic concept about telephones that should also guide you toward either simple or complex configurations for failover and overflow routing. Key system or square telephony solutions allow users to see all outside line appearances, and they must press a telephone button to select a specific line.

In SIP trunking, these are just concurrent call sessions (CCS). The IP PBX, for example, provides dial tone to each button depressed: line 1, line 2, and so on. Then, while in the PBX mode, trunks or lines are put in trunk groups or line pools, and users dial an access code, such as "9." As with square or key system modes of operations, the facilities are not mapped to individual buttons.

Not all telephony solutions can support two or more uniform resource identifiers (URIs) for the same SIP trunks (redundancy) and still provide individual line appearances on user telephones. Granted, most of these configurations reside in the SMB space, and if the ITSP has diverse routing and local and geographic redundancy, then you need to focus on failover. Another consideration is anything offsite connecting to the network, such as telecommuters.

Keep simplicity in mind and build off each success.

Matt Brunk
Telecomworx Inc.

Any 800 traffic also needs evaluation -- how to treat the calls and routing to ensure that continuity remains. Unlike the traditional telephone box testing, new rules do apply -– you will test the old ways/methods from each location during normal operation, and in the failover mode (even forcing overflow and failover conditions), but you will also need to test fully qualified domain name (FQDNs) resolution in each condition. Keep simplicity in mind and build off each success.

Lastly, combining traditional telephone tools such as alternative and least-cost routing with IT network features of overflow and load balancing blends the overall solution. It's easy to confuse them because they do overlap, arguably to the point of creating redundancy that is acceptable for some companies. By combining diverse routes with enterprise data centers, the solution quickly becomes robust. Deciding whether or not to use SBCs to simplify and centralize call routing remains a key point for larger and distributed enterprises to consider. The promise of SIP trunking without enterprise-grade data centers still takes customers a step above the old telephony model.


  • What is the monetary cost of traffic when SIP trunks fail?
  • Are your failover routes/links rated at the same capacity as your primary routes/links?
  • How will the dial plan be affected?
  • Is more than one SIP trunk provider necessary?
  • Is the voice traffic service or revenue affected?
  • Will failover be transparent to E911, telecommuters and to called/calling parties?
  • Is the failover configuration sustainable/manageable?
  • Is the solution for failover cost effective?
  • Document everything, especially digit manipulation, dial plans and SIP URI configs, and then back up configurations building on each success.
  • Test, test, test, including every call type.
  • Your redundant gear and capabilities need to match.
  • Is your solution transparent to the user?

Resolving SIP trunk failures in the real world

Our IP PBX (Panasonic NCP1000) works reasonably well with our SIP trunks from Broadvox. We currently do not have dual SIP server entries (URIs) using the same provider because the services will not work in that manner without having two different accounts. Having two separate accounts means having dual SIP trunk appearances -– meaning that for every SIP trunk we have, we'd also have a second SIP trunk, and this isn't practical for small telephony systems in the SMB space.

The other issue is that we have all the lines mapped to individual buttons on telephones. (They are not physical lines -– remember, CCS is just a maximum number of simultaneous conversations supported. The IP PBX emulates "dial tone" when each "line button" on a phone is depressed.)

While our IP PBX doesn't work in this manner, ADTRAN's IPT 7100 series solution does. It is an IAD that was built from the ground up with these considerations.

Whenever our SIP trunks fail, the provider automatically routes (failsover) all inbound calls destined to our IP PBX to a cell phone number that we use with a Multitech GSM gateway connected to our IP PBX. Considering our business size and the way in which we use telephone services, this is cost effective, and the risks are acceptable to us.

Our ISP is Verizon FiOS. We do not have a failover route for the WAN link for both voice and data. Our outbound voice traffic failsover to the GSM gateway, and we have our iPhones for outbound dialing.

Since November 2005, we have experienced two outages: the first when Verizon discontinued use of Point-to-Point Protocol (PPP) for connecting/authenticating our router and did not notify us of the change; the second, in 2007, when we found through monitoring our ADTRAN Integrated Access Device (IAD) that the Verizon Optical Network Terminal (ONT) had a defective network information center (NIC) causing dropped packets and input, symbol and alignment errors that rapidly increased and created intermittent connectivity issues for our users.

Our configuration relies on a highly reliable WAN link (Verizon's FiOS for business). We are taking advantage of that reliability while keeping our costs down and using the ITSP's failover capabilities to protect our inbound calls, which are more important to us than outbound dialing. And we have not forsaken 911.

Regardless of your SIP trunking configuration, when your SIP trunk fails, do you have mechanisms in place to provide the metrics to support troubleshooting and resolution of SIP trunk failures, or are you solely dependent on what the carriers and/or providers tell you?

About the author: Matt Brunk is a 35-year veteran of the telecommunications industry and president of Telecomworx, a Washington, D.C., area interconnect company. Previously, he was chief network engineer for Amtrak. In 2000, Brunk founded the NBX Group, whose members included dealers, users, 3Com, the media and consultants spanning 14 countries, to develop solutions for the 3Com NBX 100 and create a web portal directed at the IP PBX. He has presented at VoiceCon, authored articles for Business Communications Review and has written for the former VoIPLoop blog. He now writes weekly for the NoJitter blog.

Dig Deeper on SIP and Unified Communications Standards