EBGP Multihop vs IP SLA for failover detection
Recently I have started with a new approach to failover detection that does not rely on IP SLA on cisco routers. The problems with IP SLA and route tracking objects are non-trivial. In one particular case I have a customer with a 10MB ethernet feed that goes thru a large national carrier. The carrier has a next hop router in the building connected by 100 ft or so of cat5/cat6. The connection leaves that router and is fiber/SONET all the way to the customer edge routers at the carrier office. We have experienced several failures over the last 2 years but both of the most recent failures have been IP SLA taking the circuit down when it was not actually broken.
IP SLA is pinging an IP address in the carrier regional infrastructure and when that IP address is not available for 3 consecutive tests it takes the route out and switches over to a backup cable connection. The problem is that picking the IP address to ping is very tricky/fiddly. Here are the options - none are that good:
(1) Ping the next hop router. Problem is that this only tests that the carrier router down the hall in the closet is up. Thats testing 100 ft of cable and power in the closet. That being said, a lightning strike / power surge did damage the carrier gear once and it did have to be replaced. When the fiber itself is down, the carrier router will ping fine, but of course ip packets wont go anywhere/
(2) Ping the carrier edge access router. This is a better choice since it is testing thru the last mile of fiber all the way to the carrier office. Still its not perfect as we did have one outage where that path was fine, but the routing beyond that was broken. Carrier stated that it was a damaged card and they moved the customer to a new port, which of course had a new ip address (new router) to ping and the old router stopped pinging for some reason, which meant IP SLA still did not bring the path back up even after the carrier had fixed their side.
(3) Ping a well known ip such as the old BBN/GTE 184.108.40.206 DNS server or google 220.127.116.11 or 18.104.22.168 anycasted dns servers. This has the problem of making the customer pipe dependent on an unrelated service, and said service is not intended to be a pingable service ip - its a DNS server. (Pings could be de-preferenced if network congestion occurs
at the dns provider side for example).
(4) EBGP Multi-hop to a stable secure datacenter. BGP router at the datacenter sends a network announcement to the customer edge router. Customer edge router ties the default route to this route learned via BGP. Private (internal) ASNs are used, setup is simple and fast, and the BGP protocol is well suited to passing the route in a robust manner - small blips or micro-outages dont take the session down as can happen sometimes with IP SLA. No traffic is passed across the BGP session other than small updates/keepalives to keep the session and single route active. Unlike a gre tunnel where the customer data traffic would be tunnelled over to the datacenter inside the tunnel, this simple setup is only used for learning a route; not tunnelling any traffic.
We also use a static route to the BGP peering ip address so that the BGP session can only come up when the pipe that it is supposed to be used across is up.