The AT&T domain name server (DNS) outage of Aug. 15, 2012 demonstrates why a non-cached method of DNS monitoring results in a faster time-to-repair (TTR), and even zero downtime due to the DNS issue.
To Cache or Not-to-Cache – that is the DNS Question
Firstly, it is not generally well-known that external-based HTTP request-type website monitoring, like coffee at your local java joint, comes in different “grades” – cache-based and non-cache based.
Along those lines, think of cache-based monitoring as “decaffeinated” in that the multiple steps to propagate the DNS process are cached and therefore skipped. In the case of the AT&T outage, whether or not the user of a cached-based monitoring service would have been alerted to the Aug. 15, 2012, AT&T DNS outage would have largely been a matter of luck.
Now think of non-cache as a double-depth charge caffeinated French Roast. Essentially non-cache monitoring is what it sounds like, monitoring that does not cache aspects of the DNS resolution process. Like a hearty fresh-ground French Roast, the non-cache monitoring process conducts a “fresh” – ie. complete propagation of the DNS process from root server to final IP – with every instance of monitoring. In fact, the initial intermittent AT&T DNS issues were first detected at around 5:20 AM PST, over a full hour before new reports indicate there was a full outage at the AT&T DNS service.
Based on these early alerts, website administrators could make the decision to switch from the AT&T DNS service to another DNS provider a full hour prior to the AT&T DNS service going down at around 6:30 AM PST. Therefore, these website administrators would have little to none downtime, nor disruption to their websites.
The DNS trace taken at the earliest instance at 5:23 AM PST when intermittent timeout issues started at the AT&T DNS servers timing out to DNS query requests.
1 A.ROOT-SERVERS.NET [220.127.116.11]: Type=NS [time 62 ms]
2 L.GTLD-SERVERS.NET [18.104.22.168]: Type=NS [time 31 ms]
3 cmtu.mt.ns.els-gms.att.net [22.214.171.124]: Type=NS [time 17628 ms] error Receive timeout.
4 cbru.br.ns.els-gms.att.net [126.96.36.199]: Type=NS [time 17628 ms] error Receive timeout.
5 A.ROOT-SERVERS.NET [188.8.131.52]: Type=NS [time 62 ms]
6 E.GTLD-SERVERS.NET [184.108.40.206]: Type=NS [time 109 ms]
7 cmtu.mt.ns.els-gms.att.net [220.127.116.11]: Type=NS [time 17628 ms] error Receive timeout.
8 cbru.br.ns.els-gms.att.net [18.104.22.168]: Type=NS [time 17628 ms] error Receive timeout.
These two bolded AT&T DNS service servers show the timeout issue. AT&T DNS service server info based on: http://dpt.ip.att.net/dpt_helphome/dns_seczones.htm
How to Effectively Monitor for the next DNS Outage Situation
In the case of the AT&T DNS outage issue there are several key factors that help to speed up Time-to-Repair (TTR), or avoiding downtime:
- Error Detection method: Use a monitoring solution that uses a non-cache method to propagate DNS queries all the way through to root name servers with each monitoring instance. A cache-method service caches DNS and therefore will not detect a secondary DNS issue at all, or it may take days or even weeks to detect the issue.
- Frequency of monitoring: Use a faster frequency of non-cache monitoring, such as every 1-minute versus once per hour. The faster the non-cache monitoring solution detects and alerts an impacted administrator of a website using a failing DNS service, the faster a switch can be made to a DNS failover provider.
- Value of Time-to-Live (TTL) setting: The smaller the value of the TTL setting used by the DNS administrator to persist the IP caching of the a domain from the primary authoritative name server the faster the fail-over to another DNS provider may be implemented. Typically set to 86,400 seconds (1-day) or more, in disaster recovery planning the TTL can be set as low as once every 300 seconds, however the lower the setting the higher the load on the authoritative domain name server.
- Diagnostics – such as an automatic traceroute at the time of the detected DNS problem – is provided by the monitoring solution (keep in mind that many basic monitoring services do not provide any diagnostic info).
- Repair: Continue monitoring during the error condition to further pinpoint the issue. Send the monitored results to your DNS provider. You can also run free manual DNS traceroutes here (select Trace Style “DNS”) to verify the issue as needed.
- Prevent: Keep an eye on “soft error” DNS issues, such as DNS slowdowns and intermittent DNS outages, so you can take action before the “soft error” becomes a “hard error” such as a customer facing downtime.
Thanks, I’ll take the Caffeinated Double Depth Charge, Non-cached
Its clear then that a combination of non-cache and other factors limit the downtime exposure due to issues like the AT&T domain name server (DNS) outage of Aug. 15, 2012. Furthermore, a non-cached method of DNS monitoring is a critical factor in a faster TTR, and even zero downtime. Finally, it is important to remember that TTR determines the loss due to downtime. In other words, the longer total time it takes to detect, diagnose, and repair a DNS problem the worse the impact of the DNS issue. Conversely, the faster a monitoring solution speeds up TTR the more the loss is reduced, or completely avoided. Similar to a good strong cup of caffeinated coffee a non-cache method can make the difference between a downtime day and a fast productive day.
Brad Canham (@BradCanham) is the VP of Sales and Marketing at Dotcom-Monitor. He's passionate about leading an organization that believes deeply in the principle of “constant improvement” and delivers that to its users via web performance monitoring. He blogs at http://www.dotcom-monitor.com/blog/ and organizes the Web Performance-Minneapolis/St. Paul Meetup group. When he's not talking and writing about constantly improving performance, he enjoys spending time with his family, racing in Ironman triathlons, snowshoes races, sipping craft beers, and reading everything, everywhere, all-the-time.