I haven’t been able to find a lot of hard information on whether using round robin DNS will help make a web site more available in the face of server failures. Suppose I have two web servers in geographically separate data centers for maximum robustness in the face of problems. If one of those servers (or its entire data center) is down, I want users to automatically be directed to the other one.
At a low level, it’s clear that round robin DNS wouldn’t help with this. In round robin DNS you advertise multiple IP addresses for a single name. For example, you might say that “mysite.example.com” is at both addresses 192.168.2.3 (the server in one data center) and 192.168.3.2 (the server in the other data center). But when a program tries to connect to mysite.example.com it first asks the network API to give it the IP address for that name, this returns just one of those IP addresses, and the program uses that to connect to it. If that address happens to be for an unavailable server, the program’s request to the server will fail, even if there is a healthy server at one of the other addresses.
Of course, if you write the client program, you can make it work in this situation. You’d have your program ask the network API for all of the IP addresses associated with a name, and then your program can try them one at a time until it gets something to connect. But in the case of a web site, you haven’t written the client program. Microsoft or Mozilla or Google or Apple or Opera or some other group wrote it.
Wouldn’t it be great if those web browsers all worked that way? Wouldn’t it be great if you could find clear indications that they worked that way?
As it happens, it appears that the all do work that way, even though I have found it very hard to get clear confirmation that they are supposed to. I’ve found a few web pages that talk about browsers performing “client retry”, but not any kind of specification or promise. I’ve found many more pages saying to forget about using round robin DNS for this, and to use a load balancer or some other kind of proxy to distribute web requests to available servers. The problem with that is that you now have a new single point of failure (the load balancer) at a single location. It can be made very reliable, but can still fail and leave your users unable to connect. You can change your DNS entry to point to a new location, but that takes time to propagate (even longer than your DNS server says it should in the case of internet service providers who cache those addresses more aggressively than they should). There are routing protocols to force traffic for a specific IP address to a different location, but they’re too complicated for me and require a lot of low level routing privileges that we can’t expect to have. No, round robin DNS with clients smart enough to try each address if they need to, would be a real help here.
Since I couldn’t get clear indications that this would work where I need it to, I set up a simple experiment to see how web browsers respond in this situation. I created web servers in Amazon’s Virginia and California regions, each returning a single web page. The one in Virginia returns a page saying “Eastern Server”, and the one in California returns a page saying “Western Server”. I then set up a round robin DNS entry pointing to those two IP addresses.
I opened the web page for the round robin name in the Chrome web browser, and got the page saying “Eastern Server”. I then shut down the web server that hosts that page, and refreshed the page. It instantly changed to a page showing “Western Server”. Which is exactly what I want! So I checked other web browsers, and every one I could easily check worked the same way:
- Chrome 11 on Windows 7
- Firefox 4.0 on Windows 7
- Internet Explorer 8 on Windows 7
- Opera 11 on Windows 7
- Safari 5 on Windows 7
- Internet Explorer 7 on Windows XP (after noticeable delay)
- Firefox 4.0 on Windows XP (after noticeable delay)
- Android native browser on Android 2.3.3
- iPhone native browser on iOS 4.3.3
- curl on Windows 7
- curl on Linux
- wget on Linux
- Python with urllib on Windows 7
Wow. Maybe the operating systems were doing this, not the clients? No. wget was talkative, and reported that the connection attempt failed on one IP address and that it retried on another. And Chrome’s developer tools Network tab showed the same thing: a request failing, and then being repeated with success the second time. Also, I was able to find an HTTP aware client that did not work this way: Perl with LWP::Simple on Windows 7.
So my conclusion: round robin DNS is not certain to always cause a web browser to fail over successfully when one of the servers is down, but it is very likely to work. If you want reliability over geographically separate server locations it seems like a good way to go. When you discover a server is down you should fix it immediately or update your DNS to no longer point to it, but until that happens, most of your users will continue to be able to connect and use your site via one of the other servers.
[Update] When I got home from the office, I tested a few more web clients:
- Logitech Revue Google TV
- Chromebook CR-48
- Samsung Galaxy Tab 10.1 running Honeycomb
- Nintendo Wii
- Amazon Kindle
It worked in every case but two. Make that every case but one and half. The Wii browser reported an unavailable page when refreshed with the previously displayed server down. A second click on the refresh button did cause it to switch to the live server, which I call at least a partial success. But the Kindle failed completely. Turning off the server it had connected to and then refreshing the browser got a message about an unreachable page, no matter how many times I clicked the reload button.
So if you’ve got a mission critical web application that you offer through round robin DNS, be sure to tell your Wii users to hit refresh a second time if there’s a page failure. And warn them to not rely on the Kindle’s web browser (which, to be fair, is still marked as an “Experimental” feature).