Recently, my computer failed and I had to get it fixed and my data restored. It took several days to identify the specific problem (in this case, catastrophic hard drive failure) and restore access to my applications and data. In the meantime, my productivity dropped dramatically and it was hard to work using the alternative tools that were available to me.
This is often what happens in a disaster recovery (DR) scenario for IT networks. Separate backup tools are configured and set aside for use when the primary systems fail. Unfortunately, these backup tools have limited functionality, are not kept up to date, and are infrequently tested to ensure that they will work in case of a real emergency. Instead of focusing on a DR plan, it is smarter to design an IT infrastructure that mitigates your need to have to use the limited DR tools and facilities.
Key failure points
Last week, I talked about what it takes to build a resilient network architecture. This week, let’s look at the different possible failure points in the design and how the various technologies can mitigate the problems.
Layer 2 switch – Layer 2 switches provide end-point and inter-device connectivity. To be fault tolerant, we multi-home them and utilize rapid spanning tree protocol (RSTP) to eliminate loops in the network. In an ideal network, the RSTP domains are small and converge in less than one second. When a switch fails, only the single-homed devices and servers directly attached to that switch lose connectivity.
Application server – It would be nice to multi-home the server to multiple switches and rely on a failover mechanism, but it is often easier to build multiple application servers and provide the reliability, scalability, and availability through the server load balancing (SLB) technologies that I talk about so much. The application delivery controllers (ADC) are redundant and can automatically detect a problem with the server and/or application within a few seconds. The ADC uses a variety of health-checks, or probes, to determine the status of the server and the applications it is hosting.
Layer 3 router – Routing protocols have been around almost since the beginning of the Internet. Today, most resilient networks rely on IGP (Interior Gateway Protocols) like OSPF (Open Shortest Path First) or IS-IS (Intermediate System-to-Intermediate System) to provide dynamic rerouting of traffic due to router or link failure. These protocols will typically converge in under 30 seconds, depending on the size and complexity of the network.
The BGP (Border Gateway Protocol) is an EGP (Exterior Gateway Protocol) that connects individual networks, or autonomous systems (AS). The mesh of BGP connected networks is typically considered to be the Internet. When a failure occurs that affects the BGP network, the protocol is designed to reroute traffic within several minutes.
Data centers – If there is a broader issue that causes an entire data center to go offline such as a power failure, natural disaster, or fiber cut, a redundant datacenter that has the same functionality and services should be able to absorb the increased load. DNS manipulating technologies like global server load balancing (GSLB) and IP routing functions like IP anycast can help ensure that traffic is steered to the appropriate site. Ideally, these technologies are always working and all data centers are active at the same time. If an entire site goes offline, the network technologies can adjust from seconds to minutes.
Data and databases – Databases are hard to provide redundancy and resiliency for. They are often updated and keeping multiple copies in sync is a challenge at best. Clustering technologies and data assurance technologies help address the reliability of any single database instance. Ideally, if a portion of a database fails, then the replication and synchronization technologies make the failure invisible to the end-user. For user-generated data, cloud technologies have enabled the customer to keep the data separate from the application and client devices. Depending on the data in question, the time to adjust due to a failure can be less than a second to hours if one has to restore the data from an archived backup.
End-user devices – Assuming that the data is stored in the cloud or a networked cluster as previously described, then the end-user device is providing access to the application. If the client device (laptop, tablet, smartphone, etc.) fails, then the end-user needs to find a new device that provides access to the application. The time to recover can be from seconds to minutes depending on the end-user’s access to other devices.
These are the core components of a standard IT architecture. As long as these technologies are applied, we can build a flexible model that can withstand the failure of any single component or group of components. It is possible to create a fine-tuned application delivery architecture that can dynamically adjust for almost any failure scenario and be functional anywhere from seconds to minutes. When operational, this capability requires no human intervention since we have designed the intelligence and mitigation functions into the technologies.
Failure is not the only option
Ultimately, it is not the failures in the network and application availability that causes the most problems. It is the responsiveness and performance of the applications across the network infrastructure that causes the most problems. The performance and degradation of the application delivery infrastructure is also the hardest problem to identify and solve. Different applications have different performance requirements, even though they are all using the same infrastructure.
Next week, we will look into application performance. We will determine the metrics associated with application performance and understand how to monitor them. Based on the metrics, we will adjust the environment to create an optimally performing application delivery network.