Pets vs. Cattle
“Pets vs. Cattle” — This analogy became popular after Randy Bias first introduced the term in an inspiring article about the fundamentals of cloud hosting and how to use it. To avoid having a user think that simply because he has his applications in a cloud server means he has full redundancy, Bias came up with an easy, simple way to explain the concept of running applications in the cloud: “Pets vs. Cattle”
The analogy is simple. Are you operating in the type of environment that, if one server crashes, everything goes down (“Pet”), or, are you in a scenario in which the loss of a server means that nothing happens — the “herd” still exists and performs just as before (“Cattle”)?
In the “High Availability,” or “HA” world, multiple approaches exist to achieve it, and everybody has a different opinion. Various ideas and methodology go head to head, and everybody thinks their opinion is correct.
People are under the impression that because your application is hosted in the “cloud,” it is secure from downtime. Unfortunately, that’s far from true. An application and its infrastructure must be designed with HA in mind.
Pets
It’s called “Pets” because when a pet dies, it’s a terrible event, and the pet owner (in this case, the sysadmin) normally goes “all-hands-on-deck” (and also probably experiences the five phases of grief!) to bring back the lost server to make everything function as before.
Here are a few examples of Pet deployment:
1. Single-server hosting a website/application: If the server is lost, everything goes down.
2. A database, storage or web service that is set up is limited in its scalability (active/active) or with no redundancy (Active/Passive or Master/Slave).
Cattle
In the “Cattle” analogy, the idea is that if one server (the “cow”) becomes sick, you can “shoot” it, and nothing happens. Your application and infrastructure are designed to scale out, and everything functions normally.
Here are a few examples of “Cattle deployment”:
1. A storage solution such as Ceph ― if one node dies, the cluster is still up and can scale out to thousands of nodes without a problem.
2. A database solution such as Cassandra.
3. A pool of webservers behind load/traffic balancing.
Sometimes, it’s not possible to run applications or an infrastructure in a Cattle setting because of lack of know-how, a smaller IT team, budgetary restrictions, or simply not being well-versed on the issue.
High-Availability
When designing a High-Availability plan, the primary key is to review all possible single points of failure and review your risk assessment for each. Here are a few examples of single point failure:
• Hosting on a single server.
• Hosting on multiple servers (cluster), but with only one switch.
• Hosting on multiple servers (cluster), but in the same server room/or within the same datacenter location.
Each remedy will add to your update, but they won’t have the same impact. For example, the risk of a datacenter becoming unavailable is far less than the risk of a single server going offline. This means that when building your HA plan (including your Disaster Recover planning), you need to adjust your plan according to your risk level. For example, operating within two different locations can increase the complexity of an infrastructure (data or database synchronization and/or the price).
Remedying a single point of failure with external solutions also needs to be reflected on correct. While removing headaches from a management standpoint, it can also be disastrous when that external solution fails. When an AWS S3 (object storage) location in Virginia went down in March 2017, several websites also went down ― for five hours. A lot of those websites probably trusted the uptime of the service, which previously had been without reproach; they may never have considered the S3 as a single point of failure. In this case, human error brought down the location. This single point of failure could have been remedied by having another S3 location as a failsafe (although it would still be a single point of failure if AWS S3, as a whole, is having issues), and/or another object storage service.
A company that has an in-house IT management team has multiple ways to approach this issue. For example, you can:
• Spread/cluster your infrastructure on public cloud servers.
• Be in a single tenant environment, with either dedicated servers or a private cloud.
• Have your own private cloud (to maximize your resource usage without noisy neighbor issues).
Mitigating Single Points of Failure
When running in Cattle mode isn’t an option, there’s a solution that can help mitigate (or at least reduce) downtime. In a VMware virtual environment, for example, there are two types of HA options you can add to a Pet virtual machine:
1. VMware High Availability: When a VMware host becomes unavailable, VMware will initiate the boot of a new VM onto another host.
2. VMware Fault Tolerance: While the High Availability just described can be acceptable in terms of recovery, it won’t be faster than a VMware Fault Tolerance method, in which a second VM is created to work in tandem with the primary VM and reside on another host. If the main VM becomes unavailable, the secondary VM takes the lead and allows minimal interruptions of service (normally, the switch can occur in milliseconds).
Similar High-Availability processes to VMware High Availability exist and can be used on other virtualization stacks, such as oVirt, Proxmox, Hyper-V and others.
Popular cloud platform OpenStack currently doesn’t have a native High-Availability (failover) feature in the main release when used with a hypervisor such as Xen or KVM. Blueprints for future implementation are available, and an alternative project to add this feature is available, but not officially from the OpenStack community. This mean that in most cases, a user deploying on an OpenStack private cloud should/would operate in Cattle deployment for production or mission-critical applications.
Please note that those systems work when a node is detected as a “fail.” If the VM itself seems to still be running but went out of memory, for example, it won’t failover, and your workload will be unavailable.
Conclusion
I feel we’ve only touched on the surface of High Availability in the cloud (or in general) in this article. We will be releasing additional articles that will help explain the general concepts as well as articles on specific tools or platforms. Please let us know if something in the article isn’t clear or if you want more information.