Server downtime should be a thing of the past, at least in theory. Cloud-based server clusters and falling costs of equipment mean that should a server fail, there should always be an alternative system ready to pick up the slack. Unfortunately, in practice this is often not the case, as many IT managers will know.
There could be many reasons for this: your company’s IT budget could be cut, or the cloud services provider you use could suffer an outage or data loss. Regardless of the causes, server downtime remains an issue for modern businesses, one with high costs and high stakes.
Preventing server downtime issues requires a few different approaches, but a good start is to make sure key stakeholders in the business understand the common causes of downtime, and exactly how taxing downtime can be on both a business’ financial and human resources.
Common causes
In my experience from working with a variety of different companies, from tiny startups to large enterprises, some of the most common causes of server downtime are obvious and easily avoidable issues.
[easy-tweet tweet=”One frequent problem that popped up time and again was lack of disk space on servers.” hashtags=”diskspace, servers, cloud”]
One frequent problem that popped up time and again was lack of disk space on servers. For a simple, easily resolved problem the potential dangers of running out of disk space are serious: applications running on low disk space can run unpredictably, causing freezes, crashing or data corruption.
Address the underlying cause of storage space issues by making sure your applications are designed efficiently. The most common cause of excessive disk space is log files, which can suddenly grow very quickly or are not rotated often enough. In purely financial terms, using pay-as-you-go cloud tools to run your applications can suddenly run up unexpected costs when things go wrong, so it’s crucial to catch these issues before the spiral out of control. Effective monitoring of resources with the right alerts configured before things get too bad can really help avoid unnecessarily large bills. To use one high-profile example, Snap’s cloud bill this year will be higher than its total revenue figure for 2016.
Putting a price tag on downtime
It can be tricky to put a proper cost on how much server downtime costs you, so it’s best to start with an estimate. If your website has an uptime rate of 99% over the course of a year, this means that your website or online service has been inaccessible for 3.65 days. For many businesses, 3-4 days of lost revenue can be a critical issue, especially in smaller companies. At Server Density we’ve created a cost of downtime calculator to help businesses understand more clearly exactly how much of their bottom line is put at risk by unstable server infrastructure.
[easy-tweet tweet=”The impact on a startup’s revenue from downtime can therefore be huge” hashtags=”startups, cloud, tech”]
Often startups and small businesses won’t have the money for built-in redundancies. While this might make financial sense in the short term, it can also spell a protracted wait to resume services in the event of an outage while the IT team figures out where the problem stems from. The impact on a startup’s revenue from downtime can therefore be huge, making it of vital importance that business leaders understand the risk involved.
Larger businesses are generally able to mitigate that risk, as they do have the financial flexibility to purchase ‘insurance’ measures like large-scale redundancy. Because the cost of downtime for these businesses is orders of magnitude greater than for a small startup, so are the costs of mitigating against that downtime.
The human toll
Working on-call in IT brings with it personal costs which can affect wider business goals. A 2014 study by Tel Aviv University suggests that having your sleep interrupted can be worse than no sleep at all. When a worker doesn’t know when he or she will need to respond to an emergency that creates stress, and this stress can affect the quality of work. This is a shame, as there are easy ways to minimise stress for on-call workers.
Many existing system monitoring products produce high quantities of alerts, ‘noise’ which may not be directly relevant to the task at hand or require action. Nevertheless, these alerts drain our concentration and eat away at our time. Our own research suggests it takes an average of 23 minutes to regain intense focus after being interrupted, and our data from December 2016 shows that 1.5 million individual alerts were triggered across all our customers’ servers. These unfiltered alerts created a total of 165 years worth of interruptions for our customers.
Excessive task switching among employees, combined with the mental and physical stresses that on-call work creates, are often overlooked as costs affecting business performance. These factors need to be managed and monitored more rigorously to ensure customers are receiving high-quality service.
Create a plan
Not all server downtime issues are avoidable. Sudden problems such as power cuts, fire or flooding are difficult to anticipate and often result in catastrophic, and expensive, downtime for businesses. No amount of server monitoring will help you in this situation, however through careful planning the damage can be mitigated. GitLab’s recent challenges can provide us with a great lesson in what not to do: the company lost 300GB of production data due to a mistyped command, and its five different backup techniques all failed.
Identify your different backup methodologies in your plan, be clear about the order of priority in which they will be used, and regularly check to ensure that these backup plans are working effectively. Make it clear to all the relevant people who should be contacted in case of emergency, and develop a simple checklist of issues to tackle – these kinds of features might sound obvious, but in a crisis people can behave unpredictably, so it’s important to have as many processes as possible codified and tested prior to the event.
IT leaders often face a challenge of communication: backups and redundancies are sorely needed, but it’s often difficult to demonstrate the need for the additional budget to have these features in place. An understanding of the true cost of server downtime, both on a company’s finances and the employees themselves, can help pave the way to a more adequately resourced plan for dealing with what is still a very real risk.
David Mytton, Server Density
David Mytton is founder and CEO of Server Density, a scalable infrastructure monitoring software company. Server Density offers a SaaS product featuring the graphing, dashboards and low management overhead that modern businesses need. Server Density has more than 700 customers, including the NHS, Drupal, Firebox and Greenpeace, and has offices in London and New York.