ITSM | 9 MIN READ

How to Prevent Catastrophic IT Service Downtime

Streamlining your ITSM processes isn’t just about speed and efficiency. When it comes to delivering an amazing customer service, nothing stops those efforts in their tracks like downtime.

VIEW BLOG SUMMARY

 

It’s easy to see why too. Prolonged outages, like Amazon’s recent Prime Day debacle, can cost the world’s tech giants millions of dollars every minute. Even short periods of downtime, like Google’s 5-minute-long outage in 2013, can snowball and result in global web traffic plummeting.

Then there are the ancillary costs associated with downtime. 69% of consumers will stop using an app or online service if it takes more than 15 minutes to rectify outages. On top of that, the average company loses 545 hours of staff productivity hours every year due to downtime.

The bottom line is simple. If you’re not taking steps to avoid downtime before it happens or quickly resolve issues when they do occur, your user base and revenue streams will crater.

Good news? I’m here to help. In this blog post, I’m going to walk you through the true costs of downtime, as well as the main contributors to issues and what you can do to prevent them from happening.

Ready, set, let’s go!

 

The True Cost of Downtime (Hint: It’s a Lot)

No matter how sophisticated your ITSM tools or IT infrastructure are, you’ll still be on the receiving end of an unplanned data outage–probably sooner than you think. In fact, a U.S.-based study revealed that 91% of data center workers had dealt with a sudden period of downtime in the previous 24 months.

The duration of those periods continues to grow as well. In a 2017 poll of 400 IT professionals, nearly 50% of them said they’d experienced more than four hours of downtime in a 12-month period. In 2019, the average per-hour global cost of enterprise server downtime worldwide has ballooned to over $300,000.

That’s just the tip of the financial iceberg. Here are some additional highlights:

  • When Delta Airlines had to cancel 280 flights due to a 2016 outage, related losses totaled over $150 million
  • IDC’s survey of the Fortune 1000 puts the “average total cost of unplanned application downtime per year” anywhere between $1.25 billion to $2.5 billion
  • Assuming it’s entire IT team is involved with resolving downtime issues, labor costs the average Fortune 500 company would total $896,000 per week, translating to more than $46 million per year.

Beyond the purely fiscal, there are less tangible costs associated with downtime, ones that, if I can use a sports analogy, don’t typically show up on the stat sheet. Just one of the worst indirect costs is the impact that interrupting an organization’s daily operations can have on productivity and innovation.

A study by UC Irvine says that it takes around 23 minutes to get refocused on a task following an interruption (this includes someone simply knocking on your door and telling you about an IT service outage). And, according to the Washington Post, interruptions suck up 238 minutes per day. Add on related stress and you’re looking at a whopping 31 hours every week.

Time spent firefighting eats away at hours that could be spent on improving the customer experience or deploying innovative new features. As a result, your organization risks falling behind the development curve in your industry.

 

Another big contributing factor to innovative tech companies is a productive (and happy) IT service desk team. Find out how to keep yours happy on our blog!

 

The Main Causes of Downtime (And How to Prevent Them)

So, downtime is devastatingly expensive. That much is clear. The next logical question is: What causes those dreaded IT service outages?

The answer will vary depending on the size, scope, components, and the overall intricacy of your IT system or network. Below is a collection of the most common issues that can lead to aggravating spats of downtime:

Hardware, software, or other component failures

This category encompasses basically any technical aspect of your IT services that may have a hand in creating an outage. This includes server, storage, and application errors, as well as any computing hardware or software used to run the services. Failures related to any of those components can also open your organization up to rampant IT security issues.

Human error

If it’s not a technical issue, then oftentimes the error at the root of an outage is a human one. And, while to err is certainly human, it’s a more common occurrence than most organizations realize. Nearly half of IT network professionals claim that man-made mistakes cause outages at least some of the time, while only 3% say errors are caught before deployment.

Power outages/other weather events

While this is far less likely than either human or technical errors, freak accidents, usually to do with nasty weather conditions, still happen. From power failures to water leaks damaging hardware, there are all sorts of scenarios that fall under this category.

Planned downtime for upgrades

Sometimes, IT service or network downtime is planned by developers. This usually occurs when a big update, one that will take 12 or more hours to implement, is needed. End users will almost always get an email or in-app notification that maintenance is being performed at a specific date and time, and to adjust their usage accordingly.

 

One of the best ways to discern if there are any problem areas within your IT service system or if recurring instances of downtime can be pinned on one source is to run stress tests. These can help your development team, engineers, and other IT professionals can get to the heart of the matter as quickly as possible.

Here’s more from Forbes’ Kolton Andrus:

“The truth is that no one has all of the answers, and we can learn a lot from each other. As a place to start, I recommend getting in the habit of dedicating at least an hour a week to running chaos engineering experiments. Similar to a fire drill, it will give your engineering teams time to better understand their systems and practice responding to issues in a controlled setting, which will only make things better when real problems inevitably occur.”

Not doing one's due diligence is just one of the many preventable mistakes that many agile project management teams make. To discover which other ones to avoid, read the full blog post!

 

How to Stop Downtime From Piling Up

Once downtime issues are identified, the next step is extinguishing them through either in-the-moment problem solving taking preventative measures. While this won’t fully eradicate your organization’s chances of suffering an outage, it will minimize the number of occurrences and, ultimately, ensure that your customer experience doesn’t dip.

Since the in-the-moment aspect of this section will vary based on the precise issue at hand as well as the affected component, let’s focus on precautionary steps you can take to sidestep a significant amount of downtime.

Make Sure Your Basics Are Covered

It may sound like an overly obvious tip, but too many businesses suffer outages because they forget to do routine service or resource checks. Easy things, like domain name and/or hosting renewals, regularly scanning hardware and software for issues, and so on. If you’re going to avoid suffering from downtime, your basics must be covered at all times.

Ensure That Your IT Security Meets Current Standards

I linked to our blog post about IT security fails earlier in this blog post, but it bears mentioning again here. Making sure that your security protocols are a) up-to-date b) being practiced by the entire organization, and c) performing at a high standard is an essential part of a safe, solid user experience.

Get a DNS Backup Service

Lots of outages can be traced back to DNS-related problems. Implementing a DNS backup service will continually snag related data and also act as a carbon copy of your existing DNS, should it fail.

Invest in a Monitoring Service

Similarly, investing in a monitoring tool can reduce the possibility of oversight headaches. Various services will notify you via email, text, and more if your site or IT services happen to bite the dust. Google Webmaster Tools are also an asset in this area, keeping you in the know of any search engine crawling issues.

Have Regular Check-Ins With Your IT Team

Not sure if your IT services can handle a surge in traffic? Can’t tell if your network, hardware, or other components are in need of an upgrade? Are there any bugs that threaten your organization’s uptime? The answers to those questions and more can be had by simply checking in with your IT. This way, you empower them to plan for crises instead of constantly fight them.

Backup Your Database(s) Regularly

Let’s say a technical glitch does occur at the worst possible time. Is your data backed up and ready to be reuploaded or refreshed at a moment’s notice? If not, it’s a sign that your database(s) and other information hubs need to be backed up, ideally to a separate cloud instance or server. You can automate this process via workflows or software.

If You Spot a Potential Issue, Be Proactive

Above all else, if you see a potential issue on the horizon, be proactive now instead of reactive later. The former gives you the keys to your IT services and allows you to steer it in the right direction, while the latter puts you in the passenger seat with no way of grabbing the steering wheel before the vehicle crashes.

 

Avoiding downtime is a crucial part of building customer-centric IT services that consumers love. To read more on this topic, click here!

 


Conclusion

As you can see, preventing catastrophic IT service downtime doesn’t have to be a struggle. From ensuring that your most obvious bases are covered to performing regular scans and backups, strong precautionary measures are definitely within your reach. All you need to do is take them seriously.

We all know that downtime can negatively affect a consumer’s perception and trust in your brand, which can, in turn, put a squeeze on your bottom line. By knowing the kinds of issues that can lead to outages, you’ll be better equipped to solve those problems when they do arise or, in the best-case scenario, avoid them before they even happen.

For Jira admins and end users, a trusted app in the quest to minimize downtime is Insight. By giving you a complete picture of your most important assets and IT infrastructure, you can easily pinpoint IT service issues and resolve them quickly and effectively–much to the delight of your customers.

For more information on how Insight can take your organization to the next level when it comes to productivity and efficiency, or to see the product in action, hit the link below!

Read More About ITSM

Originally published Nov 19, 2019 6:00:00 AM

Topics: ITSM