DevOps has become a movement that is driving organizations at full speed. Collaboration and communication are now much more than goals, they are a reality which provide improvements to current technologies and services. However, businesses need to efficiently monitor and assess their systems. How? MTTR is the answer!

Around the world, organizations of multiple industries rely on complex services, work methods, and systems to keep track of critical aspects of their businesses. Thus, monetarization of these systems is needed and one helpful measurement tool is Mean Time To Recovery (MTTR).

What’s Mean Time to Recovery?

Mean Time to Recovery or MTTR, is a software term used to measure the period of time between a service being detected as “down” until it reaches the state of “available” once again, from a user’s standpoint. “Mean Time To” refers to a standard measurement of an average duration of time between to events. This measurement can then be used posteriorly to determine the financial effect on the businesses.  Although the “R” in MTTR refers to Recovery, there are 5 Rs to consider:

  • Review: How can we stop this from happening again?
  • Return: From a customer’s perspective, is everything fine and how are they satisfied with our performance and, more importantly, with what happened?
  • Repair: The amount of effort, time and overall investment spent resolving the problem.
  • Respond: Reacting to the problem, specifically, the time to notice the problem, determining if it can be fixed and what will be needed to solve it.
  • Recovery: The task of bringing the service back.

Why is MTTR useful?

Monetarization is key in order to assess the system, its performance, and potential improvements, as well as its flaws. MTTR also allows us to monitor and measure important metrics such as response time and errors, amongst others. Being able to measure these metrics provides data regarding speed, reliability, system weaknesses and production, delivering teams with valuable information. Knowing what questions to ask will lead to the right solutions, reducing the overall investment when it comes to time, team effort and cost.

DevOps has also plenty to gain with MTTR and it all starts with collaboration. DevOps focuses on cooperation, by people talking to each other, understanding what’s needed and solving problems, which will lead to more efficient tests and monetarization procedures. In addition, MTTR also helps drive movement to virtual or cloud.

Measuring MTTR

Several organizations use IT Service Management tools to create tickets to report failures. After the ticket is created, the clock has started and it will only stop until the problem is solved and the ticket closed. Why is this action important? Creating the ticket will serve as the time of the beginning of the problem and the time-resolved value as the end time. It’s crucial that teams stop the clock as soon as the service in back-on, in order to correctly report the MTTR.

The MTTR is calculated by dividing the total unplanned maintenance time spent on an asset, by the total number of failures resulted by the asset over a specific period of time.

For example, let’s say that one of your assets has broken drown seven times last year. The total number of times it took to repair the asset across all seven failures was 33 hours. The MTTR calculation would be as follows:

MTTR = 33 hours ÷ 7 failures/breakdowns
MTTR = 33 ÷ 7
MTTR = 4.71 hours

Creating an incident-management plan

When a problem arises, knowing what to do, whom to call and what procedures to implement in order to solve the problem are critical steps. Teams need to know how to proceed and that can only happen by following an existing incident-management plan. Commonly, there are three plans:

  1. Ad hoc – When a problem ensues, the team figures out who knows that specific system or technology and assigns someone to solve it. This is typically the case of recent or smaller companies where there isn’t a lot of structure in place.
  2. Rigid – A more traditional approach used by organized and sized companies, focusing on information technology systems management (ITSM). With this plan, the IT department takes charge of incident management procedures, following detailed and strict protocol, making use of a beneficial structure, organized to tackle new and potential problems.
  3. Fluid – A response used by modern companies. This approach provides actions to specific incidents, incorporating cross-functional collaboration in order to solve problems in the most efficient manner. This response is founded on Lean principles a methodology of regular experimentation and learning.

Ultimately, MTTR is an important metric to use when it comes to measuring incident response. Equally important is knowing that improvement cannot exist without measuring and testing. Having this understanding is key for better performance and procedures that allow organizations to adopt progressive approaches around reliable systems, procedures, and best practices.