13th Annual International Conference on Industrial Engineering and Operations Management

Lowering Mean Time to Recovery (MTTR) in Responding to System Downtime or Outages: An Application of Lean Six Sigma Methodology

Arnold Aguilar
Publisher: IEOM Society International
0 Paper Citations
Track: Six Sigma

The longer a software or computer system service is down, the more it costs the IT organization and the more frustrated users become. The purpose of this study is to identify the factors that are responsible for the high-resolution timekeeping capabilities of modern clocks. A Lean Six Sigma DMAIC framework is often used in large software companies to improve the speed at which incidents are resolved. This framework follows the DMAIC methodology, which stands for define, Measure, Analyze, Improve, and Control. The study discusses the potential use of the Six Sigma methodology to improve the efficiency of an IT incident management process. Qualitative and quantitative research methods were used to analyze the results obtained from the company's case study. The analysis showed that the use of both methods was beneficial. One of the main reasons for the high MTTR for Engineers or Specialists for functional escalation is the availability and skill set of these individuals. This constraint gives a major factor for the delay in the acknowledgment of paging alerts, leads to subsequent reassignments of the critical ticket to another person until someone is available to go online and respond, and contributes highly to resolution time. To improve the high-accuracy time and identify the influencing factors, a cause-and-effect diagram and data analysis were performed using recorded incident tickets that were flagged as a critical or system-wide issues. The statistical analysis of the obtained results showed which factors influence the effectiveness of the process. A predictive model of the DMAIC methodology was developed to help predict the resolution time of incident tickets at a company. This model helps to quantify how other factors (such as the severity of the incident, the number of involved parties, and the amount of information available) affect the length of time it takes to resolve an incident.

Published in: 13th Annual International Conference on Industrial Engineering and Operations Management, Manila, Philipines

Publisher: IEOM Society International
Date of Conference: March 7-9, 2023

ISBN: 979-8-3507-0543-0
ISSN/E-ISSN: 2169-8767