Production Troubleshooting Methodology

Software issues in Production can be some of the most demanding challenges a team can face, especially when occurring on a system outside of your control.

Adapted from the OODA loop — a military strategy used to describe the decision cycle of observe, orient, decide and act — this process formalizes an approach to take control, establish facts, and progress towards a resolution for even the most difficult problems.

Steps

The following is an method to investigate & resolve Production/ Customer Environment problems, which has been found effective in complex and  difficult situations.

Given the high-stakes nature of production issues, it is important to have a formal procedure — helping teams to “work the problem” in an effective way, rather than chasing scattershot fixes & potentially missing crucial evidence in the hope the problem is already solved.

The aim of this framework is to optimize analysis & investigative action, by several key principles. These include keeping analysis & hypotheses open, involving a group, and issuing a regular prioritized “action list” to progress investigation/ resolution of the problem.

The procedure is as follows:

  1. Context.

    • obtain contextual information – technical & business.
  2. Evidence.

    • obtain it;  monitoring outputs, logs, exception traces, screenshots.
    • verifiable facts preferred, with context, rather than assuming that who whoever tells you is correct/ or is looking at the right machine.
  3. Analyze the Evidence.

    • analyze the evidence.  this means, really analyze it.
    • pore over it, note observations & all anomalies, count & categorize them (eg in Notepad++), find any clues or red herrings you can.
    • list all the observations & anomalies to form Hypotheses and bring to Review.
  4. Hypotheses.

    • form hypotheses out of your analysis.
  5. Convene & Review;

    • convene, review & suggest hypotheses;
    • do as a group;  can’t be just one person responsible for finding (or failure to find) the problem.
    • this should be a cross between a brainstorming & prioritization session;  keep a note of even less-likely hypotheses, and avoid dismissing these in case one turns out to be the problem.
  6. Actionable Investigative/ Resolution steps.

    • produce a list of Actionable Steps to further the investigation/ or potentially resolve the problem;
    • listed in priority order.
    • typically give the customer 4-6 per cycle, ask for confirmation in a format (screenshot, log capture etc) that provides concrete verification.
    • keep repeating the ones that are outstanding.
  7.  next cycle, repeat from step 1/2.

For on-premises customers, cycles are typically daily; with customer responses and evidence evaluated first thing in the morning.

For Dev-Ops and SaaS deployments, cycles may be an hour or two; however long it takes to gather evidence and work through a couple of hypotheses.

Artefacts

Artefacts supporting the process should include:

  1. Technical & Business Context about the system. Ideally this should have been documented early in the project lifecycle & shared on the Wiki.
  2. detailed analyses;
    1. these can be in Notepad++.
  3. JIRA ticket.
    • keep this focused on the main problem.
    • spin other minor issues off to separate tickets;
    • keep reiterating the major problem & a summary of position if necessary to bring it back on topic.
  4. Wiki page for the Problem.
    • may be useful to build up & track Event History, Facts, Questions/Hypotheses, Actionable Items on a Wiki page.

References

Leave a Reply

Your email address will not be published. Required fields are marked *