Software issues in Production can be some of the most demanding challenges a team can face, especially when occurring on a system outside of your control.
Adapted from the OODA loop — a military strategy used to describe the decision cycle of observe, orient, decide and act — this process formalizes an approach to take control, establish facts, and progress towards a resolution for even the most difficult problems.
Steps
The following is an method to investigate & resolve Production/ Customer Environment problems, which has been found effective in complex and difficult situations.
Given the high-stakes nature of production issues, it is important to have a formal procedure — helping teams to “work the problem” in an effective way, rather than chasing scattershot fixes & potentially missing crucial evidence in the hope the problem is already solved.
The aim of this framework is to optimize analysis & investigative action, by several key principles. These include keeping analysis & hypotheses open, involving a group, and issuing a regular prioritized “action list” to progress investigation/ resolution of the problem.
The procedure is as follows:
-
Context.
- obtain contextual information – technical & business.
-
Evidence.
- obtain it; monitoring outputs, logs, exception traces, screenshots.
- verifiable facts preferred, with context, rather than assuming that who whoever tells you is correct/ or is looking at the right machine.
-
Analyze the Evidence.
- analyze the evidence. this means, really analyze it.
- pore over it, note observations & all anomalies, count & categorize them (eg in Notepad++), find any clues or red herrings you can.
- list all the observations & anomalies to form Hypotheses and bring to Review.
-
Hypotheses.
- form hypotheses out of your analysis.
-
Convene & Review;
- convene, review & suggest hypotheses;
- do as a group; can’t be just one person responsible for finding (or failure to find) the problem.
- this should be a cross between a brainstorming & prioritization session; keep a note of even less-likely hypotheses, and avoid dismissing these in case one turns out to be the problem.
-
Actionable Investigative/ Resolution steps.
- produce a list of Actionable Steps to further the investigation/ or potentially resolve the problem;
- listed in priority order.
- typically give the customer 4-6 per cycle, ask for confirmation in a format (screenshot, log capture etc) that provides concrete verification.
- keep repeating the ones that are outstanding.
- next cycle, repeat from step 1/2.
For on-premises customers, cycles are typically daily; with customer responses and evidence evaluated first thing in the morning.
For Dev-Ops and SaaS deployments, cycles may be an hour or two; however long it takes to gather evidence and work through a couple of hypotheses.
Artefacts
Artefacts supporting the process should include:
- Technical & Business Context about the system. Ideally this should have been documented early in the project lifecycle & shared on the Wiki.
- detailed analyses;
- these can be in Notepad++.
- JIRA ticket.
- keep this focused on the main problem.
- spin other minor issues off to separate tickets;
- keep reiterating the major problem & a summary of position if necessary to bring it back on topic.
- Wiki page for the Problem.
- may be useful to build up & track Event History, Facts, Questions/Hypotheses, Actionable Items on a Wiki page.