3 min read

Robotics Incident Reviews Should Produce Better State Models, Not Just Action Items

RoboticsReliabilityIncident ResponseOperationsSystemsPhysical AI

Robotics Incident Reviews Should Produce Better State Models, Not Just Action Items

Most incident reviews end with a list:

  • add a check
  • fix a timeout
  • improve a log
  • write a test

Those are useful. But for robotics systems, I think the strongest incident reviews produce something more durable:

a better state model

If the team leaves understanding the system’s modes, transitions, and recovery points more clearly, the review has done deeper work.

Symptoms Are Often State Ambiguity

Many robotics incidents look messy because the system’s state model is weak or incomplete:

  • the robot was “kind of degraded”
  • the planner was “sort of paused”
  • the operator path was “partly manual”

Those phrases are a sign that the system’s real behavior space is larger than its explicit state model.

That gap makes:

  • debugging harder
  • operator expectations fuzzier
  • observability weaker
  • recovery behavior less predictable

Reviews Should Ask State Questions

In addition to root-cause questions, I like reviews to ask:

1. what state was the system actually in?
2. was that state modeled explicitly?
3. should it have been?
4. what transition was missing, ambiguous, or invisible?

This is often where the most reusable engineering insight lives.

Better States Make Better Signals

When a state becomes explicit, several things improve at once:

  • operators can see it
  • telemetry can record it
  • triggers can depend on it
  • exit conditions can be defined

That is why state-model improvements tend to pay back more than isolated fixes.

Example: “Degraded” Is Often Too Broad

A team may discover that a single degraded mode is hiding multiple realities:

  • reduced sensor trust
  • operator hold with healthy perception
  • planner disabled but locomotion active
  • inference fallback active but control healthy

Those are not the same thing operationally. Splitting them into more meaningful states can improve future triage immediately.

Action Items Still Matter

I am not arguing against concrete fixes. I am arguing that the review should also ask whether the incident exposed a missing or weak conceptual model of the system.

The best postmortems give you both:

  • a bug fix
  • a stronger way to describe future system behavior

The Practical Standard

If a robotics incident review ends only with tasks, you may fix the local bug and still leave the system hard to reason about.

If it ends with:

  • clearer states
  • clearer transitions
  • better entry and exit conditions
  • better observability around those states

then future incidents get easier to understand before they happen.

That is why I think good incident reviews should improve the state model, not just the backlog.

related reading
SYS:ONLINE
--:--:--