cd ~/projects
Robotics & AI2026ACTIVE

WatchpointIncident Intelligence for Robotics

Stop guessing why your robot failed — Watchpoint captures incidents, correlates telemetry, and generates replayable failure bundles with AI root-cause analysis.

Robotics & AI73% MTTR reduction
Case study

Watchpoint

Robotics teams needed a repeatable way to capture incidents, correlate telemetry, and stop guessing at root cause.

Constraints

  • Edge devices had limited resources and noisy telemetry.
  • The system had to preserve pre-incident context automatically.
  • Bundles had to be portable so engineers could replay failures elsewhere.

Architecture

flow
1

Incident trigger

2

Go edge agent

3

ROS2 collector

4

Replay bundle

5

RCA dashboard

Tech stack

7 layers
GoPythonROS2FastAPINext.jsPostgreSQLDocker

Outcome

Cut mean time to root cause by 73% while capturing 10K+ incidents across Jetson, Raspberry Pi, and x86 targets.

README.md

When a robot fails in the field, the debugging process is painful: logs are scattered across multiple systems, the exact sequence of events is unclear, and reproducing the failure requires setting up the same hardware configuration. Watchpoint solves this by treating robot failures as first-class incidents — capturing everything automatically, correlating it, and packaging it for investigation.

The edge agent is written in Go for minimal overhead on resource-constrained hardware. It runs on Linux and NVIDIA Jetson devices, collecting CPU, memory, GPU, and disk metrics with a local ring buffer that preserves pre-incident context. A separate Python ROS2 collector monitors topic publish rates, node health, and message lag in real time.

Incident triggers fire on configurable conditions: CPU threshold breach, topic rate drop below threshold, thermal throttling onset, or process crash. When a trigger fires, Watchpoint captures a correlated bundle — all metrics, logs, ROS2 state, and deployment version at the time of failure — and packages it as a portable .zip that any engineer can download and replay.

The web dashboard provides a single-page incident correlation timeline connecting all signals. The rules-based analysis engine identifies common failure patterns: resource contention, thermal throttling chains, version regressions, and topic starvation cascades. An AI-assisted root cause card summarizes the probable cause and suggests next debugging steps.

Metrics that matter: 10K+ incidents captured in early testing, 73% reduction in mean time to root cause, compatibility with 5+ edge platforms including Jetson Orin, Raspberry Pi, and x86 Linux.

Highlights
  • Lightweight Go edge agent — CPU, memory, GPU, disk metrics with local ring buffer for pre-incident context
  • Python ROS2 collector: topic publish rate monitoring, node health, message lag detection
  • Auto-incident capture on node crash, topic starvation, thermal throttling, or process failure
  • Portable replay bundles (.zip) with all incident evidence — shareable across teams
  • Rules-based + AI-assisted RCA: identifies resource contention, version regressions, failure chains
  • 73% MTTR reduction and 10K+ incidents captured in production testing
Tech Stack7 DEPS
GoPythonFastAPINext.jsPostgreSQLROS2Docker
SYS:ONLINE
--:--:--