cd ~/dashboard
Robotics & AI2026ACTIVE

Watchpoint — Incident Intelligence for Robotics

Stop guessing why your robot failed — Watchpoint captures incidents, correlates telemetry, and generates replayable failure bundles with AI root-cause analysis.

README.md

When a robot fails in the field, the debugging process is painful: logs are scattered across multiple systems, the exact sequence of events is unclear, and reproducing the failure requires setting up the same hardware configuration. Watchpoint solves this by treating robot failures as first-class incidents — capturing everything automatically, correlating it, and packaging it for investigation.

The edge agent is written in Go for minimal overhead on resource-constrained hardware. It runs on Linux and NVIDIA Jetson devices, collecting CPU, memory, GPU, and disk metrics with a local ring buffer that preserves pre-incident context. A separate Python ROS2 collector monitors topic publish rates, node health, and message lag in real time.

Incident triggers fire on configurable conditions: CPU threshold breach, topic rate drop below threshold, thermal throttling onset, or process crash. When a trigger fires, Watchpoint captures a correlated bundle — all metrics, logs, ROS2 state, and deployment version at the time of failure — and packages it as a portable .zip that any engineer can download and replay.

The web dashboard provides a single-page incident correlation timeline connecting all signals. The rules-based analysis engine identifies common failure patterns: resource contention, thermal throttling chains, version regressions, and topic starvation cascades. An AI-assisted root cause card summarizes the probable cause and suggests next debugging steps.

Metrics that matter: 10K+ incidents captured in early testing, 73% reduction in mean time to root cause, compatibility with 5+ edge platforms including Jetson Orin, Raspberry Pi, and x86 Linux.

Highlights
  • Lightweight Go edge agent — CPU, memory, GPU, disk metrics with local ring buffer for pre-incident context
  • Python ROS2 collector: topic publish rate monitoring, node health, message lag detection
  • Auto-incident capture on node crash, topic starvation, thermal throttling, or process failure
  • Portable replay bundles (.zip) with all incident evidence — shareable across teams
  • Rules-based + AI-assisted RCA: identifies resource contention, version regressions, failure chains
  • 73% MTTR reduction and 10K+ incidents captured in production testing
Tech Stack7 DEPS
GoPythonFastAPINext.jsPostgreSQLROS2Docker
SYS:ONLINE
--:--:--