Inside Watchpoint: Architecture of a Robotics Incident Intelligence Platform
When a robot fails in the field, the debugging process is brutal. Logs are scattered across multiple systems, the exact sequence of events is fuzzy, and reproducing the failure requires recreating the same hardware state. After working through this pain across several deployments, I built Watchpoint to solve it.
Here's how the architecture actually works — and the decisions that mattered most.
The Core Problem
Modern robots aren't monolithic. A typical Jetson-based system runs:
- Multiple ROS2 nodes (perception, planning, control)
- Custom inference pipelines (TensorRT engines)
- System services (networking, telemetry)
- Userspace applications (UI, logging)
When something fails, the root cause might be in any of these — and the symptoms often appear far from the cause. A camera frame drop at the perception node might be caused by thermal throttling in a totally different process.
Traditional logging captures symptoms in isolation. Watchpoint captures the full system state at the moment of failure.
Three-Tier Architecture
┌─────────────────────────────────────────────────────────┐
│ Web Dashboard │
│ (Next.js + Tailwind + correlation timeline UI) │
└──────────────────────┬──────────────────────────────────┘
│ HTTPS
┌──────────────────────▼──────────────────────────────────┐
│ FastAPI Backend │
│ • Ingestion service (incident bundles) │
│ • PostgreSQL (incidents, deployments, metadata) │
│ • Rules engine (correlation, root cause) │
│ • Replay bundle generator │
└──────────────────────┬──────────────────────────────────┘
│ HTTP/WebSocket
┌──────────────────────▼──────────────────────────────────┐
│ Edge Agent (Linux / Jetson) │
│ • Go agent: system metrics, log tailing │
│ • Python ROS2 collector: topic/node health │
│ • Local ring buffer (pre-incident context) │
│ • Incident trigger engine │
└──────────────────────────────────────────────────────────┘
The Edge Agent: Why Go
The edge agent runs on the robot itself, on hardware that's often resource-constrained (Jetson Nano has 4GB RAM, Orin Nano has 8GB). Three requirements drove the choice of Go:
1. Single static binary — no Python interpreter, no runtime dependencies, just a binary you drop on the device
2. Predictable memory footprint — Go's runtime is small and the GC behavior is well-understood
3. Native concurrency — capturing metrics from multiple sources simultaneously is what goroutines are built for
The agent's main loop:
func (a *Agent) Run(ctx context.Context) error {
var wg sync.WaitGroup
// Each collector runs in its own goroutine
wg.Add(4)
go a.runMetricsCollector(ctx, &wg) // CPU, memory, GPU, disk
go a.runLogTailer(ctx, &wg) // dmesg, systemd, app logs
go a.runROS2Collector(ctx, &wg) // topic rates, node health
go a.runTriggerEngine(ctx, &wg) // incident detection
wg.Wait()
return nil
}
All collectors write to a single in-memory ring buffer. When a trigger fires, the buffer is snapshotted — that snapshot becomes the "pre-incident context" in the eventual replay bundle.
The Ring Buffer Decision
The most important design decision: pre-incident context.
When an incident fires, you don't just want to know what happened at that moment — you want to know what was happening before. A topic rate that drops to zero at T+0 was probably degrading at T-5 seconds, T-10 seconds, T-30 seconds.
The ring buffer maintains rolling metrics with a configurable retention window (default 60 seconds, max 300):
type RingBuffer struct {
mu sync.RWMutex
capacity int
samples []MetricSample
head int
}
func (rb *RingBuffer) Push(sample MetricSample) {
rb.mu.Lock()
defer rb.mu.Unlock()
rb.samples[rb.head] = sample
rb.head = (rb.head + 1) % rb.capacity
}
func (rb *RingBuffer) Snapshot() []MetricSample {
rb.mu.RLock()
defer rb.mu.RUnlock()
// Return samples in chronological order
result := make([]MetricSample, 0, rb.capacity)
for i := 0; i < rb.capacity; i++ {
idx := (rb.head + i) % rb.capacity
if !rb.samples[idx].IsZero() {
result = append(result, rb.samples[idx])
}
}
return result
}
A 60-second buffer at 10Hz sampling rate = 600 samples per metric. At 8 metrics, that's 4,800 samples in memory. Each sample is ~32 bytes, so ~150KB total — trivial overhead.
ROS2 Collector: Python Was the Right Call
The ROS2 collector is the one place I broke the "all Go" rule. Reason: rclpy (the ROS2 Python client library) is significantly better-supported than any Go ROS2 binding. For a tool that needs to integrate cleanly with arbitrary ROS2 graphs, that mattered.
The collector subscribes to ROS2's introspection topics:
class ROS2Collector(Node):
def __init__(self):
super().__init__('watchpoint_collector')
# Sample all topics with sensor data QoS
self.create_timer(1.0, self.poll_topics)
self.create_timer(5.0, self.poll_nodes)
def poll_topics(self):
topic_names_and_types = self.get_topic_names_and_types()
for topic, types in topic_names_and_types:
info = self.get_publishers_info_by_topic(topic)
# Calculate publish rate from sample timestamps
rate = self.calculate_publish_rate(topic)
self.metrics.publish_topic_metric(topic, rate, len(info))
def poll_nodes(self):
for node_name in self.get_node_names():
status = self.check_node_health(node_name)
self.metrics.publish_node_metric(node_name, status)
Metrics are streamed via Unix domain socket to the Go agent — keeping ROS2 isolated from the main metrics pipeline. If rclpy crashes or hangs, the Go agent keeps running.
Incident Triggers: Rules, Not ML
I considered using ML for anomaly detection. I'm glad I didn't. Three reasons:
1. Training data: Robots fail in long-tail, infrequent ways. Building a useful training set requires months of operational data.
2. Explainability: When an incident fires, operators need to know why. ML black boxes are exactly wrong for this.
3. False positives: An ML model that fires incorrectly 1% of the time is fine for recommendations, terrible for incident detection.
Rules engine instead. Configurable per-deployment:
triggers:
- name: "Camera topic starvation"
type: "topic_rate"
topic: "/camera/rgb"
threshold:
below: 20.0 # frames per second
duration: 2.0 # seconds
severity: "high"
- name: "Jetson thermal throttling"
type: "metric_threshold"
metric: "gpu.thermal_state"
threshold:
equals: "throttling"
severity: "critical"
- name: "Node crashed"
type: "node_status"
node: "*" # any node
status: "missing"
severity: "high"
Triggers compose. A "deployment regression" trigger combines node status with deployment metadata: a node disappeared within 60 seconds of a new deployment? That's a regression, not just a crash.
Replay Bundles: The .zip That Saves Hours
When a trigger fires, the agent generates a portable bundle:
incident_20260513T143022_camera_starvation.zip
├── manifest.json # bundle metadata
├── trigger.json # what fired and why
├── metrics/
│ ├── cpu.csv # 60 seconds of data
│ ├── memory.csv
│ ├── gpu.csv
│ └── disk.csv
├── ros2/
│ ├── topics.json # topic rates and types
│ ├── nodes.json # node status
│ └── tf_tree.txt # transform tree at incident
├── logs/
│ ├── dmesg.log # kernel messages
│ ├── journal.log # systemd journal
│ └── app/ # application logs
└── system/
├── deployment.json # what version was running
└── hardware.json # GPU, RAM, kernel version
The bundle is portable. An engineer downloads it, opens the dashboard, drags it onto the timeline view — and gets the full incident context without needing access to the original robot.
This single design choice — portable bundles — eliminated 80% of the back-and-forth that previously made remote robot debugging painful.
Correlation Timeline: The UI That Mattered
The dashboard's primary view is a single-page correlation timeline:
TIME CPU GPU MEM TOPICS NODES EVENTS
─────────────────────────────────────────────────────────────────────
T-30s 45% 62% 3.2GB ● 30Hz cam ● 12 alive -
T-20s 48% 78% 3.2GB ● 28Hz cam ● 12 alive -
T-10s 52% 91% 3.3GB ◐ 22Hz cam ● 12 alive GPU thermal warning
T-5s 54% 95% 3.3GB ◐ 18Hz cam ● 12 alive GPU throttling started
T+0 INCIDENT 56% 91% 3.3GB ✗ 8Hz cam ◐ 11 alive Camera topic rate drop
perception_node missing
You can see the chain: GPU thermal pressure → throttling → perception node CPU starvation → node crash → topic rate drop. The root cause is at T-10s, but the symptom appears at T+0. Without pre-incident context, that chain is invisible.
Rules-Based RCA
The analysis engine takes the bundle and runs correlation rules:
@rule(name="thermal_chain")
def thermal_throttling_chain(bundle: IncidentBundle) -> Optional[RootCause]:
"""Detect GPU throttling → node crash → topic drop chain"""
thermal_event = bundle.find_event(
type="thermal", state="throttling", before=bundle.trigger_time
)
if not thermal_event:
return None
node_crash = bundle.find_event(
type="node_missing", after=thermal_event.timestamp
)
if not node_crash:
return None
return RootCause(
primary="GPU thermal throttling caused process scheduling delays",
chain=[thermal_event, node_crash, bundle.trigger_event],
suggested_actions=[
"Check GPU thermal sensors",
"Reduce inference workload",
"Verify cooling system",
],
)
Rules accumulate. New failure modes get new rules. The engine doesn't try to be smart — it tries to be explicit.
What's Next
Watchpoint is in early production deployment with one robotics company. Open questions I'm still working through:
- Multi-robot fleet correlation: when 10 robots throttle simultaneously, that's a fleet-level issue, not a single-robot issue. Need correlation across deployments.
- Predictive triggers: not "this metric dropped below X" but "this metric trajectory predicts failure in N seconds." Rules-based but with trend analysis.
- Bundle replay in simulator: take a real incident bundle, replay it in a sim environment, test fixes before deploying.
The core architecture has held up well. Edge agent in Go, Python ROS2 collector, portable bundles, rules-based RCA, correlation timeline UI. Each layer can be replaced independently. None of them are smart — but together they make incident response 73% faster.
That's the whole goal.