Series: Building Infrastructure for an Autonomous Drone Fleet (3/4)

Part 1: Device Identity · Part 2: Telemetry · Part 3: Monitoring · Part 4: Battery Management

The Monitoring Challenge

Drones are remote, intermittently connected, and resource-constrained. Two distinct needs that are often conflated: real-time awareness (“Is it flying right now?”) and historical analysis (“What happened during last Tuesday’s flight?”). Standard observability tools lack domain-specific concepts like flights, battery assignments, and device connectivity patterns.

Two Services, Clear Boundaries

  • Ingestion service (Go): real-time data intake, device status, live alerts
  • Analysis service (Python): historical storage, categorized logs, flight detection, dashboards

They share no database. The analysis service polls the ingestion API using a cursor. Different languages suited to different strengths, independent scaling, independent failure domains.

Cursor-Based Polling — The Glue

The analysis service tracks exactly which data it has already processed per device. Every 5 minutes, it requests “everything since my last cursor.” Logs are routed into category-specific tables. Database constraints handle deduplication — the design tolerates overlapping fetches. Simple, debuggable, and resilient to downtime on either side.

Automatic Flight Detection

ARM event + GPS movement + DISARM event = flight detected. No pilot input needed — purely derived from the telemetry stream. For each flight: start time, end time, duration, max altitude, distance covered, GPS path. This became the most-used feature: the operations team checks flights, not raw logs.

The Operational Dashboard

Device overview with connectivity, battery level, last activity. Flight timeline with drill-down. Telemetry explorer for historical data per device. Designed for operations: “show me problems” rather than “show me metrics.”

Alerts — Work in Progress

Configurable alert rules evaluated on each polling cycle: battery degradation, device offline, GPS anomaly, unexpected disarm. Currently stored in database and displayed in frontend. Next step: integration with team chat / email / push notifications.

What I Learned

  • Separating real-time from historical was the best architectural decision — fundamentally different query patterns and retention needs
  • Cursor-based polling is boring but reliable — simpler than WebSockets or event streaming at 5-minute granularity
  • Flight detection from telemetry patterns is surprisingly nuanced — the happy path is easy, edge cases are where the complexity lives
  • Database migrations on a live system with 8-figure row counts require respect (and statement timeouts)