Vision Telemetry and Profiling Spec (v1)¶
Status¶
- Owner: vision pipeline implementation
- Version:
v1 - Scope: producer + consumer runtime telemetry, emitted asynchronously
- Last updated: 2026-02-17
1) Purpose¶
Define a low-coupling, low-overhead telemetry architecture for the vision pipeline that:
- keeps producer and consumer hot paths non-blocking
- keeps profiling optional at runtime
- emits structured stage timing only when telemetry is active
- supports real-time and offline latency/jitter analysis
- preserves detector-stage latency/FPS under load
- avoids rework when consumer telemetry is added
2) Design Goals¶
- Optional profiling:
- if telemetry is disabled, no stage timing work is done
-
pipelines still keep cheap status counters
-
No data-plane stalls:
- producer and consumer must never block on telemetry
- queue overflow drops telemetry samples, never frames b
- Loose coupling:
-
pipelines depend on a small telemetry interface, not sink details
-
Cross-thread safety:
-
producer and consumer each publish to telemetry thread through bounded queues
-
ROS2-friendly timestamps:
- monotonic durations for measurement correctness
-
realtime timestamp for cross-system correlation
-
Low and predictable overhead:
- fixed-shape samples on hot path (no
std::optional, no formatting) - one top-level timing gate per tick, then scoped stage timing inside that branch
3) Non-Goals (v1)¶
- perfect reliability of telemetry delivery
- dynamic attach/detach of telemetry while running
- profiling of preflight/startup path
- multi-process telemetry transport
- high-cost online analytics UI inside data-plane threads
4) Threading Model¶
The system has up to four threads:
- Orchestrator/App thread
-
owns lifecycle of telemetry hub and pipeline threads
-
Producer thread
- runs producer pipeline tick loop
-
emits producer telemetry samples asynchronously
-
Consumer thread
- runs consumer pipeline tick loop
-
emits consumer telemetry samples asynchronously
-
Telemetry thread
- drains producer and consumer queues
- writes JSONL sink
- computes optional rolling stats snapshots
5) Architecture¶
5.1 Interface boundary¶
Pipelines depend on one minimal interface (name can be adjusted):
class ITelemetry {
public:
virtual ~ITelemetry() = default;
// True only when stage timing + sample emission are active.
virtual bool timing_enabled() const noexcept = 0;
// Always non-blocking, never throws, best-effort.
virtual void emit_producer(const ProducerSample& sample) noexcept = 0;
virtual void emit_consumer(const ConsumerSample& sample) noexcept = 0;
};
5.2 Nullability and activation¶
ITelemetry* telemetry == nullptrmeans telemetry integration is absent.telemetry != nullptr && telemetry->timing_enabled() == truemeans stage timing + sample emission are active.- Counters do not require active timing.
5.3 Counters vs samples¶
- Always-on cheap counters:
- per-pipeline status counters (e.g., produced, no-frame, preprocess-error)
-
no formatting, no allocations, no file writes
-
Conditional timing + samples:
- only when
timing_enabled() == true - stage duration measurement + sample object creation + queue push
5.4 Hot-path sample representation¶
For producer/consumer in-memory sample structs:
- use fixed-width fields (
uint64_tdurations default to0) - include
stage_maskbitset indicating which stages executed - avoid
std::optionaland avoid formatting work in data-plane threads - JSON conversion (
stage_mask->nullfor missing stage) happens only in telemetry thread
5.5 Timing instrumentation style¶
Implementation pattern for each pipeline tick:
- Evaluate
timing_on = (telemetry != nullptr && telemetry->timing_enabled())once at tick start. - If
timing_on == false, run pipeline logic with counters only. - If
timing_on == true, use scoped RAII stage timers that write directly into sample fields.
This keeps instrumentation readable while preserving near-zero disabled-path overhead.
6) Queueing and Backpressure¶
6.1 Queue topology¶
- producer -> telemetry thread: one bounded SPSC queue
- consumer -> telemetry thread: one bounded SPSC queue
This avoids MPSC complexity and aligns with current lock-free SPSC primitives.
6.2 Queue behavior¶
try_pushonly; never block producer/consumer- on full queue: drop sample and increment a drop counter
- no retries on hot path
6.3 Sink behavior¶
- telemetry thread batches writes to sink (JSONL)
- flush by size and/or time interval
- sink I/O must not run on producer/consumer threads
6.4 Capacity sizing rule¶
- size each SPSC queue from worst burst and sink pause budget:
capacity >= worst_burst_rate_hz * max_sink_pause_s- default v1 starting point:
512per queue unless measured data suggests otherwise
7) Failure Policy¶
v1 uses fail-open policy:
- telemetry/sink errors must not stop producer/consumer pipeline execution
- on sink fault, telemetry hub marks itself faulted, increments sink-error counter, and disables timing emission
- pipeline keeps running with cheap counters
Rationale: pipeline availability is prioritized over telemetry completeness.
8) Producer Emission Rules (v1)¶
8.1 Tick inclusion¶
- Emit samples for:
ProducedNoWritableBufferCaptureRetryableErrorCaptureFatalErrorPreprocessError- Do not emit per-tick
NoFramesample by default (counter-only), to avoid high-rate log noise.
8.2 Identifiers¶
frame_id: producer-assigned monotonic ID atpublish_ready, stored in buffer metadata, and propagated into consumer telemetrytick_id: always present (monotonic counter in producer thread)sequence: present when available from captured frame
frame_id is the cross-thread correlation key for end-to-end latency analysis.
8.3 Stage partitioning¶
Producer stage durations are partitioned as:
dequeue_nsacquire_write_nspreprocess_nspublish_ready_nsrequeue_nstotal_ns
If a stage did not execute due to early return, stage duration is null in JSON.
8.4 Early-return semantics¶
- Early returns are normal outcomes, not process-fatal.
- Sample contains status fields that explain path taken.
9) Clock and Time Semantics¶
- Stage durations:
- measured from monotonic clock
-
stored as nanoseconds
-
Event timestamp:
-
realtime nanoseconds for cross-correlation with ROS2 stamps/logs
-
Duration math:
- always uses monotonic domain
- never compute duration from realtime values
10) Data Contract (JSONL v1)¶
Each line is one JSON object. Required top-level fields:
{
"schema_version": 1,
"source": "producer",
"frame_id": 4421,
"tick_id": 12345,
"sequence": 9988,
"event_ts_real_ns": 1739700000000000000,
"producer_status": "produced",
"capture_status": "ok",
"preprocess_status": "ok",
"capture_errno": 0,
"dur_ns": {
"dequeue": 12000,
"acquire_write": 900,
"preprocess": 1450000,
"publish_ready": 700,
"requeue": 11000,
"total": 1490000
}
}
Notes:
frame_idmay benullfor paths that never reachedpublish_ready.sequencemay benullfor paths without a valid dequeued frame.- any non-executed stage in
dur_nsisnull. - consumer events use
source: "consumer"and consumer-specific status/stage fields.
11) Interface and Ownership¶
11.1 Ownership¶
- app/orchestrator owns telemetry hub and sink objects
- producer/consumer receive non-owning pointer/reference to interface
11.2 Lifetime¶
- telemetry hub is created before starting producer/consumer loops
- telemetry hub is stopped and joined after producer/consumer loops exit
11.3 No runtime toggling (v1)¶
- telemetry configuration is fixed for process lifetime
- no hot attach/detach requirement
12) Flush and Shutdown Policy¶
- Flush trigger:
- batch size threshold (e.g.,
Nsamples), or -
periodic timer threshold (e.g.,
Tmilliseconds) -
Shutdown:
- stop producer and consumer loops
- drain remaining telemetry queues
- flush sink
- stop telemetry thread and join
13) Performance Requirements¶
When telemetry inactive:
- no stage clock calls
- no event allocations
- no queue operations
- only cheap status counters
When telemetry active:
- one fixed-shape sample build per emitted tick (stack/local object; no heap required)
- one non-blocking queue push attempt per sample
- stage timing guarded by a single top-level
timing_onbranch - no blocking I/O on data-plane threads
14) Minimal Test Requirements¶
- Interface disabled path:
-
no timing samples emitted when telemetry is null or inactive
-
Producer stage coverage:
- produced path populates expected durations and statuses
-
early-return paths correctly set missing stages to null
-
frame_idpropagation: - producer assigns
frame_idat publish -
consumer sample for same image carries the same
frame_id -
Queue overflow:
-
full queue causes sample drop counter increment, no blocking
-
Sink failure:
- sink error transitions telemetry to faulted/disabled mode
-
producer/consumer loops continue
-
Schema stability:
- JSON keys/types match spec and remain backward compatible within v1
15) Suggested Implementation Slices¶
- Introduce telemetry interface + no-op/null behavior + cheap counters.
- Add
frame_idto image metadata and propagate it producer -> consumer. - Implement telemetry hub thread with dual SPSC ingest queues.
- Implement JSONL sink with batched flush.
- Wire producer/consumer stage timing with top-level
timing_ongate + RAII scoped timers. - Add tests for inactive/active/overflow/fault/
frame_id/schema.
16) Open Questions for v2+¶
- strict mode option (fail-closed) for CI/debug builds
- online p50/p95/p99 export endpoint
- compression/rotation strategy for long-running JSONL logs
- additional queue/buffer pressure fields if needed for debugging