Vision Pipeline¶
This document describes the architecture and implementation details of my multi-stage, low-latency, zero copy-ish, real-time computer vision pipeline that transforms raw pixel data of a scene into object detections within it. These detections drive the robot's navigation and behavior policy.
TODO when online, write some measurements here, e.g. inference achieves xfps at Y format here, maybe link to profiler JSONL dump etcs¶
Major Design Considerations¶
-
Focus on latency, not throughput: pipeline is "latest-wins", only the freshest processed image frame is used, older frames are dropped.
-
Accelerator-first: all suitable ops are offloaded to accelerators on-board the SoC instead of clogging the CPU.
-
Zero-copy-ish: direct memory access buffers (DMA-BUF) connect pipeline stages to avoid CPU memcpys of full frames. Sharing these buffers between devices is done via a borrow token/ownership leasing strategy to avoid data races.
-
Observable: performance information emitted at every stage, including FPS, pipeline stage latency distributions.
-
Multithreaded: a producer thread performs image capture and preprocess, a consumer thread performs inference + postprocess, a telemetry thread performs intermediate data aggregation and export, and an orchestrator/application thread manages the lifecycle.
Glossary¶
- ISP: Image Signal Processor in the camera path that produces sensor frames.
- RGA: Raster Graphic Accelerator used for hardware image preprocess operations (resize, color conversion, letterbox) before inference.
- NPU: Neural Processing Unit used to run neural network model inference.
- RKNN: Rockchip Neural Network runtime/API that loads
.rknnmodels and executes inference on the NPU. - V4L2: Video4Linux2, the Linux camera/video API for configuring image capture, streaming frames, and dequeue/requeue buffer slots.
- V4L2 ring slot: One kernel-managed capture buffer in the camera queue; slots are dequeued, used, then requeued.
- DMA-BUF: File-descriptor-backed shared memory buffer that lets V4L2, RGA, and NPU access the same image data without full CPU copies.
- FrameDescriptor: Borrow-token style handle to a dequeued V4L2 slot (essentially a view consisting of fd + layout + metadata + slot index).
- ImageBuffer: Application-owned DMA-BUF used as RGA output and RKNN input.
- ImageBufferPool: Lock-free single-producer/single-consumer handoff structure for reusable
ImageBufferslots. Threshold between producer and consumer threads.
Pipeline Stages¶
| Stage | Thread | Input | Output | Primary Code |
|---|---|---|---|---|
| Capture (V4L2) | Producer | /dev/video* |
FrameDescriptor |
v4l2_capture.hpp/cpp |
| Preprocess (RGA) | Producer | FrameDescriptor |
ImageBuffer |
rga_preprocess.hpp/cpp |
| Buffering/Drop policy | Cross-thread boundary | ImageBufferPool indices |
“latest wins” ImageBufferPool index |
image_buffer_pool.hpp/cpp |
| Inference (NPU) | Consumer | ImageBuffer (RGB DMA-BUF) |
model outputs | vision/include/omniseer/vision/rknn_runner.hpp (TODO), vision/src/rknn_runner.cpp (TODO) |
| Postprocess/Publish | Consumer | model outputs | ROS msgs, telemetry | TODO |
Diagram¶
photons
|
v
+----------------------+
| Camera Sensor + ISP |
| NV12 frames |
+----------+-----------+
|
v
+-------------------+--------------------+
| V4L2 ring (kernel-owned ring slots) |
| N buffers, each exported as DMA-BUF fd |
+-------------------+--------------------+
|
| VIDIOC_DQBUF -> FrameDescriptor (borrowed slot)
v
========================== Producer Thread ==========================
+----------+-----------+
| V4l2Capture |
| owns exported slot fds|
+----------+-----------+
|
v
+----------+-------------------------+
| RgaPreprocess (RGA) |
| NV12 DMA-BUF -> RGB DMA-BUF |
| resize + letterbox + color convert|
+----------+-------------------------+
|
| publish_ready(pool_idx)
v
======================= Cross-thread Boundary =======================
+----------+-------------------------------+
| ImageBufferPool (SPSC) |
| free_ring + ready_idx (latest-wins) |
| new publish can evict older ready buffer |
+----------+-------------------------------+
|
| acquire_read(pool_idx)
v
========================== Consumer Thread ==========================
+----------+-------------------------+
| RKNN Runner (RKNN on NPU) |
| reads RGB ImageBuffer DMA-BUF |
+----------+-------------------------+
|
v
detections -> postprocess -> ROS publish / behavior
Return paths:
- Producer: requeue(v4l2_index) -> V4L2 ring (after RGA completes)
- Consumer: publish_release(pool_idx) -> ImageBufferPool free_ring
- Producer + Consumer telemetry samples -> Telemetry thread -> JSONL/metrics sink
Interfaces¶
Core data types¶
Defined in types.hpp:
FrameDescriptor: content description of a V4L2 ring buffer slot (owned by the kernel driver). Handle for ISP output/RGA input. This is critically a view of the buffer, not the data itself.ImageBuffer: DMA-BUF fd-backed buffer (owned by the application). Handle for RGA output/RKNN(NPU) input. This is also a view.
Capture: V4l2Capture¶
Manages the V4L2 streaming lifecycle and defines a borrow-token style API for accessing ISP output from downstram devices in a zero-copy fashion.
Defined/implemented in v4l2_capture.hpp/cpp.
start():-
Opens device path (e.g.
/dev/video12), negotiates image format, allocates a driver-managed ring buffer, exports each slot as a DMA-BUF fd, queues all slots, and starts streaming. -
dequeue(FrameDescriptor& out): -
Dequeues the most recently filled V4L2 ring slot, populates
outwith a borrow-token (v4l2_index, DMA-BUF fd, layout, metadata). Thse caller must, after performing its work,requeue(out.v4l2_index)to return the slot to the driver so it can refill it with a fresh frame. -
requeue(uint32_t index): - Return the specified V4L2 ring buffer slot at
indexto the driver so it can be filled with a fresh frame.
Preprocess: RgaPreprocess¶
Manages the RGA (2D blitter) transformations from ISP output to correct model input.
Defined/implemented in rga_preprocess.hpp/cpp.
run(const FrameDescriptor& src_nv12, ImageBuffer& dst_rgb, ...):-
Runs the RGA hardware pipeline to convert a captured NV12 DMA-BUF frame into a RGB888 destination buffer. Synchronous.
-
prefill(ImageBuffer& dst_rgb): - The RK3588's RGA device does not support colorfilling a RGB888 buffer, so callers must do it themselves. This must only be called once at buffer init time.
Buffering: ImageBufferPool¶
This is the boundary between the producer and consumer threads. It facilitates the data handoff between the RGA output and the NPU input in a lock-free fashion. It implements a "freshest-first" policy, where the consumer only has access to the latest processed image.
Defined/implemented in image_buffer_pool.hpp/cpp
The usage is as follows:
- Producer:
acquire_write()-> RGA writes ->publish_ready() -
Consumer:
acquire_read()-> RKNN reads ->publish_release() -
acquire_write(int& idx) -
Obtains a free buffer index into
idxfor the producer to write into. -
publish_ready(int idx) - Publishes
idxas the newest ready buffer. -
Should be called after performing the write.
-
acquire_read(int& idx) -
Atomically grabs the currently ready buffer index into
idx. -
publish_release(int idx) - Returns the consumed buffer index
idxback to the pool. -
Should be called after performing the read + consumption.
-
buffer_at(int idx) - Accessor function for buffer at
pool[idx] - Required to access data once ownership established
- Comes in non-const/const flavours for producer/consumer
Buffer Allocation: DmaHeapAllocator + DmaHeapAllocation¶
"Video malloc" allocator + allocation classes that create shareable RGB image buffers for zero-copy-ish data movement between RGA and RKNN. Resource-safe bridge between kernel memory and accelerators.
Defined and implemented in dma_heap_alloc.hpp/cpp.
DmaHeapAllocator:
- DmaHeapAllocator()
- Create factory
allocate(int width, int height, PixelFormat fmt)- Allocate a DMA-BUF suitable for RGA write / RKNN read and return an ImageBuffer and descriptor that points at it.
Ownership & Lifetime Rules for Buffers¶
V4L2 ring slots (FrameDescriptor)¶
- Owned by: kernel driver.
- Userspace handle lifetime:
- The exported DMA-BUF fds are owned by
V4l2Capturefor the duration of streaming. - Each
dequeue()borrows one ring slot at indexv4l2_index. - A
requeue(v4l2_index)must occur for every successfuldequeue()to allow slot to be refilled. This should happen after RGA is finished using the buffer.
Model input buffers (ImageBufferPool)¶
- Owned by:
ImageBufferPool(backingDmabufAllocations are RAII). - Cross-thread rule:
- Producer may only write to a buffer index after
acquire_write(idx)returns true. - Consumer may only read from a buffer index after
acquire_read(idx)returns true. - Consumer must call
publish_release(idx)once it is done reading.
Overview of Producer Responsibilities¶
Own the upstream clock: drive the loop cadence (dequeue frames) and decide when to drop work to maintain “latest-wins” latency.
Dequeue from V4L2: call DQBUF, receive the newest captured slot, and package it into a FrameDescriptor (fd(s), strides, w/h, format, timestamp, slot index).
Respect V4L2 slot lifetime: treat the dequeued slot as borrowed from the driver; do not hold it longer than necessary.
Acquire an output buffer: get a writable ImageBuffer slot from ImageBufferPool::acquire_write(idx) (or decide to skip processing if none are available).
Run preprocess on accelerators: invoke RGA to transform NV12 DMA-BUF → RGB/BGR DMA-BUF, including resize + letterbox/stretch policy, and produce LetterboxMeta if needed.
Write output metadata: fill ImageBuffer fields (fd, stride, dims, pixel format, timestamp, letterbox params, sequence number).
Publish the newest buffer: call publish_ready(idx) with release semantics so the consumer sees a fully-written frame.
Recycle old ready frames: if publish_ready “steals” the previous ready buffer (because latest-wins), ensure it goes back into the producer/free path so buffers don’t leak.
Return camera buffers promptly: QBUF the V4L2 slot back to the driver as soon as RGA is done with it (or immediately if you drop the frame).
Maintain steady-state buffer hygiene: one-time prefill/padding initialization for destination buffers (your RGB888 imfill limitation means you may do a CPU prefill fallback).
Instrumentation: emit per-stage timings (DQBUF wait, RGA submit/complete, publish cost), drop counters, and queue depths.
Error containment: handle transient failures (EINTR/EAGAIN, occasional RGA errors) without wedging the pipeline; perform clean shutdown (stop streaming, close fds, free allocations).
Consumer Responsibilities¶
-
Acquire the newest frame (latest-wins)
-
Run RKNN inference (NPU)
Initialize RKNN once at startup (load .rknn, init runtime).
Per frame:
Feed the input tensor (usually uint8 NHWC or NCHW depending on how you exported/configured).
Call inference.
Read output tensors.
Why this structure matters: your inference FPS will be lower than camera FPS, so consumer naturally drops frames and always processes “most recent state,” which is exactly what you want for robotics.
- Postprocess (CPU, usually)
Decode YOLO head outputs → candidate boxes + scores + class ids
Apply thresholding + NMS (non-max suppression)
Undo letterbox/resize to map boxes back to 1280×720 (or whatever your original frame is)
- Publish results downstream
Provide a simple struct like:
timestamp, list of {class_id, score, x1,y1,x2,y2} in original image coordinates
Feed tracking / “seek-and-capture” logic.
- Release buffer
pool.release(idx) so RGA can reuse it.
A minimal consumer loop looks like:
acquire → infer → decode → publish → release (no queue buildup, no waiting on stale frames)
Threading Model (Current Intended)¶
Two threads:
1) Capture/Preprocess thread (producer):
- cap.dequeue(frame) (nonblocking loop/poll)
- pool.acquire_write(idx); if false, immediately cap.requeue(frame.v4l2_index) and continue
- rga.run(frame, pool.buffer_at(idx), &meta)
- pool.publish_ready(idx)
- cap.requeue(frame.v4l2_index)
2) Inference thread (consumer):
- pool.acquire_read(idx) (nonblocking loop/condition variable)
- rknn.infer(pool.buffer_at(idx), ...) (TODO)
- pool.publish_release(idx)