Vision Pipeline¶

This document describes the architecture and implementation details of my multi-stage, low-latency, zero copy-ish, real-time computer vision pipeline that transforms raw pixel data of a scene into object detections within it. These detections drive the robot's navigation and behavior policy.

TODO when online, write some measurements here, e.g. inference achieves xfps at Y format here, maybe link to profiler JSONL dump etcs¶

Major Design Considerations¶

Focus on latency, not throughput: pipeline is "latest-wins", only the freshest processed image frame is used, older frames are dropped.
Accelerator-first: all suitable ops are offloaded to accelerators on-board the SoC instead of clogging the CPU.
Zero-copy-ish: direct memory access buffers (DMA-BUF) connect pipeline stages to avoid CPU memcpys of full frames. Sharing these buffers between devices is done via a borrow token/ownership leasing strategy to avoid data races.
Observable: performance information emitted at every stage, including FPS, pipeline stage latency distributions.
Multithreaded: a producer thread performs image capture and preprocess, a consumer thread performs inference + postprocess, a telemetry thread performs intermediate data aggregation and export, and an orchestrator/application thread manages the lifecycle.

Glossary¶

ISP: Image Signal Processor in the camera path that produces sensor frames.
RGA: Raster Graphic Accelerator used for hardware image preprocess operations (resize, color conversion, letterbox) before inference.
NPU: Neural Processing Unit used to run neural network model inference.
RKNN: Rockchip Neural Network runtime/API that loads .rknn models and executes inference on the NPU.
V4L2: Video4Linux2, the Linux camera/video API for configuring image capture, streaming frames, and dequeue/requeue buffer slots.
V4L2 ring slot: One kernel-managed capture buffer in the camera queue; slots are dequeued, used, then requeued.
DMA-BUF: File-descriptor-backed shared memory buffer that lets V4L2, RGA, and NPU access the same image data without full CPU copies.
FrameDescriptor: Borrow-token style handle to a dequeued V4L2 slot (essentially a view consisting of fd + layout + metadata + slot index).
ImageBuffer: Application-owned DMA-BUF used as RGA output and RKNN input.
ImageBufferPool: Lock-free single-producer/single-consumer handoff structure for reusable ImageBuffer slots. Threshold between producer and consumer threads.

Pipeline Stages¶

Stage	Thread	Input	Output	Primary Code
Capture (V4L2)	Producer	`/dev/video*`	`FrameDescriptor`	`v4l2_capture.hpp/cpp`
Preprocess (RGA)	Producer	`FrameDescriptor`	`ImageBuffer`	`rga_preprocess.hpp/cpp`
Buffering/Drop policy	Cross-thread boundary	`ImageBufferPool` indices	“latest wins” `ImageBufferPool` index	`image_buffer_pool.hpp/cpp`
Inference (NPU)	Consumer	`ImageBuffer` (RGB DMA-BUF)	model outputs	`vision/include/omniseer/vision/rknn_runner.hpp` (TODO), `vision/src/rknn_runner.cpp` (TODO)
Postprocess/Publish	Consumer	model outputs	ROS msgs, telemetry	TODO

Diagram¶

                         photons
                            |
                            v
                  +----------------------+
                  | Camera Sensor + ISP  |
                  | NV12 frames          |
                  +----------+-----------+
                             |
                             v
         +-------------------+--------------------+
         | V4L2 ring (kernel-owned ring slots)    |
         | N buffers, each exported as DMA-BUF fd |
         +-------------------+--------------------+
                             |
                             | VIDIOC_DQBUF -> FrameDescriptor (borrowed slot)
                             v

  ========================== Producer Thread ==========================
                  +----------+-----------+
                  | V4l2Capture          |
                  | owns exported slot fds|
                  +----------+-----------+
                             |
                             v
                  +----------+-------------------------+
                  | RgaPreprocess (RGA)               |
                  | NV12 DMA-BUF -> RGB DMA-BUF       |
                  | resize + letterbox + color convert|
                  +----------+-------------------------+
                             |
                             | publish_ready(pool_idx)
                             v

  ======================= Cross-thread Boundary =======================
                  +----------+-------------------------------+
                  | ImageBufferPool (SPSC)                   |
                  | free_ring + ready_idx (latest-wins)      |
                  | new publish can evict older ready buffer |
                  +----------+-------------------------------+
                             |
                             | acquire_read(pool_idx)
                             v

  ========================== Consumer Thread ==========================
                  +----------+-------------------------+
                  | RKNN Runner (RKNN on NPU)         |
                  | reads RGB ImageBuffer DMA-BUF     |
                  +----------+-------------------------+
                             |
                             v
            detections -> postprocess -> ROS publish / behavior

Return paths:
  - Producer: requeue(v4l2_index) -> V4L2 ring (after RGA completes)
  - Consumer: publish_release(pool_idx) -> ImageBufferPool free_ring
  - Producer + Consumer telemetry samples -> Telemetry thread -> JSONL/metrics sink

Interfaces¶

Core data types¶

Defined in types.hpp:

FrameDescriptor: content description of a V4L2 ring buffer slot (owned by the kernel driver). Handle for ISP output/RGA input. This is critically a view of the buffer, not the data itself.
ImageBuffer: DMA-BUF fd-backed buffer (owned by the application). Handle for RGA output/RKNN(NPU) input. This is also a view.

Capture: `V4l2Capture`¶

Manages the V4L2 streaming lifecycle and defines a borrow-token style API for accessing ISP output from downstram devices in a zero-copy fashion.

Defined/implemented in v4l2_capture.hpp/cpp.

start():
Opens device path (e.g. /dev/video12), negotiates image format, allocates a driver-managed ring buffer, exports each slot as a DMA-BUF fd, queues all slots, and starts streaming.
dequeue(FrameDescriptor& out):
Dequeues the most recently filled V4L2 ring slot, populates out with a borrow-token (v4l2_index, DMA-BUF fd, layout, metadata). Thse caller must, after performing its work, requeue(out.v4l2_index) to return the slot to the driver so it can refill it with a fresh frame.
requeue(uint32_t index):
Return the specified V4L2 ring buffer slot at index to the driver so it can be filled with a fresh frame.

Preprocess: `RgaPreprocess`¶

Manages the RGA (2D blitter) transformations from ISP output to correct model input.

Defined/implemented in rga_preprocess.hpp/cpp.

run(const FrameDescriptor& src_nv12, ImageBuffer& dst_rgb, ...):
Runs the RGA hardware pipeline to convert a captured NV12 DMA-BUF frame into a RGB888 destination buffer. Synchronous.
prefill(ImageBuffer& dst_rgb):
The RK3588's RGA device does not support colorfilling a RGB888 buffer, so callers must do it themselves. This must only be called once at buffer init time.

Buffering: `ImageBufferPool`¶

This is the boundary between the producer and consumer threads. It facilitates the data handoff between the RGA output and the NPU input in a lock-free fashion. It implements a "freshest-first" policy, where the consumer only has access to the latest processed image.

Defined/implemented in image_buffer_pool.hpp/cpp

The usage is as follows:

Producer: acquire_write() -> RGA writes -> publish_ready()
Consumer: acquire_read() -> RKNN reads -> publish_release()
acquire_write(int& idx)
Obtains a free buffer index into idx for the producer to write into.
publish_ready(int idx)
Publishes idx as the newest ready buffer.
Should be called after performing the write.
acquire_read(int& idx)
Atomically grabs the currently ready buffer index into idx.
publish_release(int idx)
Returns the consumed buffer index idx back to the pool.
Should be called after performing the read + consumption.
buffer_at(int idx)
Accessor function for buffer at pool[idx]
Required to access data once ownership established
Comes in non-const/const flavours for producer/consumer

Buffer Allocation: `DmaHeapAllocator` + `DmaHeapAllocation`¶

"Video malloc" allocator + allocation classes that create shareable RGB image buffers for zero-copy-ish data movement between RGA and RKNN. Resource-safe bridge between kernel memory and accelerators.

Defined and implemented in dma_heap_alloc.hpp/cpp.

DmaHeapAllocator: - DmaHeapAllocator() - Create factory

allocate(int width, int height, PixelFormat fmt)
Allocate a DMA-BUF suitable for RGA write / RKNN read and return an ImageBuffer and descriptor that points at it.

Ownership & Lifetime Rules for Buffers¶

V4L2 ring slots (`FrameDescriptor`)¶

Owned by: kernel driver.
Userspace handle lifetime:
The exported DMA-BUF fds are owned by V4l2Capture for the duration of streaming.
Each dequeue() borrows one ring slot at index v4l2_index.
A requeue(v4l2_index) must occur for every successful dequeue() to allow slot to be refilled. This should happen after RGA is finished using the buffer.

Model input buffers (`ImageBufferPool`)¶

Owned by: ImageBufferPool (backing DmabufAllocations are RAII).
Cross-thread rule:
Producer may only write to a buffer index after acquire_write(idx) returns true.
Consumer may only read from a buffer index after acquire_read(idx) returns true.
Consumer must call publish_release(idx) once it is done reading.

Overview of Producer Responsibilities¶

Own the upstream clock: drive the loop cadence (dequeue frames) and decide when to drop work to maintain “latest-wins” latency.

Dequeue from V4L2: call DQBUF, receive the newest captured slot, and package it into a FrameDescriptor (fd(s), strides, w/h, format, timestamp, slot index).

Respect V4L2 slot lifetime: treat the dequeued slot as borrowed from the driver; do not hold it longer than necessary.

Acquire an output buffer: get a writable ImageBuffer slot from ImageBufferPool::acquire_write(idx) (or decide to skip processing if none are available).

Run preprocess on accelerators: invoke RGA to transform NV12 DMA-BUF → RGB/BGR DMA-BUF, including resize + letterbox/stretch policy, and produce LetterboxMeta if needed.

Write output metadata: fill ImageBuffer fields (fd, stride, dims, pixel format, timestamp, letterbox params, sequence number).

Publish the newest buffer: call publish_ready(idx) with release semantics so the consumer sees a fully-written frame.

Recycle old ready frames: if publish_ready “steals” the previous ready buffer (because latest-wins), ensure it goes back into the producer/free path so buffers don’t leak.

Return camera buffers promptly: QBUF the V4L2 slot back to the driver as soon as RGA is done with it (or immediately if you drop the frame).

Maintain steady-state buffer hygiene: one-time prefill/padding initialization for destination buffers (your RGB888 imfill limitation means you may do a CPU prefill fallback).

Instrumentation: emit per-stage timings (DQBUF wait, RGA submit/complete, publish cost), drop counters, and queue depths.

Error containment: handle transient failures (EINTR/EAGAIN, occasional RGA errors) without wedging the pipeline; perform clean shutdown (stop streaming, close fds, free allocations).

Consumer Responsibilities¶

Acquire the newest frame (latest-wins)
Run RKNN inference (NPU)

Initialize RKNN once at startup (load .rknn, init runtime).

Per frame:

Feed the input tensor (usually uint8 NHWC or NCHW depending on how you exported/configured).

Call inference.

Read output tensors.

Why this structure matters: your inference FPS will be lower than camera FPS, so consumer naturally drops frames and always processes “most recent state,” which is exactly what you want for robotics.

Postprocess (CPU, usually)

Decode YOLO head outputs → candidate boxes + scores + class ids

Apply thresholding + NMS (non-max suppression)

Undo letterbox/resize to map boxes back to 1280×720 (or whatever your original frame is)

Publish results downstream

Provide a simple struct like:

timestamp, list of {class_id, score, x1,y1,x2,y2} in original image coordinates

Feed tracking / “seek-and-capture” logic.

Release buffer

pool.release(idx) so RGA can reuse it.

A minimal consumer loop looks like:

acquire → infer → decode → publish → release (no queue buildup, no waiting on stale frames)

Threading Model (Current Intended)¶

Two threads:

1) Capture/Preprocess thread (producer): - cap.dequeue(frame) (nonblocking loop/poll) - pool.acquire_write(idx); if false, immediately cap.requeue(frame.v4l2_index) and continue - rga.run(frame, pool.buffer_at(idx), &meta) - pool.publish_ready(idx) - cap.requeue(frame.v4l2_index)

2) Inference thread (consumer): - pool.acquire_read(idx) (nonblocking loop/condition variable) - rknn.infer(pool.buffer_at(idx), ...) (TODO) - pool.publish_release(idx)