New - claude agents and rust optimizations. ~

What I’m building

Real-time video analytics in Rust. YOLO detection running through ONNX Runtime, with a Claude agent sitting above it to make sense of what the detections actually mean. Rust owns the hot path, Claude handles the reasoning.

Inference

Using ort v2.x for ONNX Runtime bindings. The 2.0 API is a lot cleaner than 1.x — Session::builder() instead of the old environment globals, execution providers configured inline. The big thing for video is IoBinding, which lets you bind CUDA memory directly to the session so frame data stays on-GPU between decode and inference. No round-tripping through host memory per frame.

With TensorRT EP at FP16, a 640×640 frame is sitting around 2–3ms once the engine cache is warm. First run is slow — TensorRT builds the engine, ~200–400ms — but it serializes to disk so you only pay that once.

Running YOLOv10 which is nice because NMS is baked into the model graph. The ONNX output comes out already deduplicated so there’s no NMS step to write in Rust. Pipeline is just: decode → letterbox → normalize → infer → threshold. ndarray ArrayViews directly into ORT tensor buffers for the pre/post processing, no copies.

Decode pipeline

GStreamer via gstreamer-rs. The appsink callback fires per frame in GStreamer’s streaming thread, frames go into a bounded mpsc channel (4–8 frame capacity) before hitting the inference stage. The bounded channel is the backpressure — when inference can’t keep up, the source blocks instead of frames piling up in memory.

Bottleneck is always YUV→RGB conversion, not the demux. NVDEC handles the actual H.264/H.265 decode easily above real-time.

Claude layer

The Rust pipeline outputs structured JSON per frame — boxes, labels, confidence, timestamps. Batches of that go to a Claude agent as tool results. The agent handles the stuff that’s awkward to express as pipeline logic: zone crossing queries, anomaly explanation, incident summaries.

Playing with the skills primitive in the Agent SDK — wrapping sequences of tool calls into named, reusable units so the agent can dispatch things like “count zone B events” or “flag anomaly” without the orchestrator re-explaining the task each time. Still early but it’s a cleaner pattern than stuffing everything into the system prompt.

Rough edges

The full zero-copy chain from NVDEC → CUDA memory → ORT IoBinding → output still needs unsafe FFI glue. Nothing in crate-land abstracts it end to end yet, so that’s being written by hand.

Also running a ReID model alongside YOLO for cross-camera person tracking. ORT and candle don’t share memory so there’s a host copy between them on each detection. Fine for now, will need to revisit.

More as things settle.