AI Article

Squeezing 46 FPS from YOLOv8 on Cheap Edge Hardware

A highly optimized C++ pipeline bypasses the CPU to run real-time object detection on low-cost Rockchip NPUs.

Rachel Goldstein

Dev Tools Editor · Jun 15, 2026 · 4 min read

Edge AI has a packaging problem. Too many "real-time" computer vision demos are actually bloated Python scripts running on expensive, power-hungry developer kits, barely scraping past 15 frames per second while turning the host CPU into a space heater.

When deploying to the field, however, constraints are real. Power is limited, budgets are tight, and hardware must be cheap. A recent open-source project, khadas_yolov8n_multithread, demonstrates how to break through these bottlenecks. By running dual YOLOv8 nano (YOLOv8n) models on a low-cost Rockchip RK3588S system-on-chip (SoC), the pipeline achieves up to 46 FPS—saturating the hardware's camera sensor ceiling—while consuming a flat ~140 MB of RAM.

This isn't achieved through magic, but through aggressive hardware offloading, zero-copy memory management, and smart multi-core NPU scheduling.

Bypassing the CPU with Fixed-Function Silicon

In a typical naive computer vision pipeline, the CPU is the middleman. It pulls frames from the camera, decodes them, resizes them, converts the color space (e.g., YUV to RGB), and shoves them into the neural network. On low-power edge devices, this CPU-bound preprocessing is where performance goes to die.

The RK3588S SoC, featured on boards like the Khadas Edge2, contains dedicated hardware blocks designed to handle these exact tasks: an Image Signal Processor (ISP), a Raster Graphic Acceleration (RGA) engine, and a Neural Processing Unit (NPU).

To keep the CPU entirely free for application logic, the optimized pipeline offloads every step of the frame's journey to this fixed-function silicon:

Capture: The OS08A10 MIPI camera stream is captured directly by the hardware ISP.
Preprocessing: The RGA engine handles color conversion and resizing to the 640x640 input dimensions required by YOLOv8n.
Inference: The processed frame is passed directly to the NPU.

Instead of allocating and deallocating memory on the fly—which triggers expensive garbage collection or heap fragmentation—the pipeline uses a fixed pool of pre-allocated buffers (BufPool). This keeps memory consumption incredibly flat. A single 1080p stream demands only ~137–152 MB of Resident Set Size (RSS). Running two streams side-by-side scales linearly to just ~276–304 MB.

Because the memory footprint is so small, developers don't need high-end 8 GB or 16 GB developer kits. The entire pipeline runs comfortably on the cheapest 2 GB Rockchip RK3588S boards, which retail for as little as €90.

Tri-Core NPU Squeezing via Multi-Threading

The RK3588S NPU is not a single monolithic processor; it consists of three distinct physical cores. A naive, single-threaded inference loop will only target a single core, capping YOLOv8n throughput at roughly 31.2 FPS.

To saturate the hardware, the pipeline implements a three-thread inference pool. Using the Rockchip Runtime SDK (librknnrt v2.3.2) and the 2D Raster Graphic Acceleration library (librga v1.10.5_[8]), the code duplicates the model context across the cores using rknn_dup_context and binds each thread to a specific core mask via rknn_set_core_mask.

By pipelining frame capture, RGA preprocessing, and NPU inference across all three cores in parallel, throughput jumps from 31 FPS to 46 FPS. At this point, the bottleneck is no longer the computational pipeline or the NPU; it is the physical 46 FPS ceiling of the OS08A10 camera sensor itself.

The Unix Way: Composable IPC and LLM Hand-off

Rather than building a monolithic C++ application that tries to do everything in a single process, the project adopts a classic Unix-style architecture. The system is split into small, independent processes that communicate via Unix-domain sockets.

The pipeline flows downstream through several distinct stages:

Detection: The multi-threaded YOLOv8n process outputs bounding boxes.
Tracking: A downstream process runs ByteTrack to maintain object identities across frames.
Analysis: Temporal-feature extraction and a presence Finite State Machine (FSM) monitor the behavior of tracked objects (such as UAVs entering or leaving the frame).
Summarization: When an object leaves the scene, an on-demand Large Language Model (Qwen-0.5B) generates a natural-language assessment of the event.

Running an LLM on the same edge hardware as a real-time object detection pipeline presents a massive resource scheduling challenge. If both models try to share the NPU simultaneously, frame rates will crater.

To solve this, the pipeline implements a clever "blackout/resume" control plane. When the presence FSM triggers the LLM, the camera pipeline temporarily pauses NPU inference and frees up the NPU contexts. The Qwen2.5-0.5B model (running via the RKLLM SDK) takes over the NPU, running at full speed to generate the summary. Once the LLM finishes writing its assessment, the control plane hands the NPU back to the camera threads, resuming real-time tracking without requiring a system reboot or dropping connections.

The Economics of Edge Deployments

For developers building physical security, wildlife monitoring, or drone-tracking systems, this architecture represents a massive shift in unit economics.

Instead of deploying expensive x86 SBCs with dedicated GPUs or relying on pricey, power-hungry edge modules, teams can build highly responsive, multi-camera systems on sub-€100 ARM hardware. By writing clean C++ that respects the underlying hardware boundaries and utilizes fixed-function silicon, you can achieve production-grade performance on a hobbyist budget.

Sources & further reading

Show HN: Dual YOLOv8n UAV Detection on RK3588S at 42 FPS Using NPU — github.com

#Edge Ai #Yolov8 #Npu #Rockchip #Embedded

Written by

Rachel Goldstein · Dev Tools Editor

Rachel has been embedded in the developer tooling ecosystem for nearly eight years, covering everything from IDE wars and package-manager drama to the quiet rise of AI-assisted coding. She has a soft spot for open-source maintainers and an unhealthy number of terminal emulators installed on a single laptop.

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.

Squeezing 46 FPS from YOLOv8 on Cheap Edge Hardware

Bypassing the CPU with Fixed-Function Silicon

Tri-Core NPU Squeezing via Multi-Threading

The Unix Way: Composable IPC and LLM Hand-off

The Economics of Edge Deployments

Sources & further reading

Discussion 0

Related Reading

AMD's $3,999 Ryzen AI Halo Challenges Nvidia's DGX Spark

Mapping Codebases to Knowledge Graphs for AI Coding Agents

KPMG's Hallucination Disaster Is a Warning for LLM Pipelines

Rio's "Homegrown" 397B LLM Accused of Being a Simple Model Merge