The $75 HAT That Outruns a $500 Jetson

Patrick Duggan
Apr 10
11 min read

Updated: Apr 25

DugganUSA lab notebook — April 10, 2026

Here's the number that made me open a text editor at 1 AM:

309 frames per second of YOLOv8s object detection at 640×640, running on a Raspberry Pi 5 with a Hailo-8 AI HAT+.

For context: NVIDIA's reference benchmark for YOLOv8s at the same resolution on a Jetson Orin Nano 8GB is around 60 FPS in FP16. The Hailo-8, at INT8, on a HAT that costs one-seventh of that Jetson, delivered five times the throughput tonight. End to end — including the non-max suppression post-processing, which runs on the Hailo chip itself and never touches the Pi's CPU.

That last part is the whole point, and the whole bet.

And yes — before anyone drops the obligatory "but the Jetson does INT8 too" reply: I know. NVIDIA's own TensorRT INT8 numbers for YOLOv8s on an Orin Nano 8GB land somewhere in the 100–170 FPS range depending on firmware revision and which Super reflash you're riding. 309 is still ahead of that, on hardware that costs about a third of what the Jetson does, with a toolchain that isn't a subtle business model for locking you to CUDA-on-ARM. The reply-guy fight is not a fight.

The split-brain theory

The conventional way to put AI on a mobile robot is to buy a single fat system-on-module that tries to do everything — a Jetson, an RK3588, a Coral dev board — then cram the vision pipeline, the planning stack, the DDS middleware, and a language model all on top of one SoC, where they fight for cache lines.

I've always thought that was the wrong shape. Modern AI workloads don't actually want a single general-purpose processor. They want different processors. Vision inference is embarrassingly parallel, extremely regular, and looks the same every frame — it loves a dataflow architecture with stationary weights and streaming activations. Reasoning and planning are irregular, state-heavy, branchy, and love a fast out-of-order CPU with large caches. SLAM and Nav2 live somewhere in between, with tight real-time deadlines and lots of small floating-point updates across cost-map tiles. Each of these workloads has a correct silicon, and they're not the same silicon.

The split-brain theory says: give each workload the hardware it actually wants.

For the Butterbot Tank, that decomposes to:

Vision inference → Hailo-8 (26 TOPS INT8, dataflow architecture, on-chip SRAM, baked-in NMS)
SLAM, Nav2, robot state → Raspberry Pi 5 CPU (4× Cortex-A76 @ 2.4 GHz, 8 GB LPDDR4X-4267)
Reasoning / LLM → Pi 5 CPU again, via llama.cpp (int4-quantized 7B-class models, CPU inference)
Motor control + sensor timing → STM32 on the Hiwonder robot controller board (1 MHz UART link to the Pi)
Depth imaging → Deptrum Aurora 930 over USB (dedicated stereo ASIC, does its own disparity compute, hands us aligned depth + RGB frames)

Five processors. Each one doing exactly the thing it's shaped for, connected by buses wide enough to carry the data between them without becoming the bottleneck.

That's the bet. Tonight I got to find out whether the hardest part of it — the NPU half — actually delivers the throughput that makes the rest of the architecture make sense.

The rough parts

Because this is a lab notebook and not a keynote, here are the three things that didn't go smoothly.

1. The Hailo didn't come up on first boot. lspci saw the chip cleanly — Hailo Technologies Ltd, board 1e60-2864, revision 01, sitting on the PCIe bus like it owned the place. But /dev/hailo0 didn't exist. modprobe hailo_pci came back "module not found." The Debian package hailort-pcie-driver was installed; the source was staged in /usr/src/hailort-pcie-driver/; hailortcli was on $PATH; but the .ko itself was nowhere on the system.

Root cause: the Pi OS kernel had been bumped from 6.12.47 to 6.12.75 at some earlier apt upgrade, and the prebuilt kernel object only existed for the old version. The hailort-pcie-driver postinst script is supposed to handle that automatically — it tries make install_dkms first, and falls back to a plain make && make install if DKMS isn't available. But neither had run, because DKMS itself wasn't installed on the system, and the package had been installed before the kernel upgrade, so there'd been nothing to trigger a rebuild since.

The fix was three commands, in order: install DKMS, reinstall hailort-pcie-driver so its postinst would run again with DKMS present, then modprobe hailo_pci. The reinstall fired the postinst, which this time found DKMS, built the module against the current 6.12.75 headers, installed it into /lib/modules/$(uname -r)/updates/dkms/, and populated modules.dep. First modprobe succeeded. dmesg scrolled through the probe sequence — 64-bit DMA enabled, firmware hailo8_fw.4.23.0.bin uploaded into the chip's NNC in 151 ms, board registered as /dev/hailo0. Added hailo_pci to /etc/modules-load.d/hailo.conf so this can't bite us again at the next kernel bump.

2. The tank's treads refused to move. hailortcli worked. The depth camera worked. But commanding the motors through Hiwonder's Python SDK did nothing. The LEDs on the lower robot-controller board blinked on command. The buzzer chirped. The IMU returned plausible accelerometer and gyro readings. Motor commands went out, got acknowledged, and produced a quiet electrical hiss from the coils — but zero torque.

We chased four hypotheses. Duty cycle too low. Wrong motor type code (Hiwonder has four: JGB520/JGB37/JGB27/JGB528, and picking the wrong one throws off the firmware's internal PWM scaling). Battery voltage mis-set. Encoder feedback required. None of them.

The actual cause, found by Patrick with his fingers on the hardware: the Hiwonder MentorPi T1 uses a stacked two-board power topology, and nobody bothered to put that in the getting-started guide. The lower board has the STM32 that talks to the Pi over UART — it has its own 5 V logic rail fed parasitically from the USB bus. Everything we were commanding — LEDs, buzzer, IMU, motor signals — lives on that board. The H-bridges themselves are on an upper board, which needs its own 5 V logic rail, fed through a completely separate USB-A cable from the buck converter. That cable was unseated. PWM signals were reaching the upper board's H-bridge input pins, but the H-bridges had no power, so the output stage was dead. The coils hissed because the input capacitance was being charged and discharged, but no current was actually flowing through the motor windings.

When a vendor ships a two-board sandwich with two independent 5 V rails, and one of those rails is an unseated USB cable away from "mysteriously silent," that vendor owes its customers one sentence in the quickstart. The sentence does not exist. We wrote our own.

Lesson logged: if commands get through but the actuator doesn't move, trace physical power paths before you start tuning firmware.

3. The Pi locked up mid-build. I made the mistake of kicking off a Hailo benchmark while a colcon build --parallel-workers 2 was in the middle of compiling the Nav2 C++ dependencies — nav2_costmap_2d, teb_local_planner, costmap_converter, all C++ with heavy template instantiation. The Pi 5 went silent. No ping, no SSH, no serial console, no auto-reboot. Had to power-cycle.

Post-mortem: each g++ process compiling those nav packages resident-sets somewhere around 500–900 MB. Two in parallel is 1.5 GB of compiler alone. Loading a YOLOv8s .hef into the Hailo NPU allocates roughly 200 MB of DMA-able scatter/gather buffers in the kernel, plus a big transient PMIC current spike as the chip wakes and starts streaming through its 16 MB on-chip SRAM. The Pi 5's 8 GB of LPDDR4X was fine memory-wise, but the combination of that PMIC transient, sustained thermal load from two gcc workers, and the NPU power draw was apparently enough to either brown out the SoC rail or crash a USB PHY or something equally unrecoverable.

The Pi 5 can handle either a heavy C++ compile load or a sustained 26 TOPS NPU load. It cannot handle both simultaneously. Lesson logged: heavy build load and heavy inference load are not concurrent workloads on this hardware. Serialize them.

The good part

After the reboot, I rebuilt the ROS2 workspace clean. Twenty-four packages — Hiwonder's MentorPi base stack, a half-dozen third-party drivers (apriltag, ldlidar, sllidar, ydlidar, ascamera, teb_local_planner, rf2o_laser_odometry), and our Butterbot overlay — compiled in one minute forty seconds from a cold build/ directory. That's a fair snapshot of how much less code there is to maintain when you vendor the work the Hiwonder team already did, rather than reinventing motor driver nodes and Nav2 parameter tunings from scratch.

Then I ran the Hailo benchmarks, one at a time, with no other load on the box.

Running hailortcli benchmark on yolov8s_h8.hef cranked through 4,644 inference iterations and printed the three numbers that matter: an hw_only FPS of 309.86, a streaming FPS of 309.48, and a hardware latency of 6.66 ms per frame.

Those two different FPS numbers are the thing to read carefully.

hw_only is the throughput the Hailo chip achieves when it is fed as fast as possible — i.e., the chip is never waiting on input. It's the compute-bound FPS.

streaming is the throughput of the whole pipeline, including transferring input frames from host DRAM over PCIe to the Hailo NPU, and transferring outputs back. When streaming and hw_only are nearly equal — as they are here, 309.48 vs 309.86 — it means the PCIe link is not the bottleneck and the chip is genuinely compute-bound at this number. If streaming were significantly lower, it would mean we were PCIe-limited.

That's a good signal. Let's back-of-envelope it to see why.

PCIe bandwidth budget. The Pi 5's PCIe port runs at Gen 3 x1 once you put dtparam=pciex1_gen=3 in config.txt — which this build has. Gen 3 x1 is 8 GT/s × 128/130 encoding × 1 lane = 985 MB/s usable in each direction.

Per-frame data volume. YOLOv8s input is 640 × 640 × 3 bytes UINT8 = 1.229 MB per input frame. The output tensor from the on-chip NMS unit is up to 160,320 bytes per frame (80 classes × 100 max boxes × ~20 bytes per box), but in practice much smaller because most class slots are empty.

At 309 FPS, input bandwidth is 309 × 1.229 MB/s = 380 MB/s, plus maybe 10–20 MB/s of outputs. Total is roughly 400 MB/s — about 40% of the Gen 3 x1 link. The interconnect isn't the ceiling; the Hailo-8 compute pipeline itself is the pacing element, which is exactly what hw_only ≈ streaming told us.

Microsoft pulls this feed daily. AT&T pulls this feed daily. Starlink pulls this feed daily. Get the DugganUSA STIX feed — $9/mo →

Compute budget sanity check. YOLOv8s has about 28.6 GFLOPs per inference at 640×640. The Hailo-8 is rated at 26 TOPS INT8 — which is operations, not FLOPs, and TOPS-to-FPS conversion on quantized CNNs is never clean because utilization of the MAC array depends on the model's layer shapes, memory access pattern, and data reuse. A rough ceiling is 26e12 / 28.6e9 = 909 FPS at 100% utilization. We got 309 — about 34% of theoretical, which is genuinely respectable for a real compiled model running full end-to-end with NMS included. The other 66% is the tax for the parts of the model that don't map cleanly onto the Hailo chip's dataflow topology (anchor-free head operations, concat layers, activations that want FP16 precision rather than INT8) plus memory bandwidth between on-chip SRAM and the MAC clusters.

Latency. 6.66 ms per frame, as a hardware-only measurement. That's lower than a single vertical refresh of a 144 Hz display. By the time a robot has physically moved any meaningful distance, the inference is done.

Now the other models:

Model	Resolution	FPS (hw)	Latency	GFLOPs	Utilization
yolov8s_h8	640×640	309	6.66 ms	~28.6	~34%
yolov6n_h8	640×640	587	3.18 ms	~11.8	~27%
yolov8s_pose_h8	640×640	232	10.02 ms	~35.0	~31%
scrfd_2.5g_h8l	640×640	177	5.10 ms	~2.5	~1.7%
yolov5n_seg_h8	640×640	123	11.88 ms	~4.5	~2.1%

The two models at the bottom of that table have low utilization numbers because their heads don't vectorize cleanly onto the Hailo MAC array — SCRFD's face-detection head has anchor-free operations that break the clean dataflow, and YOLOv5n-seg's prototype-mask generation wants layer shapes the chip doesn't love. Both are latency-bound on specific layers rather than throughput-bound on compute. Useful thing to remember when we start compiling our own custom models: some head architectures leave a lot of NPU utilization on the table even when the overall FLOP count is tiny.

Every model here is 640×640 RGB input, full reference resolution. Not a compromise, not a cropped variant. The Hailo Dataflow Compiler handles quantization calibration, layer fusion, weight packing, and lowering to the chip's native dataflow graph, and the result is a single .hef binary that the userspace driver streams inference requests to over the PCIe link.

Concurrent models: the reason this is interesting

Here's where the numbers become architecturally interesting.

A typical robot-vision camera runs at 30 FPS. The Hailo-8 can run YOLOv8s at 309 FPS, which means at a 30 FPS input rate the chip is running the model for 30/309 = 9.7% of wall-clock time. The rest is idle, waiting for the next frame to arrive.

Same math for the other models:

YOLOv8s @ 30 FPS: 9.7% NPU utilization
YOLOv8s-pose @ 30 FPS: 12.9%
SCRFD face @ 30 FPS: 16.9%
YOLOv5n-seg @ 30 FPS: 24.4%

Total if we run all four concurrently on the same 30 FPS camera feed: 63.9% utilization. The Butterbot Tank's entire survey-robot perception stack — object detection, human pose, face recognition, instance segmentation — fits inside one Hailo-8 chip on a Pi 5, with roughly 35% NPU headroom to spare, and still leaves the Pi's four A76 cores almost entirely free for the rest of the workload.

"The rest of the workload" is not a small thing. It's ROS2 with its DDS discovery and inter-process communication, Nav2's lifecycle manager and behavior tree, slam_toolbox doing real-time pose graph optimization against the lidar scans, robot_state_publisher maintaining the URDF transform tree, a custom survey pipeline scanning for WiFi/BT beacons and logging to a spatial database, and — whenever we want to invoke it — a llama.cpp-hosted 7B quantized language model doing command parsing, scene description, and lightweight planning. That's a lot of software for a Pi. It gets to have a Pi more or less to itself, because the perception half has moved off the Pi and onto the chip whose shape it actually wanted.

That's the split-brain theory made real.

What it cost

Rough BOM for the vision half:

Raspberry Pi 5 8GB: ~$80
Pi AI HAT+ with Hailo-8 26 TOPS: ~$75 (Raspberry Pi Limited, direct)
PCIe FPC cable: included
Active cooling: ~$10

$165 total. For 26 TOPS INT8 of dedicated inference compute and the CPU to orchestrate an entire ROS2 stack on top of it.

For comparison:

Jetson Orin Nano 8GB dev kit: $499. Also, buying into NVIDIA's CUDA-on-ARM toolchain, which is a business model, not a free choice.
Coral Dev Board: ~$130. Only 4 TOPS, weaker CPU.
Rockchip RK3588 dev boards with NPUs: $150–250. Fragmented toolchains, vendor SDKs of varying quality.

The Pi + Hailo combination is a genuine new sweet spot on the price / TOPS / toolchain-sanity curve. Not a marginal improvement over the alternatives. A different shape of answer.

Where this goes

The Butterbot Tank is the first mobile platform we're building the Butterbot AI stack onto. The perception pipeline's hardware layer is now empirically real. The software layer — the ROS2 overlay, the Hailo inference nodes wired into ROS2 topics, the integration with the Butterbot reasoning brain — is the work for tomorrow.

There's a specific demo I want to get to, which is the real test. Point the tank's depth camera at a room full of objects. Let YOLOv8s and YOLOv5n-seg run on the RGB frames while the Aurora 930's depth pipeline produces aligned distance fields. Fuse them into a spatial object inventory. Then ask a llama.cpp-hosted Butterbot, "what's the closest thing you can see that a human would sit on?" — and have the whole stack, vision inference through depth fusion through object reasoning through natural-language response, complete in under a second. Entirely on-device. No cloud. On a Raspberry Pi that costs less than a nice dinner for four.

That demo needed the NPU bet to work. Tonight it did.

Also the motors still don't move. That's a problem for tomorrow.

— Patrick

The cheapest, fastest, most accurate threat feed on the internet.

275+ enterprises pulling daily. 1M+ IOCs. 17.4M indexed documents. We beat Zscaler by 43 days on NrodeCodeRAT. Starter tier $9/mo — less than any competitor’s sales demo.

Look up an IOC → · Audit your brand on AIPM → · See pricing →