Coral NPU Hardware Architecture
Design Philosophy
The Coral NPU is not a general-purpose CPU with ML acceleration bolted on. It is an ML accelerator with enough scalar capability to run control flow. The design starts from domain-specific matrix operations and adds just enough RISC-V scalar + vector capability to avoid needing a separate host processor.
ISA: rv32imf_zve32x_zicsr_zifencei_zbb
The ISA string encodes exactly what instructions the hardware supports:
| Extension | Purpose |
|---|---|
|
Base 32-bit integer instructions (ALU, load/store, branches) |
|
Hardware integer multiply/divide — needed for address calculations in tiled loops |
|
Single-precision hardware float (FP32) — the NPU’s native compute precision. No f16 hardware support; f16 values must be promoted to f32 |
|
128-bit SIMD vector extension — processes 4×f32 or 8×i16 or 16×i8 per instruction |
|
Control/status registers — enables reading cycle counters, configuring vector/float enables |
|
Instruction-fetch fence — needed for self-modifying code and cache coherency |
|
Bit manipulation — |
The C extension encoding space is reclaimed (per the RISC-V spec’s provision for this) to provide 6-bit vector register indices (supporting 64 vector registers v0..v63) and flexible type encodings for SIMD instructions.
Scalar Core
The scalar core is a 4-stage in-order pipeline:
-
Fetch: Reads instructions from L1 I-Cache. Branch prediction is static — backward branches are predicted taken (loops), forward branches predicted not-taken. Misprediction costs 1 cycle.
-
Decode: 4-way dispatch — scalar instructions go to the scalar execute unit, vector/SIMD instructions are posted to the command FIFO. The scalar core can dispatch up to 4 instructions per cycle when there are no data hazards.
-
Execute: ALU, FPU, load/store. The scalar core handles all control flow, address calculations, and loop iteration. It is deliberately simple — no speculation, no out-of-order execution.
-
Writeback: Results written to the 31 scalar registers (x1..x31, x0 is hardwired zero).
The scalar core is designed to be small — it is overhead that steals area and power from the ML backend. Its job is to feed the vector/MAC units with work as efficiently as possible.
Vector/SIMD Engine
The vector core is decoupled from the scalar frontend via a FIFO. This means the scalar core can run ahead, queuing vector instructions, while the vector engine processes them independently.
Registers
| Register Bank | Count × Width | Usage |
|---|---|---|
Vector registers |
64 × 256-bit |
v0..v63. Each holds 8×i32, 16×i16, or 32×i8. Used for SIMD ALU ops and MAC input staging. |
Accumulator |
8×8 × 32-bit |
2D accumulator array for the MAC engine. Holds partial sums during outer-product matrix multiplies. |
Stripmining
To reduce instruction dispatch pressure, SIMD instructions include an implicit stripmine mechanism. A single vadd v0 instruction in the dispatch stage expands into four serialized issue events:
vadd v0 → vadd v0 : vadd v1 : vadd v2 : vadd v3
This provides 4× throughput from a single dispatch slot, supporting instruction-level tiling patterns through consecutive vector registers.
MAC Engine (Matrix Multiply Unit)
The MAC engine is the core ML accelerator. It implements an outer-product multiply-accumulate, which maximizes compute density relative to memory bandwidth:
Why Outer Product?
An outer-product engine provides two-dimensional broadcast structures:
-
Wide axis (parallel broadcast): convolution weights are broadcast to all compute lanes simultaneously
-
Narrow axis (transposed, shifted): input activations from multiple spatial positions (or batches) feed different lanes
This means the MAC engine computes an 8×8 tile of the output matrix per cycle, performing 256 multiply-accumulate operations with only 16 memory reads (8 weights + 8 activations). This is a 16:1 compute-to-memory ratio.
The VDOT operation performs 4 int8 multiplies accumulated into a 32-bit result — the fundamental operation for quantized neural networks.
Memory Hierarchy
Memory Constraints for Model Deployment
The memory sizes impose hard constraints on what models can run:
| Memory | Size | Practical Limit |
|---|---|---|
ITCM |
8 KB |
~2000 instructions. The generated code for RGB-to-grayscale is ~200 instructions. Complex models with many distinct operations may not fit. |
DTCM |
32 KB |
~8000 float32 values. A 4×4 RGB image (48 floats = 192 bytes) fits easily. A 224×224 RGB image (150,528 floats = 602 KB) does NOT fit — requires tiling or EXTMEM. |
EXTMEM |
4 MB |
Usable for larger tensors but at higher latency. The linker script places |
For real-world model deployment, this means:
-
Weights must be quantized (int8 uses 4× less memory than float32)
-
Activations must be tiled (process a spatial region at a time, not the whole tensor)
-
Models must be architecture-aware (MobileNet-style depthwise separable convolutions are preferred over standard convolutions)
Execution Model
The NPU runs a run-to-completion model:
-
No operating system — bare-metal execution
-
No interrupts — the program runs until
ebreak -
No dynamic memory allocation — all arrays are statically sized at compile time
-
No I/O syscalls — data is passed via shared memory (the simulator writes to symbol addresses)
This simplicity is deliberate. An ML inference workload has a fixed, known computation graph. There is no need for OS-level scheduling, virtual memory, or interrupt handling. Removing these features saves area, power, and latency.
Simulators
Two simulators are available for testing without physical hardware:
- MPACT Behavioral Simulator
-
Fast, symbolic execution. Runs programs in milliseconds. Provides cycle counts and memory access traces. Used for functional verification and rapid iteration. Python API available for scripted testing.
- Verilator Cycle-Accurate Simulator
-
Slow but cycle-accurate. Simulates the actual Verilog/Chisel hardware design. Used for timing analysis and hardware/software co-design. Execution takes minutes to hours depending on program complexity.
Both accept ELF binaries. The MPACT simulator is the standard development tool; Verilator is used for final hardware validation.