Coral NPU Hardware Architecture

Design Philosophy

The Coral NPU is not a general-purpose CPU with ML acceleration bolted on. It is an ML accelerator with enough scalar capability to run control flow. The design starts from domain-specific matrix operations and adds just enough RISC-V scalar + vector capability to avoid needing a separate host processor.

Coral NPU Core

Memory Hierarchy

Vector/SIMD Backend

Scalar Frontend

vector instrs

L1 I-Cache
8KB, 4-way

Fetch
backward-taken prediction

Decode
4-way dispatch

Execute
rv32imf

Command FIFO
(decoupled from scalar)

Vector ALU
128-bit SIMD

MAC Engine
8×8 outer product
256 MACs/cycle

Vector Regfile
v0..v63 × 256-bit

Accumulator
8×8 × 32-bit

L1 D-Cache
16KB, 4-way, dual-bank

ITCM
8KB

DTCM
32KB

EXTMEM
4MB

ISA: rv32imf_zve32x_zicsr_zifencei_zbb

The ISA string encodes exactly what instructions the hardware supports:

Extension Purpose

rv32i

Base 32-bit integer instructions (ALU, load/store, branches)

m

Hardware integer multiply/divide — needed for address calculations in tiled loops

f

Single-precision hardware float (FP32) — the NPU’s native compute precision. No f16 hardware support; f16 values must be promoted to f32

zve32x

128-bit SIMD vector extension — processes 4×f32 or 8×i16 or 16×i8 per instruction

zicsr

Control/status registers — enables reading cycle counters, configuring vector/float enables

zifencei

Instruction-fetch fence — needed for self-modifying code and cache coherency

zbb

Bit manipulation — clz, ctz, popcount, min/max, rotate — used in address generation and loop control

The C extension encoding space is reclaimed (per the RISC-V spec’s provision for this) to provide 6-bit vector register indices (supporting 64 vector registers v0..v63) and flexible type encodings for SIMD instructions.

Scalar Core

The scalar core is a 4-stage in-order pipeline:

  1. Fetch: Reads instructions from L1 I-Cache. Branch prediction is static — backward branches are predicted taken (loops), forward branches predicted not-taken. Misprediction costs 1 cycle.

  2. Decode: 4-way dispatch — scalar instructions go to the scalar execute unit, vector/SIMD instructions are posted to the command FIFO. The scalar core can dispatch up to 4 instructions per cycle when there are no data hazards.

  3. Execute: ALU, FPU, load/store. The scalar core handles all control flow, address calculations, and loop iteration. It is deliberately simple — no speculation, no out-of-order execution.

  4. Writeback: Results written to the 31 scalar registers (x1..x31, x0 is hardwired zero).

The scalar core is designed to be small — it is overhead that steals area and power from the ML backend. Its job is to feed the vector/MAC units with work as efficiently as possible.

Vector/SIMD Engine

The vector core is decoupled from the scalar frontend via a FIFO. This means the scalar core can run ahead, queuing vector instructions, while the vector engine processes them independently.

Registers

Register Bank Count × Width Usage

Vector registers

64 × 256-bit

v0..v63. Each holds 8×i32, 16×i16, or 32×i8. Used for SIMD ALU ops and MAC input staging.

Accumulator

8×8 × 32-bit

2D accumulator array for the MAC engine. Holds partial sums during outer-product matrix multiplies.

Stripmining

To reduce instruction dispatch pressure, SIMD instructions include an implicit stripmine mechanism. A single vadd v0 instruction in the dispatch stage expands into four serialized issue events:

vadd v0 → vadd v0 : vadd v1 : vadd v2 : vadd v3

This provides 4× throughput from a single dispatch slot, supporting instruction-level tiling patterns through consecutive vector registers.

MAC Engine (Matrix Multiply Unit)

The MAC engine is the core ML accelerator. It implements an outer-product multiply-accumulate, which maximizes compute density relative to memory bandwidth:

Outer Product Engine

Input Staging

Weights (wide broadcast)
parallel across 8 lanes

Activations (narrow shift)
batched across 8 lanes

8 × VDOT units
4× int8 multiply → int32 accumulate

8×8 Accumulator Array
256 MACs/cycle

Why Outer Product?

An outer-product engine provides two-dimensional broadcast structures:

  • Wide axis (parallel broadcast): convolution weights are broadcast to all compute lanes simultaneously

  • Narrow axis (transposed, shifted): input activations from multiple spatial positions (or batches) feed different lanes

This means the MAC engine computes an 8×8 tile of the output matrix per cycle, performing 256 multiply-accumulate operations with only 16 memory reads (8 weights + 8 activations). This is a 16:1 compute-to-memory ratio.

The VDOT operation performs 4 int8 multiplies accumulated into a 32-bit result — the fundamental operation for quantized neural networks.

Memory Hierarchy

Off-Chip (slow, larger)

Cache Layer

On-Chip (fast, small)

overflow

ITCM 8KB
0x00000000
.text, .rodata

DTCM 32KB
0x00010000
.data, .bss, heap, stack

L1 I-Cache 8KB
4-way, 256B blocks

L1 D-Cache 16KB
4-way, dual-bank
alignment buffer

EXTMEM 4MB
0x20000000
.extdata, .extbss

Memory Constraints for Model Deployment

The memory sizes impose hard constraints on what models can run:

Memory Size Practical Limit

ITCM

8 KB

~2000 instructions. The generated code for RGB-to-grayscale is ~200 instructions. Complex models with many distinct operations may not fit.

DTCM

32 KB

~8000 float32 values. A 4×4 RGB image (48 floats = 192 bytes) fits easily. A 224×224 RGB image (150,528 floats = 602 KB) does NOT fit — requires tiling or EXTMEM.

EXTMEM

4 MB

Usable for larger tensors but at higher latency. The linker script places .extdata and .extbss here when DTCM overflows.

For real-world model deployment, this means:

  • Weights must be quantized (int8 uses 4× less memory than float32)

  • Activations must be tiled (process a spatial region at a time, not the whole tensor)

  • Models must be architecture-aware (MobileNet-style depthwise separable convolutions are preferred over standard convolutions)

Execution Model

The NPU runs a run-to-completion model:

  1. No operating system — bare-metal execution

  2. No interrupts — the program runs until ebreak

  3. No dynamic memory allocation — all arrays are statically sized at compile time

  4. No I/O syscalls — data is passed via shared memory (the simulator writes to symbol addresses)

This simplicity is deliberate. An ML inference workload has a fixed, known computation graph. There is no need for OS-level scheduling, virtual memory, or interrupt handling. Removing these features saves area, power, and latency.

Simulators

Two simulators are available for testing without physical hardware:

MPACT Behavioral Simulator

Fast, symbolic execution. Runs programs in milliseconds. Provides cycle counts and memory access traces. Used for functional verification and rapid iteration. Python API available for scripted testing.

Verilator Cycle-Accurate Simulator

Slow but cycle-accurate. Simulates the actual Verilog/Chisel hardware design. Used for timing analysis and hardware/software co-design. Execution takes minutes to hours depending on program complexity.

Both accept ELF binaries. The MPACT simulator is the standard development tool; Verilator is used for final hardware validation.