pccx: Parallel Compute Core eXecutor¶

Notice: Active Development in Progress. pccx is a scalable, modular Neural Processing Unit (NPU) architecture designed to accelerate Transformer-based large language models (LLMs) on resource-constrained edge devices.

1. Architecture Overview¶

pccx is a hardware-software co-design framework for autoregressive Transformer-LLM decoding on resource-constrained edge devices. The core architecture is sized at synthesis time to match the DSP, BRAM, and URAM budget of each target device. The primary target is the Xilinx Kria KV260 SoM (Zynq UltraScale+ ZU5EV).

1.1 Ecosystem Structure¶

The project is structured in three layers so that the same logic can be resynthesized for a different device or driven by a different host stack.

/architecture (Logic Layer) — core RTL and generate parameters.
- Defines the logical pipeline, instruction scheduling, and the custom 64-bit ISA.
- Independent of any specific hardware vendor or interface protocol.
/device (Implementation Layer) — maps the pccx architecture onto a specific hardware target.
- Adjusts core count, systolic-array dimensions, and memory port widths to the available resource budget (DSP count, local memory size, etc.).
/driver (Software Layer) — a C/C++ hardware abstraction layer (HAL) and high-level API.
- Handles instruction dispatch and memory mapping, bridging high-level AI models with the pccx hardware.

2. Key Technical Features¶

2.1 Decoupled Dataflow & Custom ISA¶

pccx uses a custom 64-bit ISA tuned for matrix and vector operations. A decoupled-dataflow pipeline separates instruction decode from execution to reduce dispatch-side stalls.

2.2 W4A8 Dynamic Precision Promotion¶

pccx balances efficiency with accuracy:

Compute: a parallel 2D systolic array executes dense INT4 (weight) × INT8 (activation) operations.
Promotion: during non-linear operations (Softmax, RMSNorm, GELU), the CVO core automatically promotes precision to BF16 / FP32 so numerical integrity is preserved.

2.3 Tiered Memory Hierarchy¶

Matrix core: dedicated GEMM, with a scalable array size.
Vector core: GEMV and element-wise operations.
Shared interconnect: a flexible bus that lets cores and local caches access each other concurrently without arbitration overhead.

3. Documentation¶

Detailed technical specifications for the active v002 line live under pccx v002 Architecture:

Instruction Set Architecture (ISA) — 64-bit custom instruction set.
Hardware Architecture — hardware architecture and floorplan.
Software Stack — driver and SDK documentation.

Working tracks for the next release lines:

pccx v003 — LLM line, separate RTL repository — LLM line continued on a separate RTL repository (working name pccx-LLM-v003, Gemma 4 E4B foundation; no release branch yet).
pccx vision-v001 — CNN inference track on KV260 — parallel CNN inference track on the same KV260 substrate (working name pccx-vision-v001; first model candidates ResNet18 / YOLOv8n / MobileNetV3).

The Roadmap summarises how the three tracks relate, and the pccx family-tree figure on that page links them visually.

The v001 architecture is archived at Archive: v001 Experimental Architecture.

4. License¶

Licensed under the Apache License 2.0. This provides freedom of use and modification while protecting the architecture from patent-related risks, keeping the ecosystem safe for open-source hardware development.

5. Ecosystem¶

RTL Implementation

github.com/pccxai/pccx-FPGA-NPU-LLM-kv260

The active v002 SystemVerilog sources — ISA package, controller, compute cores (GEMM / GEMV / CVO), memory hierarchy. Target device is the Xilinx Kria KV260 (Zynq UltraScale+ ZU5EV).

Every v002 RTL reference page on this site links back to the exact .sv file in that repository.

Open the pccx-FPGA-NPU-LLM-kv260 repository on GitHub

pccx-LLM-v003 (working)

github.com/pccxai/pccx-LLM-v003

LLM line continued on a separate RTL repository. Foundation Gemma 4 E4B; no release branch yet. See pccx v003 — LLM line, separate RTL repository.

Open the pccx-LLM-v003 repository on GitHub

pccx-vision-v001 (working)

github.com/pccxai/pccx-vision-v001

Parallel CNN inference track on the same KV260 substrate. First model candidates ResNet18 / YOLOv8n / MobileNetV3. See pccx vision-v001 — CNN inference track on KV260.

Open the pccx-vision-v001 repository on GitHub

Documentation source

github.com/pccxai/pccx — the Sphinx project powering this site.

Open the pccx documentation repository on GitHub

pccx-lab (verify / profile)

pccx-lab — Tauri 2 IDE. .pccx trace loader, run_verification runner, Roofline / Bottleneck cards, Vivado synth report view. See the verification workflow guide.

Open the pccx-lab verification + profiling hub

Author portfolio

hkimw.github.io/hkimw — blog, other projects, about.

Open the hkimw portfolio site