Shared Library

The v002 RTL isolates common numeric operations and data structures under the Library/ directory. In the compile ordering recorded by filelist.f, these files appear immediately after the package tier (A–D) and before isa_pkg. Compute cores depend on this shared library to prevent duplicate implementations of the same operation.

Algorithms Package

algorithms_pkg (Library/Algorithms/Algorithms.sv) Defines the QUEUE status struct queue_stat_t, a two-field packed struct containing empty and full. External logic that needs to inspect queue state uses this type rather than reading the raw signals directly. A STACK entry is reserved as a commented stub.

Listing 27 Library/Algorithms/Algorithms.sv
package algorithms_pkg;

  /*─────────────────────────────────────────────
  QUEUE
  ─────────────────────────────────────────────*/
  typedef struct packed {
    logic empty;
    logic full;
  } queue_stat_t;

  /*─────────────────────────────────────────────
  STACK
  ─────────────────────────────────────────────*/
  // typedef struct packed { ... } stack_stat_t;

bf16_math_pkg (Library/Algorithms/BF16_math.sv) Provides BF16 arithmetic as a SystemVerilog package. The file header documents the bit layout: [15]=sign, [14:7]=exp(8b), [6:0]=mantissa(7b). The hidden bit (implicit leading 1) is not stored.

Exposed types and functions:

  • bf16_t — Packed struct with a 1-bit sign, 8-bit exponent, and 7-bit mantissa.

  • bf16_aligned_t — Packed struct holding an 8-bit emax and a 24-bit two’s-complement aligned value.

  • to_bf16(raw[15:0]) — Automatic function that casts a raw 16-bit value to bf16_t.

  • align_to_emax(val, emax) — Aligns a BF16 value to a given emax and returns a 24-bit two’s-complement integer. Shifts the mantissa right by diff = emax - val.exp before sign extension.

  • bf16_add(a[15:0], b[15:0]) — Adds two packed BF16 values and returns a packed BF16 result. Aligns both operands to the larger exponent, performs a 24-bit signed addition, then renormalises by locating the leading 1. Denormal, NaN, and Inf handling are not included; the autoregressive decode path operates exclusively on normalised BF16 operands.

Listing 28 Library/Algorithms/BF16_math.sv
package bf16_math_pkg;

  /*─────────────────────────────────────────────
  BF16 struct
  [15]=sign  [14:7]=exp(8b)  [6:0]=mantissa(7b)
  hidden bit is implicit (not stored)
  ─────────────────────────────────────────────*/
  typedef struct packed {
    logic       sign;
    logic [7:0] exp;
    logic [6:0] mantissa;
  } bf16_t;

  /*─────────────────────────────────────────────
  Aligned output
  24-bit 2's complement
  ─────────────────────────────────────────────*/
  typedef struct packed {
    logic [7:0]  emax;
    logic [23:0] val;
  } bf16_aligned_t;

  /*─────────────────────────────────────────────
  cast raw 16-bit → bf16_t
  ─────────────────────────────────────────────*/
  function automatic bf16_t to_bf16(input logic [15:0] raw);
    return bf16_t'{sign: raw[15], exp: raw[14:7], mantissa: raw[6:0]};
  endfunction

  /*─────────────────────────────────────────────
  align one BF16 value to a given emax
  returns 24-bit 2's complement
  ─────────────────────────────────────────────*/
  function automatic logic [23:0] align_to_emax(input bf16_t val, input logic [7:0] emax);
    logic [ 7:0] diff;
    logic [22:0] mag;
    logic [23:0] result;

    diff   = emax - val.exp;
    mag    = ({1'b1, val.mantissa, 15'd0}) >> diff;
    result = val.sign ? (~{1'b0, mag} + 24'd1) : {1'b0, mag};
    return result;
  endfunction

  /*─────────────────────────────────────────────
  BF16 add: a + b as packed 16-bit values
  - aligns to the larger exponent
  - signed-adds the 24-bit aligned mantissas
  - renormalizes by counting the leading one
  - repacks to BF16
  First-pass implementation: no denormal / NaN / Inf handling; softmax
  uses normalized BF16 operands so the subtle corner cases don't fire
  on the autoregressive decode path. Used by CVO_top's sub-emax stage.
  ─────────────────────────────────────────────*/
  function automatic logic [15:0] bf16_add(input logic [15:0] a,
                                           input logic [15:0] b);
    bf16_t         av, bv;
    logic [7:0]    emax;
    logic [23:0]   aa, ba;
    logic signed [24:0] sum;
    logic               out_sign;
    logic [23:0]   mag;
    int            lead;
    logic [7:0]    out_exp;
    logic [6:0]    out_mant;

    av   = to_bf16(a);
    bv   = to_bf16(b);
    emax = (av.exp > bv.exp) ? av.exp : bv.exp;

    aa = align_to_emax(av, emax);
    ba = align_to_emax(bv, emax);
    sum = $signed({aa[23], aa}) + $signed({ba[23], ba});

    out_sign = sum[24];
    mag      = out_sign ? (~sum[23:0] + 24'd1) : sum[23:0];

    if (mag == 24'd0) return 16'd0;

    // Find the position of the leading 1 (MSB-first).
    lead = 23;
    while (lead > 0 && mag[lead] == 1'b0) lead = lead - 1;

    // Re-bias exponent. The mantissa's implicit leading-1 is at bit 15
    // before alignment; "lead - 15" is the net exponent correction.
    out_exp  = emax + 8'(lead - 15);

    // 7 mantissa bits immediately below the leading 1.
    if (lead >= 7)
      out_mant = mag[lead-1 -: 7];
    else
      out_mant = 7'(mag << (7 - lead));

    return {out_sign, out_exp, out_mant};
  endfunction

QUEUE Interface

The QUEUE primitive is split across two files: an interface (IF_queue) and a module (QUEUE).

IF_queue (Library/Algorithms/QUEUE/IF_queue.sv) A parameterised SystemVerilog interface with DATA_WIDTH (default 32) and DEPTH (default 8). The interface itself takes clk and rst_n as ports. Pointer width PTR_W = $clog2(DEPTH) is derived internally. The storage array mem[0:DEPTH-1] and pointers wr_ptr/rd_ptr are declared inside the interface. The empty and full flags are assigned combinationally.

Three modports:

  • producer — Imports the push() task only. Drives push_data/push_en; reads empty/full.

  • consumer — Imports the pop() task only. Reads pop_data/empty/full; drives pop_en.

  • owner — Used by the QUEUE module itself. Receives all handshake signals as inputs; drives wr_ptr/rd_ptr and references mem via ref.

Listing 29 Library/Algorithms/QUEUE/IF_queue.sv
  modport producer(import push, input empty, full, clk, rst_n, output push_data, push_en);

  // consumer : only pops
  modport consumer(import pop, input empty, full, pop_data, clk, rst_n, output pop_en);

  // owner : the FIFO module itself. Reads producer/consumer handshake
  // signals, updates its own pointers + memory contents.
  modport owner(input  clk, rst_n, push_data, push_en, pop_en, full, empty,
                output wr_ptr, rd_ptr, ref mem);

QUEUE (Library/Algorithms/QUEUE/QUEUE.sv) A module with a single port IF_queue.owner q. It re-derives the pointer width as PTR_W = $clog2($size(q.mem)) because modports cannot export parameters. The always_ff block initialises both pointers to zero on reset, writes a word when push_en && !full, and advances the read pointer when pop_en && !empty.

Quantizations

Quantize_BF16.sv (Library/Quantizations/BF16/Quantize_BF16.sv) The file is an empty placeholder. It marks the intended location for BF16 quantization helpers that will provide a common conversion path between the offline quantization pipeline and the RTL datapath.

Usage Patterns

The table reflects import statements and interface instantiations confirmed directly in each source file.

Table 11 Library dependencies by compute core

Module (core)

algorithms_pkg

bf16_math_pkg

IF_queue

QUEUE

CVO_top (CVO_CORE)

o

AXIL_CMD_IN (sub-module of ctrl_npu_frontend)

o

o

o

o = import or instantiation confirmed in source. = not present in that file.

CVO_top declares import bf16_math_pkg::*; directly. Per the source comment, the FLAG_SUB_EMAX path (the sub-emax stage of the CVO softmax) uses this package’s BF16 arithmetic. algorithms_pkg, IF_queue, and QUEUE are instantiated inside AXIL_CMD_IN, which buffers AXI4-Lite commands into a FIFO and is itself instantiated by ctrl_npu_frontend. GEMM_systolic_top, GEMV_top, and the PREPROCESS modules do not import any library package; they use only `define headers.


Last verified against

Commit 8c09e5e @ pccxai/pccx-FPGA-NPU-LLM-kv260 (2026-04-29).