feat: GPU witness generation (RV32IM + Keccak + ShardRam) by Velaciela · Pull Request #1259 · scroll-tech/ceno

Velaciela · 2026-03-03T03:07:34Z

related: #1265

GPU Witness Generation

Accelerate witness generation by offloading computation from CPU to GPU.
This module (ceno_zkvm/src/instructions/gpu/) contains all GPU-side dispatch,
caching, and utility code for the witness generation pipeline.

The CUDA backend lives in the sibling repo ceno-gpu/ (cuda_hal/src/common/witgen/).

Architecture

Module Layout

gpu/
├── dispatch.rs         — GPU dispatch entry point (try_gpu_assign_instances, gpu_fill_witness)
├── config.rs           — Environment variable config (3 env vars), kind tags
├── cache.rs            — Thread-local device buffer caching, shared EC/addr buffers
├── chips/              — Per-chip column map extractors + chip-specific GPU dispatch
│   ├── add.rs ... sw.rs  (24 RV32IM column map extractors)
│   ├── keccak.rs         (column map + keccak GPU dispatch: gpu_assign_keccak_instances)
│   └── shard_ram.rs      (column map + batch EC computation: gpu_batch_continuation_ec)
├── utils/
│   ├── column_map.rs   — Shared column map extraction helpers (extract_rs1, extract_rd, ...)
│   ├── d2h.rs          — Device-to-host: witness transpose, LK counter decode, compact EC D2H
│   ├── debug_compare.rs— GPU vs CPU comparison (activated by CENO_GPU_DEBUG_COMPARE_WITGEN)
│   ├── lk_ops.rs       — LkOp enum, SendEvent struct
│   ├── sink.rs         — LkShardramSink trait, CpuLkShardramSink
│   ├── emit.rs         — Emit helper functions (emit_u16_limbs, emit_logic_u8_ops, ...)
│   ├── fallback.rs     — CPU fallback: cpu_assign_instances, cpu_collect_lk_and_shardram
│   └── test_helpers.rs — Test utilities: assert_witness_colmajor_eq, assert_full_gpu_pipeline
└── mod.rs              — Module declarations + lk_shardram integration tests (19 tests)

Data Flow

                    Pass 1: PreflightTracer
                    ┌──────────────────────┐
                    │  ShardPlanBuilder     │ → shard boundaries
                    │  addr_future_accesses │ → next-access HashMap (GPU cache reads and sorts before H2D)
                    └──────────┬───────────┘
                               │
                    Pass 2: FullTracer (per shard)
                    ┌──────────▼───────────┐
                    │  Vec<StepRecord>      │ 136 bytes/step, #[repr(C)]
                    └──────────┬───────────┘
                               │ H2D (cached per shard in cache.rs)
                    ┌──────────▼───────────────────────────────────┐
                    │              GPU Per-Instruction              │
                    │  ┌─────────────┬──────────────┬────────────┐ │
                    │  │ F-1 Witness │ F-2 LK Count │ F-3 EC/Addr│ │
                    │  │ (col-major) │  (atomics)   │ (shared buf)│ │
                    │  └──────┬──────┴──────┬───────┴─────┬──────┘ │
                    └─────────┼─────────────┼─────────────┼────────┘
                              │             │             │
                      GPU transpose    D2H counters   flush at shard end
                              │             │             │
                    ┌─────────▼─────────────▼─────────────▼────────┐
                    │                 CPU Merge                     │
                    │  RowMajorMatrix  LkMultiplicity  ShardContext │
                    └──────────────────────┬───────────────────────┘
                                           │
                    ┌──────────────────────▼───────────────────────┐
                    │           ShardRamCircuit (GPU)               │
                    │  Phase 1: per-row Poseidon2 (344 cols)       │
                    │  Phase 2: binary EC tree (layer-by-layer)    │
                    └──────────────────────┬───────────────────────┘
                                           │
                                           ▼
                                     Proof Generation

Per-Shard Pipeline

Within generate_witness() (e2e.rs), each shard executes:

upload_shard_steps_cached — H2D Vec<StepRecord> (cached, shared across all chips)
ensure_shard_metadata_cached — H2D shard scalars + allocate shared EC/addr buffers
Per-chip dispatch — gpu_fill_witness matches GpuWitgenKind → 22 kernel variants
- Each kernel writes: witness columns (col-major), LK counters (atomics), EC records + addr (shared buffers)
flush_shared_ec_buffers — D2H shared EC records + addr_accessed into ShardContext
invalidate_shard_steps_cache — Free GPU shard_steps memory
assign_shared_circuit — ShardRamCircuit GPU pipeline (Poseidon2 + EC tree)

GPU/CPU Decision (dispatch.rs)

try_gpu_assign_instances():
  1. is_gpu_witgen_enabled()?          → CPU fallback if not set
  2. is_force_cpu_path() thread-local? → CPU fallback (debug comparison)
  3. I::GPU_LK_SHARDRAM == false?      → CPU fallback
  4. is_kind_disabled(kind)?           → CPU fallback
  5. Field != BabyBear?                → CPU fallback
  6. get_cuda_hal() unavailable?       → CPU fallback
  7. All pass                          → GPU path

Keccak Dispatch

Keccak has a dedicated GPU dispatch path (chips/keccak.rs::gpu_assign_keccak_instances)
separate from try_gpu_assign_instances because:

Rotation: each instance spans 32 rows (not 1), requiring new_by_rotation
Structural witness: 3 selectors (sel_first/sel_last/sel_all) vs the standard 1
Input packing: needs packed_instances with syscall_witnesses

The LK/shardram collection logic is identical to the standard path.

Lk and Shardram Collection

After GPU computes the witness matrix, LK multiplicities and shard RAM records
are collected through one of several paths (priority order):

Path	Witness	LK Multiplicity	Shard Records	When
A Shared buffer	GPU	GPU counters → D2H	Shared GPU buffer (deferred)	Default for all verified kinds
B Compact EC	GPU	GPU counters → D2H	Compact EC D2H per-kernel	Older non-shared-buffer kinds
C CPU shardram	GPU	GPU counters → D2H	CPU `cpu_collect_shardram`	GPU shard unverified
D CPU full	GPU	CPU `cpu_collect_lk_and_shardram`	CPU full	GPU LK unverified
E CPU only	CPU	CPU `assign_instance`	CPU `assign_instance`	GPU unavailable

Currently all non-Keccak kinds use Path A. Paths B-E are fallback/debug paths.

E2E Pipeline Modes (e2e.rs)

create_proofs_streaming()
│
├─ Default GPU backend (CENO_GPU_ENABLE_WITGEN unset):
│   Overlap pipeline:
│     Thread A (CPU): witgen(shard 0) → witgen(shard 1) → witgen(shard 2) → ...
│     Thread B (GPU): ................prove(shard 0) → prove(shard 1) → ...
│     crossbeam::bounded(0) rendezvous channel for back-pressure
│
└─ CENO_GPU_ENABLE_WITGEN=1 (GPU witgen) or CPU-only build:
    Sequential pipeline:
      witgen(shard 0) → prove(shard 0) → witgen(shard 1) → prove(shard 1) → ...
      GPU shared between witgen and proving; no overlap possible.

Environment Variables

Variable	Default	Purpose
`CENO_GPU_ENABLE_WITGEN`	unset (CPU witgen)	Set to enable GPU witness generation. Sequential witgen+prove pipeline.
`CENO_GPU_DISABLE_WITGEN_KINDS`	none	Comma-separated kind tags to disable specific chips' GPU path. Example: `add,keccak,lw`. Falls back to CPU for those chips.
`CENO_GPU_DEBUG_COMPARE_WITGEN`	unset	Enable GPU vs CPU comparison for all chips. Runs both paths and diffs results.

`CENO_GPU_DEBUG_COMPARE_WITGEN` Coverage

When set, all failures are collected into a DebugCompareReport (thread-local).
Detailed mismatches are logged via tracing::error! in real time; at pipeline end
assert_debug_compare_report() prints a summary table and panics if any failures exist.

Per-chip (in dispatch.rs, for each opcode circuit):

debug_compare_final_lk — GPU LK multiplicity vs CPU assign_instance baseline (all 8 lookup tables)
debug_compare_witness — GPU witness matrix vs CPU witness (element-by-element)
debug_compare_shardram — GPU shard records (read_records, write_records, addr_accessed) vs CPU
debug_compare_shard_ec — GPU compact EC records vs CPU-computed EC points (nonce, x[7], y[7])

Per-chip, Keccak-specific (in chips/keccak.rs):

debug_compare_keccak — Combined witness + LK + shard comparison for keccak's rotation-aware layout

ShardRamCircuit (in chips/shard_ram.rs):

debug_compare_shard_ram_witness — GPU ShardRam witness vs CPU baseline (from ShardRamInput)
debug_compare_shard_ram_witness_from_device — GPU ShardRam witness vs CPU baseline (D2H device buffer → convert → CPU assign)

Per-shard, E2E level (in e2e.rs, all chips combined):

log_shard_ctx_diff — Aggregated addr_accessed comparison (write/read_records skipped when GPU witgen enabled)
log_combined_lk_diff — Merged LK multiplicities after finalize_lk_multiplicities() (catches cross-chip merge issues)

Tests

79 tests total (cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "gpu")

Category	Count	Location	What it tests
Column map extraction	33	`chips/*.rs` (31 via `test_colmap!` macro + 2 manual)	Circuit config → column map: all IDs in-range and unique
GPU witgen correctness	23	`chips/*.rs`	GPU kernel output vs CPU `assign_instance` (element-by-element witness comparison)
LK+shardram match	19	`gpu/mod.rs`	`collect_lk_and_shardram` / `collect_shardram` vs `assign_instance` baseline
LkOp encoding	1	`utils/mod.rs`	`LkOp::encode_all()` produces correct table/key pairs
EC point match	1	`scheme/septic_curve.rs`	GPU Poseidon2+SepticCurve EC point vs CPU `to_ec_point`
Poseidon2 sponge	1	`scheme/septic_curve.rs`	GPU Poseidon2 permutation vs CPU
Septic from_x	1	`scheme/septic_curve.rs`	GPU `septic_point_from_x` vs CPU

Running Tests

# All GPU tests (requires CUDA device)
CENO_GPU_ENABLE_WITGEN=1 cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "gpu"

# Column map tests only (no CUDA device needed)
cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "test_extract_"

# LK/shardram tests only (no CUDA device needed)
cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "lk_shardram"

# With debug comparison enabled
CENO_GPU_ENABLE_WITGEN=1 CENO_GPU_DEBUG_COMPARE_WITGEN=1 cargo test --features gpu,u16limb_circuit -p ceno_host -- test_elf

Per-Chip Boilerplate Macros

Three macros in instructions.rs reduce per-chip GPU integration to ~3 lines:

impl Instruction<E> for MyChip {
    // Emit LK ops + shard RAM records (CPU companion for GPU witgen)
    impl_collect_lk_and_shardram!(r_insn, |sink, step, _config, _ctx| {
        emit_u16_limbs(sink, step.rd().unwrap().value.after);
    });

    // Collect shard RAM records only (when GPU handles LK)
    impl_collect_shardram!(r_insn);

    // GPU dispatch: try GPU → fallback CPU
    impl_gpu_assign!(dispatch::GpuWitgenKind::Add);
}

Velaciela · 2026-03-25T02:33:36Z

GPU Witness Generation — Invasive Changes to Existing Codebase

This document lists all changes to existing ceno structures, traits, and flows
that this PR introduces. GPU-only new code (instructions/gpu/) is excluded —
this focuses on what existing code was modified and why.

1. `ceno_emul` — FFI Layout Changes (+332 / -88 lines)

`#[repr(C)]` on emulator types

The following types were made #[repr(C)] to enable zero-copy H2D transfer to GPU:

Type	File	Size	Purpose
`StepRecord`	`tracer.rs`	136B	Per-step emulator output, bulk H2D
`Instruction`	`rv32im.rs`	12B	Opcode encoding embedded in StepRecord
`InsnKind`	`rv32im.rs`	1B	`#[repr(u8)]` enum discriminant
`MemOp<T>`	`tracer.rs`	16/24B	Read/Write ops embedded in StepRecord
`Change<T>`	`tracer.rs`	2×T	Before/after pair

Impact: These were previously #[derive(Debug, Clone)] with compiler-chosen layout.
Adding #[repr(C)] pins field order and padding. No behavioral change for CPU code,
but field reordering or insertion now requires updating the CUDA mirror structs.

Layout test

test_step_record_layout_for_gpu verifies byte offsets of all StepRecord fields
at compile time. CUDA side has matching static_assert(sizeof(...)).

2. `Instruction<E>` Trait — New Methods and Constants

File: ceno_zkvm/src/instructions.rs

Addition	Purpose
`const GPU_LK_SHARDRAM: bool = false`	Opt-in flag: does this chip have GPU LK+shardram support?
`fn collect_lk_and_shardram(...)`	CPU companion: collect all LK multiplicities + shard RAM records (without witness replay)
`fn collect_shardram(...)`	CPU companion: collect shard RAM records only (GPU handles LK)

Default implementations return Err(...) — chips must explicitly opt in.

Impact: Existing chips that don't implement GPU support are unaffected (defaults).
The trait's existing assign_instance and assign_instances are unchanged.

Three macros reduce per-chip boilerplate:

impl_collect_lk_and_shardram! — wraps the unsafe CpuLkShardramSink prologue
impl_collect_shardram! — one-line delegate to insn_config
impl_gpu_assign! — #[cfg(feature = "gpu")] assign_instances override

3. Gadgets — New `emit_lk_and_shardram` / `emit_shardram` Methods

File: ceno_zkvm/src/instructions/riscv/insn_base.rs (+253 lines)

Every base gadget (ReadRS1, ReadRS2, WriteRD, ReadMEM, WriteMEM, MemAddr)
gained two new methods:

Method	What it does
`emit_lk_and_shardram(sink, ctx, step)`	Emit LK ops + RAM send events through `LkShardramSink`
`emit_shardram(shard_ctx, step)`	Directly write shard RAM records to `ShardContext` (no LK)

Impact: Additive only — existing assign_instance methods are unchanged.
The new methods extract the same logic that assign_instance performed inline,
but route through the LkShardramSink trait instead of directly calling
lk_multiplicity.assert_ux(...).

Intermediate configs (`r_insn.rs`, `i_insn.rs`, `b_insn.rs`, `s_insn.rs`, `j_insn.rs`, `im_insn.rs`)

Each gained corresponding emit_lk_and_shardram / emit_shardram methods that
compose their gadgets' methods + emit LkOp::Fetch.

4. Per-Chip Circuit Files — GPU Opt-in (+792 / -129 lines across ~20 files)

Each v2 circuit file (arith.rs, logic_circuit.rs, div_circuit_v2.rs, etc.) gained:

const GPU_LK_SHARDRAM: bool = true;  // or conditional match

impl_collect_lk_and_shardram!(r_insn, |sink, step, _config, _ctx| {
    // chip-specific LK ops
});
impl_collect_shardram!(r_insn);
impl_gpu_assign!(dispatch::GpuWitgenKind::Add);

Impact: Additive — existing assign_instance and construct_circuit unchanged.
The #[cfg(feature = "gpu")] assign_instances override is only compiled with the
gpu feature flag.

5. `ShardContext` — New Fields and Methods

File: ceno_zkvm/src/e2e.rs (+616 / -199 lines)

New methods

Method	Purpose
`new_empty_like()`	Clone shard metadata with empty record storage (for debug comparison)
`insert_read_record()` / `insert_write_record()`	Direct record insertion (GPU D2H path)
`push_addr_accessed()`	Direct addr insertion (GPU D2H path)

Renamed method

send() → split into record_send_without_touch() (no addr_accessed tracking) and
send() (which calls record_send_without_touch + push_addr_accessed).

Pipeline hooks (in `generate_witness` shard loop)

#[cfg(feature = "gpu")]
flush_shared_ec_buffers(&mut shard_ctx);  // D2H shared GPU buffers

#[cfg(feature = "gpu")]
invalidate_shard_steps_cache();  // free GPU memory

Pipeline mode (in `create_proofs_streaming`)

New overlap pipeline (default when GPU feature enabled but CENO_GPU_ENABLE_WITGEN unset):
CPU witgen on thread A, GPU prove on thread B, connected by crossbeam::bounded(0) channel.

6. `ZKVMWitnesses` — GPU ShardRam Pipeline

File: ceno_zkvm/src/structs.rs (thin wrapper only)

`assign_shared_circuit` — GPU fast path

try_assign_shared_circuit_gpu() delegates to gpu/chips/shard_ram::try_gpu_assign_shared_circuit().
The full GPU pipeline logic and gpu_ec_records_to_shard_ram_inputs conversion have been
moved to instructions/gpu/ — structs.rs only contains the wrapper that inserts results
into self.witnesses.

Two helper methods made pub(crate) for GPU access: mem_addresses(), make_cross_shard_record().

7. `ShardRamCircuit` — GPU Witness Generation

File: ceno_zkvm/src/tables/shard_ram.rs (+491 / -14 lines)

New GPU functions

Function	Purpose
`try_gpu_assign_instances()`	H2D path: CPU records → GPU kernel → D2H witness
`try_gpu_assign_instances_from_device()`	Device path: records already on GPU → kernel → D2H

Both run a two-phase GPU pipeline:

Per-row kernel: basic fields + Poseidon2 trace (344 witness columns)
EC tree kernel: layer-by-layer binary tree EC summation

Visibility change

ShardRamConfig fields changed from private to pub(crate) to allow
column map extraction in gpu/chips/shard_ram.rs.

8. `SepticCurve` — New Math Utilities

File: ceno_zkvm/src/scheme/septic_curve.rs (+307 lines)

New CPU-side math for EC point computation (mirrored in CUDA):

Function	Purpose
`SepticExtension::frobenius()`	Frobenius endomorphism for norm computation
`SepticExtension::sqrt()`	Cipolla's algorithm for field square roots
`SepticPoint::from_x()`	Lift x-coordinate to curve point (used by nonce-finding loop)
`QuadraticExtension<F>`	Auxiliary type for Cipolla's algorithm

9. Minor Touches

File	Change
`Cargo.toml`	`gpu` feature flag, `crossbeam` dependency
`gkr_iop/src/gadgets/is_lt.rs`	`AssertLtConfig.0.diff` field access (already `pub`)
`gkr_iop/src/utils/lk_multiplicity.rs`	Minor: `LkMultiplicity::increment`
`ceno_zkvm/src/gadgets/signed_ext.rs`	`pub(crate) fn msb()` accessor for GPU column map
`ceno_zkvm/src/gadgets/poseidon2.rs`	Column contiguity constants for GPU
`ceno_zkvm/src/tables/*.rs`	`pub(crate)` visibility on config fields for GPU column map access
`ceno_zkvm/src/scheme/{cpu,gpu,prover,verifier}`	Minor plumbing for GPU proving path
`ceno_host/tests/test_elf.rs`	E2E test adjustments

Summary

Category	Nature	Risk
`#[repr(C)]` on emulator types	Layout pinning	Low — additive, but field changes now need CUDA sync
`Instruction<E>` trait extensions	Additive (defaults provided)	None — existing chips unaffected
Gadget `emit_*` methods	Additive	None — existing `assign_instance` unchanged
`ShardContext` new methods	Additive	Low — existing methods unchanged
`send()` → `record_send_without_touch()` + `send()`	Rename + split	Low — `send()` still works identically
`ShardRamConfig` visibility	`private` → `pub(crate)`	None
Pipeline overlap mode	New default behavior	Medium — CPU witgen + GPU prove on separate threads
`septic_curve.rs` math	Additive	None — new functions, existing unchanged

hero78119

A quick review regarding to tracer & SortedNextAccesses field

hero78119 · 2026-03-25T08:40:35Z

ceno_emul/src/tracer.rs

    mmio_min_max_access: Option<BTreeMap<WordAddr, (WordAddr, WordAddr, WordAddr, WordAddr)>>,
    latest_accesses: LatestAccesses,
    next_accesses: NextCycleAccess,
+    next_accesses_vec: Vec<PackedNextAccessEntry>,


We can re-build SortedNextAccesses from self.next_accesses map so we can avoid introduce this new vector field and PackedNextAccessEntry

re-build

slower than this method (roughly 50ms vs 200ms)
CENO_NEXT_ACCESS_SOURCE=preflight vs CENO_NEXT_ACCESS_SOURCE=hashmap

I believe the slow come from the sequential traverse of hashmap
how about par_extend() ?

info_span!("next_access_from_hashmap").in_scope(|| { let total_entries: usize = addr_future_accesses.values().map(|pairs| pairs.len()).sum(); let mut entries = Vec::with_capacity(total_entries); entries.par_extend(addr_future_accesses.par_iter().flat_map_iter( |(cycle, pairs)| { pairs .iter() .map(move |&(addr, next_cycle)| { PackedNextAccessEntry::new(*cycle, addr.0, next_cycle) }) }, )); entries })

If this work, we can remove vector version from tracer and only rebuild here and gated by gpu feature

# using preflight-appended vec (9321311 entries) ┝━ [ceno] preflight-execute [ 7.85s | 6.11% ] ┝━ next_access_presort [ 52.6ms | 0.00% / 0.04% ] │ ┝━ ｉ [info]: [next-access presort] │ ┝━ next_access_par_sort [ 52.5ms | 0.04% ] n: 9321311 # converting from HashMap - Serial ┝━ [ceno] preflight-execute [ 7.85s | 6.17% ] ┝━ next_access_presort [ 110ms | 0.01% / 0.09% ] │ ┝━ next_access_from_hashmap [ 68.4ms | 0.05% ] │ ┝━ next_access_par_sort [ 33.1ms | 0.03% ] n: 9321311 # converting from HashMap - Parallel ┝━ [ceno] preflight-execute [ 7.76s | 4.49% ] ┝━ next_access_presort [ 174ms | 0.00% / 0.10% ] │ ┝━ next_access_from_hashmap [ 138ms | 0.08% ] │ ┝━ next_access_par_sort [ 28.5ms | 0.02% ] n: 9321311

Each element's work (PackedNextAccessEntry::new) is a few assignments and bit-shifts — trivial for the parallelism gains to outweigh Rayon's scheduling/synchronization overhead.

The cost is mainly memory allocation and HashMap random access, which are memory-bandwidth-bound and only get worse with multiple threads contending for cache.

Perhaps we could switch to the rebuild-from-hashmap approach first, without introducing interface changes, and address performance concerns later.

remove: next_accesses_vec

remove: sorted_next_accesses from ShardContextBuilder and ShardContext

move: PackedNextAccessEntry to GPU codebase

remove: gpu_ec_records from ShardContext

And ShardRamCircuit cleanup:

remove: gpu_ec_records_to_shard_ram_inputs from structs.rs

refactor: move GPU shard_ram impl to gpu/chips/shard_ram.rs

Velaciela added 2 commits March 4, 2026 10:02

repr(C) StepRecord

77c43d1

fix

ac2bf54

Velaciela mentioned this pull request Mar 9, 2026

GPU Witness Generation #1265

Open

5 tasks

Velaciela added 27 commits March 12, 2026 10:31

witgen: add

f550783

witgen: lw

5a10916

witgen: integration

07360f8

minor

9b673ca

fmt

35154b1

minor

2d82306

dev-local

65985ff

GPU: AOS StepRecord

0f96033

SHARD_STEPS_DEVICE

4fc7368

batch-1234

72dd155

batch-5,12

273cf7c

batch-6-shift

31697bf

batch-8,9-slt

707ea1d

test: orrectness

3fcc70e

batch-10,11-branch

7191624

batch-13-JALR

1b6f346

batch-14-SW

107f72f

batch-15-SH,SB

4ac98ab

batch-16-LH,LB

3b9e4b5

batch-17-MUL

32c0acb

batch-18-DIV

cb48e7a

dev: non-witgen-overlap

6c43c6a

test coverage: compare all column

4ba08c0

test coverage: edge cases

e943735

gpu witgen: col-major

bcdc2a3

phase5

1e21d37

shard-1

8307ba1

Velaciela added 2 commits March 24, 2026 22:35

config, dispatch

6ca050b

fmt, lints

aadf86b

Velaciela force-pushed the feat/gpu-witnessgen branch from c95ee3f to aadf86b Compare March 24, 2026 14:42

Velaciela added 4 commits March 25, 2026 09:12

shard_ram: funcs

3c719dc

default: disable gpu witgen

bd18f1d

e2e: pipeline

4ac6a6f

README.md

4dba098

Velaciela changed the title ~~(draft) feat: GPU witness generation~~ feat: GPU witness generation (RV32IM + Keccak + ShardRam) Mar 25, 2026

Velaciela marked this pull request as ready for review March 25, 2026 02:03

Velaciela added 3 commits March 25, 2026 11:47

invasive_changes.md

e7bb55b

revert: local path

e2b8ae7

minor

02950b8

hero78119 reviewed Mar 25, 2026

View reviewed changes

Velaciela added 9 commits March 27, 2026 10:41

remove: next_accesses_vec

1fe56b8

remove: sorted_next_accesses from ShardContextBuilder and ShardContext

411a62b

move: PackedNextAccessEntry to GPU codebase

2513a2f

remove: gpu_ec_records from ShardContext

204e290

minor

f963620

remove: bench

a14c909

remove: gpu_ec_records_to_shard_ram_inputs from structs.rs

327b0ed

refactor: move GPU shard_ram impl to gpu/chips/shard_ram.rs

50b4411

move: debug funcs

4f4d4f4

Velaciela force-pushed the feat/gpu-witnessgen branch from 1a84578 to 4f4d4f4 Compare March 27, 2026 05:00

Velaciela added 6 commits March 27, 2026 13:17

cleanup: build_shard_ram_inputs

c3ed4db

gpu_batch_continuation_ec

1cb5504

minimize diff

f176997

debug compare: shard_ram

f271c93

DebugCompareReport

b15f4c2

minor: func path

e536c5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GPU witness generation (RV32IM + Keccak + ShardRam)#1259

feat: GPU witness generation (RV32IM + Keccak + ShardRam)#1259
Velaciela wants to merge 89 commits intomasterfrom
feat/gpu-witnessgen

Velaciela commented Mar 3, 2026 •

edited

Loading

Uh oh!

Velaciela commented Mar 25, 2026 •

edited

Loading

Uh oh!

hero78119 left a comment

Uh oh!

hero78119 Mar 25, 2026

Uh oh!

Velaciela Mar 25, 2026 •

edited

Loading

Uh oh!

hero78119 Mar 25, 2026 •

edited

Loading

Uh oh!

Velaciela Mar 27, 2026

Uh oh!

Velaciela Mar 27, 2026

Uh oh!

Velaciela Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Velaciela commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU Witness Generation

Architecture

Module Layout

Data Flow

Per-Shard Pipeline

GPU/CPU Decision (dispatch.rs)

Keccak Dispatch

Lk and Shardram Collection

E2E Pipeline Modes (e2e.rs)

Environment Variables

CENO_GPU_DEBUG_COMPARE_WITGEN Coverage

Tests

Running Tests

Per-Chip Boilerplate Macros

Uh oh!

Velaciela commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GPU Witness Generation — Invasive Changes to Existing Codebase

1. ceno_emul — FFI Layout Changes (+332 / -88 lines)

#[repr(C)] on emulator types

Layout test

2. Instruction<E> Trait — New Methods and Constants

3. Gadgets — New emit_lk_and_shardram / emit_shardram Methods

Intermediate configs (r_insn.rs, i_insn.rs, b_insn.rs, s_insn.rs, j_insn.rs, im_insn.rs)

4. Per-Chip Circuit Files — GPU Opt-in (+792 / -129 lines across ~20 files)

5. ShardContext — New Fields and Methods

New methods

Renamed method

Pipeline hooks (in generate_witness shard loop)

Pipeline mode (in create_proofs_streaming)

6. ZKVMWitnesses — GPU ShardRam Pipeline

assign_shared_circuit — GPU fast path

7. ShardRamCircuit — GPU Witness Generation

New GPU functions

Visibility change

8. SepticCurve — New Math Utilities

9. Minor Touches

Summary

Uh oh!

hero78119 left a comment

Choose a reason for hiding this comment

Uh oh!

hero78119 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Velaciela Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hero78119 Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Velaciela Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Velaciela Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Velaciela Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Velaciela commented Mar 3, 2026 •

edited

Loading

`CENO_GPU_DEBUG_COMPARE_WITGEN` Coverage

Velaciela commented Mar 25, 2026 •

edited

Loading

1. `ceno_emul` — FFI Layout Changes (+332 / -88 lines)

`#[repr(C)]` on emulator types

2. `Instruction<E>` Trait — New Methods and Constants

3. Gadgets — New `emit_lk_and_shardram` / `emit_shardram` Methods

Intermediate configs (`r_insn.rs`, `i_insn.rs`, `b_insn.rs`, `s_insn.rs`, `j_insn.rs`, `im_insn.rs`)

5. `ShardContext` — New Fields and Methods

Pipeline hooks (in `generate_witness` shard loop)

Pipeline mode (in `create_proofs_streaming`)

6. `ZKVMWitnesses` — GPU ShardRam Pipeline

`assign_shared_circuit` — GPU fast path

7. `ShardRamCircuit` — GPU Witness Generation

8. `SepticCurve` — New Math Utilities

Velaciela Mar 25, 2026 •

edited

Loading

hero78119 Mar 25, 2026 •

edited

Loading

Velaciela Mar 27, 2026 •

edited

Loading