feat: GPU witness generation (RV32IM + Keccak + ShardRam)#1259
feat: GPU witness generation (RV32IM + Keccak + ShardRam)#1259
Conversation
c95ee3f to
aadf86b
Compare
GPU Witness Generation — Invasive Changes to Existing CodebaseThis document lists all changes to existing ceno structures, traits, and flows 1.
|
| Type | File | Size | Purpose |
|---|---|---|---|
StepRecord |
tracer.rs |
136B | Per-step emulator output, bulk H2D |
Instruction |
rv32im.rs |
12B | Opcode encoding embedded in StepRecord |
InsnKind |
rv32im.rs |
1B | #[repr(u8)] enum discriminant |
MemOp<T> |
tracer.rs |
16/24B | Read/Write ops embedded in StepRecord |
Change<T> |
tracer.rs |
2×T | Before/after pair |
Impact: These were previously #[derive(Debug, Clone)] with compiler-chosen layout.
Adding #[repr(C)] pins field order and padding. No behavioral change for CPU code,
but field reordering or insertion now requires updating the CUDA mirror structs.
Layout test
test_step_record_layout_for_gpu verifies byte offsets of all StepRecord fields
at compile time. CUDA side has matching static_assert(sizeof(...)).
2. Instruction<E> Trait — New Methods and Constants
File: ceno_zkvm/src/instructions.rs
| Addition | Purpose |
|---|---|
const GPU_LK_SHARDRAM: bool = false |
Opt-in flag: does this chip have GPU LK+shardram support? |
fn collect_lk_and_shardram(...) |
CPU companion: collect all LK multiplicities + shard RAM records (without witness replay) |
fn collect_shardram(...) |
CPU companion: collect shard RAM records only (GPU handles LK) |
Default implementations return Err(...) — chips must explicitly opt in.
Impact: Existing chips that don't implement GPU support are unaffected (defaults).
The trait's existing assign_instance and assign_instances are unchanged.
Three macros reduce per-chip boilerplate:
impl_collect_lk_and_shardram!— wraps the unsafeCpuLkShardramSinkprologueimpl_collect_shardram!— one-line delegate to insn_configimpl_gpu_assign!—#[cfg(feature = "gpu")] assign_instancesoverride
3. Gadgets — New emit_lk_and_shardram / emit_shardram Methods
File: ceno_zkvm/src/instructions/riscv/insn_base.rs (+253 lines)
Every base gadget (ReadRS1, ReadRS2, WriteRD, ReadMEM, WriteMEM, MemAddr)
gained two new methods:
| Method | What it does |
|---|---|
emit_lk_and_shardram(sink, ctx, step) |
Emit LK ops + RAM send events through LkShardramSink |
emit_shardram(shard_ctx, step) |
Directly write shard RAM records to ShardContext (no LK) |
Impact: Additive only — existing assign_instance methods are unchanged.
The new methods extract the same logic that assign_instance performed inline,
but route through the LkShardramSink trait instead of directly calling
lk_multiplicity.assert_ux(...).
Intermediate configs (r_insn.rs, i_insn.rs, b_insn.rs, s_insn.rs, j_insn.rs, im_insn.rs)
Each gained corresponding emit_lk_and_shardram / emit_shardram methods that
compose their gadgets' methods + emit LkOp::Fetch.
4. Per-Chip Circuit Files — GPU Opt-in (+792 / -129 lines across ~20 files)
Each v2 circuit file (arith.rs, logic_circuit.rs, div_circuit_v2.rs, etc.) gained:
const GPU_LK_SHARDRAM: bool = true; // or conditional match
impl_collect_lk_and_shardram!(r_insn, |sink, step, _config, _ctx| {
// chip-specific LK ops
});
impl_collect_shardram!(r_insn);
impl_gpu_assign!(dispatch::GpuWitgenKind::Add);Impact: Additive — existing assign_instance and construct_circuit unchanged.
The #[cfg(feature = "gpu")] assign_instances override is only compiled with the
gpu feature flag.
5. ShardContext — New Fields and Methods
File: ceno_zkvm/src/e2e.rs (+616 / -199 lines)
New methods
| Method | Purpose |
|---|---|
new_empty_like() |
Clone shard metadata with empty record storage (for debug comparison) |
insert_read_record() / insert_write_record() |
Direct record insertion (GPU D2H path) |
push_addr_accessed() |
Direct addr insertion (GPU D2H path) |
Renamed method
send() → split into record_send_without_touch() (no addr_accessed tracking) and
send() (which calls record_send_without_touch + push_addr_accessed).
Pipeline hooks (in generate_witness shard loop)
#[cfg(feature = "gpu")]
flush_shared_ec_buffers(&mut shard_ctx); // D2H shared GPU buffers
#[cfg(feature = "gpu")]
invalidate_shard_steps_cache(); // free GPU memoryPipeline mode (in create_proofs_streaming)
New overlap pipeline (default when GPU feature enabled but CENO_GPU_ENABLE_WITGEN unset):
CPU witgen on thread A, GPU prove on thread B, connected by crossbeam::bounded(0) channel.
6. ZKVMWitnesses — GPU ShardRam Pipeline
File: ceno_zkvm/src/structs.rs (thin wrapper only)
assign_shared_circuit — GPU fast path
try_assign_shared_circuit_gpu() delegates to gpu/chips/shard_ram::try_gpu_assign_shared_circuit().
The full GPU pipeline logic and gpu_ec_records_to_shard_ram_inputs conversion have been
moved to instructions/gpu/ — structs.rs only contains the wrapper that inserts results
into self.witnesses.
Two helper methods made pub(crate) for GPU access: mem_addresses(), make_cross_shard_record().
7. ShardRamCircuit — GPU Witness Generation
File: ceno_zkvm/src/tables/shard_ram.rs (+491 / -14 lines)
New GPU functions
| Function | Purpose |
|---|---|
try_gpu_assign_instances() |
H2D path: CPU records → GPU kernel → D2H witness |
try_gpu_assign_instances_from_device() |
Device path: records already on GPU → kernel → D2H |
Both run a two-phase GPU pipeline:
- Per-row kernel: basic fields + Poseidon2 trace (344 witness columns)
- EC tree kernel: layer-by-layer binary tree EC summation
Visibility change
ShardRamConfig fields changed from private to pub(crate) to allow
column map extraction in gpu/chips/shard_ram.rs.
8. SepticCurve — New Math Utilities
File: ceno_zkvm/src/scheme/septic_curve.rs (+307 lines)
New CPU-side math for EC point computation (mirrored in CUDA):
| Function | Purpose |
|---|---|
SepticExtension::frobenius() |
Frobenius endomorphism for norm computation |
SepticExtension::sqrt() |
Cipolla's algorithm for field square roots |
SepticPoint::from_x() |
Lift x-coordinate to curve point (used by nonce-finding loop) |
QuadraticExtension<F> |
Auxiliary type for Cipolla's algorithm |
9. Minor Touches
| File | Change |
|---|---|
Cargo.toml |
gpu feature flag, crossbeam dependency |
gkr_iop/src/gadgets/is_lt.rs |
AssertLtConfig.0.diff field access (already pub) |
gkr_iop/src/utils/lk_multiplicity.rs |
Minor: LkMultiplicity::increment |
ceno_zkvm/src/gadgets/signed_ext.rs |
pub(crate) fn msb() accessor for GPU column map |
ceno_zkvm/src/gadgets/poseidon2.rs |
Column contiguity constants for GPU |
ceno_zkvm/src/tables/*.rs |
pub(crate) visibility on config fields for GPU column map access |
ceno_zkvm/src/scheme/{cpu,gpu,prover,verifier} |
Minor plumbing for GPU proving path |
ceno_host/tests/test_elf.rs |
E2E test adjustments |
Summary
| Category | Nature | Risk |
|---|---|---|
#[repr(C)] on emulator types |
Layout pinning | Low — additive, but field changes now need CUDA sync |
Instruction<E> trait extensions |
Additive (defaults provided) | None — existing chips unaffected |
Gadget emit_* methods |
Additive | None — existing assign_instance unchanged |
ShardContext new methods |
Additive | Low — existing methods unchanged |
send() → record_send_without_touch() + send() |
Rename + split | Low — send() still works identically |
ShardRamConfig visibility |
private → pub(crate) |
None |
| Pipeline overlap mode | New default behavior | Medium — CPU witgen + GPU prove on separate threads |
septic_curve.rs math |
Additive | None — new functions, existing unchanged |
hero78119
left a comment
There was a problem hiding this comment.
A quick review regarding to tracer & SortedNextAccesses field
ceno_emul/src/tracer.rs
Outdated
| mmio_min_max_access: Option<BTreeMap<WordAddr, (WordAddr, WordAddr, WordAddr, WordAddr)>>, | ||
| latest_accesses: LatestAccesses, | ||
| next_accesses: NextCycleAccess, | ||
| next_accesses_vec: Vec<PackedNextAccessEntry>, |
There was a problem hiding this comment.
We can re-build SortedNextAccesses from self.next_accesses map so we can avoid introduce this new vector field and PackedNextAccessEntry
There was a problem hiding this comment.
re-build
slower than this method (roughly 50ms vs 200ms)
CENO_NEXT_ACCESS_SOURCE=preflight vs CENO_NEXT_ACCESS_SOURCE=hashmap
There was a problem hiding this comment.
I believe the slow come from the sequential traverse of hashmap
how about par_extend() ?
info_span!("next_access_from_hashmap").in_scope(|| {
let total_entries: usize =
addr_future_accesses.values().map(|pairs| pairs.len()).sum();
let mut entries = Vec::with_capacity(total_entries);
entries.par_extend(addr_future_accesses.par_iter().flat_map_iter(
|(cycle, pairs)| {
pairs
.iter()
.map(move |&(addr, next_cycle)| {
PackedNextAccessEntry::new(*cycle, addr.0, next_cycle)
})
},
));
entries
})If this work, we can remove vector version from tracer and only rebuild here and gated by gpu feature
There was a problem hiding this comment.
# using preflight-appended vec (9321311 entries)
┝━ [ceno] preflight-execute [ 7.85s | 6.11% ]
┝━ next_access_presort [ 52.6ms | 0.00% / 0.04% ]
│ ┝━ i [info]: [next-access presort]
│ ┝━ next_access_par_sort [ 52.5ms | 0.04% ] n: 9321311
# converting from HashMap - Serial
┝━ [ceno] preflight-execute [ 7.85s | 6.17% ]
┝━ next_access_presort [ 110ms | 0.01% / 0.09% ]
│ ┝━ next_access_from_hashmap [ 68.4ms | 0.05% ]
│ ┝━ next_access_par_sort [ 33.1ms | 0.03% ] n: 9321311
# converting from HashMap - Parallel
┝━ [ceno] preflight-execute [ 7.76s | 4.49% ]
┝━ next_access_presort [ 174ms | 0.00% / 0.10% ]
│ ┝━ next_access_from_hashmap [ 138ms | 0.08% ]
│ ┝━ next_access_par_sort [ 28.5ms | 0.02% ] n: 9321311
- Each element's work (PackedNextAccessEntry::new) is a few assignments and bit-shifts — trivial for the parallelism gains to outweigh Rayon's scheduling/synchronization overhead.
- The cost is mainly memory allocation and HashMap random access, which are memory-bandwidth-bound and only get worse with multiple threads contending for cache.
There was a problem hiding this comment.
Perhaps we could switch to the rebuild-from-hashmap approach first, without introducing interface changes, and address performance concerns later.
There was a problem hiding this comment.
1a84578 to
4f4d4f4
Compare
related: #1265
GPU Witness Generation
Accelerate witness generation by offloading computation from CPU to GPU.
This module (
ceno_zkvm/src/instructions/gpu/) contains all GPU-side dispatch,caching, and utility code for the witness generation pipeline.
The CUDA backend lives in the sibling repo
ceno-gpu/(cuda_hal/src/common/witgen/).Architecture
Module Layout
Data Flow
Per-Shard Pipeline
Within
generate_witness()(e2e.rs), each shard executes:Vec<StepRecord>(cached, shared across all chips)gpu_fill_witnessmatchesGpuWitgenKind→ 22 kernel variantsShardContextGPU/CPU Decision (dispatch.rs)
Keccak Dispatch
Keccak has a dedicated GPU dispatch path (
chips/keccak.rs::gpu_assign_keccak_instances)separate from
try_gpu_assign_instancesbecause:new_by_rotationpacked_instanceswithsyscall_witnessesThe LK/shardram collection logic is identical to the standard path.
Lk and Shardram Collection
After GPU computes the witness matrix, LK multiplicities and shard RAM records
are collected through one of several paths (priority order):
cpu_collect_shardramcpu_collect_lk_and_shardramassign_instanceassign_instanceCurrently all non-Keccak kinds use Path A. Paths B-E are fallback/debug paths.
E2E Pipeline Modes (e2e.rs)
Environment Variables
CENO_GPU_ENABLE_WITGENCENO_GPU_DISABLE_WITGEN_KINDSadd,keccak,lw. Falls back to CPU for those chips.CENO_GPU_DEBUG_COMPARE_WITGENCENO_GPU_DEBUG_COMPARE_WITGENCoverageWhen set, all failures are collected into a
DebugCompareReport(thread-local).Detailed mismatches are logged via
tracing::error!in real time; at pipeline endassert_debug_compare_report()prints a summary table and panics if any failures exist.Per-chip (in dispatch.rs, for each opcode circuit):
debug_compare_final_lk— GPU LK multiplicity vs CPUassign_instancebaseline (all 8 lookup tables)debug_compare_witness— GPU witness matrix vs CPU witness (element-by-element)debug_compare_shardram— GPU shard records (read_records, write_records, addr_accessed) vs CPUdebug_compare_shard_ec— GPU compact EC records vs CPU-computed EC points (nonce, x[7], y[7])Per-chip, Keccak-specific (in chips/keccak.rs):
debug_compare_keccak— Combined witness + LK + shard comparison for keccak's rotation-aware layoutShardRamCircuit (in chips/shard_ram.rs):
debug_compare_shard_ram_witness— GPU ShardRam witness vs CPU baseline (from ShardRamInput)debug_compare_shard_ram_witness_from_device— GPU ShardRam witness vs CPU baseline (D2H device buffer → convert → CPU assign)Per-shard, E2E level (in e2e.rs, all chips combined):
log_shard_ctx_diff— Aggregated addr_accessed comparison (write/read_records skipped when GPU witgen enabled)log_combined_lk_diff— Merged LK multiplicities afterfinalize_lk_multiplicities()(catches cross-chip merge issues)Tests
79 tests total (
cargo test --features gpu,u16limb_circuit -p ceno_zkvm --lib -- "gpu")chips/*.rs(31 viatest_colmap!macro + 2 manual)chips/*.rsassign_instance(element-by-element witness comparison)gpu/mod.rscollect_lk_and_shardram/collect_shardramvsassign_instancebaselineutils/mod.rsLkOp::encode_all()produces correct table/key pairsscheme/septic_curve.rsto_ec_pointscheme/septic_curve.rsscheme/septic_curve.rsseptic_point_from_xvs CPURunning Tests
Per-Chip Boilerplate Macros
Three macros in
instructions.rsreduce per-chip GPU integration to ~3 lines: