Shuffle V2: Coordinated disk-backed shuffle protocol (part one)#2702
Shuffle V2: Coordinated disk-backed shuffle protocol (part one)#2702jgraettinger wants to merge 8 commits intomasterfrom
Conversation
Implementation of the Shuffle V2 gRPC protocol. Also add README for legacy shuffle implementation.
Add various impls for Producer, Clock, and Flag such as custom Debug, Hash, and various helpers. Add sequence() which implements the core logic of sequencing Producer messages, to understand pending versus committed spans and tracked offsets.
`Binding` represents extracted shuffle configuration for a single binding of a task. Frontier is a delta of progress available for consumption. It's closely related to the existing Gazette checkpoint concept, but additionally represents "causal" hints that are tracked and resolved internal to the shuffle protocol.
The Session RPC is the entrypoint of a distributed shuffle. It spawns and coordinates subordinate Slice RPCs, assigns journals to slices for reading, and aggregates & reports available progress upwards to the coordinator client.
Slices are a scale-out RPC (one per member) which do the heavy lifting of monitoring journal listings and reading across assigned journals. They sequence documents across their assigned journals, track producer states, and route & enqueue documents to Queue RPCs. They initiate flush barriers which allow the Slice to report deltas of progress upwards to the Session. Each Slice RPC maintains a Queue RPC to each member queue (MxM fan-out).
Queue RPCs join across and receive Enqueue and Flush requests from Slice RPCs. They merge and order recieved Enqueues, write to an on-disk queue file (still TODO), and respond to Slice Flush barriers. Queues build out the on-disk queued state which will be dequeued from using the checkpoint presented to the shuffle coordinator by the session RPC.
This subcommand is for development of the v2 distributed shuffle implementation. It's hidden and not intended as user-facing.
cceb032 to
db0c52a
Compare
williamhbaker
left a comment
There was a problem hiding this comment.
LGTM! Lots to digest here.
Since there is more to come on this and it isn't actually hooked up to anything production-related yet, none of these comments are blocking. The most substantive thing I found was using the flowctl raw shuffle on bindings with suspended journals does not work, so that will need to be taken care of either here or in a follow-up if it makes more sense.
| // available. The client requests a next checkpoint at times of its choosing | ||
| // (e.g., after completing processing of the previous checkpoint). | ||
| message NextCheckpoint {} | ||
| NextCheckpoint next_checkpoint = 4; |
There was a problem hiding this comment.
Is skipping field 3 intentional here?
| } | ||
|
|
||
| impl<'p> Verify<'p> { | ||
| #[must_use] |
There was a problem hiding this comment.
Is this #[must_use] redundant for an anyhow::Result return type?
| /// Pre-parsed schema for validation and inference. | ||
| pub schema: doc::Validator, | ||
| /// True if documents should be validated on read (derivation transforms only). | ||
| pub validate_on_read: bool, |
There was a problem hiding this comment.
Just for my own understanding, I wanted to ask about what this means. At a very superficial level my thought is that materializations do (schema) validation on reads as well, but clearly this must be referring to something a little different than that, and I wasn't able to deduce it.
|
|
||
| /// Parse a collection schema bundle into a Validator and inferred Shape. | ||
| /// Prefers the read schema; falls back to the write schema. | ||
| fn build_schema( |
There was a problem hiding this comment.
nit: I'm noting that we do this pattern of "build schema, validator, and shape" in at least one other place in the code base, and elsewhere too like materialize-kafka. It would be nice to have a helper maybe like doc::validation::build_validator_and_shape(bundle) -> Result<(Validator, Shape)> for this repeated pattern.
| } = added; | ||
|
|
||
| let binding = &self.bindings[binding as usize]; | ||
| let journal_name = spec.as_ref().map(|s| s.name.as_str()).unwrap_or(""); |
There was a problem hiding this comment.
Would it make sense to error here, instead of using an empty journal name? My understanding is that it shouldn't be possible to not have a name here.
| block: false, | ||
| do_not_proxy: true, | ||
| header, | ||
| ..Default::default() |
There was a problem hiding this comment.
Does setting the metadata_only flag here do any good? Looking at the broker code here I suppose the req.EndOffset != 0 && resp.Offset >= req.EndOffset would always end up being true with a request offset of -1 so perhaps it is redundant.
| } | ||
|
|
||
| /// Probe the current write head of a journal via a non-blocking read at offset -1. | ||
| /// Returns `(write_head, header)`. JournalNotFound yields `(0, None)`. |
There was a problem hiding this comment.
The "JournalNotFound yields (0, None)" part from the comment doesn't seem to track, unless I'm missing something: Seems like it would hit the map_read_error and return an Err(<something>).
| /// Returns the completed frontier when all members have flushed. | ||
| pub fn on_flushed(&mut self, member_index: usize) -> Option<crate::Frontier> { | ||
| let Some(in_flight) = self.in_flight.get_mut(member_index) else { | ||
| return None; |
There was a problem hiding this comment.
Is this a normally possible condition to happen? I'm thinking it would take some kind of application error, for a flushed to be sent twice for the same member, and we might want to error instead of silently dropping it.
| Ok(Binding { | ||
| index: 0, | ||
| filter_r_clocks: false, | ||
| journal_read_suffix: String::new(), // No suffix for ad-hoc reads. |
There was a problem hiding this comment.
Discussed in VC and will be taken care of in a later PR, but just noting here:
For an ad-hoc collection read with flowctl raw shuffle --name=<collection_name>, I found this not to work with an error of Journal.metadata path segment: invalid length (0; expected 1 <= length <= 512). If I put a dummy suffix here though, it worked.
| attempt, | ||
| inner: err, | ||
| }) => match err { | ||
| gazette::Error::BrokerStatus(broker::Status::JournalNotFound) => { |
There was a problem hiding this comment.
I tested flowctl raw shuffle on a materialization task reading from collections with suspended journals, and got an error unexpected broker status: Suspended. I think there will need to be some additional handling for suspended journals.
Summary
crates/shuffle/, a new Rust crate implementing the V2 shuffle protocol as a three-level RPC hierarchy (Session → Slice → Queue) that replaces the in-memory, per-shardgo/shuffle/systemgo/protocols/shuffle/shuffle.protodefining the wire protocol and generated Rust/gRPC bindingsflowctl raw shuffle, a development client that runs the full pipeline locally against a live collectionMotivation
The legacy
go/shuffle/system has scaling and correctness limitations that block further progress:The V2 design addresses all three: documents are routed to on-disk queue files (eliminating memory pressure), a single Session coordinates a unified checkpoint frontier across all members, and M + M² streams replace M×N (110 streams for 10 members, regardless of journal count).
Architecture
Session — Coordinates the shuffle. Receives journal discoveries from Slices, assigns journal reads via range-overlap routing, pulls progress deltas, aggregates them into a four-stage checkpoint pipeline (progressed → unresolved → ready → consumed), and serves
NextCheckpointto the Coordinator.Slice — Reads assigned journals, sequences documents with per-producer clock tracking (filtering duplicates from conservative reads and at-least-once delivery), orders them by priority then adjusted clock via a max-heap, routes each document to the owning Queue(s) by key hash and r-clock, and autonomously flushes Queues after observed commits.
Queue — Receives routed documents from all Slices, merges them into a single ordered stream via a priority heap (best-effort read leveling), and writes to disk. Responds to Flush once all preceding documents are durable.
What's in this PR
Protocol (
shuffle.proto)Session,Slice,QueueFrontierChunk,JournalFrontier,ProducerFrontier) with delta-encoded journal namesEnqueuecarries routed documents with priority, adjusted clock, packed key, and archived document bytescrates/shuffle/binding: Extracts and validates per-binding shuffle configuration from task specs (derivation transforms, materialization bindings, ad-hoc reads). Assigns cohorts by unique(priority, read_delay)tuples.frontier: Sorted frontier data structure withreduce(sorted merge),resolve_hints(causal hint resolution for cross-journal transactions),project_unresolved_hints(recovery projection), and chunkedDrainfor streaming over gRPC.session/: Handler opens Slice RPCs and reads the resume checkpoint. Actor runs theselect!loop dispatching Session and Slice messages. State implementsTopology(routing journal reads to members),CheckpointPipeline(four-stage state machine that gates on causal hint resolution).slice/: Handler opens Queue RPCs. Actor runs theselect!loop over listing watches, journal probes, journal reads, and the ready-read heap. State implementsFlushState(autonomous flush pipelining),ProgressState(pull-based progress reporting), andsequence_document(per-producer duplicate filtering with rollback detection). Sub-modules:listing(Gazette journal watch),producer(ProducerState with dual settled/pending maps),routing(r-clock rotation, key hash + range routing),read(ReadState with journal probing),heap(max-heap by priority DESC, adjusted clock ASC).queue/: Handler coordinatesQueueJoinsynchronization (defers actor spawn until all Slices connect). Actor runs theselect!loop merging Enqueues from all Slices. Heap matches Slice ordering for best-effort read leveling.service: gRPC server entry point with peer channel caching,spawn_{session,slice,queue}, and HTTP/2 keep-alive tuning.Changes to existing crates
proto-gazette/uuid:Ord/HashonProducer(for use as map keys with passthrough hashers),Clockarithmetic and accessors, customDebugimpls for readabilityproto-flow: Generated Rust bindings and serde impls forshuffle.protoproto-grpc: Generated gRPC client/server stubs for the three shuffle RPCslabels/partition:decode_field_rangenow takes explicit field names and validates coverage (needed for shuffle key partition field extraction)doc/validation: CustomDebugonValidatorshowing schema URI instead of full schemaDevelopment tooling
flowctl raw shuffle: Runs the full Session → Slice → Queue pipeline locally against a live production collection, useful for end-to-end validationNot yet implemented
uses_lambdabut actual lambda-based key extraction is not yet wired (and may be removed eventually? TBD).Testing
Extensive new unit tests of business logic, which is generally extracted into pure functions and state machines wherever possible.
E2E test via
RUST_LOG=shuffle=info cargo run -p flowctl -- raw shuffle --name <collection> --members 3, which runs the full pipeline against a live collection.A future PR will introduce integration tests which exercise shuffled reads using real Etcd & Gazette, as we have with the legacy
go/shuffle/package.