Fix Shard Iterator logic by colmsnowplow · Pull Request #33 · snowplow-devops/kinsumer

colmsnowplow · 2025-09-08T22:52:54Z

This PR fixes a critical issue with the implementation of configuring shard iterator type in release/1.7.0

The problem:

Previously, we handled the configuration of iterator type in the shard consumer. What we didn't realise is that this simple implementation uses the configured iterator for every new shard - even those that have been created during a scaling action.

So, any data for a new shard that arrived before we began consuming would be lost.

This problem is demonstrated in the test changes implemented in the first three commits - (TestSplit and TestShardsMerged). For iterator type TRIM_HORIZON they're fine, for LATEST they lose data where they shouldn't.

The solution:

The original kinsumer implementation had a mechanism whereby we could use LATEST if we amend the DDB table to have a sequenceNumber of "LATEST". I reverted the consumer logic back to this mechanism, and then implemented a "bootstrap" mechnanism.

It works as follows:

If no shards are registered (in the metadata table), and LATEST is configured, we need to bootstrap.
We bootstrap by adding records to the checkpoints table for all shards found in kinesis
If they're open, mark them with "LATEST", if they're closed mark them as finished (checkpoints table)
If it looks like another client is updating checkpoints, wait till it stops (if it stops before completing all shards, take over)
Once we have bootstrapped, register the shards (metadata table)

Some details/considerations:

Clients use the metadata register to get the list of shards to allocate for consumption. This is normally managed by leader actions - those are paused until we register shards now.

Checkpoints are used by the leader to determine when a shard is finished, and by clients to determine where in a shard to start.

I attempted to use batch writes to avoid spamming DDB when there are a lot of shards - but conention cannot be avoided (even by putting all the logic in leader actions), so I reverted back to using single conditional writes, and added a mechanism to reduce the scope for contention

* Add iteratorType option * Use ktypes enum for iterator type * Make deafult latest, solidify safety * Fix shard consumer tests

Mikhail (oo-sleeper)

got a few comments

Mikhail (oo-sleeper) · 2025-09-09T08:17:30Z

kinsumer.go

+	// - if shard iterator is TRIM_HORIZON: use the full list of shard IDs, and don't do the bootstrap
+	// - if shard iterator is LATEST: use only open shard IDs, and do the bootstrap


should we expand the comment to explain each point further in a sense that we don't really check iterator type in this function

Actually this comment should've been removed - it was a todo reminder that I missed 🤦

Mikhail (oo-sleeper) · 2025-09-09T08:25:09Z

kinsumer.go

+		return nil, fmt.Errorf("error loading shard IDs from kinesis: %v", err)
+	}
+
+	if k.config.iteratorType != ktypes.ShardIteratorTypeLatest {


so for iteratorType TRIM_HORIZON and AT_TIMESTAMP we would return all shards?

is it correct behaviour? or AT_TIMESTAMP is no longer a valid option and would be rejected?

Correct the first time, for both TRIM_HORIZON and AT_TIMESTAMP we don't bootstrap, we preserve previous behaviour - which is to proceed to process all shards

Mikhail (oo-sleeper) · 2025-09-09T08:59:45Z

kinsumer_test.go

+	assert.Greater(t, foundBefore2, 0)
+	assert.Equal(t, foundBefore2, numberOfEventsToTest)


I believe assert.Greater(t, foundBefore2, 0) can go, seems like an expectation is that foundBefore2 should be equal to numberOfEventsToTest

Mikhail (oo-sleeper) · 2025-09-09T09:10:26Z

kinsumer_test.go

+	// // DEBUG: Print checkpoints table contents after client initialization
+	// checkpoints, err := loadCheckpoints(d, clients[0].checkpointTableName)
+	// require.NoError(t, err, "Error loading checkpoints for debugging")
+	// t.Logf("=== CHECKPOINTS AFTER CLIENT INITIALIZATION ===")
+	// for shardID, checkpoint := range checkpoints {
+	// 	seqNum := "nil"
+	// 	if checkpoint.SequenceNumber != nil {
+	// 		seqNum = *checkpoint.SequenceNumber
+	// 	}
+	// 	finished := "nil"
+	// 	if checkpoint.Finished != nil {
+	// 		finished = fmt.Sprintf("%d", *checkpoint.Finished)
+	// 	}
+	// 	ownerID := "nil"
+	// 	if checkpoint.OwnerID != nil {
+	// 		ownerID = *checkpoint.OwnerID
+	// 	}
+	// 	t.Logf("Shard %s: SequenceNumber=%s, Finished=%s, OwnerID=%s",
+	// 		shardID, seqNum, finished, ownerID)
+	// }
+	// t.Logf("=== END CHECKPOINTS DEBUG ===")


believe we don't need it now

We don't. But in this project, there are a few places in tests where I've left commented out debugging code, because where we need to figure out a problem, it can be burdensome to come up with ways to explore the mechanics. This is one of those, but I should've added a comment to indicate as much.

colmsnowplow · 2025-09-10T10:09:25Z

Converting to draft - as discussed we'll take another approach in the short term, and revisit this idea later.

colmsnowplow added 28 commits September 3, 2025 13:40

Add iterator type configuration (#28)

b1cb96d

* Add iteratorType option * Use ktypes enum for iterator type * Make deafult latest, solidify safety * Fix shard consumer tests

Add getRecordsLimit to make pulls configurable (#29)

0dffb14

Add maxConcurrentShards setting (#30)

7f01690

Initial metrics implementation for tracking record in memory

0ddc9f6

Buffer metrics before sending

407f70a

Fix metrics to account for pre-buffer in-memory data

f446e33

Add a metric which measures memory backlog

9c087b8

Use int64 everywhere for recordsInMemory metric

5f0747b

Use channel instead of atomics to avoid possible contention

f746201

Log warnings if we drop metrics

a0c218f

Combine the updates to reduce channel traffic overhead

78f849f

Reduce channel activity/overhead with a batching mechanic

d06495b

Use a smaller channel and log immediately if blocked

65d6668

Factor out batch processing into a function

daec879

Add tests covering concurrency control behaviours

50801f2

Amend tests to catch timing bug with LATEST iterator

109451c

Amend TestSplit to cover shard iterator change details

588e90d

Factor TestShardsMerged to run with both iterators

5a90bb4

Revert getShardIterator behaviour back to previous logic

bccbf41

First pass at better handling of LATEST

16bf6fc

Skip leader actions if we're not done setting up the cache

d08c6e9

Deal with open and closed shards in configuring LATEST iterator

2462e31

Tidy up config in TestSplit

7171706

Add an initial sketch of a test to cover closed shard behaviour

1842874

Fix closed shards getting processed bug

9f39aa7

Send data during merge in new test as well

10a75d0

Improve bootstrap implementation

afcd252

Add test for bootstrap, cleanup other tests

0cd4523

colmsnowplow requested review from Ian Streeter (istreeter) and Josh (jbeemster) September 8, 2025 22:53

colmsnowplow requested a review from Mikhail (oo-sleeper) September 8, 2025 22:53

Mikhail (oo-sleeper) reviewed Sep 9, 2025

View reviewed changes

colmsnowplow force-pushed the refactor-consumer branch from 50801f2 to 7a92d4b Compare September 10, 2025 09:20

colmsnowplow marked this pull request as draft September 10, 2025 10:09

colmsnowplow force-pushed the refactor-consumer branch from 7a92d4b to 3b9a1e0 Compare September 10, 2025 12:54

Base automatically changed from refactor-consumer to release/1.7.0 September 10, 2025 13:01

colmsnowplow force-pushed the release/1.7.0 branch from c8f5d9b to ec8fbc3 Compare September 10, 2025 13:05

Base automatically changed from release/1.7.0 to master September 12, 2025 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Shard Iterator logic#33

Fix Shard Iterator logic#33
colmsnowplow wants to merge 28 commits intomasterfrom
fix-shiterator-logic

colmsnowplow commented Sep 8, 2025

Uh oh!

Mikhail (oo-sleeper) left a comment

Uh oh!

Mikhail (oo-sleeper) Sep 9, 2025

Uh oh!

colmsnowplow Sep 10, 2025

Uh oh!

Mikhail (oo-sleeper) Sep 9, 2025

Uh oh!

colmsnowplow Sep 10, 2025

Uh oh!

Mikhail (oo-sleeper) Sep 9, 2025

Uh oh!

Mikhail (oo-sleeper) Sep 9, 2025

Uh oh!

colmsnowplow Sep 10, 2025

Uh oh!

colmsnowplow commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// - if shard iterator is TRIM_HORIZON: use the full list of shard IDs, and don't do the bootstrap
		// - if shard iterator is LATEST: use only open shard IDs, and do the bootstrap

		assert.Greater(t, foundBefore2, 0)
		assert.Equal(t, foundBefore2, numberOfEventsToTest)

Conversation

colmsnowplow commented Sep 8, 2025

The problem:

The solution:

Some details/considerations:

Uh oh!

Mikhail (oo-sleeper) left a comment

Choose a reason for hiding this comment

Uh oh!

Mikhail (oo-sleeper) Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

colmsnowplow Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Mikhail (oo-sleeper) Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

colmsnowplow Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Mikhail (oo-sleeper) Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Mikhail (oo-sleeper) Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

colmsnowplow Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

colmsnowplow commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants