-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathdistributed-systems.html
More file actions
1696 lines (1444 loc) · 96.9 KB
/
distributed-systems.html
File metadata and controls
1696 lines (1444 loc) · 96.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Distributed Systems - Better Dev</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700;800&display=swap" rel="stylesheet">
<link rel="stylesheet" href="style.css">
</head>
<body>
<header class="topbar">
<button class="sidebar-toggle" aria-label="Open navigation" aria-expanded="false">
<span class="hamburger-icon"></span>
</button>
<a href="index.html" class="logo">Better Dev</a>
</header>
<div class="sidebar-backdrop" aria-hidden="true"></div>
<aside class="sidebar" aria-label="Site navigation">
<div class="sidebar-header">
<span class="sidebar-title">Navigation</span>
<button class="sidebar-close" aria-label="Close navigation">×</button>
</div>
<div class="sidebar-search">
<input type="text" class="sidebar-search-input" placeholder="Search topics..." aria-label="Search topics">
<div class="sidebar-search-results"></div>
</div>
<nav class="sidebar-nav">
<div class="sidebar-group"><a href="index.html">Home</a></div>
<div class="sidebar-group">
<div class="sidebar-group-label">Mathematics</div>
<a href="pre-algebra.html">Pre-Algebra</a>
<a href="algebra.html">Algebra</a>
<a href="sequences-series.html">Sequences & Series</a>
<a href="geometry.html">Geometry</a>
<a href="calculus.html">Calculus</a>
<a href="discrete-math.html">Discrete Math</a>
<a href="linear-algebra.html">Linear Algebra</a>
<a href="probability.html">Probability & Statistics</a>
<a href="binary-systems.html">Binary & Number Systems</a>
<a href="number-theory.html">Number Theory for CP</a>
<a href="computational-geometry.html">Computational Geometry</a>
<a href="game-theory.html">Game Theory</a>
</div>
<div class="sidebar-group">
<div class="sidebar-group-label">Data Structures & Algorithms</div>
<a href="dsa-foundations.html">DSA Foundations</a>
<a href="arrays.html">Arrays & Strings</a>
<a href="stacks-queues.html">Stacks & Queues</a>
<a href="hashmaps.html">Hash Maps & Sets</a>
<a href="linked-lists.html">Linked Lists</a>
<a href="trees.html">Trees & BST</a>
<a href="graphs.html">Graphs</a>
<a href="sorting.html">Sorting & Searching</a>
<a href="patterns.html">LeetCode Patterns</a>
<a href="dp.html">Dynamic Programming</a>
<a href="advanced.html">Advanced Topics</a>
<a href="string-algorithms.html">String Algorithms</a>
<a href="advanced-graphs.html">Advanced Graphs</a>
<a href="advanced-dp.html">Advanced DP</a>
<a href="advanced-ds.html">Advanced Data Structures</a>
<a href="leetcode-650.html">The 650 Problems</a>
<a href="competitive-programming.html">CP Roadmap</a>
</div>
<div class="sidebar-group">
<div class="sidebar-group-label">Languages & Systems</div>
<a href="cpp.html">C++</a>
<a href="golang.html">Go</a>
<a href="javascript.html">JavaScript Deep Dive</a>
<a href="typescript.html">TypeScript</a>
<a href="nodejs.html">Node.js Internals</a>
<a href="os.html">Operating Systems</a>
<a href="linux.html">Linux</a>
<a href="git.html">Git</a>
<a href="backend.html">Backend</a>
<a href="system-design.html">System Design</a>
<a href="networking.html">Networking</a>
<a href="cloud.html">Cloud & Infrastructure</a>
<a href="docker.html">Docker & Compose</a>
<a href="kubernetes.html">Kubernetes</a>
<a href="message-queues.html">Queues & Pub/Sub</a>
<a href="selfhosting.html">VPS & Self-Hosting</a>
<a href="databases.html">PostgreSQL & MySQL</a>
<a href="stripe.html">Stripe & Payments</a>
<a href="distributed-systems.html">Distributed Systems</a>
<a href="backend-engineering.html">Backend Engineering</a>
</div>
<div class="sidebar-group">
<div class="sidebar-group-label">JS/TS Ecosystem</div>
<a href="js-tooling.html">Tooling & Bundlers</a>
<a href="js-testing.html">Testing</a>
<a href="ts-projects.html">Building with TS</a>
</div>
<div class="sidebar-group">
<div class="sidebar-group-label">More</div>
<a href="seans-brain.html">Sean's Brain</a>
</div>
</nav>
</aside>
<div class="container">
<div class="page-header">
<div class="breadcrumb"><a href="index.html">Home</a> / Distributed Systems</div>
<h1>Distributed Systems</h1>
<p>Everything you need to understand, build, and debug distributed systems -- from CAP theorem and consensus algorithms to distributed caching and observability. With real code in Go and Node.js, ASCII diagrams, and battle-tested patterns used at companies running thousands of nodes.</p>
</div>
<div class="toc">
<h4>Table of Contents</h4>
<a href="#why">1. Why Distributed Systems</a>
<a href="#fundamentals">2. Fundamentals -- Time & Ordering</a>
<a href="#consistency">3. Consistency Models</a>
<a href="#consensus">4. Consensus Algorithms</a>
<a href="#replication">5. Replication</a>
<a href="#partitioning">6. Partitioning & Sharding</a>
<a href="#distributed-tx">7. Distributed Transactions</a>
<a href="#gossip">8. Gossip Protocols</a>
<a href="#load-balancing">9. Load Balancing</a>
<a href="#caching">10. Distributed Caching</a>
<a href="#rate-limiting">11. Rate Limiting</a>
<a href="#idempotency">12. Idempotency</a>
<a href="#observability">13. Observability</a>
<a href="#practice">14. Practice & Resources</a>
</div>
<!-- ============================================================ -->
<!-- SECTION 1: WHY DISTRIBUTED SYSTEMS -->
<!-- ============================================================ -->
<section id="why">
<h2>1. Why Distributed Systems</h2>
<p>A distributed system is a collection of independent computers that appears to its users as a single coherent system. You need them when a single machine cannot handle the load, when you need fault tolerance, or when your users are spread across the globe and need low latency.</p>
<h3>The CAP Theorem</h3>
<p>Formulated by Eric Brewer in 2000 and proven by Gilbert and Lynch in 2002. In any distributed data store, you can only guarantee two of the following three properties simultaneously:</p>
<div class="formula-box">
<div class="label">CAP Theorem</div>
<strong>C</strong>onsistency -- Every read receives the most recent write or an error<br>
<strong>A</strong>vailability -- Every request receives a non-error response (no guarantee it's the most recent write)<br>
<strong>P</strong>artition Tolerance -- The system continues to operate despite arbitrary message loss between nodes
</div>
<p>In practice, network partitions <em>will</em> happen. So the real choice is between <strong>CP</strong> (consistency + partition tolerance) and <strong>AP</strong> (availability + partition tolerance).</p>
<pre><code><span class="lang-label">ASCII</span>
<span class="comment"> Consistency (C)</span>
<span class="comment"> /\</span>
<span class="comment"> / \</span>
<span class="comment"> / \</span>
<span class="comment"> / CP \</span>
<span class="comment"> / zone \</span>
<span class="comment"> /----------\</span>
<span class="comment"> / CAN'T \</span>
<span class="comment"> / HAVE ALL \</span>
<span class="comment"> / THREE \</span>
<span class="comment"> /___________________\</span>
<span class="comment"> Availability (A) ---------- Partition Tolerance (P)</span>
<span class="comment"> AP zone</span>
<span class="comment"></span>
<span class="comment"> CP systems: ZooKeeper, HBase, MongoDB (default), etcd, Consul</span>
<span class="comment"> AP systems: Cassandra, DynamoDB, CouchDB, Riak</span>
<span class="comment"> CA systems: Traditional RDBMS (single node -- no partition tolerance)</span></code></pre>
<div class="warning-box">
<div class="label">CAP is Nuanced</div>
<p>CAP is not a binary toggle. Real systems make trade-offs on a spectrum. Many databases let you tune consistency per-query. Cassandra, for example, lets you set consistency level to QUORUM for strong reads and ONE for fast reads. The PACELC theorem extends CAP: even when there is no Partition, you still trade off between Latency and Consistency.</p>
</div>
<h3>The 8 Fallacies of Distributed Computing</h3>
<p>Peter Deutsch (and James Gosling) identified assumptions that developers new to distributed systems mistakenly make:</p>
<div class="example-box">
<div class="label">The 8 Fallacies</div>
<p><strong>1. The network is reliable</strong> -- Packets get dropped, connections reset, cables get cut.</p>
<p><strong>2. Latency is zero</strong> -- A cross-continent round trip is ~150ms. That adds up fast.</p>
<p><strong>3. Bandwidth is infinite</strong> -- Serialization, large payloads, and chatty protocols kill throughput.</p>
<p><strong>4. The network is secure</strong> -- Every hop is an attack surface. mTLS exists for a reason.</p>
<p><strong>5. Topology doesn't change</strong> -- Nodes join, leave, and move. Auto-scaling is normal.</p>
<p><strong>6. There is one administrator</strong> -- Multiple teams, orgs, and cloud providers manage pieces.</p>
<p><strong>7. Transport cost is zero</strong> -- Serialization/deserialization, DNS lookups, TLS handshakes all cost time and CPU.</p>
<p><strong>8. The network is homogeneous</strong> -- Different hardware, OS versions, protocol versions everywhere.</p>
</div>
<p>Every design decision you make in a distributed system should account for these realities. If your code assumes any of these, it <em>will</em> break in production.</p>
</section>
<!-- ============================================================ -->
<!-- SECTION 2: FUNDAMENTALS -->
<!-- ============================================================ -->
<section id="fundamentals">
<h2>2. Fundamentals -- Time & Ordering</h2>
<p>In a distributed system, there is no global clock. Each node has its own clock that drifts independently. We need logical clocks to establish ordering of events.</p>
<h3>Happens-Before Relation</h3>
<p>Defined by Leslie Lamport in 1978. Event A <em>happens-before</em> event B (written A → B) if:</p>
<div class="formula-box">
<div class="label">Happens-Before Rules</div>
1. A and B are events in the same process, and A comes before B<br>
2. A is a send event and B is the corresponding receive event<br>
3. Transitivity: if A → B and B → C, then A → C<br><br>
If neither A → B nor B → A, the events are <strong>concurrent</strong> (A || B)
</div>
<h3>Lamport Timestamps</h3>
<p>Each process maintains a counter. On local event: increment. On send: increment and attach. On receive: set to max(local, received) + 1. If A → B then L(A) < L(B), but the converse is NOT true.</p>
<pre><code><span class="lang-label">Go</span>
<span class="keyword">package</span> main
<span class="keyword">import</span> (
<span class="string">"fmt"</span>
<span class="string">"sync"</span>
)
<span class="keyword">type</span> LamportClock <span class="keyword">struct</span> {
mu sync.Mutex
counter <span class="keyword">uint64</span>
}
<span class="keyword">func</span> <span class="function">NewLamportClock</span>() *LamportClock {
<span class="keyword">return</span> &LamportClock{counter: <span class="number">0</span>}
}
<span class="comment">// Tick increments the clock for a local event</span>
<span class="keyword">func</span> (lc *LamportClock) <span class="function">Tick</span>() <span class="keyword">uint64</span> {
lc.mu.Lock()
<span class="keyword">defer</span> lc.mu.Unlock()
lc.counter++
<span class="keyword">return</span> lc.counter
}
<span class="comment">// Send increments and returns timestamp to attach to message</span>
<span class="keyword">func</span> (lc *LamportClock) <span class="function">Send</span>() <span class="keyword">uint64</span> {
<span class="keyword">return</span> lc.<span class="function">Tick</span>()
}
<span class="comment">// Receive updates clock based on incoming message timestamp</span>
<span class="keyword">func</span> (lc *LamportClock) <span class="function">Receive</span>(msgTimestamp <span class="keyword">uint64</span>) <span class="keyword">uint64</span> {
lc.mu.Lock()
<span class="keyword">defer</span> lc.mu.Unlock()
<span class="keyword">if</span> msgTimestamp > lc.counter {
lc.counter = msgTimestamp
}
lc.counter++
<span class="keyword">return</span> lc.counter
}
<span class="keyword">func</span> <span class="function">main</span>() {
nodeA := <span class="function">NewLamportClock</span>()
nodeB := <span class="function">NewLamportClock</span>()
t1 := nodeA.<span class="function">Tick</span>() <span class="comment">// A: local event, t=1</span>
sendT := nodeA.<span class="function">Send</span>() <span class="comment">// A: send msg, t=2</span>
recvT := nodeB.<span class="function">Receive</span>(sendT) <span class="comment">// B: receive, t=max(0,2)+1=3</span>
fmt.<span class="function">Printf</span>(<span class="string">"A local: %d, A send: %d, B recv: %d\n"</span>, t1, sendT, recvT)
}</code></pre>
<h3>Vector Clocks</h3>
<p>Lamport timestamps cannot detect concurrency. Vector clocks solve this: each node maintains a vector of counters, one per node. Now V(A) < V(B) implies A → B, and incomparable vectors mean concurrent events.</p>
<pre><code><span class="lang-label">Go</span>
<span class="keyword">type</span> VectorClock <span class="keyword">struct</span> {
nodeID <span class="keyword">string</span>
clock <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">uint64</span>
}
<span class="keyword">func</span> <span class="function">NewVectorClock</span>(nodeID <span class="keyword">string</span>) *VectorClock {
<span class="keyword">return</span> &VectorClock{
nodeID: nodeID,
clock: <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">uint64</span>{nodeID: <span class="number">0</span>},
}
}
<span class="keyword">func</span> (vc *VectorClock) <span class="function">Tick</span>() {
vc.clock[vc.nodeID]++
}
<span class="keyword">func</span> (vc *VectorClock) <span class="function">Send</span>() <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">uint64</span> {
vc.<span class="function">Tick</span>()
<span class="comment">// Return a copy</span>
cp := <span class="builtin">make</span>(<span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">uint64</span>)
<span class="keyword">for</span> k, v := <span class="keyword">range</span> vc.clock {
cp[k] = v
}
<span class="keyword">return</span> cp
}
<span class="keyword">func</span> (vc *VectorClock) <span class="function">Receive</span>(other <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">uint64</span>) {
<span class="keyword">for</span> k, v := <span class="keyword">range</span> other {
<span class="keyword">if</span> v > vc.clock[k] {
vc.clock[k] = v
}
}
vc.clock[vc.nodeID]++
}
<span class="comment">// HappensBefore returns true if vc happened before other</span>
<span class="keyword">func</span> (vc *VectorClock) <span class="function">HappensBefore</span>(other *VectorClock) <span class="keyword">bool</span> {
atLeastOneLess := <span class="keyword">false</span>
<span class="keyword">for</span> k, v := <span class="keyword">range</span> vc.clock {
ov := other.clock[k]
<span class="keyword">if</span> v > ov {
<span class="keyword">return</span> <span class="keyword">false</span>
}
<span class="keyword">if</span> v < ov {
atLeastOneLess = <span class="keyword">true</span>
}
}
<span class="keyword">for</span> k, ov := <span class="keyword">range</span> other.clock {
<span class="keyword">if</span> _, exists := vc.clock[k]; !exists && ov > <span class="number">0</span> {
atLeastOneLess = <span class="keyword">true</span>
}
}
<span class="keyword">return</span> atLeastOneLess
}</code></pre>
<div class="tip-box">
<div class="label">When to Use Which</div>
<p>Use <strong>Lamport timestamps</strong> when you only need a total ordering (e.g., log sequencing). Use <strong>vector clocks</strong> when you need to detect concurrent writes (e.g., conflict detection in Dynamo-style databases). Vector clocks grow with the number of nodes -- in large clusters, consider <strong>dotted version vectors</strong> or <strong>interval tree clocks</strong> instead.</p>
</div>
</section>
<!-- ============================================================ -->
<!-- SECTION 3: CONSISTENCY MODELS -->
<!-- ============================================================ -->
<section id="consistency">
<h2>3. Consistency Models</h2>
<p>Consistency models define what guarantees a distributed system gives about the order and visibility of operations. From strongest to weakest:</p>
<pre><code><span class="lang-label">ASCII</span>
<span class="comment"> Strongest Weakest</span>
<span class="comment"> | |</span>
<span class="comment"> v v</span>
<span class="comment"> Linearizable > Sequential > Causal > PRAM > Eventual</span>
<span class="comment"></span>
<span class="comment"> More consistency = more coordination = higher latency</span>
<span class="comment"> Less consistency = less coordination = higher throughput</span></code></pre>
<h3>Linearizability (Strong Consistency)</h3>
<p>The gold standard. Every operation appears to take effect atomically at some point between its invocation and response. Acts as if there is a single copy of the data.</p>
<div class="formula-box">
<div class="label">Linearizability Guarantee</div>
For any two operations op1 and op2:<br>
If op1 completes before op2 starts → op1's effect is visible to op2<br>
Real-time ordering is preserved.
</div>
<h3>Sequential Consistency</h3>
<p>All processes see the same order of operations, but that order does not need to match real-time. Operations from a single process appear in program order. Weaker than linearizability because cross-process real-time ordering is not guaranteed.</p>
<h3>Causal Consistency</h3>
<p>If operation A causally precedes B (A → B), then all nodes see A before B. Concurrent operations can appear in any order. This captures the intuition of "if you saw it, your response must come after it."</p>
<h3>Eventual Consistency</h3>
<p>If no new writes are made, eventually all replicas converge to the same value. No ordering guarantees during updates. Used by DNS, Cassandra (with low consistency level), and most caching systems.</p>
<div class="example-box">
<div class="label">Real-World Consistency Choices</div>
<p><strong>Banking (account balance)</strong>: Linearizable -- you cannot show a wrong balance.</p>
<p><strong>Social media (like count)</strong>: Eventual -- off by a few for a second is acceptable.</p>
<p><strong>Chat messages (ordering)</strong>: Causal -- replies must appear after the message they reply to.</p>
<p><strong>Shopping cart</strong>: Eventual with conflict resolution -- Amazon's Dynamo paper pioneered this.</p>
</div>
<pre><code><span class="lang-label">Node.js</span>
<span class="comment">// Demonstrating consistency issues with a naive distributed counter</span>
<span class="comment">// Two nodes both read, increment, write -- lost update problem</span>
<span class="keyword">class</span> <span class="function">DistributedCounter</span> {
<span class="function">constructor</span>(replicas) {
<span class="keyword">this</span>.replicas = replicas; <span class="comment">// Map of nodeId -> value</span>
}
<span class="comment">// Eventual consistency: read from any replica</span>
<span class="function">readEventual</span>(nodeId) {
<span class="keyword">return</span> <span class="keyword">this</span>.replicas.get(nodeId) || <span class="number">0</span>;
}
<span class="comment">// Strong consistency: read from quorum (majority)</span>
<span class="function">readStrong</span>() {
<span class="keyword">const</span> values = [...<span class="keyword">this</span>.replicas.values()];
<span class="keyword">const</span> majority = Math.<span class="function">floor</span>(values.length / <span class="number">2</span>) + <span class="number">1</span>;
<span class="comment">// In real systems, use read quorum + write quorum > N</span>
<span class="keyword">const</span> counts = <span class="keyword">new</span> <span class="function">Map</span>();
<span class="keyword">for</span> (<span class="keyword">const</span> v <span class="keyword">of</span> values) {
counts.<span class="function">set</span>(v, (counts.<span class="function">get</span>(v) || <span class="number">0</span>) + <span class="number">1</span>);
<span class="keyword">if</span> (counts.<span class="function">get</span>(v) >= majority) <span class="keyword">return</span> v;
}
<span class="keyword">return</span> values[<span class="number">0</span>]; <span class="comment">// fallback</span>
}
}
<span class="comment">// Quorum formula: R + W > N guarantees overlap</span>
<span class="comment">// N=3 replicas, W=2 (write quorum), R=2 (read quorum)</span>
<span class="comment">// 2 + 2 = 4 > 3 -- at least one node has the latest write</span></code></pre>
</section>
<!-- ============================================================ -->
<!-- SECTION 4: CONSENSUS -->
<!-- ============================================================ -->
<section id="consensus">
<h2>4. Consensus Algorithms</h2>
<p>Consensus is the problem of getting multiple nodes to agree on a value. It is the foundation of replicated state machines, leader election, and distributed locks.</p>
<h3>Raft -- Understandable Consensus</h3>
<p>Designed by Diego Ongaro and John Ousterhout in 2014 explicitly for understandability. Used in etcd, Consul, CockroachDB, and TiKV.</p>
<pre><code><span class="lang-label">ASCII</span>
<span class="comment"> Raft Node States:</span>
<span class="comment"></span>
<span class="comment"> +----------+ timeout +-----------+ wins election +--------+</span>
<span class="comment"> | Follower | ---------------> | Candidate | -----------------> | Leader |</span>
<span class="comment"> +----------+ +-----------+ +--------+</span>
<span class="comment"> ^ | |</span>
<span class="comment"> | | discovers higher term |</span>
<span class="comment"> | v |</span>
<span class="comment"> +-------- discovers higher term / loses election <-----------+</span>
<span class="comment"></span>
<span class="comment"> Term: monotonically increasing logical clock. Each term has at most one leader.</span></code></pre>
<h4>Leader Election</h4>
<p>Followers have a randomized election timeout (e.g., 150-300ms). If no heartbeat from leader within that timeout, the follower becomes a candidate, increments its term, votes for itself, and requests votes from all other nodes. A candidate wins if it gets votes from a majority.</p>
<div class="formula-box">
<div class="label">Raft Election Rules</div>
1. Each node votes for at most one candidate per term<br>
2. Candidate must have a log at least as up-to-date as the voter<br>
3. Majority (N/2 + 1) votes needed to win<br>
4. If split vote, timeout and try again with new term
</div>
<pre><code><span class="lang-label">Go</span>
<span class="keyword">type</span> NodeState <span class="keyword">int</span>
<span class="keyword">const</span> (
Follower NodeState = <span class="keyword">iota</span>
Candidate
Leader
)
<span class="keyword">type</span> RaftNode <span class="keyword">struct</span> {
mu sync.Mutex
id <span class="keyword">int</span>
state NodeState
currentTerm <span class="keyword">int</span>
votedFor <span class="keyword">int</span> <span class="comment">// -1 means no vote this term</span>
log []LogEntry
commitIndex <span class="keyword">int</span>
peers []<span class="keyword">int</span>
}
<span class="keyword">type</span> LogEntry <span class="keyword">struct</span> {
Term <span class="keyword">int</span>
Command <span class="keyword">interface</span>{}
}
<span class="keyword">type</span> RequestVoteArgs <span class="keyword">struct</span> {
Term <span class="keyword">int</span>
CandidateID <span class="keyword">int</span>
LastLogIndex <span class="keyword">int</span>
LastLogTerm <span class="keyword">int</span>
}
<span class="keyword">type</span> RequestVoteReply <span class="keyword">struct</span> {
Term <span class="keyword">int</span>
VoteGranted <span class="keyword">bool</span>
}
<span class="keyword">func</span> (rn *RaftNode) <span class="function">RequestVote</span>(args *RequestVoteArgs, reply *RequestVoteReply) {
rn.mu.Lock()
<span class="keyword">defer</span> rn.mu.Unlock()
reply.Term = rn.currentTerm
reply.VoteGranted = <span class="keyword">false</span>
<span class="comment">// Rule 1: reject if candidate's term is old</span>
<span class="keyword">if</span> args.Term < rn.currentTerm {
<span class="keyword">return</span>
}
<span class="comment">// Step down if we see a higher term</span>
<span class="keyword">if</span> args.Term > rn.currentTerm {
rn.currentTerm = args.Term
rn.state = Follower
rn.votedFor = -<span class="number">1</span>
}
<span class="comment">// Rule 2: only vote if we haven't voted or already voted for this candidate</span>
<span class="keyword">if</span> rn.votedFor == -<span class="number">1</span> || rn.votedFor == args.CandidateID {
<span class="comment">// Rule 3: candidate's log must be at least as up-to-date</span>
lastIdx := <span class="builtin">len</span>(rn.log) - <span class="number">1</span>
lastTerm := <span class="number">0</span>
<span class="keyword">if</span> lastIdx >= <span class="number">0</span> {
lastTerm = rn.log[lastIdx].Term
}
<span class="keyword">if</span> args.LastLogTerm > lastTerm ||
(args.LastLogTerm == lastTerm && args.LastLogIndex >= lastIdx) {
reply.VoteGranted = <span class="keyword">true</span>
rn.votedFor = args.CandidateID
}
}
}</code></pre>
<h4>Log Replication</h4>
<p>Once elected, the leader accepts client requests, appends them to its log, and replicates them to followers via AppendEntries RPCs. An entry is committed once a majority of nodes have it. Committed entries are applied to the state machine.</p>
<pre><code><span class="lang-label">ASCII</span>
<span class="comment"> Log Replication Flow:</span>
<span class="comment"></span>
<span class="comment"> Client --> Leader: "SET x=5"</span>
<span class="comment"> |</span>
<span class="comment"> v</span>
<span class="comment"> Leader appends to local log: [term=3, SET x=5]</span>
<span class="comment"> |</span>
<span class="comment"> +-------+-------+</span>
<span class="comment"> | | |</span>
<span class="comment"> v v v</span>
<span class="comment"> Follower1 Follower2 Follower3 (AppendEntries RPC)</span>
<span class="comment"> | | |</span>
<span class="comment"> v v v</span>
<span class="comment"> ACK ACK ACK</span>
<span class="comment"> | | |</span>
<span class="comment"> +-------+-------+</span>
<span class="comment"> |</span>
<span class="comment"> v</span>
<span class="comment"> Leader: 3/4 nodes have entry (majority!) --> COMMIT</span>
<span class="comment"> Leader applies to state machine and responds to client</span>
<span class="comment"> Next heartbeat tells followers to commit too</span></code></pre>
<h3>Paxos -- Simplified</h3>
<p>Paxos, by Leslie Lamport, is the original consensus algorithm (1989). It is notoriously hard to understand. The basic idea has three roles: Proposers, Acceptors, and Learners.</p>
<div class="example-box">
<div class="label">Paxos in Two Phases</div>
<p><strong>Phase 1 (Prepare):</strong> Proposer picks a proposal number N. Sends Prepare(N) to a majority of acceptors. Acceptors promise not to accept proposals with number < N. If they already accepted a value, they return it.</p>
<p><strong>Phase 2 (Accept):</strong> If proposer gets promises from majority, it sends Accept(N, value) where value is either the highest-numbered previously accepted value, or the proposer's own value. Acceptors accept if they haven't promised a higher number.</p>
<p><strong>Result:</strong> Once a majority of acceptors accept a value, consensus is reached. Learners are notified.</p>
</div>
<div class="tip-box">
<div class="label">Raft vs Paxos</div>
<p>Use Raft. It was designed to be understandable and produces equivalent results. Paxos is a family of protocols (Basic, Multi, Fast, Cheap, etc.) and implementing it correctly is extremely difficult. Every production Paxos implementation deviates from the paper significantly. Raft gives you a clear spec to implement against.</p>
</div>
</section>
<!-- ============================================================ -->
<!-- SECTION 5: REPLICATION -->
<!-- ============================================================ -->
<section id="replication">
<h2>5. Replication</h2>
<p>Replication keeps copies of data on multiple nodes for fault tolerance and read throughput. The fundamental challenge: keeping replicas consistent when writes happen.</p>
<h3>Single-Leader Replication</h3>
<p>One node is the leader (master). All writes go to the leader, which streams changes to followers (slaves/replicas). Reads can come from any replica.</p>
<pre><code><span class="lang-label">ASCII</span>
<span class="comment"> Writes ----> [Leader] ----> replication log ----+----+----+</span>
<span class="comment"> | | | |</span>
<span class="comment"> v v v v</span>
<span class="comment"> [Follower 1] [F2] [F3] [F4]</span>
<span class="comment"> ^</span>
<span class="comment"> |</span>
<span class="comment"> Reads <---------+ (reads from any replica)</span>
<span class="comment"></span>
<span class="comment"> + Simple, no write conflicts</span>
<span class="comment"> - Single point of failure for writes</span>
<span class="comment"> - Replication lag = stale reads from followers</span></code></pre>
<h3>Multi-Leader Replication</h3>
<p>Multiple nodes accept writes. Used in multi-datacenter setups. Each datacenter has a local leader. The big problem: write conflicts when two leaders accept conflicting writes.</p>
<h3>Leaderless Replication</h3>
<p>Any replica can accept writes. Used by Dynamo, Cassandra, Riak. Clients write to multiple replicas and read from multiple replicas. Uses quorum reads/writes to ensure consistency.</p>
<h3>Conflict Resolution</h3>
<pre><code><span class="lang-label">Node.js</span>
<span class="comment">// Last-Writer-Wins (LWW) -- simple but loses data</span>
<span class="keyword">function</span> <span class="function">resolveConflictLWW</span>(versions) {
<span class="keyword">return</span> versions.<span class="function">reduce</span>((latest, v) =>
v.timestamp > latest.timestamp ? v : latest
);
}
<span class="comment">// Merge function -- application-specific, preserves data</span>
<span class="keyword">function</span> <span class="function">resolveShoppingCart</span>(v1, v2) {
<span class="comment">// Union of items from both carts</span>
<span class="keyword">const</span> merged = <span class="keyword">new</span> <span class="function">Map</span>();
<span class="keyword">for</span> (<span class="keyword">const</span> item <span class="keyword">of</span> [...v1.items, ...v2.items]) {
<span class="keyword">const</span> existing = merged.<span class="function">get</span>(item.id);
<span class="keyword">if</span> (!existing || item.quantity > existing.quantity) {
merged.<span class="function">set</span>(item.id, item);
}
}
<span class="keyword">return</span> { items: [...merged.<span class="function">values</span>()] };
}</code></pre>
<h3>CRDTs -- Conflict-Free Replicated Data Types</h3>
<p>Data structures that can be replicated across nodes and updated independently without coordination. They mathematically guarantee convergence. Two types: state-based (CvRDT) and operation-based (CmRDT).</p>
<pre><code><span class="lang-label">Go</span>
<span class="comment">// G-Counter: a grow-only counter CRDT</span>
<span class="comment">// Each node increments its own slot. Total = sum of all slots.</span>
<span class="keyword">type</span> GCounter <span class="keyword">struct</span> {
counts <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">uint64</span> <span class="comment">// nodeID -> count</span>
nodeID <span class="keyword">string</span>
}
<span class="keyword">func</span> <span class="function">NewGCounter</span>(nodeID <span class="keyword">string</span>) *GCounter {
<span class="keyword">return</span> &GCounter{
counts: <span class="keyword">map</span>[<span class="keyword">string</span>]<span class="keyword">uint64</span>{nodeID: <span class="number">0</span>},
nodeID: nodeID,
}
}
<span class="keyword">func</span> (gc *GCounter) <span class="function">Increment</span>() {
gc.counts[gc.nodeID]++
}
<span class="keyword">func</span> (gc *GCounter) <span class="function">Value</span>() <span class="keyword">uint64</span> {
<span class="keyword">var</span> total <span class="keyword">uint64</span>
<span class="keyword">for</span> _, v := <span class="keyword">range</span> gc.counts {
total += v
}
<span class="keyword">return</span> total
}
<span class="comment">// Merge takes the max of each node's counter</span>
<span class="keyword">func</span> (gc *GCounter) <span class="function">Merge</span>(other *GCounter) {
<span class="keyword">for</span> k, v := <span class="keyword">range</span> other.counts {
<span class="keyword">if</span> v > gc.counts[k] {
gc.counts[k] = v
}
}
}
<span class="comment">// PN-Counter: supports both increment and decrement</span>
<span class="comment">// Uses two G-Counters: one for increments, one for decrements</span>
<span class="keyword">type</span> PNCounter <span class="keyword">struct</span> {
pos *GCounter
neg *GCounter
}
<span class="keyword">func</span> (pn *PNCounter) <span class="function">Increment</span>() { pn.pos.<span class="function">Increment</span>() }
<span class="keyword">func</span> (pn *PNCounter) <span class="function">Decrement</span>() { pn.neg.<span class="function">Increment</span>() }
<span class="keyword">func</span> (pn *PNCounter) <span class="function">Value</span>() <span class="keyword">int64</span> {
<span class="keyword">return</span> <span class="keyword">int64</span>(pn.pos.<span class="function">Value</span>()) - <span class="keyword">int64</span>(pn.neg.<span class="function">Value</span>())
}</code></pre>
<div class="tip-box">
<div class="label">CRDTs in Production</div>
<p>Redis has built-in CRDT support in Redis Enterprise. Riak uses CRDTs for maps, sets, counters, and flags. Automerge and Yjs are CRDT libraries for collaborative editing (Google Docs-style). CRDTs trade off space (metadata grows) for coordination-freedom.</p>
</div>
</section>
<!-- ============================================================ -->
<!-- SECTION 6: PARTITIONING -->
<!-- ============================================================ -->
<section id="partitioning">
<h2>6. Partitioning & Sharding</h2>
<p>When data is too large for one node, you split it across multiple nodes. Each piece is a <strong>partition</strong> (or shard). The key question: which data goes to which node?</p>
<h3>Range Partitioning</h3>
<p>Assign contiguous ranges of keys to each partition. Good for range queries but risks hot spots (e.g., all recent timestamps hit the same partition).</p>
<h3>Hash Partitioning</h3>
<p>Hash the key and assign based on hash value. Distributes data evenly but range queries require scatter-gather.</p>
<pre><code><span class="lang-label">Go</span>
<span class="keyword">import</span> (
<span class="string">"crypto/sha256"</span>
<span class="string">"encoding/binary"</span>
<span class="string">"fmt"</span>
)
<span class="keyword">func</span> <span class="function">hashPartition</span>(key <span class="keyword">string</span>, numPartitions <span class="keyword">int</span>) <span class="keyword">int</span> {
h := sha256.<span class="function">Sum256</span>([]<span class="keyword">byte</span>(key))
num := binary.BigEndian.<span class="function">Uint64</span>(h[:8])
<span class="keyword">return</span> <span class="keyword">int</span>(num % <span class="keyword">uint64</span>(numPartitions))
}
<span class="keyword">func</span> <span class="function">main</span>() {
keys := []<span class="keyword">string</span>{<span class="string">"user:1001"</span>, <span class="string">"user:1002"</span>, <span class="string">"user:1003"</span>, <span class="string">"order:5001"</span>}
<span class="keyword">for</span> _, k := <span class="keyword">range</span> keys {
fmt.<span class="function">Printf</span>(<span class="string">"Key %s -> Partition %d\n"</span>, k, <span class="function">hashPartition</span>(k, <span class="number">4</span>))
}
}</code></pre>
<h3>Consistent Hashing</h3>
<p>The problem with simple hash partitioning: when you add or remove a node, almost ALL keys get reassigned. Consistent hashing fixes this -- only K/N keys move on average (K = total keys, N = number of nodes).</p>
<pre><code><span class="lang-label">ASCII</span>
<span class="comment"> Consistent Hashing Ring:</span>
<span class="comment"></span>
<span class="comment"> 0 / 2^32</span>
<span class="comment"> |</span>
<span class="comment"> Node C | Node A</span>
<span class="comment"> \ | /</span>
<span class="comment"> \ | /</span>
<span class="comment"> *----+----*</span>
<span class="comment"> / \</span>
<span class="comment"> / \</span>
<span class="comment"> ------ ------</span>
<span class="comment"> | | | |</span>
<span class="comment"> | | | |</span>
<span class="comment"> ------ ------</span>
<span class="comment"> \ /</span>
<span class="comment"> \ /</span>
<span class="comment"> *---------*</span>
<span class="comment"> / | \</span>
<span class="comment"> / | \</span>
<span class="comment"> Node B | Node D</span>
<span class="comment"> |</span>
<span class="comment"> 2^31</span>
<span class="comment"></span>
<span class="comment"> Each key is hashed onto the ring.</span>
<span class="comment"> Walk clockwise to find the first node -- that's the owner.</span>
<span class="comment"> Virtual nodes: each physical node gets multiple positions on the ring</span>
<span class="comment"> for better balance.</span></code></pre>
<pre><code><span class="lang-label">Go</span>
<span class="keyword">import</span> (
<span class="string">"crypto/sha256"</span>
<span class="string">"encoding/binary"</span>
<span class="string">"fmt"</span>
<span class="string">"sort"</span>
)
<span class="keyword">type</span> ConsistentHash <span class="keyword">struct</span> {
ring []<span class="keyword">uint64</span>
nodeMap <span class="keyword">map</span>[<span class="keyword">uint64</span>]<span class="keyword">string</span>
vnodes <span class="keyword">int</span> <span class="comment">// virtual nodes per physical node</span>
}
<span class="keyword">func</span> <span class="function">NewConsistentHash</span>(vnodes <span class="keyword">int</span>) *ConsistentHash {
<span class="keyword">return</span> &ConsistentHash{
nodeMap: <span class="builtin">make</span>(<span class="keyword">map</span>[<span class="keyword">uint64</span>]<span class="keyword">string</span>),
vnodes: vnodes,
}
}
<span class="keyword">func</span> (ch *ConsistentHash) <span class="function">hashKey</span>(key <span class="keyword">string</span>) <span class="keyword">uint64</span> {
h := sha256.<span class="function">Sum256</span>([]<span class="keyword">byte</span>(key))
<span class="keyword">return</span> binary.BigEndian.<span class="function">Uint64</span>(h[:8])
}
<span class="keyword">func</span> (ch *ConsistentHash) <span class="function">AddNode</span>(node <span class="keyword">string</span>) {
<span class="keyword">for</span> i := <span class="number">0</span>; i < ch.vnodes; i++ {
vkey := fmt.<span class="function">Sprintf</span>(<span class="string">"%s-vnode-%d"</span>, node, i)
hash := ch.<span class="function">hashKey</span>(vkey)
ch.ring = <span class="builtin">append</span>(ch.ring, hash)
ch.nodeMap[hash] = node
}
sort.<span class="function">Slice</span>(ch.ring, <span class="keyword">func</span>(i, j <span class="keyword">int</span>) <span class="keyword">bool</span> {
<span class="keyword">return</span> ch.ring[i] < ch.ring[j]
})
}
<span class="keyword">func</span> (ch *ConsistentHash) <span class="function">GetNode</span>(key <span class="keyword">string</span>) <span class="keyword">string</span> {
hash := ch.<span class="function">hashKey</span>(key)
idx := sort.<span class="function">Search</span>(<span class="builtin">len</span>(ch.ring), <span class="keyword">func</span>(i <span class="keyword">int</span>) <span class="keyword">bool</span> {
<span class="keyword">return</span> ch.ring[i] >= hash
})
<span class="keyword">if</span> idx == <span class="builtin">len</span>(ch.ring) {
idx = <span class="number">0</span> <span class="comment">// wrap around the ring</span>
}
<span class="keyword">return</span> ch.nodeMap[ch.ring[idx]]
}</code></pre>
<div class="warning-box">
<div class="label">Rebalancing is Expensive</div>
<p>Even with consistent hashing, moving data between nodes takes time and bandwidth. Plan for it: use partition counts larger than your node count so you can move whole partitions. Kafka uses this approach -- partitions are the unit of rebalancing, not individual keys.</p>
</div>
</section>
<!-- ============================================================ -->
<!-- SECTION 7: DISTRIBUTED TRANSACTIONS -->
<!-- ============================================================ -->
<section id="distributed-tx">
<h2>7. Distributed Transactions</h2>
<p>How do you ensure atomicity when a transaction spans multiple services or partitions? Single-node ACID does not help you here.</p>
<h3>Two-Phase Commit (2PC)</h3>
<pre><code><span class="lang-label">ASCII</span>
<span class="comment"> Phase 1: PREPARE</span>
<span class="comment"> Coordinator ----Prepare----> Participant A --> "Yes, I can commit"</span>
<span class="comment"> ----Prepare----> Participant B --> "Yes, I can commit"</span>
<span class="comment"> ----Prepare----> Participant C --> "Yes, I can commit"</span>
<span class="comment"></span>
<span class="comment"> Phase 2: COMMIT (if all said yes)</span>
<span class="comment"> Coordinator ----Commit-----> Participant A --> ACK</span>
<span class="comment"> ----Commit-----> Participant B --> ACK</span>
<span class="comment"> ----Commit-----> Participant C --> ACK</span>
<span class="comment"></span>
<span class="comment"> If ANY participant says "No" in Phase 1:</span>
<span class="comment"> Coordinator ----Abort------> All Participants</span>
<span class="comment"></span>
<span class="comment"> Problem: If coordinator crashes after Phase 1, participants are STUCK</span>
<span class="comment"> waiting (blocking protocol). This is why 2PC is rarely used across services.</span></code></pre>
<h3>Sagas -- Compensating Transactions</h3>
<p>Instead of a distributed transaction, break it into a sequence of local transactions. Each step has a compensating action that undoes it if a later step fails.</p>
<pre><code><span class="lang-label">Node.js</span>
<span class="comment">// Saga: Order Processing</span>
<span class="comment">// Step 1: Reserve inventory</span>
<span class="comment">// Step 2: Charge payment</span>
<span class="comment">// Step 3: Ship order</span>
<span class="comment">// If step 3 fails, compensate: refund payment, release inventory</span>
<span class="keyword">class</span> <span class="function">SagaOrchestrator</span> {
<span class="function">constructor</span>() {
<span class="keyword">this</span>.steps = [];
<span class="keyword">this</span>.completedSteps = [];
}
<span class="function">addStep</span>(name, execute, compensate) {
<span class="keyword">this</span>.steps.<span class="function">push</span>({ name, execute, compensate });
}
<span class="keyword">async</span> <span class="function">run</span>(context) {
<span class="keyword">for</span> (<span class="keyword">const</span> step <span class="keyword">of</span> <span class="keyword">this</span>.steps) {
<span class="keyword">try</span> {
console.<span class="function">log</span>(<span class="string">`Executing: ${step.name}`</span>);
<span class="keyword">const</span> result = <span class="keyword">await</span> step.<span class="function">execute</span>(context);
context[step.name] = result;
<span class="keyword">this</span>.completedSteps.<span class="function">push</span>(step);
} <span class="keyword">catch</span> (err) {
console.<span class="function">log</span>(<span class="string">`Failed: ${step.name} -- ${err.message}`</span>);
<span class="keyword">await</span> <span class="keyword">this</span>.<span class="function">compensate</span>(context);
<span class="keyword">throw</span> err;
}
}
<span class="keyword">return</span> context;
}
<span class="keyword">async</span> <span class="function">compensate</span>(context) {
<span class="comment">// Compensate in reverse order</span>
<span class="keyword">for</span> (<span class="keyword">const</span> step <span class="keyword">of</span> <span class="keyword">this</span>.completedSteps.<span class="function">reverse</span>()) {
<span class="keyword">try</span> {
console.<span class="function">log</span>(<span class="string">`Compensating: ${step.name}`</span>);
<span class="keyword">await</span> step.<span class="function">compensate</span>(context);
} <span class="keyword">catch</span> (err) {
console.<span class="function">error</span>(<span class="string">`Compensation failed for ${step.name}: ${err.message}`</span>);
<span class="comment">// Log for manual intervention -- compensations MUST eventually succeed</span>
}
}
}
}
<span class="comment">// Usage</span>
<span class="keyword">const</span> saga = <span class="keyword">new</span> <span class="function">SagaOrchestrator</span>();
saga.<span class="function">addStep</span>(
<span class="string">'reserveInventory'</span>,
<span class="keyword">async</span> (ctx) => { <span class="comment">/* call inventory service */</span> <span class="keyword">return</span> { reservationId: <span class="string">'abc'</span> }; },
<span class="keyword">async</span> (ctx) => { <span class="comment">/* release reservation ctx.reserveInventory.reservationId */</span> }
);
saga.<span class="function">addStep</span>(
<span class="string">'chargePayment'</span>,
<span class="keyword">async</span> (ctx) => { <span class="comment">/* call payment service */</span> <span class="keyword">return</span> { chargeId: <span class="string">'ch_123'</span> }; },
<span class="keyword">async</span> (ctx) => { <span class="comment">/* refund ctx.chargePayment.chargeId */</span> }
);
saga.<span class="function">addStep</span>(
<span class="string">'shipOrder'</span>,
<span class="keyword">async</span> (ctx) => { <span class="comment">/* call shipping service */</span> <span class="keyword">return</span> { trackingId: <span class="string">'trk_789'</span> }; },
<span class="keyword">async</span> (ctx) => { <span class="comment">/* cancel shipment */</span> }
);</code></pre>
<h3>Transactional Outbox Pattern</h3>
<p>When you need to update a database AND publish an event atomically. Write both to the same database in one transaction. A separate process reads the outbox table and publishes events.</p>
<pre><code><span class="lang-label">Go</span>
<span class="comment">// Transactional Outbox: write order + event in one DB transaction</span>
<span class="keyword">func</span> <span class="function">CreateOrderWithOutbox</span>(db *sql.DB, order Order) <span class="keyword">error</span> {
tx, err := db.<span class="function">Begin</span>()
<span class="keyword">if</span> err != <span class="keyword">nil</span> {
<span class="keyword">return</span> err
}
<span class="keyword">defer</span> tx.<span class="function">Rollback</span>()
<span class="comment">// 1. Insert the order</span>
_, err = tx.<span class="function">Exec</span>(
<span class="string">"INSERT INTO orders (id, user_id, total) VALUES ($1, $2, $3)"</span>,
order.ID, order.UserID, order.Total,
)
<span class="keyword">if</span> err != <span class="keyword">nil</span> {
<span class="keyword">return</span> err
}
<span class="comment">// 2. Insert event into outbox table (same transaction!)</span>
eventPayload, _ := json.<span class="function">Marshal</span>(order)
_, err = tx.<span class="function">Exec</span>(
<span class="string">"INSERT INTO outbox (id, event_type, payload, created_at) VALUES ($1, $2, $3, NOW())"</span>,
uuid.<span class="function">New</span>(), <span class="string">"order.created"</span>, eventPayload,
)
<span class="keyword">if</span> err != <span class="keyword">nil</span> {
<span class="keyword">return</span> err
}
<span class="keyword">return</span> tx.<span class="function">Commit</span>()
}
<span class="comment">// Separate poller reads outbox and publishes to message broker</span>
<span class="keyword">func</span> <span class="function">OutboxPoller</span>(db *sql.DB, publisher MessagePublisher) {
<span class="keyword">for</span> {
rows, _ := db.<span class="function">Query</span>(
<span class="string">"SELECT id, event_type, payload FROM outbox WHERE published = false ORDER BY created_at LIMIT 100"</span>,
)
<span class="keyword">for</span> rows.<span class="function">Next</span>() {
<span class="keyword">var</span> id, eventType, payload <span class="keyword">string</span>
rows.<span class="function">Scan</span>(&id, &eventType, &payload)
publisher.<span class="function">Publish</span>(eventType, payload)
db.<span class="function">Exec</span>(<span class="string">"UPDATE outbox SET published = true WHERE id = $1"</span>, id)
}
time.<span class="function">Sleep</span>(<span class="number">500</span> * time.Millisecond)
}
}</code></pre>
<div class="tip-box">
<div class="label">CDC Instead of Polling</div>
<p>Instead of polling the outbox table, use <strong>Change Data Capture (CDC)</strong> with tools like Debezium. It reads the database's write-ahead log (WAL) and streams changes to Kafka. No polling delay, no extra load on the database. This is the production-grade approach.</p>
</div>
</section>
<!-- ============================================================ -->
<!-- SECTION 8: GOSSIP PROTOCOLS -->
<!-- ============================================================ -->
<section id="gossip">
<h2>8. Gossip Protocols</h2>
<p>How do nodes in a large cluster discover each other, detect failures, and propagate state? Gossip protocols spread information like a rumor -- each node periodically tells a random peer what it knows.</p>
<pre><code><span class="lang-label">ASCII</span>
<span class="comment"> Gossip Round 1: Round 2: Round 3:</span>
<span class="comment"> Node A knows X A tells B B tells D</span>
<span class="comment"> A tells C C tells E</span>
<span class="comment"> [A*] [B] [C] [A*] [B*] [C*] [A*] [B*] [C*]</span>
<span class="comment"> [D] [E] [F] [D] [E] [F] [D*] [E*] [F]</span>
<span class="comment"></span>
<span class="comment"> Information spreads in O(log N) rounds to all N nodes.</span>
<span class="comment"> Even if some messages are lost, redundancy ensures delivery.</span></code></pre>
<h3>Failure Detection with Gossip</h3>
<p>Nodes periodically ping random peers. If a node does not respond, don't immediately declare it dead. Use a <strong>suspicion mechanism</strong>: mark it as suspected, gossip the suspicion, and only declare dead after a timeout.</p>
<pre><code><span class="lang-label">Go</span>
<span class="keyword">type</span> NodeStatus <span class="keyword">int</span>
<span class="keyword">const</span> (
Alive NodeStatus = <span class="keyword">iota</span>
Suspected
Dead
)
<span class="keyword">type</span> Member <span class="keyword">struct</span> {
ID <span class="keyword">string</span>
Addr <span class="keyword">string</span>
Status NodeStatus
Heartbeat <span class="keyword">uint64</span>
Timestamp time.Time
}
<span class="keyword">type</span> GossipNode <span class="keyword">struct</span> {
mu sync.RWMutex
self Member
members <span class="keyword">map</span>[<span class="keyword">string</span>]*Member
suspectTTL time.Duration
}
<span class="keyword">func</span> (g *GossipNode) <span class="function">GossipRound</span>() {
g.mu.RLock()
peers := <span class="builtin">make</span>([]<span class="keyword">string</span>, <span class="number">0</span>, <span class="builtin">len</span>(g.members))
<span class="keyword">for</span> id := <span class="keyword">range</span> g.members {
<span class="keyword">if</span> id != g.self.ID {
peers = <span class="builtin">append</span>(peers, id)
}
}
g.mu.RUnlock()
<span class="keyword">if</span> <span class="builtin">len</span>(peers) == <span class="number">0</span> {
<span class="keyword">return</span>
}