Don't let your kubernetes cluster go wild: Ensuring etcd reliability

Watch talk on YouTube

Fair warning: This talk was very technical and pretty interesing - but don’t even try to understand it if you’re tired (or if it’s the thrid to last session on the last day of a long conference).

Baseline

Standard example: Write and read KV-Data, put(A,2) -> Get (A)
Problem: Concurrency

TODO: Steal image from intuition of correctness

Correctness

Correctness: Kinda funky when it comes to time
Fix: Define serialization that executes parallel request one after another to bring them in an order

Failures

What happens is connections between etcd nodes go down -> Serving stale data
What happens if data corrupts -> If enough members are online, it can repair itself
And many more that can happen at random times -> Hard to test

TODO: Steal “in a concurrent world”

Robustness framework

Automates tests for failures
Includes reliable reproductions of past (seamingly random) errors
Currently a mixture of existing go debugging tools

Future

Reproduce more bugs consistently
Run additional consistency checks