How We Ensure Distributed System Consistency
When a database-backed app needs to deliver messages to another app, we use the sending app’s database-backed job queue as the sole mechanism for ensuring at-least-once delivery of messages because we already need it for local work, and message delivery should be the responsibility of the producer, not the receiver. There’s no need for an intermediary.
When an app needs to eventually receive an upstream-application fact’s latest state without a requirement to see all interim states, we use the flyweight events pattern because it avoids A-B-A bugs given out-of-order message delivery.
When an app needs to replicate an upstream-application’s fact as above and store it as a record in its own database, we use the Event Driven Sync framework because implementing the flyweight events pattern for data sync is verbose, nuanced, and relatively common when we build domain oriented services.
When an app needs to effect change on a remote application with no local side effects we just use REST because that’s all there is to it.
When an app needs to provide an idempotent, mutative endpoint, we use a cotransactional idempotency filter built into the receiving controller (Idempotent Receiver) because it can capture the result of the request and replay it for the consumer after a network partition without adding complexity to the business domain.
When an app needs to effect change with both local and remote side effects but the remote side has no right to reject the change, we commit locally and background an idempotent REST request because the downstream will reliably receive the change eventually.
When an app needs to effect change with both local and remote side effects but the remote side has a say in the validity of the operation, we perform a validation preflight request and upon success optimistically proceed as above despite the potential for race conditions. This is because all eventually consistent systems “create garbage” - they require domain-aware conflict resolution mechanisms which we handle as follows:
Given a conflict caused by the above race condition, we handle unacceptable requests by introducing the ability to accept “invalid” states into the receiving side and building workflows to advance now-invalid requests to a failed state in both systems because this iterative strategy allows API provider teams to make informed tradeoffs between operational cost and product velocity while providing an avenue to ~zero operational load.
Back to Our Principles