The Ghost Writes Twice: Concurrent Updates and Silent Data Corruption


concurrency race-conditions optimistic-locking distributed-systems data-consistency

The Ghost Writes Twice

Two services. One record. Zero errors. Just wrong data, discovered three weeks later.

⏱ 14 min readπŸ“ IntermediateπŸ”’ Data Consistency

The Day the Numbers Lied

It’s a Tuesday afternoon. Alex, a senior engineer at a fast-growing fintech startup, is staring at a support ticket. A user named Priya insists she earned 650 loyalty points over the weekend. The system shows 500.

No exception in the logs. No failed transaction. No error email. Every service returned 200 OK. The application behaved perfectly, from its own perspective. It wrote data. The database accepted it. The numbers are just… wrong.

Three weeks earlier, two microservices had a collision neither of them knew about.

The Order Service processed Priya’s purchase and prepared to add 150 loyalty points. At the exact same millisecond, the Referral Service activated a referral bonus and prepared to add 400 points. Both services read her current balance of 100 points at the same moment. Both did their math. Both wrote their result. One won. The other’s update was silently overwritten, taking 150 points from Priya without a trace.

This is the lost update problem. It is not a bug in the traditional sense. It is a design gap. And it surfaces in production long after the code ships.


The Failure Chain

Here is how the collision played out at the database level:

Timeline (milliseconds)

t=0ms    Priya's balance: 100 pts

t=1ms    Order Service:    SELECT balance FROM users WHERE id=42
         Referral Service: SELECT balance FROM users WHERE id=42
         Both read:        balance = 100

t=5ms    Order Service:    calculates new_balance = 100 + 150 = 250
t=6ms    Referral Service: calculates new_balance = 100 + 400 = 500

t=10ms   Order Service:    UPDATE users SET balance = 250 WHERE id=42
         Result:           balance is now 250 βœ“

t=12ms   Referral Service: UPDATE users SET balance = 500 WHERE id=42
         Result:           balance is now 500 ← OVERWRITES 250

Expected: 100 + 150 + 400 = 650
Actual:   500
Lost:     150 points, silently

Both services succeeded. Both returned success codes. The database accepted both writes without complaint. This is Last-Write-Wins behavior - and it silently deleted 150 points from Priya’s account.

The cruel part: you will not catch this in unit tests. You might not catch it in integration tests. You will catch it when an angry customer calls.

The core issue: A read-then-write operation across two separate services is not atomic. There is a window between the read and the write where another process can change the value underneath you.


Why β€œJust Use Transactions” Does Not Scale

Your first instinct might be: wrap everything in a database transaction. The problem is that transactions work within a single database connection, not across services. When the Order Service and Referral Service are separate processes talking to the same database, they have separate connections and separate transaction contexts.

Even with transactions, two concurrent reads at the same isolation level will see the same snapshot of data. You need an explicit mechanism to detect or prevent concurrent modification.

The concurrent read problem at READ COMMITTED isolation:

  Connection A                  Connection B
  ─────────────                 ─────────────
  BEGIN;
  SELECT balance = 100          BEGIN;
                                SELECT balance = 100
  UPDATE SET balance = 250;
  COMMIT;
                                UPDATE SET balance = 500;
                                COMMIT;
                                ← still overwrites 250!

Both transactions committed cleanly. Both are β€œcorrect” in isolation. Together they produce wrong data.


Solution 1: Optimistic Locking (Version Numbers)

Optimistic locking is the most common fix and the right default for most systems. Add a version column to the row. Every update increments the version. Before writing, assert that the version you read is still the current version. If it is not, someone else modified the row - retry with fresh data.

Schema addition:
  users: id, balance, version

Read:
  SELECT balance, version FROM users WHERE id=42
  Returns: balance=100, version=7

Write (with version assertion):
  UPDATE users
  SET balance=250, version=8
  WHERE id=42 AND version=7
  ← the AND version=7 is the entire safety net

If version changed: UPDATE affects 0 rows β†’ conflict detected
If version matches: UPDATE succeeds β†’ safe write

Here is what the timeline looks like with optimistic locking active:

t=0ms   balance=100, version=7

t=1ms   Order Service reads:    (balance=100, version=7)
        Referral Service reads: (balance=100, version=7)

t=10ms  Order Service writes:
        UPDATE ... WHERE version=7
        Sets balance=250, version=8
        Result: 1 row updated βœ“

t=12ms  Referral Service writes:
        UPDATE ... WHERE version=7
        Version is now 8 - WHERE clause matches 0 rows!
        Result: CONFLICT DETECTED

        Referral Service re-reads: (balance=250, version=8)
        Recalculates:              250 + 400 = 650
        Writes:                    (balance=650, version=9)
        Result: 1 row updated βœ“

Final balance: 650. Correct.

Here is a minimal implementation in Python:

def credit_points(user_id, amount, max_retries=3):
    for attempt in range(max_retries):
        user = db.query(
            "SELECT balance, version FROM users WHERE id=%s", user_id
        )
        new_balance = user.balance + amount
        rows_updated = db.execute(
            "UPDATE users SET balance=%s, version=%s WHERE id=%s AND version=%s",
            new_balance, user.version + 1, user_id, user.version
        )
        if rows_updated == 1:
            return new_balance
        # Conflict: another writer beat us, retry with fresh read
    raise ConcurrencyError("Max retries exceeded - high contention detected")

Optimistic locking works best when conflicts are rare. It adds no database locks, so concurrent reads stay fast. Retries are cheap. If your system has many services frequently updating the same row, though, you will get a storm of retries - which is a signal to switch strategies.


Solution 2: Pessimistic Locking

If conflicts are frequent, or if a retry is expensive (for example, the update involves an external payment API call), pessimistic locking is a better fit. You acquire an exclusive lock on the row before reading it. Other processes block until you release it.

Order Service:
  BEGIN;
  SELECT * FROM users WHERE id=42 FOR UPDATE;
  ← Referral Service blocks here, waiting

  UPDATE users SET balance=250 WHERE id=42;
  COMMIT; ← lock released

Referral Service (now unblocked):
  BEGIN;
  SELECT * FROM users WHERE id=42 FOR UPDATE;
  ← reads balance=250 (the updated, correct value)

  UPDATE users SET balance=650 WHERE id=42;
  COMMIT;

SELECT ... FOR UPDATE tells the database: I am about to write this row. Hold it until my transaction commits.

Without lock:    β”Œβ”€β”€β”€ Read ─────────────── Write ───┐
                 └─── Read ─────────────── Write β”€β”€β”€β”˜
                          ↑ Race window here

With lock:       β”Œβ”€β”€β”€ Read ── Write ─┐
                                      └─── Read ── Write ─┐
                 Serial access, no window

The tradeoff is throughput. Locks serialize access, which reduces concurrency under load. Never hold a pessimistic lock across a slow operation like an HTTP call or a third-party API request. Lock, do the database work, commit immediately.


Solution 3: Atomic Operations via Redis

When the operation is simple (increment, decrement), you can bypass the database read-modify-write cycle entirely using Redis atomic commands.

Redis key:  user:42:balance

Order Service:    INCRBY user:42:balance 150
Referral Service: INCRBY user:42:balance 400

Redis guarantees these are atomic and serial.
No reads needed. No version checks. No conflicts possible.

Final Redis value: 100 + 150 + 400 = 650. Always correct.

INCRBY in Redis is a server-side atomic operation. Redis processes commands sequentially in a single-threaded event loop - two clients issuing INCRBY concurrently will be serialized automatically. No two increments can interleave.

The caveat: Redis is not your primary database. You need a strategy to sync the authoritative balance back to your relational DB. Common approaches: periodic background sync, event-driven sync on a threshold, or dual-write with Redis as the fast path and the DB as the durable store.


The Full Picture

Here is how these three strategies sit relative to each other:

Conflict Frequency  Low                          High
                    β”‚                             β”‚
Strategy:           Optimistic Locking  ──────►  Pessimistic Locking
                    (version column)             (SELECT FOR UPDATE)

Operation Type      Complex multi-field          Simple counter
                    β”‚                             β”‚
Strategy:           Optimistic / Pessimistic ───► Redis Atomic Ops
                    (in relational DB)           (INCRBY / DECR)

Choosing the Right Strategy

StrategyBest ForThroughputComplexity
Optimistic LockingLow conflict rate, any writeHighLow
Pessimistic LockingHigh conflict, expensive retriesMediumLow
Redis Atomic OpsSimple counters, high frequencyVery HighMedium
Database sequencesAuto-incrementing IDs onlyHighNone

Key Takeaways

  • Last-Write-Wins is a silent data killer. It returns success codes even when it corrupts data.
  • Optimistic locking is your default. Version numbers cost almost nothing and catch most race conditions.
  • Pessimistic locking serializes access. Use it when retries are expensive or contention is high.
  • Redis atomic operations are the fastest option for simple increment and decrement patterns.
  • Always test concurrency explicitly. Fire two concurrent updates against the same row and assert the mathematically correct final state.

Frequently Asked Questions

Q: Does optimistic locking work across microservices?

Yes, but each service must thread the version number through the entire call chain. If Service A reads version=7 and delegates the write to Service B via an API call, Service B must include version=7 in its database update clause. This requires deliberate API design from the start.

Q: What if the retry loop in optimistic locking never converges?

Cap your retries - typically 3 to 5 attempts. If a row is contended that heavily, optimistic locking is the wrong tool. Switch to pessimistic locking or route all writes through a single serialized queue (one writer at a time).

Q: Is SELECT FOR UPDATE dangerous in a high-traffic system?

Lock contention under high concurrency causes significant slowdowns. The rule: never hold a pessimistic lock across a slow operation (HTTP call, third-party API, heavy computation). Lock, execute the database work, commit immediately. Keep lock duration under 50ms as a guideline.

Q: Can database constraints prevent this?

For invariants like non-negative balances, CHECK constraints are a great safety net. But they do not solve the lost update problem - they catch some corrupted values after the fact. Constraints and locking solve different problems and should be used together.

Q: What is MVCC and does it help here?

Multi-Version Concurrency Control is how PostgreSQL gives each transaction a consistent snapshot without blocking readers. It does not prevent lost updates at the READ COMMITTED isolation level. You still need optimistic or pessimistic concurrency control on top of MVCC.


Interview Questions

Q: Explain the lost update problem with an example.

Walk through two concurrent read-then-write operations on the same row. Show how Last-Write-Wins produces an incorrect final value. Then explain how optimistic locking with a version column catches and handles the conflict. Be specific about what β€œ0 rows updated” means and what the retry strategy looks like.

Q: When would you choose pessimistic locking over optimistic locking?

When conflicts are frequent (many writers competing on the same row), when retries are expensive (the update involves an external API call or heavy computation), or when the business operation requires strict ordering - for example, a financial ledger that must process debits before credits.

Q: How does Redis handle concurrent increments safely?

Redis is single-threaded in command execution. INCRBY is an atomic server-side operation: it reads and writes in a single step from Redis’s perspective. Two clients issuing INCRBY concurrently are serialized by the Redis event loop - there is no window for interleaving.

Q: How do you test for race conditions?

Use concurrent test harnesses: fire multiple goroutines or threads simultaneously, all targeting the same resource. Assert the final state against the mathematically correct answer. For databases, custom load tests with high parallelism (using pgbench, k6, or language-native concurrency) can reliably surface race conditions in staging before they appear in production.

Q: What is a Compare-And-Swap operation and how does it relate to optimistic locking?

Compare-And-Swap is a CPU-level primitive: atomically compare a memory location to an expected value, and write a new value only if they match. Optimistic locking is the database-level equivalent: compare the version column to the expected version in the WHERE clause, and apply the update only if they match. The mechanics are different; the invariant is identical.

This article is premium

One-time payment Β· Lifetime access to all premium content

Get Premium Access

Already have access? Sign in