Crypto Trends

Concurrency Bugs You Can’t See (Until Your System Fails in Production)

Writing reliable concurrent software is a subtle and often unforgiving craft. At first glance, the promise of concurrency is appealing: faster execution, improved responsiveness, and the ability to scale workloads across multiple threads or cores. But lurking beneath that promise are several complex challenges, issues that don’t always manifest under test conditions, but can quietly wreak havoc in production. Among the most notorious are race conditions, violations of atomicity, deadlocks, livelocks, and starvation. These aren’t just academic concerns. They represent real risks that every systems programmer, backend engineer, or microservices architect may confront.

Imagine you’re building a backend service for a fintech application. One of its key responsibilities is managing user accounts, particularly balance checks and withdrawals. The logic might look like this:

type Account struct {
    Balance int
}

func (a *Account) Withdraw(amount int) error {
    if a.Balance < amount {
        return ErrInsufficient
    }
    a.Balance -= amount
    return nil
}

Looks safe in a single-threaded world. But what happens when two withdrawal requests arrive at nearly the same time?

account := Account{Balance: 20}

go func() { account.Withdraw(12) }() // Request A
go func() { account.Withdraw(11) }() // Request B

Depending on timing, the final balance can become negative. Something that should never happen.

Race Conditions in Action

A race condition occurs when the correctness of a program depends on the relative timing of threads. It’s unpredictable, subtle, and difficult to reproduce. In the above example, multiple interleavings are possible.

Scenario: Both read before either subtracts

  • Request A: checks balance (20 ≥ 12) → OK
  • Request B: checks balance (20 ≥ 11) → OK
  • Both subtract → Final balance: -3

Scenario: One finishes before the other starts

  • Request A completes → balance becomes 8
  • Request B checks balance (8 ≥ 11) → false → rejected

Only some scenario behaves correctly. The rest are logically invalid but all are valid from the CPU scheduler’s perspective. That’s what makes concurrency so difficult: even correct-looking code can break without synchronization.

What we saw above is more than just a logic error, it’s a data race. A data race occurs when:

  • Two or more threads access the same memory concurrently
  • At least one of the accesses is a write
  • No synchronization is used to coordinate access

Why are data races dangerous?

  • The outcome is non-deterministic
  • Compilers and CPUs may reorder instructions for performance
  • Threads may see stale or inconsistent memory values

Illusion of Atomicity

In some interleavings, operations appear atomic simply because they don’t overlap. But in concurrent systems, we cannot rely on this. Without explicit synchronization, atomicity is just an illusion.

For instance:

  • Request B completes and updates the balance to 9
  • Request A, running on a different core, still sees balance as 20
  • A proceeds deducts 12 → Final balance: -3

This happens due to memory reordering, cache delays, or lack of visibility guarantees between threads. These bugs are especially dangerous because: 1) They don’t always trigger during testing; 2) They can’t always be caught by data race detectors; 3) They can occur even when no lines of code visibly overlap;

To make operations safe and atomic, we must introduce critical sections: code regions that only one thread may access at a time.

In Go, this is achieved using sync.Mutex:

type Account struct {
    sync.Mutex
    ID      string
    Balance int
}

func (a *Account) Withdraw(amount int) error {
    a.Lock()
    defer a.Unlock()

    if a.Balance < amount {
        return ErrInsufficient
    }

    a.Balance -= amount
    return nil
}

This simple change ensures:

  • Only one goroutine can access or modify the balance at any given time
  • Other goroutines must wait until the lock is released
  • All changes made within the lock become visible before the next one proceeds

One of the most common and dangerous mistakes in concurrent programming is inconsistent locking. When a shared variable is properly protected by a lock in some parts of the code, but accessed elsewhere without acquiring the same lock, effectively breaks the synchronization guarantees and reintroduces the risk of race conditions. This breaks the atomicity guarantee and reintroduces race conditions.

Golden Rule:

Every read or write to shared mutable state must be protected by the same lock.

Deadlocks in Action

Despite our best efforts to enforce atomicity with mutexes, concurrency hazards don’t end there. Introducing synchronization primitives like locks can prevent race conditions, but they also open the door to an equally serious class of problems: deadlocks. These occur when two or more operations wait for each other indefinitely, freezing part of the system. And what’s even more deceptive is that everything may seem to be working correctly, until it doesn’t.

Let’s continue with our fintech example. Suppose we now want to support transfers between two accounts. Conceptually, transferring money from one account to another requires locking both accounts at once to ensure consistency. Here’s a naive first implementation:

func Transfer(from, to *Account, amount int) error {
    from.Lock()
    defer from.Unlock()

    to.Lock()
    defer to.Unlock()

    if from.Balance < amount {
        return ErrInsufficientFunds
    }

    from.Balance -= amount
    to.Balance += amount
    return nil
}

Looks fine? Not quite. Now consider this concurrent scenario:

go func() { Transfer(account1, account2, 5) }()
go func() { Transfer(account2, account1, 10) }()

Two goroutines are attempting opposite transfers between the same two accounts. Here’s a valid (and dangerous) interleaving:

  1. Goroutine A calls Transfer(account1, account2)

    a. Acquires lock-on account1

    b. Attempts to lock account2 → blocked

  2. Goroutine B calls Transfer(account2, account1)

    a. Acquires lock-on account2

    b. Attempts to lock account1 → blocked

Now, both goroutines are holding one lock and waiting for the other. Neither can proceed, and neither will ever release their lock. The system is deadlocked.

This situation satisfies all of the following conditions, which together are sufficient to cause a deadlock:

  1. Mutual Exclusion — at least one resource is held exclusively
  2. Hold and Wait — threads hold some resources while waiting for others
  3. No Preemption — resources cannot be forcibly taken away
  4. Circular Wait — a cycle of dependencies exists among the threads

If all four are present, a deadlock is not just possible, it’s inevitable. Fortunately, there’s a simple and effective strategy to prevent deadlocks when dealing with multiple locks: always acquire locks in a consistent global order.

Every account has a unique ID, so we can sort them before locking:

func Transfer(from, to *Account, amount int) error {
    if from.ID < to.ID {
        from.Lock()
        defer from.Unlock()

        to.Lock()
        defer to.Unlock()
    } else {
        to.Lock()
        defer to.Unlock()

        from.Lock()
        defer from.Unlock()
    }

    if from.Balance < amount {
        return ErrInsufficientFunds
    }

    from.Balance -= amount
    to.Balance += amount
    return nil
}

By locking accounts in the same order across all goroutines, we eliminate circular waits and with them, deadlocks. While this strategy works in simple cases, it’s not a universal fix. It becomes harder to apply when:

  • The set of resources to lock is dynamic
  • Locks are acquired conditionally
  • Resources are nested across layers or modules

In such cases, you might need to use:

  • Timeout-based locks (e.g., TryLock)
  • Queues or orchestrators to coordinate access
  • Higher-level abstractions like transactions or actor models

Interestingly, the Go runtime has its concept of what constitutes a deadlock. If all goroutines are blocked and none can make progress, Go panics with:

func main() {
    c := make(chan int)
    <-c // no sender, program deadlocks immediately
}

However, Go cannot detect partial deadlocks, when only some goroutines are stuck, but others continue running. These are especially dangerous in production services: the system appears “alive,” but part of its functionality is silently frozen.

Starvations in Action

Closely related to deadlocks is starvation: a condition in which some threads are consistently denied access to a resource, while others proceed. This usually stems from flawed scheduling or lock contention patterns.

Consider this situation:

var mutex sync.Mutex

// Goroutine A: holds the lock for a long time
go func() {
    for {
        mutex.Lock()
        time.Sleep(500 * time.Millisecond)
        mutex.Unlock()
        time.Sleep(10 * time.Millisecond)
    }
}()

// Goroutine B: tries to acquire the same lock
go func() {
    for {
        mutex.Lock()
        time.Sleep(10 * time.Millisecond)
        mutex.Unlock()
        time.Sleep(500 * time.Millisecond)
    }
}()

Here, Goroutine A frequently grabs the lock and holds it for longer, while Goroutine B despite needing it briefly is rarely fast enough to acquire it. Over time, B may starve. To combat starvation, Go’s runtime employs queue-based fairness for mutexes. Waiting goroutines are serviced roughly in order of arrival, which helps prevent long-held locks from monopolizing execution. But remember: this only applies to low-level primitives like sync.Mutex. You are still responsible for fairness when building custom schedulers, task queues, or coordination logic.

Livelocks in Action

Now let’s flip the scenario. In a deadlock, everything stops. But in a livelock, everything runs and yet nothing progresses.

A classic example of a livelock is two people trying to pass each other in a narrow hallway. Such as Čertovka Alley, the famously tight passageway in Prague, where one steps to the left to let the other pass, only to be mirrored; then both step to the right, again mirroring each other, repeating the dance endlessly. They’re not idle or unresponsive, each is actively trying to resolve the situation but despite constant motion, neither makes any progress.

Here’s a simplified example using TryLock, which doesn’t block:

go func() {
    for {
        if mutex1.TryLock() {
            if mutex2.TryLock() {
                // Critical section
                mutex2.Unlock()
            }
            mutex1.Unlock()
        }
    }
}()

go func() {
    for {
        if mutex2.TryLock() {
            if mutex1.TryLock() {
                // Critical section
                mutex1.Unlock()
            }
            mutex2.Unlock()
        }
    }
}()

Both goroutines repeatedly acquire one lock and fail to acquire the second, releasing and retrying rapidly. The CPU is busy, but progress is zero. Because threads in a livelock are not blocked, these issues are harder to detect. But you can mitigate them by introducing:

  • Backoff strategies (e.g., time.Sleep(randomDuration))
  • Retry limits
  • Timeouts
  • Progress tracking (e.g., queue length, throughput metrics)

These strategies reduce contention and increase the odds of success by breaking symmetry.

Even when deadlocks and livelocks are avoided, concurrent systems can still suffer from another insidious problem: starvation. Unlike deadlocks, where threads are frozen waiting for each other, starvation occurs when some threads are perpetually delayed or denied access to shared resources, even though progress is technically possible. The result is uneven, unfair scheduling—some parts of the system thrive, while others are silently left behind.

Let’s revisit the sync.Mutex example from earlier, but introduce asymmetric workloads:

var mutex sync.Mutex

// Goroutine A: holds the lock for a long time
go func() {
    for {
        mutex.Lock()
        time.Sleep(500 * time.Millisecond) // long critical section
        mutex.Unlock()

        time.Sleep(10 * time.Millisecond)
    }
}()

// Goroutine B: quick and frequent
go func() {
    for {
        mutex.Lock()
        time.Sleep(10 * time.Millisecond) // short critical section
        mutex.Unlock()

        time.Sleep(500 * time.Millisecond)
    }
}()

At first glance, this looks harmless. But in practice:

  • Goroutine A is almost always ready to grab the lock.
  • Goroutine B, despite needing the lock for only a short time, is usually outside its contention window.
  • The result? A wins the race nearly every time, and B is starved.

Even with a fair or randomized scheduler, faster or more persistent threads often win. Over time, starvation introduces latency spikes, throughput drops, or complete degradation in time-sensitive components. Modern versions of the Go runtime introduce fairness mechanisms in its sync.Mutex implementation. Specifically:

  • Goroutines waiting on a mutex are placed in a FIFO queue
  • The oldest waiter is given preference when the lock becomes available

This mitigates starvation in most simple cases but it only applies to low-level synchronization. If you implement your queues, semaphores, or schedulers (e.g., for task distribution), it’s up to you to enforce fairness. Starvation doesn’t just affect internal logic. In some cases, it can be exploited deliberately used as a form of attack.

Concurrency bugs aren’t just reliability issues. They’re also potential security flaws. A well-crafted request or payload can lead to resource exhaustion, effectively denying service to legitimate users. This is what makes starvation a DoS vector.

Consider the infamous “Billion Laughs” XML bomb:


  
  
  
]>
&lol3;

Parsing this input triggers exponential expansion. The result?

  • Memory gets saturated
  • CPU usage spikes
  • The parser starves all other work
  • The system grinds to a halt

To protect systems from accidental or malicious starvation, apply these principles early and often:

  • Limit input size: Cap the maximum length of payloads, arrays, recursion depth, etc.
  • Validate early: Reject malformed or unusually complex input as soon as possible
  • Avoid unbounded work: Never loop or expand based on user input without a safeguard
  • Use timeouts: Wrap operations in time limits, especially network or disk-bound ones
  • Rate-limit and throttle: Don’t allow any client to consume more than their fair share
  • Quarantine expensive tasks: Move risky processing to isolated queues or worker pools

Never trust client input. Assume every request is potentially malicious.

Another subtle concurrency failure is when goroutines don’t crash, don’t deadlock, and don’t starve but still never complete. This is especially problematic in web servers and long-running services. From the outside, everything looks fine. Internally, resources are being leaked one by one.

Take this HTTP handler example:

func Producer() <-chan string {
  c := make(chan string)

  go func() {
      defer func() {
          if r := recover(); r != nil {
              log.Println("Recovered:", r)
          }
          // Channel is never closed if panic happens earlier!
      }()

      for i := 0; i < 10; i++ {
          c <- produceValue() // ⚠️ Might panic
      }

      close(c)
  }()

  return c
}

func httpHandler(w http.ResponseWriter, req *http.Request) {
  c := Producer()

  for val := range c {
      w.Write([]byte(val))
  }
}

What can go wrong here?

  • Suppose produceValue() panics on iteration 3
  • The goroutine exits before calling close(c)
  • The HTTP handler keeps waiting for values from a channel that will never close
  • The result? The response never finishes, and the handler goroutine is leaked

Now imagine this happening hundreds of times per second, triggered by user input. Over time, the server runs out of memory or goroutine slots and crashes. Go’s runtime doesn’t see a deadlock or panic in the handler. It just sees an idle goroutine stuck in a read. The damage accumulates quietly.

Always guarantee that your channels are closed even if the producer panics or exits early:

func Producer() <-chan string {
    c := make(chan string)

    go func() {
        defer func() {
            if r := recover(); r != nil {
                log.Println("Recovered:", r)
            }
            close(c) // ✅ Always close
        }()

        for i := 0; i < 10; i++ {
            c <- produceValue()
        }
    }()

    return c
}

And in HTTP handlers, consider using:

  • context.Context with timeouts
  • Buffered channels with fallbacks
  • Select statements that include ctx.Done()

In concurrent systems, especially those exposed to the outside world, you must always assume that something will go wrong. To build resilient services:

  • Catch and recover from panics
  • Limit lifetime of goroutines
  • Track and monitor open goroutines using tools like pprof
  • Use metrics and alerts to detect abnormal latency, memory, or CPU usage

This gradual resource exhaustion from silent goroutine leaks is a prime example of a liveness failure. The system hasn’t crashed. No panics, no explicit errors. And yet, it’s dying—slowly, invisibly, and irreversibly. It illustrates a crucial point about concurrent systems: problems don’t always explode; sometimes, they erode your service one stuck handler, one orphaned channel, or one retry loop at a time.

To better reason about such behavior, it helps to distinguish between two fundamental categories of correctness: safety and liveness. Safety ensures that “nothing bad ever happens”. For instance, no two threads simultaneously entering a critical section, or no data structure ending up in an invalid state. Liveness, on the other hand, guarantees that “something good eventually happens”. A request gets processed, a message gets delivered, or a lock is eventually acquired.

Violating safety often results in loud, visible failures: crashes, corruptions, panics. However, liveness violations are quieter, slower, and harder to detect. A system may appear healthy while gradually slowing down, growing unresponsive, or silently dropping work. That’s why monitoring only for safety isn’t enough — a system that runs isn’t necessarily a system that works.

In practice, liveness issues like livelocks, starvation, and goroutine leaks demand a different mindset. While correctness at the code level is important, resilience at the system level is critical. Modern infrastructure embraces this by shifting focus from strict prevention to graceful recovery. You can’t predict or eliminate every failure mode but you can make sure your system survives and recovers from them.

This philosophy drives the design of self-healing systems: detect anomalies early, fail fast when needed, and restart cleanly. A stuck service? Replace it. A thread that never finishes? Kill and restart it. Is a node misbehaving? Evict and reschedule it. Tools like Kubernetes make this pattern operational continuously maintaining a desired state, even in the face of unpredictable failure. Ultimately, building concurrent software isn’t just about locks, threads, and shared memory. It’s about designing systems that remain correct, observable, and adaptable, even when things go wrong.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button