First Journey

by πŸ§‘β€πŸ’» Wang Tech on Thu Nov 27 2025

Walking on Mars - Generated by AI

Recently, our system faced a strange issue during a rolling deployment.

When Customer A books a ride, our system tries to assign a Driver B.
We allow up to 60 seconds to find a driver.
If no driver is found within 60 seconds, we return β€œDriver Not Found”.

Normally, everything works perfectly.

But during deployment, something went wrong…
Some trips stayed in WAITING status for 120–180 seconds,
even though the timeout limit is only 60 seconds. 😳


πŸ€” Why does this happen?

The system uses a Virtual Thread to handle timeouts.

  • When a trip is created, a timer thread starts (e.g., 60s).
  • If a driver is found β†’ we interrupt the timer (happy path).
  • If no driver is found β†’ the timer triggers driver-not-found callback.
  • This mechanism works well on all pods… until deployment happens.

πŸš– Let’s imagine the system like Grab

πŸ§ͺ Booking scenario

You are User A, booking a ride to Driver B

               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚      User A book vehicle     β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚          SERVICE POD             β”‚
              β”‚----------------------------------β”‚
              β”‚ 1. Create trip β†’ WAITING         β”‚
              β”‚ 2. Start countdown timer:        β”‚
              β”‚    .sleep(60s) -> handleNotFound β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
              (timeout counting...) 59 -> 58 -> 57 -> ... -> 0
                              β”‚
                              β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚              ROLLING DEPLOYMENT STARTS             β”‚
      β”‚       β€’ Kubernets sends SIGTERM                    β”‚
      β”‚       β€’ JVM shuts down                             β”‚
      β”‚       β€’ ALL threads killed instantly               β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                 ❌ **Timeout Thread Disappears**
                 ❌ No recovery logic
                 ❌ No callback fired
                              β”‚
                              β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚   TRIP STUCK IN WAITING FOREVER                    β”‚
       β”‚   Client show not found driver                     β”‚
       β”‚   But server still waiting confirm from driver B,  β”‚
       β”‚   If driver accept the trip, the conplict occurred β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  


⚠️ Why this is dangerous?

Client ViewBackend Reality
Shows Trip failedTrip still WAITING
User creates new tripOlder trip still listens
Driver accepts lateConflict: trip already closed

This causes:

  • Ghost bookings πŸ‘»
  • Double driver assignment πŸš—πŸš—
  • Broken user trust 😩
  • Hard-to-debug production issues 🧯

πŸ’‘ Root cause

We relied entirely on an in-memory timeout thread.
During deployment, the JVM was terminated, and the thread was killed before it fired.
No backup, no persistence, no recovery.


πŸš€ Next, how we fixed it (v2)

(Coming in next article β€” using Redis ZSET + Recovery Worker to ensure timeout survives pod death)

Tagged: microservicesspring bootK8S

Subscribe to Wang Tech

One update per week. All the latest stories in your inbox.