First Journey
by π§βπ» Wang Tech on Thu Nov 27 2025
Recently, our system faced a strange issue during a rolling deployment.
When Customer A books a ride, our system tries to assign a Driver B.
We allow up to 60 seconds to find a driver.
If no driver is found within 60 seconds, we return βDriver Not Foundβ.
Normally, everything works perfectly.
But during deployment, something went wrongβ¦
Some trips stayed in WAITING status for 120β180 seconds,
even though the timeout limit is only 60 seconds. π³
π€ Why does this happen?
The system uses a Virtual Thread to handle timeouts.
- When a trip is created, a timer thread starts (e.g., 60s).
- If a driver is found β we interrupt the timer (happy path).
- If no driver is found β the timer triggers driver-not-found callback.
- This mechanism works well on all pods⦠until deployment happens.
π Letβs imagine the system like Grab
π§ͺ Booking scenario
You are User A, booking a ride to Driver B
ββββββββββββββββββββββββββββββββ
β User A book vehicle β
βββββββββββββββββ¬βββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββ
β SERVICE POD β
β----------------------------------β
β 1. Create trip β WAITING β
β 2. Start countdown timer: β
β .sleep(60s) -> handleNotFound β
βββββββββββββββββ¬βββββββββββββββββββ
β
βΌ
(timeout counting...) 59 -> 58 -> 57 -> ... -> 0
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ROLLING DEPLOYMENT STARTS β
β β’ Kubernets sends SIGTERM β
β β’ JVM shuts down β
β β’ ALL threads killed instantly β
βββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β
βΌ
β **Timeout Thread Disappears**
β No recovery logic
β No callback fired
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRIP STUCK IN WAITING FOREVER β
β Client show not found driver β
β But server still waiting confirm from driver B, β
β If driver accept the trip, the conplict occurred β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β οΈ Why this is dangerous?
| Client View | Backend Reality |
|---|---|
| Shows Trip failed | Trip still WAITING |
| User creates new trip | Older trip still listens |
| Driver accepts late | Conflict: trip already closed |
This causes:
- Ghost bookings π»
- Double driver assignment ππ
- Broken user trust π©
- Hard-to-debug production issues π§―
π‘ Root cause
We relied entirely on an in-memory timeout thread.
During deployment, the JVM was terminated, and the thread was killed before it fired.
No backup, no persistence, no recovery.
π Next, how we fixed it (v2)
(Coming in next article β using Redis ZSET + Recovery Worker to ensure timeout survives pod death)
Tagged: microservicesspring bootK8S