First Journey

by 🧑‍💻 Wang Tech on Thu Nov 27 2025

Recently, our system faced a strange issue during a rolling deployment.

When Customer A books a ride, our system tries to assign a Driver B.
We allow up to 60 seconds to find a driver.
If no driver is found within 60 seconds, we return “Driver Not Found”.

Normally, everything works perfectly.

But during deployment, something went wrong…
Some trips stayed in WAITING status for 120–180 seconds,
even though the timeout limit is only 60 seconds. 😳

🤔 Why does this happen?

The system uses a Virtual Thread to handle timeouts.

When a trip is created, a timer thread starts (e.g., 60s).
If a driver is found → we interrupt the timer (happy path).
If no driver is found → the timer triggers driver-not-found callback.
This mechanism works well on all pods… until deployment happens.

🚖 Let’s imagine the system like Grab

🧪 Booking scenario

You are User A, booking a ride to Driver B

               ┌──────────────────────────────┐
               │      User A book vehicle     │
               └───────────────┬──────────────┘
                               │
                               ▼
              ┌──────────────────────────────────┐
              │          SERVICE POD             │
              │----------------------------------│
              │ 1. Create trip → WAITING         │
              │ 2. Start countdown timer:        │
              │    .sleep(60s) -> handleNotFound │
              └───────────────┬──────────────────┘
                              │
                              ▼
              (timeout counting...) 59 -> 58 -> 57 -> ... -> 0
                              │
                              ▼
      ┌────────────────────────────────────────────────────┐
      │              ROLLING DEPLOYMENT STARTS             │
      │       • Kubernets sends SIGTERM                    │
      │       • JVM shuts down                             │
      │       • ALL threads killed instantly               │
      └───────────────────────┬────────────────────────────┘
                              │
                              ▼
                 ❌ **Timeout Thread Disappears**
                 ❌ No recovery logic
                 ❌ No callback fired
                              │
                              ▼
       ┌────────────────────────────────────────────────────┐
       │   TRIP STUCK IN WAITING FOREVER                    │
       │   Client show not found driver                     │
       │   But server still waiting confirm from driver B,  │
       │   If driver accept the trip, the conplict occurred │
       └────────────────────────────────────────────────────┘

⚠️ Why this is dangerous?

Client View	Backend Reality
Shows Trip failed	Trip still WAITING
User creates new trip	Older trip still listens
Driver accepts late	Conflict: trip already closed

This causes:

Ghost bookings 👻
Double driver assignment 🚗🚗
Broken user trust 😩
Hard-to-debug production issues 🧯

💡 Root cause

We relied entirely on an in-memory timeout thread.
During deployment, the JVM was terminated, and the thread was killed before it fired.
No backup, no persistence, no recovery.

🚀 Next, how we fixed it (v2)

(Coming in next article — using Redis ZSET + Recovery Worker to ensure timeout survives pod death)

Tagged: microservices spring boot K8S

Back to All Stories Next Post: First Journey