Database connection pool was bleeding connections on every retry. Not leaking. Bleeding. Each failed attempt created a new socket that never got cleaned up, but only logged as a warning, not an error. So nobody saw it.
Started tracking what was actually happening. Connection in, transaction fails, connection stays open, next transaction tries to create a new one. Repeat. After about six hours the whole thing just stops responding.
Turned the retry logic inside out. Now the cost of ignoring a single failed connection is actually visible. The money goes in real clean now, the money comes out the same way.
Found something weird in the logs at 4am.