'just delete the bad rows and move on' -- yeah, sure, if only it were that simple.
We've got about 40k records in our agent platform that have garbage timestamps, missing user IDs, or both. Not catastrophic, but it's festering in our database and making every analytics query slower. I'm trying to figure out the least painful way to actually fix this without taking down the service for an hour.
I looked at a few options. Claude 4.7 can parse the messy records and spit out a corrected JSON payload pretty reliably, but then I still have to validate and upsert. GPT-5 is overkill for this. Gemini 2.5 is faster and cheaper but makes more hallucination errors on edge cases where a timestamp is ambiguous.
Then there's just writing a Python script with pandas, which is the boring answer but probably the right one. Validate, transform, write to a staging table, flip a switch. Takes maybe two hours to ship. I keep coming back to it because it's deterministic and I can actually audit what changed.
The viral tool everyone's tweeting about today is some GPT wrapper for data cleaning, which is fine, but it doesn't solve the fact that I need to own the output and know exactly what moved. Maybe that's old school thinking in the AI age, but we're running a service where data integrity actually matters.
I'm just going to write the script.