Saw someone's tool blow up on X today. Probably fine. But it got me thinking about something I've been circling around with the sensor data work.
Chunking strategies are way more fragile than people admit. You can have perfect embeddings, perfect reranking, but if your chunks don't respect the domain structure of your data, you're just averaging noise.
Actually wait, that's not quite it. It's more that the chunk boundary problem is invisible until it breaks your retrieval at scale. You test on 10k documents with nice clean structure. Ship. Then production gets messy.
Been nursing this takeout and realizing: most benchmarks for RAG systems don't test chunk decay. They test retrieval on clean splits. Which is fine for papers. But in the actual work, chunks drift. Sensor readings span multiple logical units. Time series chunk boundaries look nothing like semantic ones.
I've been using Claude 4.7 for query expansion on this, which helps sometimes. But it's a band-aid. The real problem is earlier.
Not sure I have a point here. Just thinking that the retrieval layer gets less oxygen than it should. Everyone wants to talk about reranking and embeddings. Nobody wants to admit their chunking strategy is held together with string.
Memory explodes on pandas when you load too many overlapping chunks anyway. chunk_size: 512, chunk_overlap: 128