Skip to main content

5 posts tagged with "data-pipelines"

View all tags

The Wiki Edit Mid-Flight When Your RAG Pipeline Read It

· 11 min read
Tian Pan
Software Engineer

A tech writer on your platform team is moving a paragraph. Not metaphorically — literally cutting a section from the onboarding page, pasting it into the runbook, deleting a stub draft on a third page, and rewording a deprecated warning on a fourth. The whole edit takes her about eleven minutes. Your RAG ingest job runs every fifteen. It happens to fire at minute six.

For the next fifteen minutes, your retrieval index contains a state of the wiki that did not exist at any single moment in her mind. The onboarding page still has the section. The runbook still doesn't. The stub draft is captured halfway through being deleted, with a placeholder sentence she never intended to publish. The old deprecated warning is still indexed. When an engineer asks the agent "how do we handle credential rotation in this service," the model retrieves contradictory chunks from the same source and confidently synthesizes whichever was ranked higher. The answer is wrong in a shape no one wrote.

This is a failure mode most teams ship without noticing: the source-of-truth is transactional, the ingest is a poll, and the gap between them is where dirty reads live.

Knowledge Graph Staleness Has a Different SLA Than Vector Staleness

· 10 min read
Tian Pan
Software Engineer

The vector index is wrong by approximately ten percent and nobody panics. The knowledge graph is wrong by one missing edge and somebody ships a wrong answer to a regulator. The two failure modes look identical from the data engineering org chart — both are "the index is stale" — and they sit behind the same change-data-capture pipeline with the same lag tolerance. The pipeline was sized for the vector workload because that was the louder consumer. The graph silently inherited those defaults, and the silence is the bug.

Vector retrieval and graph retrieval fail differently under staleness, and treating them as the same kind of lag problem is how you end up with a system that scores well on RAG benchmarks and is silently wrong on multi-hop queries — the silently-wrong case being, of course, the one users notice last. The fix is not faster pipelines. The fix is recognizing that "stale" means two different things, designing freshness tiers per edge class, and building the eval that catches the difference before a regulator does.

Agentic Data Pipelines: Offline Enrichment and Classification at Scale

· 9 min read
Tian Pan
Software Engineer

You have a batch job that classifies 10 million customer support tickets overnight. You swap the regex classifier for an LLM and the accuracy jumps from 61% to 89%. Then you ship it and discover: the job now costs 40x more, runs 12x slower, silently skips 3% of records when the model returns unparseable output, and your downstream analytics team is filing bugs because the label schema drifted without anyone noticing.

Agentic data pipelines break in ways that ETL engineers haven't seen before, and the fixes require a different mental model than either traditional batch processing or real-time LLM serving.

LLMs as ETL Primitives: AI in the Data Pipeline, Not Just the Product

· 9 min read
Tian Pan
Software Engineer

The typical AI narrative goes like this: you build a product, you add an AI feature, and users get smarter outputs. That framing is correct, but incomplete. The more durable advantage isn't in the product layer at all — it's in the data pipeline running underneath it.

A growing number of engineering teams have quietly swapped out regex rules, custom classifiers, and hand-coded parsers in their ETL pipelines and replaced them with LLM calls. The result: pipelines that handle unstructured input, adapt to schema drift, and classify records across thousands of categories — without retraining a model for every new edge case. Teams running this pattern at scale are building data assets that compound. Teams still treating LLMs purely as product features are not.

The Batch LLM Pipeline Blind Spot: Queue Design, Checkpointing, and Cost Attribution for Offline AI

· 12 min read
Tian Pan
Software Engineer

Most production AI engineering advice assumes you're building a chatbot. The architecture discussions center on time-to-first-token, streaming partial responses, and sub-second latency budgets. But a growing share of real LLM workloads look nothing like a chat interface. They look like nightly data enrichment jobs, weekly document classification runs, and monthly compliance reviews over millions of records.

These batch pipelines are where teams quietly burn the most money, lose the most data to silent failures, and carry the most technical debt — precisely because the latency-first mental model from real-time serving doesn't apply, and nobody has replaced it with something better.