Skip to main content

861 posts tagged with "insider"

View all tags

Document Extraction Is Your RAG System's Hidden Ceiling

· 10 min read
Tian Pan
Software Engineer

A compliance contractor builds a RAG system to answer questions against a 400-page policy document. The system passes internal QA. It retrieves correctly against single-topic queries. Then it goes live and starts returning confident, well-structured, wrong answers on anything involving exception clauses.

The debugging loop looks familiar: swap the embedding model, tune similarity thresholds, experiment with chunk sizes, add a reranker. Weeks pass. The improvement is marginal. The real problem is that a key exception clause was split across two chunks at a paragraph boundary — not because of chunking strategy, but because the PDF extractor silently broke the paragraph in two when it misread the layout. Neither chunk, in isolation, is retrievable or interpretable. The system cannot hallucinate its way to a correct answer because the correct information never entered the index cleanly.

This is the extraction ceiling: the point beyond which no downstream optimization can compensate for corrupted or missing input data.

Earned Autonomy: How to Graduate AI Agents from Supervised to Independent Operation

· 10 min read
Tian Pan
Software Engineer

Most teams treat AI autonomy as a binary switch: the agent is either supervised or it isn't. That framing is why 80% of organizations report unintended agent actions, and why Gartner projects that more than 40% of agentic AI projects will be abandoned by end of 2027 due to inadequate risk controls. The problem isn't that AI agents are inherently untrustworthy—it's that teams promote them to independence before earning it.

Autonomy should be something an agent accumulates through demonstrated reliability, not a property you assign at deployment. The same way a new engineer starts by reviewing PRs before getting production access, an AI agent should operate with progressively expanding scope as it builds a track record. This isn't just philosophical—it changes the specific architectural decisions you make, the metrics you track, and how you design your rollback mechanisms.

The Edge Inference Decision Framework: When to Run AI Models Locally Instead of in the Cloud

· 12 min read
Tian Pan
Software Engineer

Most teams make the cloud-vs-edge decision by gut instinct: cloud is easier, so they default to cloud. Then a HIPAA audit hits, or the latency SLO slips by 400ms, or the monthly invoice arrives. Only then do they ask whether some of that inference should have been local all along.

The answer is almost never "all cloud" or "all edge." The teams running production AI at scale have settled on a tiered architecture: an on-device or on-premise model handles the majority of requests, and a cloud frontier model catches what the smaller model can't. Getting that routing right is an engineering decision, not an intuition.

This is the decision framework for making it rigorously.

The Enterprise AI Capability Discovery Problem

· 10 min read
Tian Pan
Software Engineer

You shipped the AI feature. You put it in the product. You wrote the help doc. And still, six months later, your most sophisticated enterprise users are copy-pasting text into ChatGPT to do the same thing your feature already does natively. This is not a training problem. It is a discoverability problem, and it is one of the most consistent sources of wasted AI investment in enterprise software today.

The pattern is well-documented: 49% of workers report they never use AI in their role, and 74% of companies struggle to scale value from AI deployments. But the interesting failure mode is not the late-adopters who explicitly resist. It is the engaged users who open your product every day, never knowing that the AI capability they would have paid for is sitting one click away from where their cursor already is.

Why Your Document Extractor Breaks on the Contracts That Matter Most

· 13 min read
Tian Pan
Software Engineer

Your invoice parser probably works fine. Feed it a clean, digital PDF from a Fortune 500 vendor — structured rows, consistent column widths, machine-generated text — and it will extract line items with near-perfect accuracy. Then someone uploads a multi-page contract from a regional supplier, a scanned form with handwritten amendments, or a financial statement where the table header lives on page 3 and the rows continue through page 6. The extractor fails silently, returns partial data, or confidently produces structured output that is wrong in ways no downstream validation catches.

This is the central problem with enterprise document intelligence: the documents that break your system are not the edge cases. They are the ones with the highest business value.

Enterprise RAG Governance: The Org Chart Behind Your Retrieval Pipeline

· 11 min read
Tian Pan
Software Engineer

Forty to sixty percent of enterprise RAG deployments fail to reach production. The culprit is almost never the retrieval algorithm—HNSW indexing works fine, embeddings are reasonably good, and vector similarity search is a solved problem. The breakdown happens upstream and downstream: no document ownership, no access controls enforced at query time, PII sitting unprotected in vector indexes, and a retrieval corpus that diverges from reality within weeks of launch. These are governance failures, and most engineering teams treat them as someone else's problem right up until a compliance team, a security audit, or a user who received another tenant's data makes it their problem.

This is the organizational and technical anatomy of a governed RAG knowledge base—written for engineers who own the pipeline, not executives who approved the budget.

Event-Driven Agent Scheduling: Why Cron + REST Calls Fail for Recurring AI Workloads

· 11 min read
Tian Pan
Software Engineer

The most common way teams schedule recurring AI agent jobs is also the most dangerous: a cron entry that fires a REST call every N minutes, which kicks off an LLM workflow, which either finishes or silently doesn't. This pattern feels fine in staging. In production, it creates a class of failures that are uniquely hard to detect, recover from, and reason about.

Cron was designed in 1975 for sysadmin scripts. The assumptions it encodes—short runtime, stateless execution, fire-and-forget outcomes—are wrong for LLM workloads in every dimension. Recurring AI agent jobs are long-running, stateful, expensive, and fail in ways that compound across retries. Using cron to schedule them is not just a reliability risk. It's a visibility risk. When things go wrong, you often won't know.

GraphRAG vs. Vector RAG: When Knowledge Graphs Beat Embeddings

· 9 min read
Tian Pan
Software Engineer

Most teams reach for vector embeddings when building RAG pipelines. It's the obvious default: embed documents, embed queries, find the nearest neighbors, feed results to the LLM. It works well enough on the demos. Then they deploy to a compliance team or a scientific literature corpus, and accuracy falls off a cliff. Not gradually — abruptly. On queries involving five or more entities, vector RAG accuracy in enterprise analytics benchmarks drops to zero. Not 50%. Not 20%. Zero.

This isn't a configuration problem. It's an architectural mismatch. Vector retrieval treats documents as points in semantic space. Knowledge graphs treat them as nodes in a relational structure. When your queries require traversing relationships — not just finding similar content — the topology of your retrieval architecture is what determines whether you get the right answer.

Where to Put the Human: Placement Theory for AI Approval Gates

· 12 min read
Tian Pan
Software Engineer

Most teams add human-in-the-loop review as an afterthought: the agent finishes its chain of work, the result lands in a review queue, and a human clicks approve or reject. This feels like safety. It is mostly theater.

By the time a multi-step agent reaches end-of-chain review, it has already sent the API requests, mutated the database rows, drafted the customer email, and scheduled the follow-up. The "review" is approving a done deal. Declining it means explaining to the agent — and often to the user — why nothing that happened for the past 10 minutes will stick.

The damage from misplaced approval gates isn't always dramatic. Often it's subtler: reviewers who approve everything because the real decisions have already been made, engineers who add more checkpoints after incidents and watch trust in the product crater, and organizations that oscillate between "too much friction" and "not enough oversight" without ever solving the underlying placement problem.

The Insider Threat You Created When You Deployed Enterprise AI

· 10 min read
Tian Pan
Software Engineer

Most enterprise security teams have a reasonably well-developed model for insider threats: a disgruntled employee downloads files to a USB drive, emails a spreadsheet to a personal account, or walks out with credentials. The detection playbook is known — DLP rules, egress monitoring, UEBA baselines. What those playbooks don't account for is the scenario where you handed every one of your employees a tool that can plan, execute, and cover multi-stage operations at machine speed. That's what deploying AI coding assistants and RAG-based document agents actually does.

The problem isn't that these tools are insecure in isolation. It's that they dramatically amplify what a compromised or malicious insider can accomplish in a single session. The average cost of an insider incident has reached $17.4 million per organization annually, and 83% of organizations experienced at least one insider attack in the past year. AI tools don't introduce a new threat category — they multiply the capability of every threat category that already exists.

The Instruction Complexity Cliff: Why LLMs Follow 5 Rules Reliably but Not 15

· 10 min read
Tian Pan
Software Engineer

There's a pattern that shows up in almost every production AI system: the team starts with a focused system prompt, ships the feature, and then iterates. A new edge case surfaces, so they add a rule. Another ticket comes in, another rule. Six months later the system prompt has grown to 2,000 tokens and covers 20 distinct behavioral requirements. The AI still sounds coherent on most requests. But subtle compliance failures have been creeping in for weeks — formatting ignored here, a tone requirement skipped there, an escalation rule quietly bypassed. Nobody flagged it because no individual failure was dramatic enough to page anyone.

This isn't a model quality problem. It's a fundamental architectural characteristic of how transformer-based language models process instructions, and there's a substantial body of empirical research that makes the failure modes predictable. Understanding it changes how you should write system prompts.

The Jagged Frontier: Why AI Fails at Easy Things and What It Means for Your Product

· 10 min read
Tian Pan
Software Engineer

A common assumption in AI product development goes something like this: if a model can handle a hard task, it can definitely handle an easier one nearby. This assumption is wrong, and it's responsible for a category of production failures that no amount of benchmark reading prepares you for.

The research term for the underlying phenomenon is the "jagged frontier" — AI's capability boundary isn't a smooth line that hard tasks sit outside of and easy tasks sit inside. It's a ragged, unpredictable shape. AI systems can write production-grade database query optimizers and still miscalculate whether two line segments on a diagram intersect. They can pass PhD-level science exams and fail children's riddle questions that involve spatial relationships. They can synthesize 50-page documents and then confidently hallucinate a summary of a paragraph they just read.