Skip to main content

The Model EOL Clock: Treating Provider LLMs as External Dependencies

· 11 min read
Tian Pan
Software Engineer

In January 2026, OpenAI retired several GPT models from ChatGPT with two weeks' notice — weeks after its CEO had publicly promised "plenty of notice" following an earlier backlash. For teams that had built workflows around those models, the announcement arrived like a pager alert on a Friday afternoon. The API remained unaffected that time. But it won't always.

Every model you're currently calling has a deprecation date. Some of those dates are already listed on your provider's documentation page. Others haven't been announced yet. The operational question isn't whether your production model will be retired — it's whether you'll find out in time to handle it gracefully, or scramble to migrate after users start seeing failures.

This is a solved problem in software engineering. Libraries have EOL dates. Operating systems have support windows. The industry built tooling, processes, and cultural norms around dependency lifecycle management decades ago. AI engineering hasn't caught up yet, but the underlying pattern is identical: treat the model version as an external dependency you don't control, and build accordingly.

The Quarterly Deprecation Treadmill

Provider model deprecation now runs on a roughly quarterly cadence. Anthropic retired 5–6 distinct model versions in 2024 alone, including Claude 1.x, Claude Instant 1.x, and several Sonnet snapshots. OpenAI's deprecations page lists dozens of retired models since 2023, with notice periods ranging from 14 days (for ChatGPT-only models) to about 12 months for flagship API models. Google Vertex AI targets 6 months for standard notice, but has given as little as one month for third-party models hosted on the platform.

The typical lifespan of a major frontier model is 12–18 months before retirement. If your application uses more than one LLM-powered feature, you should expect at least one active deprecation migration at any given time. At three features, it's effectively continuous.

The practical policy across providers comes out to:

  • OpenAI (API, GA models): 60 days minimum notice; in practice, major models get 6–12 months.
  • Anthropic: 60–90 days in practice; some major models received 6 months.
  • Azure OpenAI: 60 days minimum for GA models; 30 days for preview.
  • Google Vertex AI: 6 months standard; preview models shorter.
  • ChatGPT UI (not API): As low as 14 days.

Notice that none of these are long enough for an unprepared engineering team to do a quality migration under normal sprint velocity. The teams that handle deprecations without incident built their migration infrastructure before they needed it.

The "Frozen Version" Illusion

Before getting to the runbook, it's worth confronting a dangerous assumption: that pinning to a dated snapshot — gpt-4o-2024-08-06, claude-3-5-sonnet-20241022 — gives you a stable behavioral guarantee.

It doesn't, always.

In 2023, Stanford and UC Berkeley researchers documented that GPT-4's accuracy on prime number identification dropped from 84% to 51% between March and June of that year — a 33-point collapse — with no API version change announced. The same model identifier, different behavior. Chain-of-thought prompting compliance changed. Code generation errors increased. OpenAI initially maintained nothing had changed.

In early 2025, developers documented gpt-4o-2024-08-06 changing behavior: JSON parsing failures and classifier breakage, neither of which threw API errors. The application appeared to work; it was silently wrong.

Behavioral drift without a version bump is rare, but it does happen. The implication is that your regression suite needs to run continuously against the live endpoint — not just at migration time — to catch silent changes. This is exactly how you'd instrument a third-party API you don't control.

The Model Inventory: Before You Can Manage It, You Have to See It

The foundation of any deprecation management system is a current inventory of every model your system calls. This should be a first-class artifact, not reconstructed by grepping the codebase when a deprecation notice arrives.

Each entry needs:

  • Exact model identifier (the pinned snapshot version, not an alias)
  • Which services and features consume it
  • The announced EOL date (check the provider's deprecations page)
  • The recommended replacement

The llm-model-deprecation Python library provides a scan command that walks your codebase looking for hardcoded model strings and flags anything against its weekly-refreshed deprecation registry. Running this in CI ensures deprecated model names don't survive code review. deprecations.info provides RSS/JSON feeds aggregating retirement announcements across OpenAI, Anthropic, Google AI, Vertex AI, AWS Bedrock, and others — wire this into a Slack channel and you get early warning before the email lands.

One operational detail: never use model alias versions in production. gpt-4.1 (no date suffix) silently resolves to whatever OpenAI designates as latest. gemini-1.5-pro (no version suffix) started serving gemini-1.5-pro-002 traffic the day that version launched. Aliases are your provider's equivalent of npm's ^ operator — you're accepting automatic upgrades you haven't tested. Pin to the date-stamped snapshot.

What a Behavioral Regression Suite Actually Looks Like

When deprecation notice arrives, the migration decision isn't "does the new model score higher on MMLU?" It's "does the new model produce acceptable outputs for the exact tasks my application performs?" Those are different questions with different answers, and only one of them you can answer.

The benchmark trap is real. Teams that trusted provider-supplied benchmark scores on new models found the older model outperformed significantly on their specific tasks — customer feedback summarization, domain-specific classification, instruction-following chains with 10+ steps. Benchmarks measure what the benchmark measures. Your regression suite measures what your application does.

The principle that comes out of documented migration experiences is 50–100 carefully chosen golden examples outperform thousands of synthetic ones. These golden examples should come from:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates