Production AI Institute — vendor-neutral certification for AI practitioners
Verify a credentialFor organisationsContact
PSF Deep DiveDomain 8

PSF Domain 8: Vendor Resilience

The AI vendor landscape will look substantially different in two years. Models deprecate. Providers get acquired. Startups shut down. Framework maintainers burn out. This guide maps the lock-in taxonomy, the deprecation response playbook, and the multi-vendor architecture patterns that keep production AI recoverable when a dependency fails.

14 min readUpdated April 2026PSF Domain 8

This is the final deep dive in the PSF D1–D8 series. With D8, the complete Production Safety Framework implementation guide is now available across all eight domains.

The D8 gap: vendor resilience requires deliberate architectural and contractual decisions that frameworks cannot make on your behalf. No framework can prevent model deprecation, provider outage, or vendor acquisition. The controls are entirely in your design and operating procedures.

Why Vendor Resilience Fails Silently

D8 failures are slow-motion compared to the other PSF domains. A prompt injection attack happens in milliseconds. A vendor deprecation unfolds over months. This is why D8 gets deprioritised — it rarely triggers an incident until it is already a crisis.

The typical D8 failure pattern: a team builds a production AI system tightly coupled to a specific model and framework. The model performs well. Eighteen months later, the provider announces deprecation with a six-month window. The prompt tuning accumulated over eighteen months does not transfer to the replacement model. The framework has accrued unmaintained dependencies. Six months is not enough time to replace both the model and the framework. Production is disrupted.

The preventive work takes hours when done proactively. It takes weeks under deadline pressure, with production at risk, when done reactively.

Lock-In Risk Taxonomy

AI vendor lock-in operates across five distinct layers. Most teams are aware of one or two. The full taxonomy determines the true migration cost.

Model API Lock-inHigh

Prompts tuned for GPT-4 do not transfer to Claude or Gemini without significant rework. Model-specific behaviours — context window handling, instruction following, function calling syntax — create deep coupling.

Warning signs:
Mitigation:

Design prompts against a common instruction format. Maintain eval suites that run across multiple providers. Abstract model calls behind a unified interface.

Framework Lock-inMedium

Agent logic written in LangChain does not run in AutoGen. Framework-specific abstractions (chains, runnables, crews, panels) are not portable. A framework deprecation or acquisition can orphan production systems.

Warning signs:
Mitigation:

Separate business logic from orchestration logic. Keep framework-specific code in thin adapters. Evaluate the maintenance trajectory of any framework before committing to it for production.

Vector Database Lock-inMedium

Embeddings generated by OpenAI text-embedding-3-large are not interchangeable with Cohere embed-v3. Switching embedding providers requires re-indexing the entire corpus. Vector DB proprietary features (hybrid search implementations, metadata filtering syntax) add further coupling.

Warning signs:
Mitigation:

Track embedding model as a versioned dependency. Design retrieval logic against standard interfaces. Periodically benchmark alternative embedding providers against your actual retrieval corpus.

Managed Service Lock-inMedium

Azure AI Studio, AWS Bedrock, and Google Vertex AI provide managed AI infrastructure. Their proprietary features — guardrails, fine-tuning pipelines, deployment managers — create coupling that makes migration expensive even if the underlying model is vendor-neutral.

Warning signs:
Mitigation:

Evaluate which managed service features are load-bearing vs. convenience. Prefer open standards where available. Maintain a periodic migration feasibility assessment.

Data Lock-inHigh

Training data uploaded for fine-tuning, conversations stored in a provider's platform, and evaluation datasets stored in managed tooling can be difficult or impossible to export. This creates dependency even when the model is replaceable.

Warning signs:
Mitigation:

Maintain independent copies of all training data, evaluation datasets, and conversation data. Contract for explicit data portability rights. Export and verify recoverability on a quarterly basis.

Model Deprecation Response Playbook

Model deprecation is a predictable event that can be planned for. Every major provider has deprecated models: GPT-4 variants, Claude 2, Anthropic\'s older Instant models, Google\'s PaLM family. The providers that handled deprecation best gave 12+ months notice. The ones that handled it worst gave 30-60 days. Assume the worst and plan for it.

1
Monitor

Track provider model lifecycle communications. Subscribe to provider changelogs, status pages, and developer email lists. Set a 6-month deprecation lead time as a minimum trigger for planning.

2
Benchmark

When a deprecation notice arrives, run the replacement candidate(s) through your existing eval suite immediately. Identify performance gaps. Prioritise prompt adaptation work by severity of regression.

3
Adapt

Adapt prompts for the replacement model. Run parallel evaluation: old model vs. new model on production sample inputs. Document regressions and compensating prompt changes.

4
Canary

Canary the replacement model at 5-10% of production traffic. Compare output quality metrics, latency, cost, and validation failure rates to the deprecated model. Run for minimum 48 hours.

5
Promote

Promote the replacement model to full traffic before the deprecation date — never on the last day. Update model registry, documentation, and monitoring baselines.

Multi-Vendor Architecture Patterns

Multi-vendor resilience does not require running multiple models simultaneously at all times. It requires that switching is achievable within an acceptable time window without architectural rework.

Pattern 1: Abstraction Layer

All model API calls go through a unified interface that can be re-implemented for a different provider without changing application code. LiteLLM is the most widely used open-source implementation — it provides a unified interface to 100+ model providers with the OpenAI API shape. The abstraction layer should also handle retry logic, rate limit backoff, and provider failover.

Pattern 2: Provider Fallback

For latency-sensitive or availability-critical workloads, configure a secondary model provider that activates automatically when the primary is unavailable or rate-limited. This requires pre-validation that the fallback model produces acceptable output quality for the use case — you cannot discover the quality gap during an incident.

Pattern 3: Eval-Driven Migration

Maintain a portable evaluation suite that runs against any model and produces comparable metrics. When migration is needed, run the suite against candidates, identify the best replacement, and document the adaptation work required. This converts "can we migrate" from an unknown to a measured cost. Teams that have done this once can execute a model migration in days rather than weeks.

Pattern 4: Prompt Portability

Write prompts against the smallest common denominator of instruction-following capability. Avoid model-specific tricks, special tokens, or undocumented behaviours. The upside of a model-specific optimisation is usually marginal. The cost when that optimisation becomes a migration blocker is substantial.

SLA Benchmarking

Vendor SLAs for AI services are typically worse than comparable SLAs for databases, compute, or storage. This is partly because AI inference involves more failure modes and partly because the market is still maturing. Before committing a workload to a provider, benchmark their SLA against your reliability requirements.

SLA DimensionMinimum AcceptableRecommended for Critical Path
Uptime SLA99.9% (8.7h downtime/yr)99.95%+ for critical path
Response latency p99< 5s for synchronous user-facing< 30s for background / async
Rate limitsDocument limits, implement backoffNegotiate higher limits before launch
Deprecation notice periodMinimum 6 monthsPrefer 12+ months for mission-critical
Data portabilityAll training/eval data exportableContractual right to export on termination
Sub-processor transparencyPublished list of sub-processorsNotification of sub-processor changes

Exit Strategy Documentation

An exit strategy is not a plan you execute — it is a document you keep current so you know whether migration is feasible before you need to decide. A useful exit strategy document covers: the current vendor dependency map (all providers, services, and data locations), the estimated migration effort by dependency type, the identified replacement candidates for each critical dependency, and the data export procedure and last-verified export date.

Exit strategy document — minimum contents:

Framework D8 Status

Most frameworks are Partial on D8 — they provide model abstraction layers but no operational resilience tooling or lifecycle management.

LangChain / LangGraph
Partial

The LangChain abstraction layer theoretically supports multiple model providers. In practice, model-specific prompt tuning and output parser coupling reduce portability. No deprecation tooling provided.

CrewAI
Gap

Agent definitions include model configuration but no multi-model failover. No deprecation management. Heavy coupling to LangChain internals in some versions adds framework-within-framework lock-in risk.

AutoGen
Partial

Model configuration is per-agent and supports multiple backends. The abstraction is cleaner than LangChain for multi-vendor use. Still no deprecation lifecycle management.

Semantic Kernel
Partial

Strong multi-model support by design — the kernel can be configured with multiple AI services. Azure-backed deployments benefit from Azure's managed model lifecycle. No deprecation tooling.

Pydantic AI
Partial

Model abstraction is clean and swappable per-agent. The lightweight design means vendor-specific coupling is minimal. No operational resilience tooling — that remains the operator's responsibility.

Haystack
Partial

Pipeline architecture supports model component substitution. Good multi-provider support. Haystack Cloud adds managed deployment but introduces managed service dependency.

The Complete PSF Deep Dive Series

With D8, every PSF domain now has a complete implementation guide. The full series:

D1 Input GovernanceD2 Output ValidationD3 Data ProtectionD4 ObservabilityD5 Deployment SafetyD6 Human OversightD7 SecurityD8 Vendor Resilience
From reading to credential

You understand the gaps.
Get the credential that proves it.

The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.

Start AIDA — free →CPAP practitioner credential
The Production AI Brief

Related Guides

PSF D5: Deployment Safety — Model Versioning and Rollback
Vendor resilience depends on clean deployment practices — these are complementary
PSF-Compliant Stack Recipes
Multi-vendor stack patterns that satisfy D8 requirements
Pinecone vs Weaviate vs Chroma — Vector DB Comparison
Evaluating vector DB alternatives before you need to migrate
LangSmith vs Langfuse vs Arize Phoenix
Maintaining portability in observability tooling
Legal & Government AI Deployment Playbook
The D8 requirements in the highest-accountability deployment context