PSF Deep DiveDomain 8

PSF Domain 8: Vendor Resilience

The AI vendor landscape will look substantially different in two years. Models deprecate. Providers get acquired. Startups shut down. Framework maintainers burn out. This guide maps the lock-in taxonomy, the deprecation response playbook, and the multi-vendor architecture patterns that keep production AI recoverable when a dependency fails.

14 min readUpdated April 2026PSF Domain 8

This is the final deep dive in the PSF D1–D8 series. With D8, the complete Production Safety Framework implementation guide is now available across all eight domains.

The D8 gap: vendor resilience requires deliberate architectural and contractual decisions that frameworks cannot make on your behalf. No framework can prevent model deprecation, provider outage, or vendor acquisition. The controls are entirely in your design and operating procedures.

Why Vendor Resilience Fails Silently

D8 failures are slow-motion compared to the other PSF domains. A prompt injection attack happens in milliseconds. A vendor deprecation unfolds over months. This is why D8 gets deprioritised — it rarely triggers an incident until it is already a crisis.

The typical D8 failure pattern: a team builds a production AI system tightly coupled to a specific model and framework. The model performs well. Eighteen months later, the provider announces deprecation with a six-month window. The prompt tuning accumulated over eighteen months does not transfer to the replacement model. The framework has accrued unmaintained dependencies. Six months is not enough time to replace both the model and the framework. Production is disrupted.

The preventive work takes hours when done proactively. It takes weeks under deadline pressure, with production at risk, when done reactively.

Lock-In Risk Taxonomy

AI vendor lock-in operates across five distinct layers. Most teams are aware of one or two. The full taxonomy determines the true migration cost.

Model API Lock-inHigh

Prompts tuned for GPT-4 do not transfer to Claude or Gemini without significant rework. Model-specific behaviours — context window handling, instruction following, function calling syntax — create deep coupling.

Warning signs:

Prompts reference model-specific features or behaviour
Evaluation results are only available for one provider
No abstraction layer between application code and model API

Mitigation:

Design prompts against a common instruction format. Maintain eval suites that run across multiple providers. Abstract model calls behind a unified interface.

Framework Lock-inMedium

Agent logic written in LangChain does not run in AutoGen. Framework-specific abstractions (chains, runnables, crews, panels) are not portable. A framework deprecation or acquisition can orphan production systems.

Warning signs:

Agent logic is tightly coupled to framework primitives
No framework-agnostic business logic layer
Framework version pinned and unmaintained

Mitigation:

Separate business logic from orchestration logic. Keep framework-specific code in thin adapters. Evaluate the maintenance trajectory of any framework before committing to it for production.

Vector Database Lock-inMedium

Embeddings generated by OpenAI text-embedding-3-large are not interchangeable with Cohere embed-v3. Switching embedding providers requires re-indexing the entire corpus. Vector DB proprietary features (hybrid search implementations, metadata filtering syntax) add further coupling.

Warning signs:

No embedding model abstraction layer
Proprietary vector DB features used in production queries
Re-embedding cost and time not documented

Mitigation:

Track embedding model as a versioned dependency. Design retrieval logic against standard interfaces. Periodically benchmark alternative embedding providers against your actual retrieval corpus.

Managed Service Lock-inMedium

Azure AI Studio, AWS Bedrock, and Google Vertex AI provide managed AI infrastructure. Their proprietary features — guardrails, fine-tuning pipelines, deployment managers — create coupling that makes migration expensive even if the underlying model is vendor-neutral.

Warning signs:

Cloud-provider-specific features used in production inference path
No documented migration path for managed services
Observability data locked in provider-specific tooling

Mitigation:

Evaluate which managed service features are load-bearing vs. convenience. Prefer open standards where available. Maintain a periodic migration feasibility assessment.

Data Lock-inHigh

Training data uploaded for fine-tuning, conversations stored in a provider's platform, and evaluation datasets stored in managed tooling can be difficult or impossible to export. This creates dependency even when the model is replaceable.

Warning signs:

Fine-tuning datasets stored exclusively in provider platform
Conversation history or evaluation data not independently exported
Provider contract does not guarantee data portability

Mitigation:

Maintain independent copies of all training data, evaluation datasets, and conversation data. Contract for explicit data portability rights. Export and verify recoverability on a quarterly basis.

Model Deprecation Response Playbook

Model deprecation is a predictable event that can be planned for. Every major provider has deprecated models: GPT-4 variants, Claude 2, Anthropic\'s older Instant models, Google\'s PaLM family. The providers that handled deprecation best gave 12+ months notice. The ones that handled it worst gave 30-60 days. Assume the worst and plan for it.

Monitor

Track provider model lifecycle communications. Subscribe to provider changelogs, status pages, and developer email lists. Set a 6-month deprecation lead time as a minimum trigger for planning.

Benchmark

When a deprecation notice arrives, run the replacement candidate(s) through your existing eval suite immediately. Identify performance gaps. Prioritise prompt adaptation work by severity of regression.

Adapt

Adapt prompts for the replacement model. Run parallel evaluation: old model vs. new model on production sample inputs. Document regressions and compensating prompt changes.

Canary

Canary the replacement model at 5-10% of production traffic. Compare output quality metrics, latency, cost, and validation failure rates to the deprecated model. Run for minimum 48 hours.

Promote

Promote the replacement model to full traffic before the deprecation date — never on the last day. Update model registry, documentation, and monitoring baselines.

Multi-Vendor Architecture Patterns

Multi-vendor resilience does not require running multiple models simultaneously at all times. It requires that switching is achievable within an acceptable time window without architectural rework.

Pattern 1: Abstraction Layer

All model API calls go through a unified interface that can be re-implemented for a different provider without changing application code. LiteLLM is the most widely used open-source implementation — it provides a unified interface to 100+ model providers with the OpenAI API shape. The abstraction layer should also handle retry logic, rate limit backoff, and provider failover.

Pattern 2: Provider Fallback

For latency-sensitive or availability-critical workloads, configure a secondary model provider that activates automatically when the primary is unavailable or rate-limited. This requires pre-validation that the fallback model produces acceptable output quality for the use case — you cannot discover the quality gap during an incident.

Pattern 3: Eval-Driven Migration

Maintain a portable evaluation suite that runs against any model and produces comparable metrics. When migration is needed, run the suite against candidates, identify the best replacement, and document the adaptation work required. This converts "can we migrate" from an unknown to a measured cost. Teams that have done this once can execute a model migration in days rather than weeks.

Pattern 4: Prompt Portability

Write prompts against the smallest common denominator of instruction-following capability. Avoid model-specific tricks, special tokens, or undocumented behaviours. The upside of a model-specific optimisation is usually marginal. The cost when that optimisation becomes a migration blocker is substantial.

SLA Benchmarking

Vendor SLAs for AI services are typically worse than comparable SLAs for databases, compute, or storage. This is partly because AI inference involves more failure modes and partly because the market is still maturing. Before committing a workload to a provider, benchmark their SLA against your reliability requirements.

SLA Dimension	Minimum Acceptable	Recommended for Critical Path
Uptime SLA	99.9% (8.7h downtime/yr)	99.95%+ for critical path
Response latency p99	< 5s for synchronous user-facing	< 30s for background / async
Rate limits	Document limits, implement backoff	Negotiate higher limits before launch
Deprecation notice period	Minimum 6 months	Prefer 12+ months for mission-critical
Data portability	All training/eval data exportable	Contractual right to export on termination
Sub-processor transparency	Published list of sub-processors	Notification of sub-processor changes

Exit Strategy Documentation

An exit strategy is not a plan you execute — it is a document you keep current so you know whether migration is feasible before you need to decide. A useful exit strategy document covers: the current vendor dependency map (all providers, services, and data locations), the estimated migration effort by dependency type, the identified replacement candidates for each critical dependency, and the data export procedure and last-verified export date.

Exit strategy document — minimum contents:

Vendor dependency map: every provider and service with classification (model API, vector DB, managed orchestration, observability, auth)
Lock-in assessment per dependency: migration effort estimate (hours/days/weeks), identified replacement candidates
Data inventory: where training data, evaluation datasets, and conversation logs are stored; export procedure; last export date
Replacement candidate evaluations: eval results for at least one alternative to each critical dependency, with date run
Contractual exit rights: data portability provisions, termination notice requirements, any exclusivity clauses
Migration rehearsal: at least one dependency migration (even dev environment) to validate the process is executable

Framework D8 Status

Most frameworks are Partial on D8 — they provide model abstraction layers but no operational resilience tooling or lifecycle management.

LangChain / LangGraph

Partial

The LangChain abstraction layer theoretically supports multiple model providers. In practice, model-specific prompt tuning and output parser coupling reduce portability. No deprecation tooling provided.

CrewAI

Gap

Agent definitions include model configuration but no multi-model failover. No deprecation management. Heavy coupling to LangChain internals in some versions adds framework-within-framework lock-in risk.

AutoGen

Partial

Model configuration is per-agent and supports multiple backends. The abstraction is cleaner than LangChain for multi-vendor use. Still no deprecation lifecycle management.

Semantic Kernel

Partial

Strong multi-model support by design — the kernel can be configured with multiple AI services. Azure-backed deployments benefit from Azure's managed model lifecycle. No deprecation tooling.

Pydantic AI

Partial

Model abstraction is clean and swappable per-agent. The lightweight design means vendor-specific coupling is minimal. No operational resilience tooling — that remains the operator's responsibility.

Haystack

Partial

Pipeline architecture supports model component substitution. Good multi-provider support. Haystack Cloud adds managed deployment but introduces managed service dependency.

The Complete PSF Deep Dive Series

With D8, every PSF domain now has a complete implementation guide. The full series:

D1 Input Governance D2 Output Validation D3 Data Protection D4 Observability D5 Deployment Safety D6 Human Oversight D7 Security D8 Vendor Resilience

From reading to credential

You understand the gaps.
Get the credential that proves it.

The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.

Start AIDA — free →CPAP practitioner credential

The Production AI Brief

Vendor resilience depends on clean deployment practices — these are complementary

PSF-Compliant Stack Recipes

Multi-vendor stack patterns that satisfy D8 requirements

Pinecone vs Weaviate vs Chroma — Vector DB Comparison

Evaluating vector DB alternatives before you need to migrate

LangSmith vs Langfuse vs Arize Phoenix

Maintaining portability in observability tooling

Legal & Government AI Deployment Playbook

The D8 requirements in the highest-accountability deployment context