PSF Domain 8: Vendor Resilience
The AI vendor landscape will look substantially different in two years. Models deprecate. Providers get acquired. Startups shut down. Framework maintainers burn out. This guide maps the lock-in taxonomy, the deprecation response playbook, and the multi-vendor architecture patterns that keep production AI recoverable when a dependency fails.
This is the final deep dive in the PSF D1–D8 series. With D8, the complete Production Safety Framework implementation guide is now available across all eight domains.
The D8 gap: vendor resilience requires deliberate architectural and contractual decisions that frameworks cannot make on your behalf. No framework can prevent model deprecation, provider outage, or vendor acquisition. The controls are entirely in your design and operating procedures.
Why Vendor Resilience Fails Silently
D8 failures are slow-motion compared to the other PSF domains. A prompt injection attack happens in milliseconds. A vendor deprecation unfolds over months. This is why D8 gets deprioritised — it rarely triggers an incident until it is already a crisis.
The typical D8 failure pattern: a team builds a production AI system tightly coupled to a specific model and framework. The model performs well. Eighteen months later, the provider announces deprecation with a six-month window. The prompt tuning accumulated over eighteen months does not transfer to the replacement model. The framework has accrued unmaintained dependencies. Six months is not enough time to replace both the model and the framework. Production is disrupted.
The preventive work takes hours when done proactively. It takes weeks under deadline pressure, with production at risk, when done reactively.
Lock-In Risk Taxonomy
AI vendor lock-in operates across five distinct layers. Most teams are aware of one or two. The full taxonomy determines the true migration cost.
Prompts tuned for GPT-4 do not transfer to Claude or Gemini without significant rework. Model-specific behaviours — context window handling, instruction following, function calling syntax — create deep coupling.
- Prompts reference model-specific features or behaviour
- Evaluation results are only available for one provider
- No abstraction layer between application code and model API
Design prompts against a common instruction format. Maintain eval suites that run across multiple providers. Abstract model calls behind a unified interface.
Agent logic written in LangChain does not run in AutoGen. Framework-specific abstractions (chains, runnables, crews, panels) are not portable. A framework deprecation or acquisition can orphan production systems.
- Agent logic is tightly coupled to framework primitives
- No framework-agnostic business logic layer
- Framework version pinned and unmaintained
Separate business logic from orchestration logic. Keep framework-specific code in thin adapters. Evaluate the maintenance trajectory of any framework before committing to it for production.
Embeddings generated by OpenAI text-embedding-3-large are not interchangeable with Cohere embed-v3. Switching embedding providers requires re-indexing the entire corpus. Vector DB proprietary features (hybrid search implementations, metadata filtering syntax) add further coupling.
- No embedding model abstraction layer
- Proprietary vector DB features used in production queries
- Re-embedding cost and time not documented
Track embedding model as a versioned dependency. Design retrieval logic against standard interfaces. Periodically benchmark alternative embedding providers against your actual retrieval corpus.
Azure AI Studio, AWS Bedrock, and Google Vertex AI provide managed AI infrastructure. Their proprietary features — guardrails, fine-tuning pipelines, deployment managers — create coupling that makes migration expensive even if the underlying model is vendor-neutral.
- Cloud-provider-specific features used in production inference path
- No documented migration path for managed services
- Observability data locked in provider-specific tooling
Evaluate which managed service features are load-bearing vs. convenience. Prefer open standards where available. Maintain a periodic migration feasibility assessment.
Training data uploaded for fine-tuning, conversations stored in a provider's platform, and evaluation datasets stored in managed tooling can be difficult or impossible to export. This creates dependency even when the model is replaceable.
- Fine-tuning datasets stored exclusively in provider platform
- Conversation history or evaluation data not independently exported
- Provider contract does not guarantee data portability
Maintain independent copies of all training data, evaluation datasets, and conversation data. Contract for explicit data portability rights. Export and verify recoverability on a quarterly basis.
Model Deprecation Response Playbook
Model deprecation is a predictable event that can be planned for. Every major provider has deprecated models: GPT-4 variants, Claude 2, Anthropic\'s older Instant models, Google\'s PaLM family. The providers that handled deprecation best gave 12+ months notice. The ones that handled it worst gave 30-60 days. Assume the worst and plan for it.
Track provider model lifecycle communications. Subscribe to provider changelogs, status pages, and developer email lists. Set a 6-month deprecation lead time as a minimum trigger for planning.
When a deprecation notice arrives, run the replacement candidate(s) through your existing eval suite immediately. Identify performance gaps. Prioritise prompt adaptation work by severity of regression.
Adapt prompts for the replacement model. Run parallel evaluation: old model vs. new model on production sample inputs. Document regressions and compensating prompt changes.
Canary the replacement model at 5-10% of production traffic. Compare output quality metrics, latency, cost, and validation failure rates to the deprecated model. Run for minimum 48 hours.
Promote the replacement model to full traffic before the deprecation date — never on the last day. Update model registry, documentation, and monitoring baselines.
Multi-Vendor Architecture Patterns
Multi-vendor resilience does not require running multiple models simultaneously at all times. It requires that switching is achievable within an acceptable time window without architectural rework.
Pattern 1: Abstraction Layer
All model API calls go through a unified interface that can be re-implemented for a different provider without changing application code. LiteLLM is the most widely used open-source implementation — it provides a unified interface to 100+ model providers with the OpenAI API shape. The abstraction layer should also handle retry logic, rate limit backoff, and provider failover.
Pattern 2: Provider Fallback
For latency-sensitive or availability-critical workloads, configure a secondary model provider that activates automatically when the primary is unavailable or rate-limited. This requires pre-validation that the fallback model produces acceptable output quality for the use case — you cannot discover the quality gap during an incident.
Pattern 3: Eval-Driven Migration
Maintain a portable evaluation suite that runs against any model and produces comparable metrics. When migration is needed, run the suite against candidates, identify the best replacement, and document the adaptation work required. This converts "can we migrate" from an unknown to a measured cost. Teams that have done this once can execute a model migration in days rather than weeks.
Pattern 4: Prompt Portability
Write prompts against the smallest common denominator of instruction-following capability. Avoid model-specific tricks, special tokens, or undocumented behaviours. The upside of a model-specific optimisation is usually marginal. The cost when that optimisation becomes a migration blocker is substantial.
SLA Benchmarking
Vendor SLAs for AI services are typically worse than comparable SLAs for databases, compute, or storage. This is partly because AI inference involves more failure modes and partly because the market is still maturing. Before committing a workload to a provider, benchmark their SLA against your reliability requirements.
| SLA Dimension | Minimum Acceptable | Recommended for Critical Path |
|---|---|---|
| Uptime SLA | 99.9% (8.7h downtime/yr) | 99.95%+ for critical path |
| Response latency p99 | < 5s for synchronous user-facing | < 30s for background / async |
| Rate limits | Document limits, implement backoff | Negotiate higher limits before launch |
| Deprecation notice period | Minimum 6 months | Prefer 12+ months for mission-critical |
| Data portability | All training/eval data exportable | Contractual right to export on termination |
| Sub-processor transparency | Published list of sub-processors | Notification of sub-processor changes |
Exit Strategy Documentation
An exit strategy is not a plan you execute — it is a document you keep current so you know whether migration is feasible before you need to decide. A useful exit strategy document covers: the current vendor dependency map (all providers, services, and data locations), the estimated migration effort by dependency type, the identified replacement candidates for each critical dependency, and the data export procedure and last-verified export date.
Exit strategy document — minimum contents:
- Vendor dependency map: every provider and service with classification (model API, vector DB, managed orchestration, observability, auth)
- Lock-in assessment per dependency: migration effort estimate (hours/days/weeks), identified replacement candidates
- Data inventory: where training data, evaluation datasets, and conversation logs are stored; export procedure; last export date
- Replacement candidate evaluations: eval results for at least one alternative to each critical dependency, with date run
- Contractual exit rights: data portability provisions, termination notice requirements, any exclusivity clauses
- Migration rehearsal: at least one dependency migration (even dev environment) to validate the process is executable
Framework D8 Status
Most frameworks are Partial on D8 — they provide model abstraction layers but no operational resilience tooling or lifecycle management.
The LangChain abstraction layer theoretically supports multiple model providers. In practice, model-specific prompt tuning and output parser coupling reduce portability. No deprecation tooling provided.
Agent definitions include model configuration but no multi-model failover. No deprecation management. Heavy coupling to LangChain internals in some versions adds framework-within-framework lock-in risk.
Model configuration is per-agent and supports multiple backends. The abstraction is cleaner than LangChain for multi-vendor use. Still no deprecation lifecycle management.
Strong multi-model support by design — the kernel can be configured with multiple AI services. Azure-backed deployments benefit from Azure's managed model lifecycle. No deprecation tooling.
Model abstraction is clean and swappable per-agent. The lightweight design means vendor-specific coupling is minimal. No operational resilience tooling — that remains the operator's responsibility.
Pipeline architecture supports model component substitution. Good multi-provider support. Haystack Cloud adds managed deployment but introduces managed service dependency.
The Complete PSF Deep Dive Series
With D8, every PSF domain now has a complete implementation guide. The full series:
You understand the gaps.
Get the credential that proves it.
The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.