PSF AssessmentRAG & Data Framework

LlamaIndex in Production: A PSF Domain Assessment

LlamaIndex has become the leading data framework for RAG applications and increasingly for agentic workflows over structured data. Its modularity, broad integration support, and RAG-specific evaluation tools make it popular for enterprise knowledge base applications. This assessment maps LlamaIndex against all eight PSF domains, with particular attention to the data access control challenges that create silent compliance risks in most production deployments.

Strong

Partial

Gap

Input Governance

Partial

Query pipelines can include pre-processing steps for input validation. No native prompt injection detection or semantic classification of user queries.

LlamaIndex's query pipeline architecture allows practitioners to insert pre-processing nodes before a query reaches the retrieval or synthesis step. In principle, this enables input governance: a classification node that evaluates the query's intent before proceeding. In practice, LlamaIndex's documentation and examples emphasise retrieval and synthesis — input governance steps are not prominent in the framework's narrative, and most production deployments skip them. The specific PSF Domain 1 risk for LlamaIndex is RAG-specific: a user who understands that the system is a RAG application over internal documents can craft queries specifically designed to extract sensitive information that the document set contains but that the application's intended use case does not permit surfacing. Standard input filtering (checking for harmful content or off-topic queries) does not address this data exfiltration vector — it requires intent classification specific to the data access context.

Practitioner Action

Implement a query intent classifier before retrieval: a lightweight LLM call that classifies the query as within-scope, out-of-scope, or a potential data extraction attempt. For sensitive document sets, implement a data access policy layer that restricts which documents can be retrieved in response to which query categories. Log all query classifications for governance audit.

Output Validation

Partial

Response synthesisers can be customised with structured output prompts. Faithfulness evaluation via LlamaIndex Evaluate is available. No automatic output guardrails.

LlamaIndex's evaluators — including FaithfulnessEvaluator, RelevancyEvaluator, and CorrectnessEvaluator — are its strongest output validation contribution. FaithfulnessEvaluator checks whether a response's claims are grounded in the retrieved source documents, which is the core output quality challenge for RAG systems: a model can generate a confident, fluent response that is not supported by the retrieved context. RelevancyEvaluator checks whether the retrieved context actually supports the query. These evaluators are production-useful tools for identifying hallucination and context-answer mismatch. The limitation is that they are not run automatically — they must be explicitly invoked as part of a validation pipeline. Most production deployments use them in offline evaluation (testing against a question set) rather than as real-time output gates. Building a real-time validation step that intercepts and rejects low-faithfulness responses adds latency and cost that must be justified by the stakes of the application.

Practitioner Action

Run FaithfulnessEvaluator offline against a representative sample of production queries weekly to detect model or retrieval quality degradation. For high-stakes RAG applications (legal, medical, financial), implement real-time faithfulness checking as an output gate — route low-faithfulness responses to a human review queue rather than returning them to the user. Always cite retrieved sources in responses to make faithfulness auditable.

Data Protection

Gap

No native PII detection, data classification, or access control over document retrieval. All retrieved context and generated responses are passed to the configured LLM provider.

Data protection is LlamaIndex's most significant PSF gap, and the one with the highest potential for unexpected compliance violations. The framework's core function is retrieving relevant content from a document store and synthesising it into a response — but it applies no data classification to the documents it indexes or retrieves. If a document set contains mixed-sensitivity data (public documents alongside documents containing PII, financial data, or confidential information), LlamaIndex will retrieve and surface content from all categories with equal facility. There is no native mechanism to tag documents with sensitivity classification, restrict retrieval based on the caller's access rights, or prevent PII from appearing in retrieved context that is passed to a third-party LLM API. For regulated environments, this means an unmodified LlamaIndex deployment will almost certainly cause data protection violations — not through any failure of the framework, but because data classification and access control are not part of its design scope.

Practitioner Action

Implement document-level metadata with classification tags (PUBLIC, INTERNAL, CONFIDENTIAL, RESTRICTED) and enforce retrieval filters that match query context to permitted classification levels. For multi-tenant RAG systems, namespace indexes by tenant and enforce namespace isolation in all retrieval calls. Never index regulated data (HIPAA, PCI, personal data under GDPR) into a shared index used by a third-party LLM deployment without explicit legal review.

Observability

Partial

LlamaTrace (hosted on LlamaCloud) and Arize Phoenix integration provide trace-level observability for queries and agent runs. Local callback system enables custom logging.

LlamaIndex's callback system is the primary native observability mechanism: every retrieval, LLM call, and synthesis step fires events that practitioners can hook into for logging. LlamaTrace (LlamaIndex's hosted observability product on LlamaCloud) captures full query traces — the query, retrieved nodes, synthesis prompt, and response — in a UI designed for RAG-specific analysis. Arize Phoenix is the recommended open-source alternative for self-hosted deployments. The gaps: LlamaIndex does not emit structured metrics (query counts, latency distributions, error rates) in a format directly consumable by standard observability platforms without a custom callback. Teams wanting Datadog or Grafana integration need to implement the bridging. LlamaTrace's hosted model creates the same data residency considerations as any third-party tracing product — traces contain retrieved document content and generated responses, which may include sensitive data.

Practitioner Action

Implement the LlamaIndex callback system to emit query metadata (query hash, retrieval count, latency, evaluation scores) to your observability platform. Use a self-hosted Phoenix instance for trace-level debugging if LlamaTrace's data residency profile is incompatible with your document sensitivity. Alert on retrieval count outliers — queries that retrieve significantly more or fewer documents than average may indicate index quality issues.

Deployment Safety

Partial

Index versioning is managed externally (vector store versioning, S3 object versions). No native agent versioning or canary deployment. Model version is a configuration parameter.

LlamaIndex itself has no deployment safety primitives — it is a framework, not a deployment platform, and versioning and rollout control are entirely the practitioner's concern. The practical deployment safety challenge for LlamaIndex is the index: when you rebuild or update the knowledge base index (re-ingesting updated documents, changing chunk sizes, updating embedding models), you are changing a critical component of the application's behaviour in ways that may not be visible from the code. An index rebuild that changes retrieval quality can silently degrade application behaviour without triggering a deployment alarm. PSF Domain 5 for LlamaIndex deployments requires versioning the index as a first-class artifact: maintaining the current and previous index versions, having a rollback procedure for index regression, and testing retrieval quality on a golden question set before promoting a new index to production.

Practitioner Action

Version your knowledge base index as a production artifact: use vector store namespaces or separate collections to maintain PROD and STAGING indexes. Define a retrieval quality test suite (golden questions with expected top-k documents) and run it against every new index before promotion. Implement a one-command index rollback procedure and test it before you need it.

Human Oversight

Gap

No native human-in-the-loop mechanism. Agent workflows execute autonomously. Human oversight requires application-layer implementation.

LlamaIndex has no framework-level human oversight primitives. Agent workflows — using AgentRunner or ReActAgent — execute their full reasoning-tool-observation loops without a native mechanism to pause and surface decisions to a human. This is a meaningful PSF Domain 6 gap for any LlamaIndex application that takes consequential actions (sending emails, updating databases, making API calls) rather than purely retrieving and synthesising information. For RAG-only applications (question answering, document search, summarisation), human oversight is less critical at the execution level — the risk is in the quality of the output, addressed by D2 evaluation, rather than in autonomous actions. For LlamaIndex agent deployments that use tool-calling to take real-world actions, practitioners must implement human oversight entirely at the application layer.

Practitioner Action

For LlamaIndex agent deployments with consequential tool use, implement an async execution pattern: agent tool calls are queued as pending actions and a human reviewer approves before execution proceeds. Store the agent's planned actions in a database, present them in a review interface, and resume execution only after approval. Log every oversight decision as an auditable event.

Security

Partial

No native access control on document retrieval. API keys for LLM providers and vector stores are managed via environment variables. No supply chain scanning for community integrations.

LlamaIndex's security posture has two distinct layers: credential security (managing API keys and connection strings) and data access security (controlling which users can retrieve which documents). Credential security is adequate but manual — API keys are typically passed via environment variables with no native rotation mechanism. Data access security is the more significant PSF Domain 7 concern. LlamaIndex has no built-in user identity concept: all queries to a LlamaIndex application are treated as having equal retrieval rights. For multi-user applications where different users should have access to different document subsets (standard in enterprise deployments), access control must be implemented entirely by the practitioner — typically via metadata filters on retrieval queries that enforce per-user or per-role document access. Forgetting to implement these filters is a common source of data access violations in LlamaIndex deployments.

Practitioner Action

For every LlamaIndex deployment, document the intended access model (who should be able to retrieve what) and implement explicit metadata filters on all retrieval calls to enforce it. Never rely on the assumption that users won't craft queries that surface documents outside their intended scope. Treat retrieval access control as a security requirement, not an optional feature.

Vendor Resilience

Strong

Modular architecture supports swapping LLM providers, embedding models, and vector stores with configuration changes. Over 160 LLM integrations; 40+ vector store integrations.

Vendor resilience is LlamaIndex's strongest PSF domain. The framework's design philosophy is explicit modularity: LLM providers, embedding models, vector stores, document loaders, and retrievers are all swappable via a consistent interface. Switching from OpenAI embeddings to Cohere embeddings requires a configuration change, not a code rewrite. Switching from Pinecone to pgvector requires updating the vector store configuration and re-indexing. The framework supports over 160 LLM provider integrations and over 40 vector store integrations — this breadth is unmatched in the ecosystem. For PSF Domain 8, LlamaIndex applications can be designed to be genuinely portable across model providers and infrastructure layers. The practical caveat is that index migration (moving from one vector store to another) requires re-embedding all documents, which is a non-trivial operation for large knowledge bases. Documenting the migration procedure and having it tested before a vendor incident is the practitioner's responsibility.

Practitioner Action

Design your LlamaIndex deployment with provider configuration externalised to environment variables from day one. Document which providers each component uses and maintain a tested migration procedure for switching the most critical dependency (typically the LLM provider). For large knowledge bases, maintain the source documents and ingestion pipeline in a way that allows re-indexing against a new vector store within an acceptable recovery time objective.

Overall Assessment

LlamaIndex\'s vendor resilience (D8) is the strongest in the ecosystem — its modular architecture makes provider switching genuinely straightforward. Its RAG-specific evaluation tools (D2) are production-valuable for catching hallucination. The critical gap is data access control (D3, D7): LlamaIndex has no native access model, and the absence of retrieval access control is the most common source of silent compliance violations in enterprise RAG deployments.

For pure knowledge retrieval applications on non-sensitive document sets, LlamaIndex is production-appropriate with standard PSF companion tooling. For multi-tenant or regulated-data deployments, access control must be a primary design concern from the first architecture decision — retrofitting it into an existing LlamaIndex deployment is significantly harder than building it in from the start.

From reading to credential

You understand the gaps.
Get the credential that proves it.

The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.

Start AIDA — free →CPAP practitioner credential

The Production AI Brief

LlamaIndex in Production: A PSF Domain Assessment

Input Governance

Output Validation

Data Protection

Observability

Deployment Safety

Human Oversight

Security

Vendor Resilience

Overall Assessment

You understand the gaps.Get the credential that proves it.

Get framework updates in your inbox

You understand the gaps.
Get the credential that proves it.