PSF Domain 6: Human Oversight — HITL Patterns for Production AI

The compliance theatre problem

Human-in-the-loop became a checkbox requirement as AI deployment concerns grew. The result is a generation of AI systems with nominal human oversight that provides no real safety guarantees. The signatures are familiar: an approval button that appears at 2am when the reviewer is asleep, a review queue that processes 400 items per hour, a confirmation screen that 94% of users click through without reading.

Compliance theatre is not just ineffective — it creates a false sense of safety that may be more dangerous than no oversight at all. A team that believes their human review process is working has less incentive to invest in the upstream controls (D1, D2, D4) that would actually catch problems.

PSF Domain 6 requires human oversight that is effective, not merely present. The distinction is between oversight that could plausibly catch problems and oversight that exists to satisfy an audit requirement.

The PSF autonomy level framework

The first design decision in D6 is: what level of autonomy is appropriate for this system? The PSF defines five levels. Most production deployments should target L1 or L2. L3 is appropriate only with robust D4 monitoring. L4 is not appropriate for any customer-facing or regulated system.

L0Full human control

AI generates a draft or suggestion. Human reviews and approves before any action is taken. All consequential actions require explicit human authorisation.

Examples: Contract drafting, medical documentation, legal advice

L1Human in the loop

AI acts autonomously on low-risk operations. Human review is required before high-risk operations. Risk classification is explicit and documented.

Examples: Email triage (auto-categorise, human sends), data analysis with human-reviewed summary

L2Human on the loop

AI acts autonomously. Human monitors activity and can intervene. Automatic escalation when confidence drops or anomalies are detected.

Examples: Automated customer support with supervisor dashboard, monitoring agents

L3Human by exception

AI acts autonomously on virtually all operations. Humans are involved only when the AI explicitly requests escalation or when audit sampling triggers review.

Examples: Infrastructure automation, document processing at scale, data pipelines

L4Full autonomy

No human oversight during operation. Not appropriate for any system processing personal data, making consequential decisions, or operating in regulated contexts.

Examples: Internal compute optimisation only; not appropriate for customer-facing systems

When human oversight is required

The decision of when to require human review is the core D6 design question. The answer depends on the consequence severity of the action, the confidence of the model output, and the regulatory context. A practical framework:

Condition

Recommended level

Override trigger

Action affects a human being (sends email, makes payment, changes access)

L0–L1

Never fully automate without review

Output confidence below defined threshold

Confidence scoring via D4 observability

Novel or anomalous input outside training distribution

Anomaly detection in D4 pipeline

High financial or legal consequence action

Amount/risk threshold in action schema

Action in a regulated domain (finance, healthcare, legal)

L0–L1

Regulatory requirement, not just preference

Multi-step autonomous task involving tool use

Anomaly or tool call volume threshold

Internal data processing with no direct human impact

L2–L3

Audit sampling rate defined

Designing effective human review

Once you have decided that human review is required, the design of the review interface determines whether oversight is effective. Four principles:

1. Present the decision, not the output

Most HITL implementations show the reviewer the model output and ask them to approve or reject it. This is backwards. The reviewer should be presented with the decision that needs to be made — the action that will be taken if they approve — not just the text the model generated. "Approve sending this email to John Smith declining his application" is a decision. "Here is a draft rejection email" is an output.

2. Make the cost of approval visible

Reviewers approve items faster when the cost of approval is invisible. A review interface that shows "Approve / Reject" with no context about what approval means will produce rubber-stamp oversight. Make the downstream consequence explicit in every review request.

3. Design for rejection, not just approval

If your review interface has a one-click approve and a multi-step reject, your reviewers will approve more than they should. The friction to reject should be the same as the friction to approve. And rejection should trigger a feedback loop that improves the model — not just block the action.

4. Blind review sampling

For L2/L3 deployments where most actions are automated, implement blind review sampling: randomly select a percentage of automated actions and present them to a reviewer as if they required approval, but do not block execution. This measures whether your automated actions are ones a reviewer would have approved. If the sample approval rate drops, you have a signal to investigate.

AutoGen's UserProxyAgent implements the closest native approximation to this with its human_input_mode="TERMINATE" pattern, but this is a conversation-end trigger rather than a sampling mechanism. For true blind sampling, you need to implement this at the application layer.

Skill maintenance and automation complacency

There is a documented phenomenon in aviation automation: as systems become more reliable, operators reduce their active engagement, and their ability to take over when the automation fails degrades. The same risk applies to AI oversight. If human reviewers approve 99% of what the AI produces, they are not developing the judgment to catch the 1%.

PSF D6 requires that human oversight capability is maintained over time, not just present at deployment. Practically this means: periodic exercises where reviewers encounter synthetic failures designed to test their judgment, rotation of review responsibilities to prevent rubber-stamp patterns, and monitoring of reviewer decision latency and consistency as proxies for engagement quality.

Framework D6 implementation notes

AutoGen / AG2

Full assessment →

Best native D6 support of any framework assessed. UserProxyAgent with human_input_mode='ALWAYS' or 'TERMINATE' provides genuine oversight points. The challenge is that these are chat-based interactions — for production, wrap UserProxyAgent in an application that presents the decision context to the right reviewer.

LangGraph

Full assessment →

Graph interrupt() nodes are the native primitive. Define interrupt conditions at edges — trigger human review when the graph transitions to high-consequence nodes. LangSmith provides the review interface. This is the cleanest HITL architecture in the LangChain ecosystem.

Semantic Kernel

Full assessment →

Step approval patterns can be implemented via kernel filters. For Azure deployments, Azure Logic Apps can serve as the human review routing layer with full audit trail. No native blind sampling — implement at application layer.

CrewAI

Full assessment →

Human oversight is the most significant D6 gap in CrewAI. The multi-agent architecture means a human approval at the crew level may not catch individual agent actions. Implement approval at the task level, not just the crew kickoff — each task that takes a real-world action should have an approval gate.

D6 pre-deployment checklist

Autonomy level for this system is documented and justifiedRequired

Risk classification for all agent actions is complete (what requires human review vs. what is automated)Required

Review interface presents the decision and its consequences, not just the outputRequired

Rejection path has equal UX friction to approval pathRequired

Review SLA is defined and monitored (actions do not queue indefinitely)Required

Blind review sampling rate is configured for L2/L3 deployments

Reviewer skill maintenance exercises are scheduled

Reviewer approval rate and decision latency are monitored as leading indicators

Escalation path exists when reviewers are unavailable (degraded mode vs. block)

EU AI Act high-risk classification has been assessed — if applicable, oversight meets Article 14 requirements

PSF Domain 6:
Human Oversight

The compliance theatre problem

The PSF autonomy level framework

When human oversight is required

Designing effective human review

1. Present the decision, not the output

2. Make the cost of approval visible

3. Design for rejection, not just approval

4. Blind review sampling

Skill maintenance and automation complacency

Framework D6 implementation notes

D6 pre-deployment checklist

Related guides

You understand the gaps.
Get the credential that proves it.

PSF Domain 6:Human Oversight

The compliance theatre problem

The PSF autonomy level framework

When human oversight is required

Designing effective human review

1. Present the decision, not the output

2. Make the cost of approval visible

3. Design for rejection, not just approval

4. Blind review sampling

Skill maintenance and automation complacency

Framework D6 implementation notes

D6 pre-deployment checklist

Related guides

You understand the gaps.Get the credential that proves it.

PSF Domain 6:
Human Oversight

You understand the gaps.
Get the credential that proves it.