Production AI Institute — vendor-neutral certification for AI practitioners
Verify a credentialFor organisationsContact
PSF Deep DiveDomain 6 · April 2026

PSF Domain 6:
Human Oversight

Human-in-the-loop is not a binary feature you toggle on. It is a design discipline that determines when human judgment is required, how humans are presented with decisions, and how oversight scales without becoming a bottleneck. Most production AI systems either have no human oversight (dangerous) or have checkbox compliance oversight (useless). This guide documents the middle path.

Read time
15 min
PSF version
v1.1
CC BY 4.0
Citable

The compliance theatre problem

Human-in-the-loop became a checkbox requirement as AI deployment concerns grew. The result is a generation of AI systems with nominal human oversight that provides no real safety guarantees. The signatures are familiar: an approval button that appears at 2am when the reviewer is asleep, a review queue that processes 400 items per hour, a confirmation screen that 94% of users click through without reading.

Compliance theatre is not just ineffective — it creates a false sense of safety that may be more dangerous than no oversight at all. A team that believes their human review process is working has less incentive to invest in the upstream controls (D1, D2, D4) that would actually catch problems.

PSF Domain 6 requires human oversight that is effective, not merely present. The distinction is between oversight that could plausibly catch problems and oversight that exists to satisfy an audit requirement.

The PSF autonomy level framework

The first design decision in D6 is: what level of autonomy is appropriate for this system? The PSF defines five levels. Most production deployments should target L1 or L2. L3 is appropriate only with robust D4 monitoring. L4 is not appropriate for any customer-facing or regulated system.

L0Full human control

AI generates a draft or suggestion. Human reviews and approves before any action is taken. All consequential actions require explicit human authorisation.

Examples: Contract drafting, medical documentation, legal advice
L1Human in the loop

AI acts autonomously on low-risk operations. Human review is required before high-risk operations. Risk classification is explicit and documented.

Examples: Email triage (auto-categorise, human sends), data analysis with human-reviewed summary
L2Human on the loop

AI acts autonomously. Human monitors activity and can intervene. Automatic escalation when confidence drops or anomalies are detected.

Examples: Automated customer support with supervisor dashboard, monitoring agents
L3Human by exception

AI acts autonomously on virtually all operations. Humans are involved only when the AI explicitly requests escalation or when audit sampling triggers review.

Examples: Infrastructure automation, document processing at scale, data pipelines
L4Full autonomy

No human oversight during operation. Not appropriate for any system processing personal data, making consequential decisions, or operating in regulated contexts.

Examples: Internal compute optimisation only; not appropriate for customer-facing systems

When human oversight is required

The decision of when to require human review is the core D6 design question. The answer depends on the consequence severity of the action, the confidence of the model output, and the regulatory context. A practical framework:

Condition
Recommended level
Override trigger
Action affects a human being (sends email, makes payment, changes access)
L0–L1
Never fully automate without review
Output confidence below defined threshold
L1
Confidence scoring via D4 observability
Novel or anomalous input outside training distribution
L1
Anomaly detection in D4 pipeline
High financial or legal consequence action
L0
Amount/risk threshold in action schema
Action in a regulated domain (finance, healthcare, legal)
L0–L1
Regulatory requirement, not just preference
Multi-step autonomous task involving tool use
L2
Anomaly or tool call volume threshold
Internal data processing with no direct human impact
L2–L3
Audit sampling rate defined

Designing effective human review

Once you have decided that human review is required, the design of the review interface determines whether oversight is effective. Four principles:

1. Present the decision, not the output

Most HITL implementations show the reviewer the model output and ask them to approve or reject it. This is backwards. The reviewer should be presented with the decision that needs to be made — the action that will be taken if they approve — not just the text the model generated. "Approve sending this email to John Smith declining his application" is a decision. "Here is a draft rejection email" is an output.

2. Make the cost of approval visible

Reviewers approve items faster when the cost of approval is invisible. A review interface that shows "Approve / Reject" with no context about what approval means will produce rubber-stamp oversight. Make the downstream consequence explicit in every review request.

3. Design for rejection, not just approval

If your review interface has a one-click approve and a multi-step reject, your reviewers will approve more than they should. The friction to reject should be the same as the friction to approve. And rejection should trigger a feedback loop that improves the model — not just block the action.

4. Blind review sampling

For L2/L3 deployments where most actions are automated, implement blind review sampling: randomly select a percentage of automated actions and present them to a reviewer as if they required approval, but do not block execution. This measures whether your automated actions are ones a reviewer would have approved. If the sample approval rate drops, you have a signal to investigate.

AutoGen's UserProxyAgent implements the closest native approximation to this with its human_input_mode="TERMINATE" pattern, but this is a conversation-end trigger rather than a sampling mechanism. For true blind sampling, you need to implement this at the application layer.

Skill maintenance and automation complacency

There is a documented phenomenon in aviation automation: as systems become more reliable, operators reduce their active engagement, and their ability to take over when the automation fails degrades. The same risk applies to AI oversight. If human reviewers approve 99% of what the AI produces, they are not developing the judgment to catch the 1%.

PSF D6 requires that human oversight capability is maintained over time, not just present at deployment. Practically this means: periodic exercises where reviewers encounter synthetic failures designed to test their judgment, rotation of review responsibilities to prevent rubber-stamp patterns, and monitoring of reviewer decision latency and consistency as proxies for engagement quality.

Framework D6 implementation notes

AutoGen / AG2
Full assessment →

Best native D6 support of any framework assessed. UserProxyAgent with human_input_mode='ALWAYS' or 'TERMINATE' provides genuine oversight points. The challenge is that these are chat-based interactions — for production, wrap UserProxyAgent in an application that presents the decision context to the right reviewer.

LangGraph
Full assessment →

Graph interrupt() nodes are the native primitive. Define interrupt conditions at edges — trigger human review when the graph transitions to high-consequence nodes. LangSmith provides the review interface. This is the cleanest HITL architecture in the LangChain ecosystem.

Semantic Kernel
Full assessment →

Step approval patterns can be implemented via kernel filters. For Azure deployments, Azure Logic Apps can serve as the human review routing layer with full audit trail. No native blind sampling — implement at application layer.

CrewAI
Full assessment →

Human oversight is the most significant D6 gap in CrewAI. The multi-agent architecture means a human approval at the crew level may not catch individual agent actions. Implement approval at the task level, not just the crew kickoff — each task that takes a real-world action should have an approval gate.

D6 pre-deployment checklist

Autonomy level for this system is documented and justifiedRequired
Risk classification for all agent actions is complete (what requires human review vs. what is automated)Required
Review interface presents the decision and its consequences, not just the outputRequired
Rejection path has equal UX friction to approval pathRequired
Review SLA is defined and monitored (actions do not queue indefinitely)Required
Blind review sampling rate is configured for L2/L3 deployments
Reviewer skill maintenance exercises are scheduled
Reviewer approval rate and decision latency are monitored as leading indicators
Escalation path exists when reviewers are unavailable (degraded mode vs. block)
EU AI Act high-risk classification has been assessed — if applicable, oversight meets Article 14 requirements

Related guides

HITL: When, Why, and How to Design OversightAutoGen PSF Assessment (best D6 framework)PSF D4: Observability — feeds D6 triggersFramework comparison matrix
From reading to credential

You understand the gaps.
Get the credential that proves it.

The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.

Start AIDA — free →CPAP practitioner credential