Hallucination Shield
Detect and block AI hallucinations before they execute actions.
Overview
The Hallucination Shield (Vaccine #3) is a specialized defense layer within the ABS Core that intercepts tool calls from LLM agents before they reach the policy engine. It analyzes the semantic coherence of the request to identify signs of fabrication, logical inconsistencies, or "phantom" parameters.
Detection Layers
The shield operates on 6 simultaneous detection layers:
1. Phantom Tool Detection
Blocks attempts to call tools that do not exist in the registered schema.
- Scenario: Agent tries to call
delete_databasebut onlyread_databaseis exposed. - Verdict:
HALLUCINATED(Blocking)
2. Phantom Target Detection
Identifies when a tool is called on a resource ID that was never mentioned in the conversation context.
- Scenario: Agent tries to refund transaction
tx_99999but the user only asked abouttx_12345. - Verdict:
SUSPICIOUS(Flagged)
3. Parameter Mismatch
Validates that arguments match the expected type and format (e.g., regex for UUIDs, email formats).
- Scenario: Agent passes a string "yesterday" to a
dateparameter requiring ISO-8601. - Verdict:
HALLUCINATED(Blocking)
4. Self-Contradiction
Detects sequences of actions that logically cancel each other out within a short window.
- Scenario:
create_user(id=1)followed immediately bydelete_user(id=1). - Verdict:
SUSPICIOUS(Flagged)
5. Impossible State
Checks if the parameters imply a state that violates business logic constraints.
- Scenario:
transfer(amount=-500)(Negative value transfer). - Verdict:
HALLUCINATED(Blocking)
6. Confidence Drop
Analyzes the probabilistic confidence of the tool selection (requires model logprobs access).
- Verdict:
SUSPICIOUS(Flagged)
Configuration
Enable the shield in your policy configuration:
// policies/enterprise.ts
import { shield } from "@abscore/shield";
export const policy = shield.configure({
mode: "strict", // Options: "strict" | "audit"
threshold: 0.8, // Sensitivity (0.0 - 1.0)
layers: ["phantom_tool", "impossible_state"],
});Telemetry
Hallucination events are logged with a specific tag threat.type: hallucination and can be viewed in the Risk Heatmap on the Enterprise Dashboard.