Hallucination Shield

Overview

The Hallucination Shield (Vaccine #3) is a specialized defense layer within the ABS Core that intercepts tool calls from LLM agents before they reach the policy engine. It analyzes the semantic coherence of the request to identify signs of fabrication, logical inconsistencies, or "phantom" parameters.

This feature is available in the Enterprise Edition only.

Detection Layers

The shield operates on 6 simultaneous detection layers:

1. Phantom Tool Detection

Blocks attempts to call tools that do not exist in the registered schema.

Scenario: Agent tries to call delete_database but only read_database is exposed.
Verdict: HALLUCINATED (Blocking)

2. Phantom Target Detection

Identifies when a tool is called on a resource ID that was never mentioned in the conversation context.

Scenario: Agent tries to refund transaction tx_99999 but the user only asked about tx_12345.
Verdict: SUSPICIOUS (Flagged)

3. Parameter Mismatch

Validates that arguments match the expected type and format (e.g., regex for UUIDs, email formats).

Scenario: Agent passes a string "yesterday" to a date parameter requiring ISO-8601.
Verdict: HALLUCINATED (Blocking)

4. Self-Contradiction

Detects sequences of actions that logically cancel each other out within a short window.

Scenario: create_user(id=1) followed immediately by delete_user(id=1).
Verdict: SUSPICIOUS (Flagged)

5. Impossible State

Checks if the parameters imply a state that violates business logic constraints.

Scenario: transfer(amount=-500) (Negative value transfer).
Verdict: HALLUCINATED (Blocking)

6. Confidence Drop

Analyzes the probabilistic confidence of the tool selection (requires model logprobs access).

Verdict: SUSPICIOUS (Flagged)

Configuration

Enable the shield in your policy configuration:

// policies/enterprise.ts
import { shield } from "@abscore/shield";

export const policy = shield.configure({
  mode: "strict", // Options: "strict" | "audit"
  threshold: 0.8, // Sensitivity (0.0 - 1.0)
  layers: ["phantom_tool", "impossible_state"],
});

Telemetry

Hallucination events are logged with a specific tag threat.type: hallucination and can be viewed in the Risk Heatmap on the Enterprise Dashboard.

On this page