Back to glossary

Agent Guardrails

Safety mechanisms that constrain agent behavior within acceptable boundaries, preventing harmful actions, excessive spending, or unauthorized access. Guardrails operate at the prompt, tool, and system levels to enforce policies.

Agent guardrails are the safety infrastructure that makes production agent deployment responsible. They include input validation (blocking prompt injection attempts), output filtering (preventing harmful or off-brand responses), action constraints (limiting which tools can be called and with what parameters), and resource limits (capping token usage, API calls, and execution time).

For any team deploying agents that interact with customers or modify production systems, guardrails are non-negotiable. Implement them in layers: prompt-level guardrails instruct the model on boundaries, tool-level guardrails validate parameters before execution, and system-level guardrails enforce hard limits regardless of model behavior. Common guardrails include spending caps per conversation, allowlists for permitted actions, PII detection and redaction, and content policy enforcement. Test guardrails adversarially, as the model may find creative ways to work around soft constraints. Hard system-level limits that cannot be bypassed by model outputs are your last line of defense.

Related Terms