- Jan 2
AI Agent Safety Controls Are Drifting. Here's How to Catch It.
- Teddy Kim
- 0 comments
I had a coach who used to say "assuming makes an ass out of you and me." Eyes would roll every time. But after twenty-five years in tech, I've come to appreciate the wisdom in that corny phrase.
Last week, I got an automated alert. A GitHub issue appeared in my repo with a P0-critical label:
[ALERT] Agent Restriction Verification Failed - 2025-12-28The issue was created by a weekly GitHub Action that verifies my AI agent safety controls are actually in place. Nine out of ten tests passed. But one failed:
[PASS] Workflow agents contain NON-NEGOTIABLE PROTOCOL blocks
[PASS] pr-reviewer.md contains NEVER merge statement
[PASS] github-ticket-worker.md prohibits pushing to main
[FAIL] Branch protection NOT enabled on main```Branch protection was disabled on my main branch. I don't remember disabling it. Someone might have turned it off during an incident. A config change might have reset it. Doesn't matter how it happened. What matters is this: my AI agents could have pushed code directly to production without a pull request review.
The weekly check caught it before damage occurred.
The Drift Problem
Agent controls drift silently. This is the uncomfortable truth that most people using AI coding assistants don't want to hear.
You set up your restrictions once. You configure branch protection. You add NON-NEGOTIABLE PROTOCOL blocks to your agent configs. You feel good about your safety posture. Then you move on with your life.
But controls don't stay in place by themselves:
Branch protection gets disabled during an incident and never re-enabled
A team member modifies permissions "temporarily"
Config files get updated and restrictions accidentally removed
New repos are created without inheriting safety policies
Someone copies an agent config without the safety blocks
You think you have guardrails. You might not.
This is different from traditional software bugs. If your code breaks, users complain. If your tests fail, CI goes red. You get feedback. But if your agent safety controls drift, you won't know until an agent does something it shouldn't—which might be too late.
The Solution: Continuous Verification
The answer isn't "be more careful." Humans forget. Humans get busy. Humans assume things are fine because they were fine last time they checked.
The answer is automated, continuous verification.
I run a GitHub Action every Sunday that executes a test suite against my agent policies. The tests verify:
Protocol Compliance
Agent configs contain NON-NEGOTIABLE PROTOCOL blocks
PR reviewer agent has explicit "NEVER merge" statements
Ticket worker agent prohibits pushing to main
Ticket worker agent prohibits moving issues to Done
Permission Enforcement
Branch protection is enabled on main
Settings templates have deny rules for dangerous operations
Documentation
Bot account documentation exists
Workflow documentation is current
If any test fails, the workflow automatically creates a GitHub issue with a P0-critical label. I get notified. I fix it. The safety gap never lasts more than a week.
What the Verification Catches
Here's the actual output from a passing run:
========================================
Agent Restriction Verification Suite
========================================
Date: Sun Dec 28 01:28:06 UTC 2025
Repository: https://github.com/vibeacademy/agentic-patterns
--------------------------------------------
Category 1: Agent Protocol Compliance
--------------------------------------------
[PASS] Workflow agents contain NON-NEGOTIABLE PROTOCOL blocks
[PASS] pr-reviewer.md contains NEVER merge statement
[PASS] review-pr.md correctly documents merge prohibition
[PASS] github-ticket-worker.md prohibits moving to Done
[PASS] github-ticket-worker.md prohibits pushing to main
--------------------------------------------
Category 2: Permission Enforcement
--------------------------------------------
[PASS] Branch protection enabled on main
[PASS] settings.template.json has merge deny rule
--------------------------------------------
Category 3: Documentation Verification
--------------------------------------------
[PASS] Bot account documentation exists
[PASS] CLAUDE.md contains workflow documentation
[PASS] Agent restriction test documentation exists
========================================
Test Summary
========================================
Total tests run: 10
Tests passed: 10
Tests failed: 0
Pass rate: 100.0%
PASSED: All tests passed!When something fails, the automated issue includes the full test output, a link to the workflow run, and specific investigation steps. No ambiguity about what broke or how to fix it.
The Broader Principle
This applies beyond just agent safety. Any control that matters should be continuously verified:
Security controls drift
Compliance configurations drift
Infrastructure guardrails drift
Access permissions drift
If you're not continuously asserting that your restrictions are in place, you're not actually restricting anything. You're hoping.
Hope is not a strategy. Automated verification is.
Building Your Own Verification Suite
The verification script is straightforward. It's a bash script that greps through config files and calls the GitHub API. Here's the structure:
#!/bin/bash
# Agent Restriction Verification Tests
# Test 1: Check for NON-NEGOTIABLE PROTOCOL blocks
if grep -q "NON-NEGOTIABLE PROTOCOL" .claude/agents/pr-reviewer.md; then
log_success "pr-reviewer contains protocol block"
else
log_failure "pr-reviewer missing protocol block"
fi
# Test 2: Check branch protection via GitHub API
if gh api "repos/$REPO/branches/main/protection" &> /dev/null; then
log_success "Branch protection enabled on main"
else
log_failure "Branch protection NOT enabled on main"
fi
# ... more testsThe GitHub Action runs this weekly and creates an issue if anything fails:
on:
schedule:
- cron: '0 0 * * 0' # Every Sunday at midnight UTC
jobs:
verify-restrictions:
runs-on: ubuntu-latest
steps:
- name: Run verification tests
run: ./scripts/verify-agent-restrictions.sh
- name: Create issue for failures
if: failure()
uses: actions/github-script@v7
with:
script: |
await github.rest.issues.create({
title: '[ALERT] Agent Restriction Verification Failed',
labels: ['safety', 'security', 'P0-critical'],
body: '...'
});The key insight is that the verification must be automated and the alerts must be impossible to ignore. If verification requires a human to remember to run it, it won't happen. If alerts go to a channel nobody watches, they won't get fixed.
What This Means for Safe Vibing
I use AI agents extensively in my development workflow. They review PRs. They work on tickets. They write code. But they operate within constraints.
The responsible way to work with agents is continuous verification, not one-time configuration.
Every Sunday, my system asserts that:
Agents cannot merge pull requests
Agents cannot push directly to main
Agents cannot mark their own work as complete
Branch protection prevents bypassing review
If any of these controls drift, I know within a week. That's the difference between responsible AI adoption and hoping for the best.
You can give agents significant autonomy if you continuously verify they're operating within bounds. Without verification, you're not giving agents autonomy—you're abandoning oversight.
The Uncomfortable Question
Here's the question you should ask yourself: When was the last time you verified that your AI agent restrictions are actually in place?
If the answer is "when I first set them up," you have no idea what state your controls are in right now. They might be fine. They might have drifted months ago. You're assuming.
And assuming, as my old coach would say, makes an ass out of you and me.
---
If you want to go deeper on responsible AI development practices, I put together a free study guide covering the fundamentals. Grab the AI Study Guide here