• Jan 2

AI Agent Safety Controls Are Drifting. Here's How to Catch It.

  • Teddy Kim
  • 0 comments

Most developers set up AI agent restrictions once and assume they stay. They don't. Learn how automated weekly verification caught a safety failure before damage occurred.

I had a coach who used to say "assuming makes an ass out of you and me." Eyes would roll every time. But after twenty-five years in tech, I've come to appreciate the wisdom in that corny phrase.

Last week, I got an automated alert. A GitHub issue appeared in my repo with a P0-critical label:

[ALERT] Agent Restriction Verification Failed - 2025-12-28

The issue was created by a weekly GitHub Action that verifies my AI agent safety controls are actually in place. Nine out of ten tests passed. But one failed:

[PASS] Workflow agents contain NON-NEGOTIABLE PROTOCOL blocks
[PASS] pr-reviewer.md contains NEVER merge statement
[PASS] github-ticket-worker.md prohibits pushing to main
[FAIL] Branch protection NOT enabled on main```

Branch protection was disabled on my main branch. I don't remember disabling it. Someone might have turned it off during an incident. A config change might have reset it. Doesn't matter how it happened. What matters is this: my AI agents could have pushed code directly to production without a pull request review.

The weekly check caught it before damage occurred.

The Drift Problem

Agent controls drift silently. This is the uncomfortable truth that most people using AI coding assistants don't want to hear.

You set up your restrictions once. You configure branch protection. You add NON-NEGOTIABLE PROTOCOL blocks to your agent configs. You feel good about your safety posture. Then you move on with your life.

But controls don't stay in place by themselves:

  • Branch protection gets disabled during an incident and never re-enabled

  • A team member modifies permissions "temporarily"

  • Config files get updated and restrictions accidentally removed

  • New repos are created without inheriting safety policies

  • Someone copies an agent config without the safety blocks

You think you have guardrails. You might not.

This is different from traditional software bugs. If your code breaks, users complain. If your tests fail, CI goes red. You get feedback. But if your agent safety controls drift, you won't know until an agent does something it shouldn't—which might be too late.

The Solution: Continuous Verification

The answer isn't "be more careful." Humans forget. Humans get busy. Humans assume things are fine because they were fine last time they checked.

The answer is automated, continuous verification.

I run a GitHub Action every Sunday that executes a test suite against my agent policies. The tests verify:

Protocol Compliance

  • Agent configs contain NON-NEGOTIABLE PROTOCOL blocks

  • PR reviewer agent has explicit "NEVER merge" statements

  • Ticket worker agent prohibits pushing to main

  • Ticket worker agent prohibits moving issues to Done

Permission Enforcement

  • Branch protection is enabled on main

  • Settings templates have deny rules for dangerous operations

Documentation

  • Bot account documentation exists

  • Workflow documentation is current

If any test fails, the workflow automatically creates a GitHub issue with a P0-critical label. I get notified. I fix it. The safety gap never lasts more than a week.

What the Verification Catches

Here's the actual output from a passing run:

========================================
  Agent Restriction Verification Suite
========================================
Date: Sun Dec 28 01:28:06 UTC 2025
Repository: https://github.com/vibeacademy/agentic-patterns
--------------------------------------------
Category 1: Agent Protocol Compliance
--------------------------------------------
[PASS] Workflow agents contain NON-NEGOTIABLE PROTOCOL blocks
[PASS] pr-reviewer.md contains NEVER merge statement
[PASS] review-pr.md correctly documents merge prohibition
[PASS] github-ticket-worker.md prohibits moving to Done
[PASS] github-ticket-worker.md prohibits pushing to main
--------------------------------------------
Category 2: Permission Enforcement
--------------------------------------------
[PASS] Branch protection enabled on main
[PASS] settings.template.json has merge deny rule
--------------------------------------------
Category 3: Documentation Verification
--------------------------------------------
[PASS] Bot account documentation exists
[PASS] CLAUDE.md contains workflow documentation
[PASS] Agent restriction test documentation exists
========================================
  Test Summary
========================================
Total tests run:    10
Tests passed:       10
Tests failed:       0
Pass rate: 100.0%
PASSED: All tests passed!

When something fails, the automated issue includes the full test output, a link to the workflow run, and specific investigation steps. No ambiguity about what broke or how to fix it.

The Broader Principle

This applies beyond just agent safety. Any control that matters should be continuously verified:

  1. Security controls drift

  2. Compliance configurations drift

  3. Infrastructure guardrails drift

  4. Access permissions drift

If you're not continuously asserting that your restrictions are in place, you're not actually restricting anything. You're hoping.

Hope is not a strategy. Automated verification is.

Building Your Own Verification Suite

The verification script is straightforward. It's a bash script that greps through config files and calls the GitHub API. Here's the structure:

#!/bin/bash
# Agent Restriction Verification Tests
# Test 1: Check for NON-NEGOTIABLE PROTOCOL blocks
if grep -q "NON-NEGOTIABLE PROTOCOL" .claude/agents/pr-reviewer.md; then
  log_success "pr-reviewer contains protocol block"
else
  log_failure "pr-reviewer missing protocol block"
fi
# Test 2: Check branch protection via GitHub API
if gh api "repos/$REPO/branches/main/protection" &> /dev/null; then
  log_success "Branch protection enabled on main"
else
  log_failure "Branch protection NOT enabled on main"
fi
# ... more tests

The GitHub Action runs this weekly and creates an issue if anything fails:

on:
  schedule:
    - cron: '0 0 * * 0'  # Every Sunday at midnight UTC
jobs:
  verify-restrictions:
    runs-on: ubuntu-latest
    steps:
      - name: Run verification tests
        run: ./scripts/verify-agent-restrictions.sh
      - name: Create issue for failures
        if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            await github.rest.issues.create({
              title: '[ALERT] Agent Restriction Verification Failed',
              labels: ['safety', 'security', 'P0-critical'],
              body: '...'
            });

The key insight is that the verification must be automated and the alerts must be impossible to ignore. If verification requires a human to remember to run it, it won't happen. If alerts go to a channel nobody watches, they won't get fixed.

What This Means for Safe Vibing

I use AI agents extensively in my development workflow. They review PRs. They work on tickets. They write code. But they operate within constraints.

The responsible way to work with agents is continuous verification, not one-time configuration.

Every Sunday, my system asserts that:

  • Agents cannot merge pull requests

  • Agents cannot push directly to main

  • Agents cannot mark their own work as complete

  • Branch protection prevents bypassing review

If any of these controls drift, I know within a week. That's the difference between responsible AI adoption and hoping for the best.

You can give agents significant autonomy if you continuously verify they're operating within bounds. Without verification, you're not giving agents autonomy—you're abandoning oversight.

The Uncomfortable Question

Here's the question you should ask yourself: When was the last time you verified that your AI agent restrictions are actually in place?

If the answer is "when I first set them up," you have no idea what state your controls are in right now. They might be fine. They might have drifted months ago. You're assuming.

And assuming, as my old coach would say, makes an ass out of you and me.

---

If you want to go deeper on responsible AI development practices, I put together a free study guide covering the fundamentals. Grab the AI Study Guide here

0 comments

Sign upor login to leave a comment