Jan 16, 2026

AI Agent Debugging: When "Ready" Doesn't Mean Ready

Teddy Kim
0 comments

The AI agent reported moving tickets to Ready. The board was empty. How semantic confusion in tool APIs creates invisible failures—and how to fix them.

"The Ready column has nothing in it."

I stared at the project board. My agent had just told me it moved three tickets to Ready. I looked at the conversation history. Tickets #79, #127, and #128. All confirmed. All Ready.

Except they weren't. They were still sitting in Backlog.

The confident lie

Here's what made this interesting. I asked the agent: "Are you sure those tickets are in Ready?"

It doubled down. Of course they were in Ready. It had moved them. They had the Ready label. Therefore, Ready column.

This is the pattern that makes agent failures so dangerous. The agent wasn't lying. It genuinely believed what it was telling me. Its mental model said: Ready label equals Ready column. And from inside that mental model, everything made perfect sense.

The problem is that mental model was completely wrong.

Two systems, same word

GitHub has two completely separate systems that happen to use similar vocabulary.

Issue Labels are metadata tags. You can slap a "Ready" label on an issue. That label does nothing except make the issue searchable by that tag. It doesn't move anything. It doesn't affect workflow. It's just a Post-it note.

Project Board Columns represent workflow state. An issue in the "Ready" column is actually ready for work. It shows up in that column on the board. It has workflow meaning.

These are separate systems. Adding a label doesn't move the column. Moving the column doesn't add a label. They're orthogonal.

But they both use the word "Ready." And to an LLM reasoning about concepts, that semantic overlap is enough to conflate them.

What the agent was doing

Here's the operation that the agent thought was moving tickets:

await mcp__github__update_issue(
    owner="vibeacademy",
    repo="agentic-patterns",
    issue_number=79,
    labels=["Ready", "P1-high"]
)

This adds a label. That's it. The issue is still wherever it was on the project board. But the agent reports success: "Moved #79 to Ready."

Why does it report success? Because the operation succeeded. No errors. The label was added. From the agent's perspective, the job is done.

Except the actual goal—putting the ticket in the Ready column on the board—never happened.

What should have happened

The correct operation requires the GitHub CLI, not the issue update API:

# 1. Get the item's ID on the project board
ITEM_ID=$(gh project item-list 12 --owner vibeacademy --format json \
  | jq -r '.items[] | select(.content.number == 79) | .id')

# 2. Move to Ready column using its option ID
gh project item-edit \
  --project-id PVT_kwDODjeGB84BLVNa \
  --id "$ITEM_ID" \
  --field-id PVTSSF_lADODjeGB84BLVNazg67_nw \
  --single-select-option-id 61e4505c

Notice how different this is. Different tool. Different API. Different IDs entirely. Labels and columns aren't just semantically different—they're architecturally separate.

But if you only know the tool names and parameter descriptions, they sound related. Add "Ready" label. Move to "Ready" column. Same word. Similar intent. Easy mistake.

Why this persisted

The agent made this mistake repeatedly. Not once. Not twice. Multiple sessions, same error.

Why didn't it learn?

Partial feedback. The label operation succeeded without errors. Success signal, no failure signal. From the agent's perspective, everything worked.

Semantic plausibility. "Ready" as a label and "Ready" as a column sound like they should be related. The LLM's reasoning about the concepts is internally consistent. It's just wrong.

No verification. The agent wasn't checking board state after operations. It assumed that if the label was added, the column was updated. Bad assumption, never tested.

Stateless sessions. Across different sessions, the agent reconstructed its understanding from scratch each time. Without persistent memory of "I tried this before and it was wrong," it kept arriving at the same plausible-but-incorrect interpretation.

The broader failure mode

This isn't a GitHub-specific problem. This is a fundamental challenge in agentic systems.

LLMs reason about tools based on names, descriptions, and parameter schemas. When two concepts share vocabulary, the LLM may conflate them unless you explicitly document the distinction.

Your API might have createUser() and createUserProfile(). Sounds related, right? But maybe one creates authentication credentials and the other creates display metadata. Completely separate operations. If your agent confuses them, you'll get weird half-state where accounts exist but profiles don't, or vice versa.

Or you have archive() and delete(). Similar intent, different permanence. If the agent thinks archiving is the same as deleting, you'll lose data. And it'll report success the whole time.

Trust but verify

The fix here isn't just "document it better." Documentation helps, but the real defense is verification.

After any state-changing operation, verify the outcome. Not the API response. The actual state.

# After moving, verify board state:
gh project item-list 12 --owner vibeacademy --format json \
  | jq '.items[] | select(.content.number == 79) | .status'

# Expected: "Ready"

If the agent reports moving a ticket to Ready, make it check that the ticket actually shows up in the Ready column. Not the label. The column.

This catches semantic drift immediately. The agent can't sustain a wrong mental model if it's forced to verify ground truth after every operation.

The calibration fix

We implemented three changes:

1. Remove the naming collision. Delete the "Ready" label entirely. The column is what matters for workflow. The label is redundant and creates confusion. Eliminate the overlap.

2. Explicit documentation. Added to the agent's context:

Labels (metadata tags) are COMPLETELY SEPARATE from
Project Board Columns (workflow state).
Adding a label does NOT move an item on the board.
Use gh CLI for board operations, never issue update API.

All caps, bolded, cannot-miss-this formatting. Because subtle distinctions get lost in context windows.

3. Verification requirement. After moving tickets, the agent must verify board state. Not optional. Required.

After these fixes, the board state matches the agent's reports. Not because the agent got smarter, but because we eliminated the ambiguity, made the distinction explicit, and added verification.

What this means for your agent

If you're building agents that interact with external tools, watch for these patterns:

Similar names create hazards. If two concepts share vocabulary, assume the agent will confuse them until proven otherwise. Either rename one, or document the distinction in bold.

Success doesn't mean correctness. An API can return 200 OK and still not achieve your goal. Make agents verify outcomes, not just operations.

Semantic plausibility is dangerous. The most persistent bugs are the ones that make sense from inside a wrong mental model. The agent won't question its assumptions unless you force verification.

Document the non-obvious. You know that Labels and Columns are separate because you've used GitHub for years. The agent doesn't have that context. Make implicit knowledge explicit.

The human supervisor's job

The agent didn't catch this. It couldn't. From inside its mental model, everything was fine.

I caught it because I looked at the actual board. The agent's reports said one thing. Reality showed another.

This is the human-in-the-loop role for agentic systems. You're not checking whether the code is good. You're checking whether the system's mental model matches reality.

When you see a mismatch—agent says X, reality shows Y—that's not a code bug. That's a calibration failure. The agent has developed a persistent misconception about how tools work.

Your job is to surface that, make it explicit, and build guardrails so it can't persist.

---

If you're building AI agents and want to understand the failure modes that only show up in production, I put together a free study guide covering the fundamentals of agentic systems, tool use, and verification patterns. Get the AI Study Guide →

0 comments

Sign upor login to leave a comment