AXME Cloud March 29, 2026 · 2 min read

Your Agent Has Been Stuck for 3 Hours and Nobody Knows

API call hangs. Agent blocks. No timeout fires. No one gets paged. The default behavior in most agent frameworks is silence. Here's how to fix that.

Your agent calls an external API. The API hangs. Not a timeout - it just stops responding. TCP connection is open, but nothing comes back.

In most frameworks, your agent is now stuck. No error. No timeout. No alert. Just… waiting.

How Agents Handle Stuck Tasks Today

LangGraph:  Agent blocks on tool call. No built-in timeout.
CrewAI:     max_execution_time exists. But no notification. No escalation.
Raw Python: requests.get(url, timeout=30)? Great. Now what?
            You caught the timeout. The agent retried. Still failing.
            Who gets notified? Nobody. What happens next? Nothing.

The problem isn’t detecting the timeout. timeout=30 is one line. The problem is everything after:

Retry - try again, maybe with different parameters
Timeout on retries - don’t retry forever
Notify - tell someone the agent is stuck
Remind - if that person doesn’t respond in 10 minutes, remind them
Escalate - if they still don’t respond, notify someone else
Audit - record what happened, when, who responded

That’s not a timeout. That’s a workflow.

What You Actually Build

# timeout_watcher.py - runs as a separate process
def check_stuck_tasks():
    stuck = db.query(
        "SELECT * FROM agent_tasks WHERE status = 'running' "
        "AND started_at < NOW() - INTERVAL '5 minutes'"
    )
    for task in stuck:
        if task.retry_count < 3:
            retry_task(task)
        else:
            send_slack_alert(task.owner, f"Agent stuck: {task.name}")
            db.update(task.id, status="needs_human")

# escalation_cron.py - runs every 10 minutes
def check_unacknowledged():
    unacked = db.query(
        "SELECT * FROM agent_tasks WHERE status = 'needs_human' "
        "AND notified_at < NOW() - INTERVAL '30 minutes'"
    )
    for task in unacked:
        if task.escalation_level == 0:
            send_slack_alert(task.backup_owner, f"Escalation: {task.name}")
            db.update(task.id, escalation_level=1)
        else:
            page_oncall(f"Agent stuck, nobody responding: {task.name}")

Two cron jobs. Two database queries. Slack integration. PagerDuty integration. Schema for escalation levels. And this is the simplified version.

What This Should Look Like

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.ops.stuck_task.v1",
    "to_agent": "agent://myorg/production/task-runner",
    "payload": {
        "task": "sync_inventory",
        "api_endpoint": "https://api.warehouse.internal/sync",
        "error": "Connection timeout after 30s",
    },
})
result = client.wait_for(intent_id)

The scenario definition handles the rest:

Agent gets 60 seconds to complete the task
If it fails, platform retries (up to 3 attempts)
After all retries fail, escalation fires: human approval gate opens
Human gets notified immediately
Reminder at 3 minutes if no response
Repeat reminder every 30 minutes
30-minute total timeout on human response
Full audit trail of every state transition

No cron jobs. No escalation table. No Slack webhook code.

Timeline: Without vs With

Without AXME:

00:00  Agent calls API
00:30  Timeout. Agent retries.
01:00  Timeout again. Agent retries.
01:30  Timeout again. Agent gives up. Logs error.
...
03:00  Someone notices the dashboard is red.
03:15  Engineer checks logs, finds the stuck task.
03:30  Engineer manually restarts.

With AXME:

00:00  Agent calls API via intent
00:30  Timeout. Platform retries.
01:00  Timeout again. Platform retries.
01:30  Third failure. Platform escalates to human.
01:30  On-call engineer gets notification.
01:33  Reminder (no response yet).
01:35  Engineer approves manual override from phone.
01:35  Agent resumes with override instruction.

The difference: 3 hours of silence vs 1.5 minutes to human response.

Try It

Working example - agent detects stuck task, automatic retry, escalation to human with reminders:

github.com/AxmeAI/agent-timeout-and-escalation

Built with AXME - timeout, escalation, and human approval built into the lifecycle. Alpha - feedback welcome.

Your Agent Has Been Stuck for 3 Hours and Nobody Knows

How Agents Handle Stuck Tasks Today

What You Actually Build

What This Should Look Like

Timeline: Without vs With

Try It

More on AXME Cloud

Your AI Agent Crashed at Step 47. Why Isn't Crash Recovery the Default?

How to Add Human Approval to AI Agent Workflows Without Building It Yourself

Temporal Alternative Without the Cluster and Determinism Constraints