Your Agent Has Been Stuck for 3 Hours and Nobody Knows
API call hangs. Agent blocks. No timeout fires. No one gets paged. The default behavior in most agent frameworks is silence. Here's how to fix that.
Your agent calls an external API. The API hangs. Not a timeout - it just stops responding. TCP connection is open, but nothing comes back.
In most frameworks, your agent is now stuck. No error. No timeout. No alert. Just… waiting.
How Agents Handle Stuck Tasks Today
LangGraph: Agent blocks on tool call. No built-in timeout.
CrewAI: max_execution_time exists. But no notification. No escalation.
Raw Python: requests.get(url, timeout=30)? Great. Now what?
You caught the timeout. The agent retried. Still failing.
Who gets notified? Nobody. What happens next? Nothing.
The problem isn’t detecting the timeout. timeout=30 is one line. The problem is everything after:
- Retry - try again, maybe with different parameters
- Timeout on retries - don’t retry forever
- Notify - tell someone the agent is stuck
- Remind - if that person doesn’t respond in 10 minutes, remind them
- Escalate - if they still don’t respond, notify someone else
- Audit - record what happened, when, who responded
That’s not a timeout. That’s a workflow.
What You Actually Build
# timeout_watcher.py - runs as a separate process
def check_stuck_tasks():
stuck = db.query(
"SELECT * FROM agent_tasks WHERE status = 'running' "
"AND started_at < NOW() - INTERVAL '5 minutes'"
)
for task in stuck:
if task.retry_count < 3:
retry_task(task)
else:
send_slack_alert(task.owner, f"Agent stuck: {task.name}")
db.update(task.id, status="needs_human")
# escalation_cron.py - runs every 10 minutes
def check_unacknowledged():
unacked = db.query(
"SELECT * FROM agent_tasks WHERE status = 'needs_human' "
"AND notified_at < NOW() - INTERVAL '30 minutes'"
)
for task in unacked:
if task.escalation_level == 0:
send_slack_alert(task.backup_owner, f"Escalation: {task.name}")
db.update(task.id, escalation_level=1)
else:
page_oncall(f"Agent stuck, nobody responding: {task.name}")
Two cron jobs. Two database queries. Slack integration. PagerDuty integration. Schema for escalation levels. And this is the simplified version.
What This Should Look Like
from axme import AxmeClient, AxmeClientConfig
client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))
intent_id = client.send_intent({
"intent_type": "intent.ops.stuck_task.v1",
"to_agent": "agent://myorg/production/task-runner",
"payload": {
"task": "sync_inventory",
"api_endpoint": "https://api.warehouse.internal/sync",
"error": "Connection timeout after 30s",
},
})
result = client.wait_for(intent_id)
The scenario definition handles the rest:
- Agent gets 60 seconds to complete the task
- If it fails, platform retries (up to 3 attempts)
- After all retries fail, escalation fires: human approval gate opens
- Human gets notified immediately
- Reminder at 3 minutes if no response
- Repeat reminder every 30 minutes
- 30-minute total timeout on human response
- Full audit trail of every state transition
No cron jobs. No escalation table. No Slack webhook code.
Timeline: Without vs With
Without AXME:
00:00 Agent calls API
00:30 Timeout. Agent retries.
01:00 Timeout again. Agent retries.
01:30 Timeout again. Agent gives up. Logs error.
...
03:00 Someone notices the dashboard is red.
03:15 Engineer checks logs, finds the stuck task.
03:30 Engineer manually restarts.
With AXME:
00:00 Agent calls API via intent
00:30 Timeout. Platform retries.
01:00 Timeout again. Platform retries.
01:30 Third failure. Platform escalates to human.
01:30 On-call engineer gets notification.
01:33 Reminder (no response yet).
01:35 Engineer approves manual override from phone.
01:35 Agent resumes with override instruction.
The difference: 3 hours of silence vs 1.5 minutes to human response.
Try It
Working example - agent detects stuck task, automatic retry, escalation to human with reminders:
github.com/AxmeAI/agent-timeout-and-escalation
Built with AXME - timeout, escalation, and human approval built into the lifecycle. Alpha - feedback welcome.