Blog Post #35: Graceful Failure: Handling Errors and Exceptions Within Your Agent’s Tools

Your shiny new weather agent, connected to a live API, works perfectly… until it doesn’t. A user asks for the weather in “Narnia.” The external API, quite correctly, returns a 404 Not Found error. Your Python script, not expecting this, throws an unhandled exception, and your entire agent crashes in a blaze of red error text.

A brilliant agent that shatters at the first sign of trouble isn’t very intelligent. In the real world, things go wrong: APIs go down, users provide nonsensical input, network connections fail, and API keys expire. A production-grade agent must be resilient. It must be designed to fail gracefully.

Graceful failure is the practice of anticipating errors within your tools, catching them, and returning a clear, descriptive message back to the agent as an Observation. This is a superpower. A well-crafted error message isn’t a dead end; it’s information that allows the agent to reason about the failure and self-correct.


The Anatomy of a Brittle Tool

Let’s look at a “naive” version of our weather tool that lacks proper error handling.

# A brittle, unsafe function - DO NOT USE
import requests
from langchain_core.tools import tool

@tool
def get_live_weather_brittle(city: str) -> str:
    """Retrieves the current weather."""
    api_key = "..."
    url = f"https://api.openweathermap.org/data/2.5/weather?q={city}&appid={api_key}&units=metric"
    
    response = requests.get(url)
    # This line will throw an exception and CRASH the agent if the API returns a 4xx or 5xx error.
    response.raise_for_status() 
    
    data = response.json()
    return f"The weather in {city} is {data['main']['temp']}°C."

This tool is a ticking time bomb. It will crash if:

  • The user provides a fake city name like “Narnia” (404 error).
  • The API key is invalid (401 error).
  • The API service is temporarily down (503 error).
  • The server’s internet connection fails (ConnectionError).

Building a Resilient Tool with try...except

The solution is to wrap our risky operations—anything that interacts with the outside world—in a try...except block. This is our safety net. It allows us to “catch” exceptions and handle them elegantly instead of crashing.

Let’s refactor our tool from the last post to be truly robust.

# tools.py - A robust, production-ready tool
import os
import requests
from langchain_core.tools import tool
from typing import Literal

WEATHER_API_KEY = os.getenv("OPENWEATHERMAP_API_KEY")

@tool
def get_live_weather(city: str, units: Literal["metric", "imperial"] = "metric") -> str:
    """
    Retrieves the current, real-time weather for a specified city from OpenWeatherMap.
    """
    # ... (Docstring from previous post) ...

    # --- Start of our safety net ---
    try:
        base_url = "https://api.openweathermap.org/data/2.5/weather"
        params = {"q": city, "appid": WEATHER_API_KEY, "units": units}
        
        response = requests.get(base_url, params=params)
        
        # Check for HTTP errors and raise them to be caught by our except block
        response.raise_for_status()
        
        data = response.json()
        
        description = data['weather'][0]['description']
        temperature = data['main']['temp']
        unit_symbol = "°C" if units == "metric" else "°F"
        
        return f"The current weather in {city} is {temperature}{unit_symbol} with {description}."

    # --- Catching specific, expected errors first ---
    except requests.exceptions.HTTPError as http_err:
        if response.status_code == 404:
            return f"Error: The city '{city}' was not found. Please ask the user to check the spelling."
        elif response.status_code == 401:
            return "Error: Invalid API key provided. Please check the server configuration."
        else:
            return f"Error: An HTTP error occurred: {http_err}"
            
    # --- Catching broader network errors ---
    except requests.exceptions.RequestException as req_err:
        return f"Error: A network error occurred: {req_err}"
        
    # --- A final, general catch-all for any other unexpected problem ---
    except Exception as e:
        return f"An unexpected error occurred: {e}"

The Agent That Learns from Mistakes

Now, let’s see why this is so powerful. When the agent uses our robust tool with bad input, it doesn’t crash. It gets a useful observation and can decide what to do next.

User Query: "What's the weather in the lovely land of Narnia?"

Turn 1: Action and Observation

  • Thought: The user is asking for the weather in a place called “Narnia.” I have a tool, get_live_weather, that can find this. I will call it with the city parameter set to “Narnia”.
  • Action: get_live_weather(city='Narnia')
  • Observation: (Our Python function runs, hits the 404 error, and returns the message) Error: The city 'Narnia' was not found. Please ask the user to check the spelling.

Turn 2: Self-Correction

The agent now has this error message as its observation. It feeds this back into the LLM for its next thought.

  • Thought: My attempt to find the weather for “Narnia” failed because the tool reported that the city was not found. Narnia is likely a fictional place. I should inform the user of this and ask for a real city.
  • Action: finish("I'm sorry, I couldn't find any weather information for 'Narnia' as it seems to be a fictional place. Could you please provide the name of a real city?")

This is the magic of graceful failure. The error became part of the solution. The agent was able to reason about its own failure and formulate a helpful, intelligent response.

Conclusion

Robustness is not an optional feature; it’s a core requirement for any serious application. When building tools for your agents, always design for failure, not just for the happy path.

  1. Wrap all risky operations (API calls, file I/O, database queries) in try...except blocks.
  2. Catch specific exceptions first, then more general ones.
  3. Return clear, descriptive, human-readable error messages as strings.

An agent’s intelligence isn’t just measured by how it succeeds when everything works, but by how it understands and recovers when things go wrong. By building tools that fail gracefully, you are directly enhancing your agent’s ability to reason, self-correct, and navigate the messy, unpredictable real world.

Author

Debjeet Bhowmik

Experienced Cloud & DevOps Engineer with hands-on experience in AWS, GCP, Terraform, Ansible, ELK, Docker, Git, GitLab, Python, PowerShell, Shell, and theoretical knowledge on Azure, Kubernetes & Jenkins. In my free time, I write blogs on ckdbtech.com

Leave a Comment