You’ve built a complex, multi-step agent. Remember the Plan-and-Execute agent from Post #36? It had a planner, an executor, and multiple tools to solve a logic puzzle. When you ran it with verbose=True
, the output was a long, linear scroll of text. Now imagine that for one step, the calculator
tool was passed the wrong numbers. How would you pinpoint the exact input it received? Scrolling through that wall of text is like trying to debug a jumbo jet by looking at its exhaust fumes.
There has to be a better way.
As LLM-powered applications (LLMops) have matured, a new class of LLM Observability platforms has emerged. These tools give you a clear, interactive “X-ray” into your agent’s mind. The most prominent in the LangChain ecosystem is LangSmith. This post will show you how to take the exact agent you’ve already built and use LangSmith to transform your ability to debug, evaluate, and monitor it.
Part 1: Getting Started – Instrumenting Your Plan-and-Execute Agent
One of the most powerful features of LangSmith is its effortless integration. You don’t need to rewrite your code.
Step 1: Sign Up and Get Your API Key
- Go to the LangSmith website and create a free account.
- Navigate to your settings and create a new API key.
Step 2: Set Your Environment Variables
Open the .env file in the project folder for your Plan-and-Execute agent (from Post #36) and add the following lines:
# .env file
# Add these three lines for LangSmith Tracing
LANGCHAIN_TRACING_V2="true"
LANGCHAIN_API_KEY="ls__your_langsmith_api_key_goes_here"
LANGCHAIN_PROJECT="Plan and Execute Agent" # This groups runs into a project
Step 3: Run Your Existing Agent Code
That’s it. Take your main.py file from the Plan-and-Execute agent tutorial and run it again. LangSmith’s integration will automatically capture every step of the execution. Now, head over to your LangSmith project dashboard to see the magic.
Pillar 1: Tracing – The Interactive Debugger for Your Plan
The flat verbose=True
output now appears in LangSmith as a structured, hierarchical “trace”.
For our Plan-and-Execute agent, the trace is a game-changer. You will see:
- Root Run: The top-level
PlanAndExecute
chain. - Level 1 Children: The
planner
chain and theexecutor
chain, showing you exactly what plan was created. - Level 2 Children: Under the
executor
, you will see each distinctTool Call
:get_current_year
,calculator
, andis_prime
.
This is the “aha!” moment. Remember how the agent had to calculate the age (28) before checking if it was prime? With LangSmith, you can click specifically on the is_prime
tool call in the trace and see its exact input ({'number': 28}
) and its output ('28 is not a prime number.'
). This level of isolation and clarity is impossible to get from a flat text log. It turns debugging from a guessing game into a precise investigation.
Pillar 2: Evaluation – Proving Your Agent is Good
Our agent correctly solved the prime number problem for a person born in 1997. But how do we ensure it works for other years? How do we know if a change to our planner’s prompt made it better or worse? We need systematic evaluation.
Step 1: Create a Dataset from Your Trace
In LangSmith, find the successful trace of your agent’s run. With a few clicks, you can “Add to Dataset.” This saves the run as a test case.
- Input:
"Find the current year, and then use it to determine the age of a person born in 1997. Finally, tell me if that age is a prime number."
- Reference Output (Ground Truth): “The person’s age is 28, which is not a prime number.”You can now manually add other examples to this dataset, like a case for someone born in 2002.
Step 2: Run an Evaluator
LangSmith allows you to run your PlanAndExecute agent over this entire dataset. For each row, an “evaluator” can score the result. For example, you can use an LLM-as-judge to check for “Correctness” by comparing the agent’s output to your reference output.
This is transformative for agent development. You can now A/B test a new planner prompt. Run the agent with prompt_A
on the dataset and get your scores. Then, run it with prompt_B
. LangSmith gives you a data-driven way to prove your changes are actual improvements.
Pillar 3: Monitoring – Your Agent in the Wild
Once deployed, LangSmith continues to collect traces, populating a monitoring dashboard that gives you a high-level view of your agent’s health.
Imagine we deploy our Plan-and-Execute agent as a “Fun Facts” API. The monitoring dashboard would let us see:
- Latency & Cost: What is the average token cost per query? Is the
get_current_year
tool slowing down the whole chain? - Error Rates: Is the
calculator
tool failing for certain types of inputs? - User Feedback: You can attach user feedback (e.g., a thumbs up/down) to a specific trace. When a user complains that the agent gave the wrong age for someone born in 2010, you can filter for their session, find the exact trace, and see precisely what went wrong. You can debug a live user issue in minutes.
Conclusion
We took our familiar Plan-and-Execute agent, and without changing a single line of code, we gained a powerful debugging UI, a systematic evaluation framework, and a production monitoring dashboard. This is the leap from building a script to engineering a reliable application.
Observability isn’t a luxury; it’s a necessity for building high-quality AI systems. Tools like LangSmith provide the “eyes and ears” you need to truly understand, debug, and improve your agents, turning a black box into a glass box.
Author

Experienced Cloud & DevOps Engineer with hands-on experience in AWS, GCP, Terraform, Ansible, ELK, Docker, Git, GitLab, Python, PowerShell, Shell, and theoretical knowledge on Azure, Kubernetes & Jenkins. In my free time, I write blogs on ckdbtech.com