How to Track Agent Behavior Using Pylar Evals

by Hoshang Mehta

You've built your MCP tools and connected them to agents. Everything seems to be working—agents are calling your tools, getting results, and answering questions. But here's what I've learned: you can't improve what you don't measure. Without visibility into how agents are actually using your tools, you're flying blind.

Pylar Evals gives you that visibility. It's your window into production agent behavior—showing you exactly what queries agents are running, which ones succeed, which ones fail, and what patterns emerge over time. This tutorial will walk you through setting up evals, understanding the metrics, and using insights to improve your tools. By the end, you'll know how to track query patterns, identify anomalies, and continuously improve agent performance.


Why This Approach Works

Before we dive into the dashboard, let me explain why monitoring agent behavior matters and how Evals fits into Pylar's approach.

The Problem: Building Tools Without Feedback

Most teams I've worked with build tools, publish them, and hope they work. They don't have visibility into:

  • Which tools agents use most frequently
  • What queries agents actually run
  • Why some queries fail
  • How query patterns change over time
  • Whether optimizations actually help

The result? Teams build tools based on assumptions, not real usage. They optimize the wrong things, miss critical errors, and can't improve because they don't know what needs improving.

Why Evals Solve This

Evals provide comprehensive insights into how AI agents interact with your MCP tools. They give you:

  • Performance metrics: Success rates, error rates, usage volume
  • Query patterns: What queries agents run most frequently
  • Error analysis: Why queries fail and how often
  • Time-series trends: How performance changes over time
  • Raw execution logs: Every tool call with full context

This visibility enables a feedback loop:

  1. Monitor: See how agents use your tools
  2. Identify: Find errors, slow queries, or patterns
  3. Refine: Update tools or views based on insights
  4. Verify: Check Evals again to confirm improvements
  5. Repeat: Continuously improve based on real usage

Think of Evals as your observability layer for AI agents. Just like you monitor application performance, you need to monitor agent behavior.

How Evals Fit into the Bigger Picture

Evals are the final piece of Pylar's "Data → Views → Tools → Agents" flow:

  1. Data: Connect your data sources
  2. Views: Build governed SQL views
  3. Tools: Create MCP tools that wrap your views
  4. Agents: Agents call your tools
  5. Evals: Monitor how agents use your tools and optimize

The key insight: Evals close the feedback loop. They show you how agents actually use your tools, so you can refine them to better match real needs.


Step-by-Step Guide

Let's walk through using Evals to track agent behavior. I'll show you how to access the dashboard, understand the metrics, and use insights to improve your tools.

Prerequisites

Before we start, make sure you have:

  • ✅ MCP tools created and published (if you don't have any, follow the previous tutorials)
  • ✅ Agents connected and using your tools (Evals only show data when tools are actually being used)
  • ✅ A Pylar project with tools deployed

If you haven't published tools yet, do that first. Evals need actual usage data to be useful.

Step 1: Access the Evals Dashboard

The Evals dashboard is your command center for monitoring agent behavior.

  1. Navigate to Your Project: In Pylar, go to the project that contains your MCP tools.

  2. Click "Eval": In the top-right corner of the screen, you'll see an "Eval" button. Click it.

  3. Dashboard Opens: The Evaluation Dashboard opens, showing comprehensive metrics for all your MCP tools.

Evals dashboard showing tool performance metrics

What you'll see: The dashboard is organized into several sections:

  • Filters: Select which MCP tool to review
  • Summary Metrics: High-level performance indicators
  • Visual Insights: Time-series graphs showing trends
  • Error Analysis: Detailed error breakdown
  • Raw Logs: Complete records of all tool calls

Step 2: Understand Summary Metrics

The summary metrics give you a quick overview of tool performance. Let's break down what each one means.

Total Count

What it is: The total number of times the selected MCP tool was invoked.

What it tells you: Overall usage volume—how frequently agents are using this tool.

Example: If Total Count is 1,000, agents have called this tool 1,000 times.

What to look for:

  • High count = tool is being used frequently (good sign)
  • Low count = tool might not be useful or agents can't find it
  • Sudden drops = something might be broken

Success Count

What it is: How many invocations returned a valid result.

What it tells you: The absolute number of successful tool calls. Higher is better.

Example: If Success Count is 950, 950 out of 1,000 calls succeeded.

Error Count

What it is: How many invocations failed to return a result.

What it tells you: The absolute number of failed tool calls. Lower is better.

Example: If Error Count is 50, 50 out of 1,000 calls failed.

Success Rate

Calculation:

Success Rate = (Success Count ÷ Total Count) × 100

What it is: Percentage of successful tool invocations.

What it tells you:

  • High success rate (90%+): Tool is working well
  • Medium success rate (70-90%): Some issues, needs attention
  • Low success rate (less than 70%): Significant problems, needs immediate attention

Example: If Success Count is 950 and Total Count is 1,000, Success Rate is 95%—excellent.

Target: Aim for success rates above 90%. If your success rate is below this threshold, investigate errors to understand what's going wrong.

Error Rate

Calculation:

Error Rate = (Error Count ÷ Total Count) × 100

What it is: Percentage of failed tool invocations.

What it tells you:

  • Low error rate (less than 10%): Tool is reliable
  • Medium error rate (10-30%): Some reliability issues
  • High error rate (greater than 30%): Major problems, needs fixing

Example: If Error Count is 50 and Total Count is 1,000, Error Rate is 5%—acceptable.

Warning: High error rates indicate problems that are affecting agent performance. Address these issues promptly to improve agent experience.

Step 3: Analyze Visual Insights

The dashboard includes time-series graphs that show how metrics change over time. These are crucial for understanding trends and patterns.

Calls/Success/Errors Graph

What it shows: A time-series plot displaying:

  • Total Calls: How many times the tool was invoked over time
  • Successes: Successful invocations over time
  • Errors: Failed invocations over time

What it tells you:

  • Usage trends: Are agents using the tool more or less over time?
  • Performance trends: Is success rate improving or declining?
  • Error patterns: When do errors occur most frequently?

How to use it:

  • Look for trends: Are errors increasing or decreasing?
  • Identify patterns: Do errors spike at certain times?
  • Compare periods: How does current performance compare to past performance?

Example: If you see errors spiking every Monday morning, that might indicate a weekly data refresh issue.

Success/Error Rate (%) Graph

What it shows: Displays success and error percentages as a time-series trend.

What it tells you:

  • Performance stability: Is your tool's performance consistent over time?
  • Improvement or degradation: Is the tool getting better or worse?
  • Correlation with changes: Did a recent change improve or hurt performance?

How to use it:

  • Monitor trends: Is success rate improving?
  • Spot anomalies: Are there sudden drops in performance?
  • Track improvements: Did your changes improve performance?

Example: If you see a sudden drop in success rate after deploying a tool update, that update likely introduced a bug.

Pro tip: Use these graphs to understand not just current performance, but also trends and patterns. This helps you identify issues before they become critical.

Step 4: Analyze Errors

The Error Analysis section helps you understand what's going wrong with your tools.

Error Explorer

The Error Explorer lists all error codes that occurred and how often each one happened.

Common error codes:

  • 400: Bad Request - Invalid parameters or query syntax
  • 404: Not Found - Resource doesn't exist
  • 500: Internal Server Error - Server-side issues
  • Timeout: Query execution exceeded time limit
  • Permission: Access denied or insufficient permissions

Example Error Explorer:

Error Code    Frequency
400           3
500           1
Timeout       2

This shows:

  • Error code 400 occurred 3 times
  • Error code 500 occurred 1 time
  • Timeout errors occurred 2 times

What to do: Focus on fixing high-frequency errors first, as they have the biggest impact.

Understanding Error Types

Error Code 400: Bad Request

What it means: The request was invalid.

Common causes:

  • Invalid parameter values
  • SQL syntax errors in query
  • Parameter type mismatches
  • Missing required parameters

How to fix:

  1. Check parameter definitions match query placeholders
  2. Verify parameter types are correct
  3. Test with valid parameter values
  4. Review SQL query syntax

Error Code 500: Internal Server Error

What it means: Server-side error occurred.

Common causes:

  • Database connection issues
  • Query execution failures
  • View definition problems
  • Data type mismatches

How to fix:

  1. Verify view queries are correct
  2. Check database connections are active
  3. Test queries manually in SQL IDE
  4. Review error messages in raw logs

Timeout Errors

What it means: Query took too long to execute.

Common causes:

  • Large result sets
  • Complex joins
  • Missing indexes
  • Inefficient queries

How to fix:

  1. Add LIMIT to restrict result size
  2. Optimize query performance
  3. Add indexes to source views
  4. Refine WHERE conditions

Permission Errors

What it means: Access denied.

Common causes:

  • Insufficient database permissions
  • View access restrictions
  • Token authentication issues

How to fix:

  1. Verify database connection permissions
  2. Check view access controls
  3. Regenerate authentication tokens if needed

Step 5: Understand Query Shapes

Query shapes represent patterns in the types of queries executed by your MCP tools. They show you how agents are interacting with your data.

Accessing Query Shapes

  1. Select the tool you want to analyze (using filters)
  2. Scroll to the Error Analysis section
  3. Find the Query Shape section
  4. Review the query patterns displayed

Common Query Patterns

Query shapes show patterns like:

  • Simple Filters: Queries with single WHERE conditions
  • Date Ranges: Queries filtering by date ranges
  • Multiple Filters: Queries with multiple WHERE conditions
  • Aggregations: Queries with GROUP BY or aggregations
  • Sorting Patterns: How results are ordered

Example Query Shapes:

Pattern 1: Single Filter

Most common: WHERE event_type = '{value}'
Frequency: 45% of queries

Pattern 2: Date Range

Most common: WHERE date >= '{start}' AND date <= '{end}'
Frequency: 30% of queries

Pattern 3: Multiple Filters

Most common: WHERE type = '{type}' AND status = '{status}'
Frequency: 15% of queries

Using Query Shapes to Improve Tools

Scenario 1: Common Filter Pattern

Observation: 60% of queries filter by event_type

Action:

  • Ensure event_type filtering is optimized
  • Consider making event_type a required parameter
  • Add tool description emphasizing event_type filtering

Scenario 2: Date Range Pattern

Observation: Many queries use date ranges

Action:

  • Create a dedicated tool for date range queries
  • Add date range parameters to existing tools
  • Optimize date filtering in queries

Scenario 3: Multiple Parameter Combinations

Observation: Agents frequently combine specific parameters

Action:

  • Create tools optimized for common combinations
  • Add default values for frequently used parameters
  • Update tool descriptions to highlight common use cases

Pro tip: If a query pattern appears frequently (30%+ of queries), consider creating a dedicated tool optimized for that pattern. This can improve performance and agent experience.

Step 6: Review Raw Logs

Raw Logs provide detailed records of every MCP tool call. They show you exactly what happened—which tools were called, what queries were executed, whether they succeeded, and any errors that occurred.

Accessing Raw Logs

  1. Scroll to the bottom of the Evaluation Dashboard
  2. Find the Raw Logs section
  3. Review the detailed records

Understanding Log Fields

Each log entry contains:

  • Tool Name: Which MCP tool was invoked
  • Query Executed: The actual SQL query that ran (with parameter values injected)
  • Invocation Status: Success or failure
  • Timestamp: When the invocation occurred
  • Error Message: Error details (empty when successful)

Example Log Entry:

Tool: fetch_engagement_scores_by_event_type
Query: SELECT engagement_score FROM table0 WHERE event_type LIKE '%login%' ORDER BY engagement_score DESC
Status: Success
Timestamp: 2024-11-05 14:23:45
Error Message: (empty)

What it tells you:

  • Which tool agents are using
  • Exact query that ran
  • Parameter values that were used
  • Whether it succeeded or failed
  • When it happened

Using Logs for Debugging

Step 1: Identify the Problem

  1. Find failed invocations in logs
  2. Review error messages
  3. Note which tool failed
  4. Check what parameters were used

Step 2: Understand the Context

  1. Look at the query that was executed
  2. Review parameter values
  3. Check timestamp (when did it fail?)
  4. Compare with successful invocations

Step 3: Reproduce the Issue

  1. Copy the query from the log
  2. Test it manually in SQL IDE
  3. Use the same parameter values
  4. See if you can reproduce the error

Step 4: Fix the Issue

  1. Identify the root cause
  2. Fix the tool or query
  3. Test the fix
  4. Monitor logs to confirm it's resolved

Pro tip: Error messages in raw logs are your best source of information for understanding what went wrong. Always review them when investigating errors.


Real-World Examples

Let me show you how different teams use Evals to improve their tools.

Example 1: Identifying High-Error Tools

A support team has published multiple tools and wants to identify which ones need attention.

Setup:

  1. Open Evals dashboard
  2. Review summary metrics for each tool
  3. Identify tools with error rates above 10%

Finding: One tool (get_customer_tickets) has a 25% error rate.

Investigation:

  1. Filter to get_customer_tickets in Evals
  2. Review Error Explorer: Most errors are code 400 (Bad Request)
  3. Check Raw Logs: Parameter customer_id is being passed as a string, but query expects integer

Fix: Update tool to cast parameter to integer:

WHERE customer_id = CAST('{customer_id}' AS INTEGER)

Result: Error rate drops from 25% to 2% after the fix.

Value: Evals helped identify a specific issue and verify the fix worked.

Example 2: Optimizing Query Patterns

A product team notices their get_user_events tool is slow. They want to optimize it.

Setup:

  1. Open Evals dashboard
  2. Filter to get_user_events
  3. Review Query Shapes section

Finding: 70% of queries filter by event_type and date_range.

Investigation:

  1. Check Raw Logs: Most queries use the same date range (last 30 days)
  2. Review query execution: Queries scan entire table without date filter optimization

Fix:

  1. Add default date range parameter (last 30 days)
  2. Optimize query to filter by date first
  3. Add index on event_type and date columns

Result: Query execution time drops by 60%, and agents get faster responses.

Value: Query shapes revealed the most common pattern, enabling targeted optimization.

Example 3: Tracking Performance Improvements

A sales team updates their get_opportunity_summary tool and wants to verify the update improved performance.

Setup:

  1. Note current success rate before update: 85%
  2. Deploy tool update
  3. Monitor Evals dashboard over next few days

Finding: Success rate improves to 95% after update.

Verification:

  1. Review Success/Error Rate (%) graph: Shows steady improvement
  2. Check Error Explorer: Fewer error codes appearing
  3. Review Raw Logs: Fewer failed invocations

Result: Confirmed the update improved tool reliability.

Value: Evals provided objective metrics to verify improvements, not just assumptions.

Notice how each example uses the same pattern: monitor, identify, fix, verify. Evals enable this feedback loop.


Common Pitfalls & Tips

I've seen teams make these mistakes when using Evals. Here's how to avoid them.

Pitfall 1: Not Monitoring Regularly

Don't set up Evals and forget about them. Regular monitoring is essential.

Why this matters: Issues compound over time. A 5% error rate today becomes 20% next week if you don't catch it early.

How to avoid it:

  • Check Evals at least weekly
  • Set up alerts for high error rates (if available)
  • Review trends, not just current metrics
  • Investigate spikes immediately

Pitfall 2: Ignoring Low-Frequency Errors

Don't dismiss errors that occur occasionally. They often indicate edge cases or boundary conditions.

Why this matters: Low-frequency errors can become high-frequency errors as usage grows. Also, they might indicate data quality issues.

How to avoid it:

  • Investigate all errors, even if they're rare
  • Look for patterns in low-frequency errors
  • Document edge cases and handle them
  • Monitor if frequency increases over time

Pitfall 3: Not Using Query Shapes

Don't just look at success/error rates. Query shapes reveal how agents actually use your tools.

Why this matters: Understanding query patterns helps you optimize the right things. You might optimize for a pattern that never happens.

How to avoid it:

  • Review Query Shapes section regularly
  • Identify high-frequency patterns (30%+ of queries)
  • Create specialized tools for common patterns
  • Optimize queries based on actual usage, not assumptions

Pitfall 4: Not Reviewing Raw Logs

Don't rely only on summary metrics. Raw logs provide the details you need to fix issues.

Why this matters: Summary metrics tell you something is wrong, but raw logs tell you why. You can't fix what you don't understand.

How to avoid it:

  • Always review raw logs when investigating errors
  • Look for patterns in error messages
  • Compare failed queries with successful ones
  • Use logs to reproduce and fix issues

Pitfall 5: Not Tracking Improvements

Don't make changes without verifying they helped. Track improvements to confirm your fixes worked.

Why this matters: You might think a change helped, but metrics might show otherwise. Or a change might have unintended side effects.

How to avoid it:

  • Note metrics before making changes
  • Monitor Evals after deploying fixes
  • Use time-series graphs to track trends
  • Verify specific error codes disappear

Best Practices Summary

Here's a quick checklist:

  • Monitor regularly: Check Evals at least weekly
  • Focus on high-frequency errors: Fix the biggest problems first
  • Review query shapes: Understand how agents use your tools
  • Use raw logs: Get details needed to fix issues
  • Track improvements: Verify fixes actually work
  • Set thresholds: Define acceptable error rates (e.g., <10%)
  • Document patterns: Note common issues and fixes
  • Iterate continuously: Use Evals to continuously improve

Next Steps

You've learned how to track agent behavior using Pylar Evals. That's the foundation for continuous improvement. Now you can:

  1. Monitor regularly: Set up a routine to check Evals weekly or daily, depending on your usage volume.

  2. Identify issues: Use summary metrics, error analysis, and query shapes to find problems before they become critical.

  3. Fix and verify: Make improvements based on insights, then verify they worked by monitoring Evals again.

  4. Optimize continuously: Use query patterns to optimize tools for how agents actually use them, not how you think they should use them.

The key is to make Evals part of your regular workflow. Don't just check them when something breaks—monitor proactively to catch issues early and continuously improve.

If you want to keep going, the next step is using Evals insights to improve your tools and views. That's where you'll see the real value—turning observations into optimizations that make agents more reliable and faster.


Frequently Asked Questions

How often should I check Evals?

What's a good success rate to aim for?

Can I see Evals for multiple tools at once?

How do I know if a query pattern is worth optimizing?

What if I see errors but don't know how to fix them?

Can I export Evals data?

How long does Evals data persist?

What if my tools have zero usage?

How to Track Agent Behavior Using Pylar Evals | Pylar Blog