How often should I check Evals?

It depends on your usage volume. For high-traffic tools (100+ calls per day), check daily. For lower-traffic tools, weekly is fine. The key is consistency—set a schedule and stick to it.

What's a good success rate to aim for?

Aim for success rates above 90%. If your success rate is below 70%, that's a critical issue that needs immediate attention. Between 70-90%, investigate errors and fix them.

Can I see Evals for multiple tools at once?

Yes! Use the filter dropdown to select different tools and compare their metrics. This helps you identify which tools perform best and learn from them.

How do I know if a query pattern is worth optimizing?

If a query pattern appears in 30% or more of queries, it's worth optimizing. Create specialized tools for these patterns, or optimize your existing tools to handle them better.

What if I see errors but don't know how to fix them?

Start with raw logs. Review the error messages, check what queries failed, and compare them with successful queries. If you're still stuck, test the query manually in SQL IDE to reproduce the issue.

Can I export Evals data?

Evals data is available in the dashboard. For now, you can review it directly in Pylar. If you need to export data, contact Pylar support.

How long does Evals data persist?

Evals data persists for the lifetime of your project. You can review historical data to see trends and track improvements over time.

What if my tools have zero usage?

If your tools show zero usage in Evals, it means agents aren't calling them yet. Make sure: - Tools are published - Agents are connected - Tools are enabled in your agent builder - Agents are actually running and using tools Once agents start using your tools, you'll see data in Evals.

How to Track Agent Behavior Using Pylar Evals

You've built your MCP tools and connected them to agents. Everything seems to be working—agents are calling your tools, getting results, and answering questions. But here's what I've learned: you can't improve what you don't measure. Without visibility into how agents are actually using your tools, you're flying blind.

Pylar Evals gives you that visibility. It's your window into production agent behavior—showing you exactly what queries agents are running, which ones succeed, which ones fail, and what patterns emerge over time. This tutorial will walk you through setting up evals, understanding the metrics, and using insights to improve your tools. By the end, you'll know how to track query patterns, identify anomalies, and continuously improve agent performance.

Why This Approach Works

Before we dive into the dashboard, let me explain why monitoring agent behavior matters and how Evals fits into Pylar's approach.

The Problem: Building Tools Without Feedback

Most teams I've worked with build tools, publish them, and hope they work. They don't have visibility into:

Which tools agents use most frequently
What queries agents actually run
Why some queries fail
How query patterns change over time
Whether optimizations actually help

The result? Teams build tools based on assumptions, not real usage. They optimize the wrong things, miss critical errors, and can't improve because they don't know what needs improving.

Why Evals Solve This

Evals provide comprehensive insights into how AI agents interact with your MCP tools. They give you:

Performance metrics: Success rates, error rates, usage volume
Query patterns: What queries agents run most frequently
Error analysis: Why queries fail and how often
Time-series trends: How performance changes over time
Raw execution logs: Every tool call with full context

This visibility enables a feedback loop:

Monitor: See how agents use your tools
Identify: Find errors, slow queries, or patterns
Refine: Update tools or views based on insights
Verify: Check Evals again to confirm improvements
Repeat: Continuously improve based on real usage

Think of Evals as your observability layer for AI agents. Just like you monitor application performance, you need to monitor agent behavior.

How Evals Fit into the Bigger Picture

Evals are the final piece of Pylar's "Data → Views → Tools → Agents" flow:

Data: Connect your data sources
Views: Build governed SQL views
Tools: Create MCP tools that wrap your views
Agents: Agents call your tools
Evals: Monitor how agents use your tools and optimize

The key insight: Evals close the feedback loop. They show you how agents actually use your tools, so you can refine them to better match real needs.

Step-by-Step Guide

Let's walk through using Evals to track agent behavior. I'll show you how to access the dashboard, understand the metrics, and use insights to improve your tools.

Prerequisites

Before we start, make sure you have:

✅ MCP tools created and published (if you don't have any, follow the previous tutorials)
✅ Agents connected and using your tools (Evals only show data when tools are actually being used)
✅ A Pylar project with tools deployed

If you haven't published tools yet, do that first. Evals need actual usage data to be useful.

Step 1: Access the Evals Dashboard

The Evals dashboard is your command center for monitoring agent behavior.

Navigate to Your Project: In Pylar, go to the project that contains your MCP tools.
Click "Eval": In the top-right corner of the screen, you'll see an "Eval" button. Click it.
Dashboard Opens: The Evaluation Dashboard opens, showing comprehensive metrics for all your MCP tools.

Evals dashboard showing tool performance metrics

What you'll see: The dashboard is organized into several sections:

Filters: Select which MCP tool to review
Summary Metrics: High-level performance indicators
Visual Insights: Time-series graphs showing trends
Error Analysis: Detailed error breakdown
Raw Logs: Complete records of all tool calls

Step 2: Understand Summary Metrics

The summary metrics give you a quick overview of tool performance. Let's break down what each one means.

Total Count

What it is: The total number of times the selected MCP tool was invoked.

What it tells you: Overall usage volume—how frequently agents are using this tool.

Example: If Total Count is 1,000, agents have called this tool 1,000 times.

What to look for:

High count = tool is being used frequently (good sign)
Low count = tool might not be useful or agents can't find it
Sudden drops = something might be broken

Success Count

What it is: How many invocations returned a valid result.

What it tells you: The absolute number of successful tool calls. Higher is better.

Example: If Success Count is 950, 950 out of 1,000 calls succeeded.

Error Count

What it is: How many invocations failed to return a result.

What it tells you: The absolute number of failed tool calls. Lower is better.

Example: If Error Count is 50, 50 out of 1,000 calls failed.

Success Rate

Calculation:

Success Rate = (Success Count ÷ Total Count) × 100

What it is: Percentage of successful tool invocations.

What it tells you:

High success rate (90%+): Tool is working well
Medium success rate (70-90%): Some issues, needs attention
Low success rate (less than 70%): Significant problems, needs immediate attention

Example: If Success Count is 950 and Total Count is 1,000, Success Rate is 95%—excellent.

Target: Aim for success rates above 90%. If your success rate is below this threshold, investigate errors to understand what's going wrong.

Error Rate

Calculation:

Error Rate = (Error Count ÷ Total Count) × 100

What it is: Percentage of failed tool invocations.

What it tells you:

Low error rate (less than 10%): Tool is reliable
Medium error rate (10-30%): Some reliability issues
High error rate (greater than 30%): Major problems, needs fixing

Example: If Error Count is 50 and Total Count is 1,000, Error Rate is 5%—acceptable.

Warning: High error rates indicate problems that are affecting agent performance. Address these issues promptly to improve agent experience.

Step 3: Analyze Visual Insights

The dashboard includes time-series graphs that show how metrics change over time. These are crucial for understanding trends and patterns.

Calls/Success/Errors Graph

What it shows: A time-series plot displaying:

Total Calls: How many times the tool was invoked over time
Successes: Successful invocations over time
Errors: Failed invocations over time

What it tells you:

Usage trends: Are agents using the tool more or less over time?
Performance trends: Is success rate improving or declining?
Error patterns: When do errors occur most frequently?

How to use it:

Look for trends: Are errors increasing or decreasing?
Identify patterns: Do errors spike at certain times?
Compare periods: How does current performance compare to past performance?

Example: If you see errors spiking every Monday morning, that might indicate a weekly data refresh issue.

Success/Error Rate (%) Graph

What it shows: Displays success and error percentages as a time-series trend.

What it tells you:

Performance stability: Is your tool's performance consistent over time?
Improvement or degradation: Is the tool getting better or worse?
Correlation with changes: Did a recent change improve or hurt performance?

How to use it:

Monitor trends: Is success rate improving?
Spot anomalies: Are there sudden drops in performance?
Track improvements: Did your changes improve performance?

Example: If you see a sudden drop in success rate after deploying a tool update, that update likely introduced a bug.

Pro tip: Use these graphs to understand not just current performance, but also trends and patterns. This helps you identify issues before they become critical.

Step 4: Analyze Errors

The Error Analysis section helps you understand what's going wrong with your tools.

Error Explorer

The Error Explorer lists all error codes that occurred and how often each one happened.

Common error codes:

400: Bad Request - Invalid parameters or query syntax
404: Not Found - Resource doesn't exist
500: Internal Server Error - Server-side issues
Timeout: Query execution exceeded time limit
Permission: Access denied or insufficient permissions

Example Error Explorer:

Error Code    Frequency
400           3
500           1
Timeout       2

This shows:

Error code 400 occurred 3 times
Error code 500 occurred 1 time
Timeout errors occurred 2 times

What to do: Focus on fixing high-frequency errors first, as they have the biggest impact.

Understanding Error Types

Error Code 400: Bad Request

What it means: The request was invalid.

Common causes:

Invalid parameter values
SQL syntax errors in query
Parameter type mismatches
Missing required parameters

How to fix:

Check parameter definitions match query placeholders
Verify parameter types are correct
Test with valid parameter values
Review SQL query syntax

Error Code 500: Internal Server Error

What it means: Server-side error occurred.

Common causes:

Database connection issues
Query execution failures
View definition problems
Data type mismatches

How to fix:

Verify view queries are correct
Check database connections are active
Test queries manually in SQL IDE
Review error messages in raw logs

Timeout Errors

What it means: Query took too long to execute.

Common causes:

Large result sets
Complex joins
Missing indexes
Inefficient queries

How to fix:

Add LIMIT to restrict result size
Optimize query performance
Add indexes to source views
Refine WHERE conditions

Permission Errors

What it means: Access denied.

Common causes:

Insufficient database permissions
View access restrictions
Token authentication issues

How to fix:

Verify database connection permissions
Check view access controls
Regenerate authentication tokens if needed

Step 5: Understand Query Shapes

Query shapes represent patterns in the types of queries executed by your MCP tools. They show you how agents are interacting with your data.

Accessing Query Shapes

Select the tool you want to analyze (using filters)
Scroll to the Error Analysis section
Find the Query Shape section
Review the query patterns displayed

Common Query Patterns

Query shapes show patterns like:

Simple Filters: Queries with single WHERE conditions
Date Ranges: Queries filtering by date ranges
Multiple Filters: Queries with multiple WHERE conditions
Aggregations: Queries with GROUP BY or aggregations
Sorting Patterns: How results are ordered

Example Query Shapes:

Pattern 1: Single Filter

Most common: WHERE event_type = '{value}'
Frequency: 45% of queries

Pattern 2: Date Range

Most common: WHERE date >= '{start}' AND date <= '{end}'
Frequency: 30% of queries

Pattern 3: Multiple Filters

Most common: WHERE type = '{type}' AND status = '{status}'
Frequency: 15% of queries

Using Query Shapes to Improve Tools

Scenario 1: Common Filter Pattern

Observation: 60% of queries filter by event_type

Action:

Ensure event_type filtering is optimized
Consider making event_type a required parameter
Add tool description emphasizing event_type filtering

Scenario 2: Date Range Pattern

Observation: Many queries use date ranges

Action:

Create a dedicated tool for date range queries
Add date range parameters to existing tools
Optimize date filtering in queries

Scenario 3: Multiple Parameter Combinations

Observation: Agents frequently combine specific parameters

Action:

Create tools optimized for common combinations
Add default values for frequently used parameters
Update tool descriptions to highlight common use cases

Pro tip: If a query pattern appears frequently (30%+ of queries), consider creating a dedicated tool optimized for that pattern. This can improve performance and agent experience.

Step 6: Review Raw Logs

Raw Logs provide detailed records of every MCP tool call. They show you exactly what happened—which tools were called, what queries were executed, whether they succeeded, and any errors that occurred.

Accessing Raw Logs

Scroll to the bottom of the Evaluation Dashboard
Find the Raw Logs section
Review the detailed records

Understanding Log Fields

Each log entry contains:

Tool Name: Which MCP tool was invoked
Query Executed: The actual SQL query that ran (with parameter values injected)
Invocation Status: Success or failure
Timestamp: When the invocation occurred
Error Message: Error details (empty when successful)

Example Log Entry:

Tool: fetch_engagement_scores_by_event_type
Query: SELECT engagement_score FROM table0 WHERE event_type LIKE '%login%' ORDER BY engagement_score DESC
Status: Success
Timestamp: 2024-11-05 14:23:45
Error Message: (empty)

What it tells you:

Which tool agents are using
Exact query that ran
Parameter values that were used
Whether it succeeded or failed
When it happened

Using Logs for Debugging

Step 1: Identify the Problem

Find failed invocations in logs
Review error messages
Note which tool failed
Check what parameters were used

Step 2: Understand the Context

Look at the query that was executed
Review parameter values
Check timestamp (when did it fail?)
Compare with successful invocations

Step 3: Reproduce the Issue

Copy the query from the log
Test it manually in SQL IDE
Use the same parameter values
See if you can reproduce the error

Step 4: Fix the Issue

Identify the root cause
Fix the tool or query
Test the fix
Monitor logs to confirm it's resolved

Pro tip: Error messages in raw logs are your best source of information for understanding what went wrong. Always review them when investigating errors.

Real-World Examples

Let me show you how different teams use Evals to improve their tools.

Example 1: Identifying High-Error Tools

A support team has published multiple tools and wants to identify which ones need attention.

Setup:

Open Evals dashboard
Review summary metrics for each tool
Identify tools with error rates above 10%

Finding: One tool (get_customer_tickets) has a 25% error rate.

Investigation:

Filter to get_customer_tickets in Evals
Review Error Explorer: Most errors are code 400 (Bad Request)
Check Raw Logs: Parameter customer_id is being passed as a string, but query expects integer

Fix: Update tool to cast parameter to integer:

WHERE customer_id = CAST('{customer_id}' AS INTEGER)

Result: Error rate drops from 25% to 2% after the fix.

Value: Evals helped identify a specific issue and verify the fix worked.

Example 2: Optimizing Query Patterns

A product team notices their get_user_events tool is slow. They want to optimize it.

Setup:

Open Evals dashboard
Filter to get_user_events
Review Query Shapes section

Finding: 70% of queries filter by event_type and date_range.

Investigation:

Check Raw Logs: Most queries use the same date range (last 30 days)
Review query execution: Queries scan entire table without date filter optimization

Fix:

Add default date range parameter (last 30 days)
Optimize query to filter by date first
Add index on event_type and date columns

Result: Query execution time drops by 60%, and agents get faster responses.

Value: Query shapes revealed the most common pattern, enabling targeted optimization.

Example 3: Tracking Performance Improvements

A sales team updates their get_opportunity_summary tool and wants to verify the update improved performance.

Setup:

Note current success rate before update: 85%
Deploy tool update
Monitor Evals dashboard over next few days

Finding: Success rate improves to 95% after update.

Verification:

Review Success/Error Rate (%) graph: Shows steady improvement
Check Error Explorer: Fewer error codes appearing
Review Raw Logs: Fewer failed invocations

Result: Confirmed the update improved tool reliability.

Value: Evals provided objective metrics to verify improvements, not just assumptions.

Notice how each example uses the same pattern: monitor, identify, fix, verify. Evals enable this feedback loop.

Common Pitfalls & Tips

I've seen teams make these mistakes when using Evals. Here's how to avoid them.

Pitfall 1: Not Monitoring Regularly

Don't set up Evals and forget about them. Regular monitoring is essential.

Why this matters: Issues compound over time. A 5% error rate today becomes 20% next week if you don't catch it early.

How to avoid it:

Check Evals at least weekly
Set up alerts for high error rates (if available)
Review trends, not just current metrics
Investigate spikes immediately

Pitfall 2: Ignoring Low-Frequency Errors

Don't dismiss errors that occur occasionally. They often indicate edge cases or boundary conditions.

Why this matters: Low-frequency errors can become high-frequency errors as usage grows. Also, they might indicate data quality issues.

How to avoid it:

Investigate all errors, even if they're rare
Look for patterns in low-frequency errors
Document edge cases and handle them
Monitor if frequency increases over time

Pitfall 3: Not Using Query Shapes

Don't just look at success/error rates. Query shapes reveal how agents actually use your tools.

Why this matters: Understanding query patterns helps you optimize the right things. You might optimize for a pattern that never happens.

How to avoid it:

Review Query Shapes section regularly
Identify high-frequency patterns (30%+ of queries)
Create specialized tools for common patterns
Optimize queries based on actual usage, not assumptions

Pitfall 4: Not Reviewing Raw Logs

Don't rely only on summary metrics. Raw logs provide the details you need to fix issues.

Why this matters: Summary metrics tell you something is wrong, but raw logs tell you why. You can't fix what you don't understand.

How to avoid it:

Always review raw logs when investigating errors
Look for patterns in error messages
Compare failed queries with successful ones
Use logs to reproduce and fix issues

Pitfall 5: Not Tracking Improvements

Don't make changes without verifying they helped. Track improvements to confirm your fixes worked.

Why this matters: You might think a change helped, but metrics might show otherwise. Or a change might have unintended side effects.

How to avoid it:

Note metrics before making changes
Monitor Evals after deploying fixes
Use time-series graphs to track trends
Verify specific error codes disappear

Best Practices Summary

Here's a quick checklist:

✅ Monitor regularly: Check Evals at least weekly
✅ Focus on high-frequency errors: Fix the biggest problems first
✅ Review query shapes: Understand how agents use your tools
✅ Use raw logs: Get details needed to fix issues
✅ Track improvements: Verify fixes actually work
✅ Set thresholds: Define acceptable error rates (e.g., <10%)
✅ Document patterns: Note common issues and fixes
✅ Iterate continuously: Use Evals to continuously improve

Next Steps

You've learned how to track agent behavior using Pylar Evals. That's the foundation for continuous improvement. Now you can:

Monitor regularly: Set up a routine to check Evals weekly or daily, depending on your usage volume.
Identify issues: Use summary metrics, error analysis, and query shapes to find problems before they become critical.
Fix and verify: Make improvements based on insights, then verify they worked by monitoring Evals again.
Optimize continuously: Use query patterns to optimize tools for how agents actually use them, not how you think they should use them.

The key is to make Evals part of your regular workflow. Don't just check them when something breaks—monitor proactively to catch issues early and continuously improve.

If you want to keep going, the next step is using Evals insights to improve your tools and views. That's where you'll see the real value—turning observations into optimizations that make agents more reliable and faster.

How to Track Agent Behavior Using Pylar Evals

Why This Approach Works

The Problem: Building Tools Without Feedback

Why Evals Solve This

How Evals Fit into the Bigger Picture

Step-by-Step Guide

Prerequisites

Step 1: Access the Evals Dashboard

Step 2: Understand Summary Metrics

Total Count

Success Count

Error Count

Success Rate

Error Rate

Step 3: Analyze Visual Insights

Calls/Success/Errors Graph

Success/Error Rate (%) Graph

Step 4: Analyze Errors

Error Explorer

Understanding Error Types

Step 5: Understand Query Shapes

Accessing Query Shapes

Common Query Patterns

Using Query Shapes to Improve Tools

Step 6: Review Raw Logs

Accessing Raw Logs

Understanding Log Fields

Using Logs for Debugging

Real-World Examples

Example 1: Identifying High-Error Tools

Example 2: Optimizing Query Patterns

Example 3: Tracking Performance Improvements

Common Pitfalls & Tips

Pitfall 1: Not Monitoring Regularly

Pitfall 2: Ignoring Low-Frequency Errors

Pitfall 3: Not Using Query Shapes

Pitfall 4: Not Reviewing Raw Logs

Pitfall 5: Not Tracking Improvements

Best Practices Summary

Next Steps

Frequently Asked Questions

How often should I check Evals?

What's a good success rate to aim for?

Can I see Evals for multiple tools at once?

How do I know if a query pattern is worth optimizing?

What if I see errors but don't know how to fix them?

Can I export Evals data?

How long does Evals data persist?

What if my tools have zero usage?

Related Posts

How to Build Your First Data View in Pylar

How to Build Your First MCP Tool on a Data View

How to Publish a Pylar Tool to the OpenAI Agent Builder