As AI agents take on critical roles in business processes, the need for reliable, repeatable testing becomes essential. In the past, agents have been manually tested—typing in questions, hoping for the right answers, and troubleshooting inconsistencies case by case. That time consuming, unscalable, and inconsistent approach that relies on intuition instead of structured testing doesn’t work for enterprise-grade agent deployment. Enterprise makers need testing that is built-in, automated, and at-scale to deploy agents.
Today, we are announcing the public preview of Agent Evaluation in Microsoft Copilot Studio, bringing rigor directly into the agent-building tool you already use, backed by Microsoft’s end-to-end approach.

Introducing Agent Evaluation
Agent Evaluation enables structured, automated testing directly in Copilot Studio, providing makers with a direct and seamless way to create evaluation sets, choose test methods, define success measures for the agent, and then run the test—maximizing the power of model choice that Copilot Studio offers by evaluating agent performance across multiple agent-level models.

Create evaluation sets
Makers can now upload predefined test sets, reuse recent Test Pane interactions, and add test questions manually. We are also enabling AI-powered generation of test queries from the agent’s metadata, knowledge sources, and more—delivering makers with quick visibility into agent quality without requiring the manual work for expected answers. This allows for early testing, while additional Q&A sets can be manually added for deeper evaluation.
Makers can also mix AI-generated queries with manual or imported test sets to expand coverage, helping to evaluate both breadth (common scenarios auto-generated by AI) and depth (organization-specific queries) of agent behavior.

Choose flexible test methods
Makers can choose from a wide rage of test methods—whether it is exact or partial matches, advanced similarity metrics, intent recognition, or relevance and completeness, makers can choose the test methods that work for them based on the type of agent they are deploying. This allows makers to mimic how different users judge the agent—from strict checklist compliance to overall helpfulness—giving a comprehensive view of performance.

Define measures of agent success
Agent Evaluations allows you to define what constitutes success for your business, whether it is strict keyword matches (lexical alignment) or conceptual, meaning-based matches (semantic alignment). You can also set custom thresholds to ensure your agent meets your organization’s unique standards for accuracy and relevance.

Execute evaluations
Once the dataset is prepared, test methods are chosen, and thresholds are configured, evaluations are executed with a single click. Results are displayed with clear pass or fail indicators, numeric scores on answer quality, and details around the knowledge sources used by the agent. No more guessing as to why an answer failed.

Transforming agent quality: From build to continuous improvement
Agent Evaluation transforms agent development into a full lifecycle of build, test, and improve. We want makers to have the same rigorous and streamlined quality process for agents as they do for traditional software. By launching evaluations in Copilot Studio, we’re ensuring that every agent can be tested and continuously improved, leading to well-tested agents deployed across the organization. This also enables makers to test agents using different agent-level models for agent orchestration, to find the model that best suits the business process being transformed. You can go from building an agent to testing it in the same interface, all while being confident in Microsoft enterprise-grade permission controls, compliance, and governance capabilities.
Next steps
To learn how to get started, visit Agent Evaluation in Copilot Studio.
Check out all the updates live as we ship them, as well as new features released in the next few months here: What’s new in Microsoft Copilot Studio.
To learn more about Copilot Studio and how it can transform your organization’s productivity, visit the Copilot Studio website or sign up for our free trial today.
We look forward to sharing more about Agent Evaluation at the Power Platform Community Conference 2025.
