• News
  • 3 min read

Build smarter, test smarter: Agent Evaluation in Microsoft Copilot Studio

Photograph of two female executives in an office. One executive is seated, holding a Surface Laptop 13" Platinum device in her right hand. A Surface Pro 12" Platinum device in laptop mode is on another desk, connected to an external monitor. A Surface Arc Mouse is next to the device.
Automated agent testing is now built into Copilot Studio—evaluate performance, improve quality, and scale confidently with Agent Evaluation.

As AI agents take on critical roles in business processes, the need for reliable, repeatable testing becomes essential. In the past, agents have been manually tested—typing in questions, hoping for the right answers, and troubleshooting inconsistencies case by case. That time consuming, unscalable, and inconsistent approach that relies on intuition instead of structured testing doesn’t work for enterprise-grade agent deployment. Enterprise makers need testing that is built-in, automated, and at-scale to deploy agents. 

Today, we are announcing the public preview of Agent Evaluation in Microsoft Copilot Studio, bringing rigor directly into the agent-building tool you already use, backed by Microsoft’s end-to-end approach.

A helpdesk agent in Copilot Studio showing 32 test cases and an evaluation summary of 94%.

Introducing Agent Evaluation

Agent Evaluation enables structured, automated testing directly in Copilot Studio, providing makers with a direct and seamless way to create evaluation sets, choose test methods, define success measures for the agent, and then run the test—maximizing the power of model choice that Copilot Studio offers by evaluating agent performance across multiple agent-level models.  

A Helpdesk agent in Copilot Studio, asking the user to start by importing a file

Create evaluation sets

Makers can now upload predefined test sets, reuse recent Test Pane interactions, and add test questions manually. We are also enabling AI-powered generation of test queries from the agent’s metadata, knowledge sources, and more—delivering makers with quick visibility into agent quality without requiring the manual work for expected answers. This allows for early testing, while additional Q&A sets can be manually added for deeper evaluation. 

Makers can also mix AI-generated queries with manual or imported test sets to expand coverage, helping to evaluate both breadth (common scenarios auto-generated by AI) and depth (organization-specific queries) of agent behavior.

A helpdesk agent in Copilot Studio where the user is configuring test sets

Choose flexible test methods

Makers can choose from a wide rage of test methods—whether it is exact or partial matches, advanced similarity metrics, intent recognition, or relevance and completeness, makers can choose the test methods that work for them based on the type of agent they are deploying. This allows makers to mimic how different users judge the agent—from strict checklist compliance to overall helpfulness—giving a comprehensive view of performance.

A helpdesk agent in Copilot Studio where the user is reviewing test cases

Define measures of agent success  

Agent Evaluations allows you to define what constitutes success for your business, whether it is strict keyword matches (lexical alignment) or conceptual, meaning-based matches (semantic alignment). You can also set custom thresholds to ensure your agent meets your organization’s unique standards for accuracy and relevance.

A helpdesk agent in Copilot Studio where the user is editing a test case to define the passing score

Execute evaluations

Once the dataset is prepared, test methods are chosen, and thresholds are configured, evaluations are executed with a single click. Results are displayed with clear pass or fail indicators, numeric scores on answer quality, and details around the knowledge sources used by the agent. No more guessing as to why an answer failed.

A helpdesk agent in Copilot Studio where the user is reviewing the results of a test case

Transforming agent quality: From build to continuous improvement 

Agent Evaluation transforms agent development into a full lifecycle of build, test, and improve. We want makers to have the same rigorous and streamlined quality process for agents as they do for traditional software. By launching evaluations in Copilot Studio, we’re ensuring that every agent can be tested and continuously improved, leading to well-tested agents deployed across the organization. This also enables makers to test agents using different agent-level models for agent orchestration, to find the model that best suits the business process being transformed. You can go from building an agent to testing it in the same interface, all while being confident in Microsoft enterprise-grade permission controls, compliance, and governance capabilities.

Next steps 

To learn how to get started, visit Agent Evaluation in Copilot Studio

Check out all the updates live as we ship them, as well as new features released in the next few months here: What’s new in Microsoft Copilot Studio

To learn more about Copilot Studio and how it can transform your organization’s productivity, visit the Copilot Studio website or sign up for our free trial today.

We look forward to sharing more about Agent Evaluation at the Power Platform Community Conference 2025.

Efrat Gilboa

Efrat Gilboa

Principal Product Manager, Microsoft Copilot Studio
Efrat Gilboa is a Principal Product Manager at Microsoft, leading the vision and execution for Agent Evaluation in Copilot Studio. She partners closely with enterprise customers and engineering teams to deliver innovative, AI-powered evaluation capabilities, such as automated testing, test methods frameworks, and advanced evaluation workflows , that empower makers to confidently test, refine, and deploy agents at scale.
See more articles from this author