Powered by Smartsupp

Microsoft Launches ASSERT to Simplify AI Model Testing for Product-Specific Behavior



By admin | Jun 02, 2026 | 4 min read


Microsoft Launches ASSERT to Simplify AI Model Testing for Product-Specific Behavior

Artificial intelligence researchers and laboratories have made remarkable progress in assessing AI models across various dimensions, including safety, compliance, sycophancy, and alignment. However, companies and developers now face a more specific challenge: ensuring their AI system behaves exactly as intended within their particular product or service. To streamline this evaluation process, Microsoft unveiled ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) on Tuesday. This open-source framework, according to Microsoft, simplifies the evaluation of application-specific AI behavior by leveraging AI to transform high-level, natural-language descriptions of goals, policies, or intended behaviors into detailed, scored tests that can be examined.

ASSERT takes plain-language descriptions of an AI model's expected behavior and policies, converts them into a structured set of acceptable and unacceptable actions, generates problem scenarios and test cases, runs these against the target system, and scores the outcomes. Additionally, it can record the paths the AI system follows, including intermediate steps and tool calls, allowing developers to pinpoint where failures occur. Developers can also provide system context, tools, and constraints to further customize the evaluations. For instance, a developer could specify that a document-research AI agent should not send emails to people outside the company, limit confidential information to C-level executives, and deliver concise summaries while considering prior context. ASSERT will then use these rules to create test cases that continuously verify whether the system adheres to them.

Image Credits:Microsoft

Microsoft asserts that ASSERT fills a gap left by broader, more general evaluations, which are insufficient when AI models must behave according to an application's or product's specific context, policies, and tools. "One of the things we’ve learned is that evaluations are absolutely critical to making good decisions," said Sarah Bird, chief product officer of Responsible AI at Microsoft. "Because if you don’t understand the behavior of the AI system, it’s really hard to know if it’s meeting your organization’s bar [...] What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific."

Bird noted that ASSERT can be used to evaluate systems during development, after deployment, and even for continuous monitoring. This release coincides with a gradual yet significant shift in the AI industry. As models become more capable, researchers are emphasizing repeatable testing and regression checks. Initiatives such as Stanford's HELM, MLCommons' AILuminate, and evaluation groups like METR are rolling out benchmarks to measure how models perform under varying conditions.




RELATED AI TOOLS CATEGORIES AND TAGS

Comments

Please log in to leave a comment.

No comments yet. Be the first to comment!