Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
测试和基准评估LLM代理,包括行为测试、能力评估、可靠性指标和生产监控——即使顶尖代理在真实世界基准测试中的得分也低于50% 适用于:代理测试、代理评估、代理基准测试、代理可靠性、代理测试
直接复制以下提示词,发送给你的 AI 助手即可完成安装。
点击右上角 下载SKILL 按钮