Conversation
There was a problem hiding this comment.
Pull request overview
Adds tool-call tracing to the Claude Code CLI evaluation so the test can score both response correctness and whether the expected tool(s) were used.
Changes:
- Switch Claude CLI invocation to
--output-format stream-jsonand parse tool call events intoToolCallobjects. - Add DeepEval
ToolCorrectnessMetricand populatetools_called/expected_toolson theLLMTestCase. - Extend allowed tools to include
Skill.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Adds tool-call evaluation to the existing Upsun login eval and introduces a dedicated skill for checking Upsun authentication, while updating the CI workflow to install the Upsun plugin via Claude’s plugin system.
Changes:
- Added a new
check-upsun-authskill documenting how to check/login/logout with the Upsun CLI. - Enhanced
evals/test_login.pyto capture Claude tool-use traces (stream JSON) and evaluate them withToolCorrectnessMetric. - Updated the eval workflow to install the Upsun plugin instead of manually copying skills.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| plugins/upsun/skills/check-upsun-auth/SKILL.md | New skill docs for checking Upsun auth and related commands. |
| evals/test_login.py | Parse Claude stream-json output to extract tool calls; add tool correctness metric expectations. |
| .github/workflows/run-evals.yml | Switch CI setup from manual skills copy to plugin marketplace/install flow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Eval Results1 tests 1 ✅ 19s ⏱️ Results for commit 1bf30b6. ♻️ This comment has been updated with latest results. |
|
@pjcdawkins @romainneutron This is the sample PR I have created.
I will remove the new added skill as it was just for testing and make sure that the existing |
Uh oh!
There was an error while loading. Please reload this page.