add tool correctness metric by ganeshdipdumbare · Pull Request #13 · upsun/ai

ganeshdipdumbare · 2026-03-16T10:56:05Z

add tool correctness metrics

Copilot

Pull request overview

Adds tool-call tracing to the Claude Code CLI evaluation so the test can score both response correctness and whether the expected tool(s) were used.

Changes:

Switch Claude CLI invocation to --output-format stream-json and parse tool call events into ToolCall objects.
Add DeepEval ToolCorrectnessMetric and populate tools_called / expected_tools on the LLMTestCase.
Extend allowed tools to include Skill.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

evals/test_login.py

Copilot

Pull request overview

Adds tool-call evaluation to the existing Upsun login eval and introduces a dedicated skill for checking Upsun authentication, while updating the CI workflow to install the Upsun plugin via Claude’s plugin system.

Changes:

Added a new check-upsun-auth skill documenting how to check/login/logout with the Upsun CLI.
Enhanced evals/test_login.py to capture Claude tool-use traces (stream JSON) and evaluate them with ToolCorrectnessMetric.
Updated the eval workflow to install the Upsun plugin instead of manually copying skills.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
plugins/upsun/skills/check-upsun-auth/SKILL.md	New skill docs for checking Upsun auth and related commands.
evals/test_login.py	Parse Claude stream-json output to extract tool calls; add tool correctness metric expectations.
.github/workflows/run-evals.yml	Switch CI setup from manual skills copy to plugin marketplace/install flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

evals/test_login.py

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

github-actions · 2026-03-16T14:58:47Z

Eval Results

1 tests 1 ✅ 19s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit 1bf30b6.

♻️ This comment has been updated with latest results.

ganeshdipdumbare · 2026-03-16T15:12:43Z

@pjcdawkins @romainneutron This is the sample PR I have created.
Following are the changes done:

Installed plugins instead of copy pasting the skills
Added tool correction metric to check if the skill was actually called
Made sure the CC is not failing silently
Added Eval result on PR

I will remove the new added skill as it was just for testing and make sure that the existing upsun skill is used even though it is not as MCP is taking over before it.

add tool correctness metric

0ab829c

ganeshdipdumbare marked this pull request as ready for review March 16, 2026 11:14

Copilot AI review requested due to automatic review settings March 16, 2026 11:14

Copilot started reviewing on behalf of ganeshdipdumbare March 16, 2026 11:15 View session

install upsun plugin in CC

ce13adc

Copilot AI reviewed Mar 16, 2026

View reviewed changes

evals/test_login.py Show resolved Hide resolved

evals/test_login.py Show resolved Hide resolved

evals/test_login.py Show resolved Hide resolved

ganeshdipdumbare added 10 commits March 16, 2026 12:34

add eval for PR plugins

6bd2155

install PR plugins to Claude

404100d

fix path

4c1b767

fix skill to match the correct command

3b4d776

add fix to check skill call correctly7

39acf84

add check if the plugin is installed or not

cf657b0

add correct eval param

9a1cd00

fix tool call check

8bc0064

add correct tool debug

2b41e55

check env to run evals

c5e60ee

ganeshdipdumbare requested a review from Copilot March 16, 2026 14:40

Copilot started reviewing on behalf of ganeshdipdumbare March 16, 2026 14:41 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

evals/test_login.py Show resolved Hide resolved

ganeshdipdumbare and others added 2 commits March 16, 2026 15:53

Potential fix for pull request finding

d672193

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

put the eval result on PR

b136079

Merge branch 'main' into add-tool-correctness-metric

1bf30b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tool correctness metric#13

add tool correctness metric#13
ganeshdipdumbare wants to merge 15 commits intomainfrom
add-tool-correctness-metric

ganeshdipdumbare commented Mar 16, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

ganeshdipdumbare commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ganeshdipdumbare commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Eval Results

Uh oh!

ganeshdipdumbare commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ganeshdipdumbare commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading