Skip to content

add tool correctness metric#13

Open
ganeshdipdumbare wants to merge 15 commits intomainfrom
add-tool-correctness-metric
Open

add tool correctness metric#13
ganeshdipdumbare wants to merge 15 commits intomainfrom
add-tool-correctness-metric

Conversation

@ganeshdipdumbare
Copy link
Copy Markdown
Contributor

@ganeshdipdumbare ganeshdipdumbare commented Mar 16, 2026

  • add tool correctness metrics

@ganeshdipdumbare ganeshdipdumbare marked this pull request as ready for review March 16, 2026 11:14
Copilot AI review requested due to automatic review settings March 16, 2026 11:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds tool-call tracing to the Claude Code CLI evaluation so the test can score both response correctness and whether the expected tool(s) were used.

Changes:

  • Switch Claude CLI invocation to --output-format stream-json and parse tool call events into ToolCall objects.
  • Add DeepEval ToolCorrectnessMetric and populate tools_called / expected_tools on the LLMTestCase.
  • Extend allowed tools to include Skill.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds tool-call evaluation to the existing Upsun login eval and introduces a dedicated skill for checking Upsun authentication, while updating the CI workflow to install the Upsun plugin via Claude’s plugin system.

Changes:

  • Added a new check-upsun-auth skill documenting how to check/login/logout with the Upsun CLI.
  • Enhanced evals/test_login.py to capture Claude tool-use traces (stream JSON) and evaluate them with ToolCorrectnessMetric.
  • Updated the eval workflow to install the Upsun plugin instead of manually copying skills.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
plugins/upsun/skills/check-upsun-auth/SKILL.md New skill docs for checking Upsun auth and related commands.
evals/test_login.py Parse Claude stream-json output to extract tool calls; add tool correctness metric expectations.
.github/workflows/run-evals.yml Switch CI setup from manual skills copy to plugin marketplace/install flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

ganeshdipdumbare and others added 2 commits March 16, 2026 15:53
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 16, 2026

Eval Results

1 tests   1 ✅  19s ⏱️
1 suites  0 💤
1 files    0 ❌

Results for commit 1bf30b6.

♻️ This comment has been updated with latest results.

@ganeshdipdumbare
Copy link
Copy Markdown
Contributor Author

@pjcdawkins @romainneutron This is the sample PR I have created.
Following are the changes done:

  1. Installed plugins instead of copy pasting the skills
  2. Added tool correction metric to check if the skill was actually called
  3. Made sure the CC is not failing silently
  4. Added Eval result on PR

I will remove the new added skill as it was just for testing and make sure that the existing upsun skill is used even though it is not as MCP is taking over before it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants