Skip to content

Expand evals to 20 and improve SKILL.md diagnostic coverage#50

Merged
CybotTM merged 1 commit intomainfrom
feature/evals-and-improvements
Apr 1, 2026
Merged

Expand evals to 20 and improve SKILL.md diagnostic coverage#50
CybotTM merged 1 commit intomainfrom
feature/evals-and-improvements

Conversation

@CybotTM
Copy link
Copy Markdown
Member

@CybotTM CybotTM commented Apr 1, 2026

Summary

  • Expanded evals from 2 to 20, covering all major skill domains: auto-merge, branch protection, security compliance, merge strategy, CodeQL, review threads, merge queues, Copilot race conditions, and workflow file limitations
  • Improved SKILL.md with new "Security & Compliance Quick Checks" and "Merge Strategy Issues" sections plus expanded "When to Use" triggers
  • SKILL.md stays under 500 words (487 total)

A/B Test Results

Eval Name A (Original) B (Improved)
1 setup_branch_protection PARTIAL PASS
2 fix_blocked_pr_merge PASS PASS
3 setup_auto_merge_workflow PASS PASS
4 diagnose_auto_merge_failure PASS PASS
5 solo_maintainer_pr_stuck PASS PASS
6 setup_codeowners PARTIAL PARTIAL
7 fix_github_actions_failure PASS PASS
8 migrate_master_to_main FAIL FAIL
9 setup_dependabot FAIL FAIL
10 codeql_default_setup_conflict FAIL PASS
11 signed_commits_merge_failure FAIL PASS
12 pr_too_many_commits FAIL PARTIAL
13 enforce_admins_audit FAIL PASS
14 resolve_review_threads FAIL PASS
15 openssf_scorecard_improvement FAIL PARTIAL
16 workflow_permissions_least_privilege FAIL PARTIAL
17 setup_release_labeling FAIL PARTIAL
18 merge_queue_troubleshooting FAIL PARTIAL
19 copilot_reviewer_race_condition FAIL PASS
20 workflow_file_pr_cannot_merge FAIL PASS

Score: A = 12/40, B = 29/40 (+142%)

Remaining FAILs (8, 9) are low-priority setup tasks adequately covered by reference file links.

Test plan

  • Verify evals.json is valid JSON
  • Verify SKILL.md word count is under 500
  • Spot-check that new eval assertions match SKILL.md and reference content

Add 18 new evals covering auto-merge setup, solo maintainer workflow,
CodeQL conflicts, signed commit merge failures, enforce_admins audit,
review thread resolution, OpenSSF Scorecard, merge queue issues,
Copilot reviewer race conditions, and workflow file merge limitations.

Improve SKILL.md with expanded "When to Use" triggers, new "Security &
Compliance Quick Checks" section with inline gh commands, and "Merge
Strategy Issues" section. Keeps SKILL.md under 500 words (487).

A/B test shows version B scores 29/40 vs original 12/40 (+142%).
Copilot AI review requested due to automatic review settings April 1, 2026 08:30
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the GitHub Project skill documentation to include more detailed troubleshooting scenarios, such as security compliance checks, merge strategy issues, and OpenSSF Scorecard improvements. It also significantly expands the evaluation test suite to cover these new scenarios. Feedback highlights a contradiction in the advice regarding squash merges for signed commits and suggests improving the readability of a GraphQL query output using a JQ filter.

Verify repository configuration against best practices:
### Merge Strategy Issues

Rebase merge fails with signed commits: enable squash or auto-detect strategy. Workflow file PRs need manual merge (GITHUB_TOKEN lacks `workflows` scope). Copilot reviewer race conditions: re-run auto-approve workflow. See `references/auto-merge-guide.md`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The recommendation to 'enable squash' for signed commit failures contradicts references/merge-strategy.md, which explicitly states that squash merges are incompatible with signed commits (line 115) and recommends using --merge instead (line 165). Conversely, references/auto-merge-guide.md (line 161) claims squash is preferred and compatible. Please reconcile these reference files to ensure the skill provides consistent and correct advice.

Comment on lines +75 to +79
gh api graphql -f query='query($owner:String!,$repo:String!,$pr:Int!){
repository(owner:$owner,name:$repo){pullRequest(number:$pr){
reviewThreads(first:50){nodes{id isResolved}}
}}
}' -f owner=OWNER -f repo=REPO -F pr=NUMBER
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This GraphQL query is redundant as the 'PR Won't Merge' section (lines 35-40) already covers reviewThreads. If you choose to keep it here for the 'Security & Compliance' context, please add a --jq filter to make the output readable and consistent with other examples in this file.

Suggested change
gh api graphql -f query='query($owner:String!,$repo:String!,$pr:Int!){
repository(owner:$owner,name:$repo){pullRequest(number:$pr){
reviewThreads(first:50){nodes{id isResolved}}
}}
}' -f owner=OWNER -f repo=REPO -F pr=NUMBER
gh api graphql -f query='query($owner:String!,$repo:String!,$pr:Int!){
repository(owner:$owner,name:$repo){pullRequest(number:$pr){
reviewThreads(first:50){nodes{id isResolved}}
}}
}' -f owner=OWNER -f repo=REPO -F pr=NUMBER --jq '.data.repository.pullRequest.reviewThreads.nodes'

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Expands the GitHub Project skill eval suite and updates SKILL.md to add more diagnostic guidance across branch protection, auto-merge, security/compliance, and merge strategy troubleshooting.

Changes:

  • Expanded evals.json from a small set to 20 evals covering major GitHub repo management/troubleshooting scenarios.
  • Updated SKILL.md “When to Use” triggers and added new diagnostic sections for security/compliance and merge strategy issues.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
skills/github-project/evals/evals.json Adds many new eval prompts/assertions and broadens existing assertion patterns.
skills/github-project/SKILL.md Refreshes usage triggers and adds new diagnostic/troubleshooting sections.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

},
{
"type": "content",
"pattern": "(contents: read|pull-requests: write|least.privilege)"
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a regex, . matches any character. If you intended to assert the literal phrase least privilege (or least-privilege), this pattern will also match unintended strings like leastXprivilege. Consider escaping the dot (least\\.privilege) or rewriting the alternation to match the literal wording you expect.

Suggested change
"pattern": "(contents: read|pull-requests: write|least.privilege)"
"pattern": "(contents: read|pull-requests: write|least[ -]privilege)"

Copilot uses AI. Check for mistakes.
},
{
"type": "content",
"pattern": "(force.push|re-queue|resolveReviewThread|auto-approve)"
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The force.push token uses . which is a wildcard in regex. If the intent is to match force-push/force push wording, this will mis-match (and also over-match). Escape the dot (force\\.push) if you truly want a literal force.push, or update the pattern to explicitly match the expected phrase (e.g., force[- ]push).

Suggested change
"pattern": "(force.push|re-queue|resolveReviewThread|auto-approve)"
"pattern": "(force[- ]push|re-queue|resolveReviewThread|auto-approve)"

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +74
gh api repos/OWNER/REPO/branches/main/protection --jq '.enforce_admins.enabled'
gh api repos/OWNER/REPO/code-scanning/default-setup --jq '.state'
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes main for the branch protection check. Since one of the evals explicitly covers migrating mastermain, and repos can have non-main default branches, it’s easy for readers to run this command against the wrong branch. Suggest using a placeholder like DEFAULT_BRANCH (or documenting how to obtain it via gh repo view) instead of main.

Copilot uses AI. Check for mistakes.
@@ -12,26 +12,26 @@ allowed-tools: Bash(gh:*) Bash(git:*) Bash(grep:*) Read Write

# GitHub Project Skill

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous ## Overview section heading was removed, leaving an unheaded overview sentence. If other skills/docs rely on consistent section headings for navigation or automated extraction, consider restoring ## Overview and placing the sentence under it for consistency.

Suggested change
## Overview

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA c7d8ff0.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@CybotTM CybotTM merged commit 8911bc3 into main Apr 1, 2026
11 checks passed
@CybotTM CybotTM deleted the feature/evals-and-improvements branch April 1, 2026 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants