Bug fix of specs/implementations in benchmark by barabbs · Pull Request #67 · trishullab/clever

barabbs · 2026-03-05T15:58:02Z

Bug fixes addressing errors listed in issue #65

In my tests, correct proofs were found for the following entries after fixing, so I am confident that the issues have been

sample_examples

problem_0.lean

human_eval

The human_eval entries problem_19.lean and problem_25.lean had no successful proof, but no system I tested was able to find counterexamples or proof of the negations after the fix, so the original issues should have been addressed at least.

I can provide more details for each of the fixes, if needed.

… 1, 6, 9, 13, 18, 20, 31, 32, 36, 73, 78, 82, 101, 108)

amit9oct · 2026-03-05T16:05:30Z

Bug fixes addressing errors listed in issue #65

In my tests, correct proofs were found for the following entries after fixing, so I am confident that the issues have been
..........

Thank you so much for your evaluations and possible fixes. I will go through your PR and will get the changes merged in a few days.

Since you mentioned finding some proofs, let me know if you are interested in being added to our leaderboard here: https://trishullab.github.io/clever/leaderboard.html

Thank you so much for putting in the hard work. I will go through your PR ASAP.

Copilot

Pull request overview

Fixes multiple benchmark entries where the written specs didn’t match the intended behavior / reference implementations (per issue #65), improving solvability and correctness of the Lean4 benchmark problems.

Changes:

Adjusts several specs to align with intended semantics (e.g., prime checks require 2 ≤ n, palindrome-change spec, closest-pair duplicate handling, empty-substring case).
Fixes/updates a few implementations to match the corrected specs (notably rolling max initialization, factorization loop, polynomial root finding).
Updates minor API usages and indexing forms (e.g., Nat.digits, list indexing notation).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/lean4/sample_examples/problem_0.lean	Updates paren-balance implementation to detect excess closing parens via `Option Nat`.
src/lean4/human_eval/problem_1.lean	Adds precondition requiring filtered input parentheses to be balanced before specifying behavior.
src/lean4/human_eval/problem_6.lean	Aligns result type with `count_max_paren_depth` by switching from `List Int` to `List Nat`.
src/lean4/human_eval/problem_9.lean	Fixes rolling-max initialization for non-empty lists.
src/lean4/human_eval/problem_13.lean	Tightens GCD-like spec (nonnegative result + divisibility characterization).
src/lean4/human_eval/problem_18.lean	Refines substring-occurrence spec (including empty-substring case) and occurrence definition.
src/lean4/human_eval/problem_19.lean	Fixes sortedness helper initial flag; minor list-head access change.
src/lean4/human_eval/problem_20.lean	Refines “closest elements” spec for duplicates and distinct-pair minimization; indexing cleanup.
src/lean4/human_eval/problem_25.lean	Fixes factorization loop bounds/state and appends remaining prime factor.
src/lean4/human_eval/problem_31.lean	Aligns primality spec with `Nat.Prime` convention (`2 ≤ n ∧ ...`).
src/lean4/human_eval/problem_32.lean	Replaces non-robust Newton iteration with bounded bisection approach.
src/lean4/human_eval/problem_36.lean	Corrects off-by-one references in spec (`n-1` vs `n`).
src/lean4/human_eval/problem_73.lean	Rewrites spec to characterize minimum changes needed to reach a palindrome.
src/lean4/human_eval/problem_78.lean	Adjusts spec to count prime hex digits (1 per digit) and constrains to valid hex range.
src/lean4/human_eval/problem_82.lean	Aligns prime-length spec with `Nat.Prime` convention (`2 ≤ n ∧ ...`).
src/lean4/human_eval/problem_101.lean	Fixes word-splitting implementation (comma normalization) and strengthens spec invariants.
src/lean4/human_eval/problem_108.lean	Updates digit-sum helper to use `Nat.digits`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/lean4/human_eval/problem_9.lean

src/lean4/human_eval/problem_25.lean

src/lean4/human_eval/problem_18.lean

src/lean4/human_eval/problem_32.lean

src/lean4/sample_examples/problem_0.lean

amit9oct

I have reviewed some problems (problem_{1,101, 108,13,18,19,20}.lean in human_eval). Will review more problems in a few days. Once again, thank you for raising this PR, there are some good catches, but there are some changes that don't lead to correct specification (I have provided details in my comments).

src/lean4/human_eval/problem_1.lean

src/lean4/human_eval/problem_101.lean

src/lean4/human_eval/problem_108.lean

src/lean4/human_eval/problem_13.lean

amit9oct

@barabbs Thank you very much for the contribution! I’ve reviewed the specifications, and most of the changes look correct. The only issue I noticed is in src/lean4/human_eval/problem_73.lean.

We really appreciate the effort from the community in helping maintain and improve the benchmark. If you have run any evaluations with your approach or models, please feel free to share them—we would be happy to update the leaderboard with your results.

src/lean4/human_eval/problem_18.lean

src/lean4/human_eval/problem_32.lean

src/lean4/human_eval/problem_36.lean

src/lean4/human_eval/problem_6.lean

src/lean4/human_eval/problem_73.lean

src/lean4/sample_examples/problem_0.lean

src/lean4/human_eval/problem_108.lean

src/lean4/human_eval/problem_101.lean

amit9oct

@barabbs thank you so much for your contribution and interest in our benchmark. Most of your specification changes are correct except for problem 101. Please do those minor changes as discussed, and I will approve the PR.

Again, thank you so much for your fixes. You mentioned finding some hard proofs. Please DM me your results. I would love to update our leaderboard here: https://trishullab.github.io/clever/leaderboard.html

src/lean4/human_eval/problem_32.lean

src/lean4/human_eval/problem_73.lean

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

amit9oct

@barabbs : Thank you so much for all the good work. I have approved your PR. Feel free to merge it.

barabbs · 2026-03-18T15:27:32Z

@amit9oct It says that there's still 1 workflow awaiting approval, dosen't allow me to merge yet

amit9oct · 2026-03-18T15:38:54Z

@amit9oct It says that there's still 1 workflow awaiting approval, dosen't allow me to merge yet

Approved the workflow. Once this gets merged I will update the PyPi package version and do a new release.

amit9oct · 2026-03-19T07:02:54Z

@amit9oct It says that there's still 1 workflow awaiting approval, dosen't allow me to merge yet

Approved the workflow. Once this gets merged I will update the PyPi package version and do a new release.

@barabbs The build is failing for problem 32, see the error message:

error: human_eval/problem_32.lean:93:0: ===================
Found a counter-example!
issue: (-1048575 : Rat)/2097152 = (-1 : Rat)/2 does not hold
(0 shrinks)
-------------------
error: human_eval/problem_32.lean:94:0: ===================
Found a counter-example!
issue: 3 = 1 does not hold
(0 shrinks)
-------------------
warning: human_eval/problem_32.lean:98:8: declaration uses 'sorry'
error: Lean exited with code 1

^ Ran this on my machine; somehow, the logs from the GitHub action are not available, so I built on my machine.
So the real problem is that there are multiple possible answers here. So we need to change the test case to actually evaluate the polynomial and see if it is close to zero. We can write a helper function in scr/lean4/Imports/AllImports.lean which takes in the list and evaluates the polynomial on the given value. For now, you can just comment on the test cases. I will add the fix before the next release.

barabbs · 2026-03-19T10:05:21Z

@amit9oct I'll look into these issues today/tomorrow and see if I can propose a fix

barabbs · 2026-03-20T17:58:18Z

@amit9oct For problem 32 I suggest fixing it with

-- start_def test_cases
#test |implementation [1, 2] - (-1/2)| ≤ 1/1000000
#test |implementation [-6, 11, -6, 1] - 3| ≤ 1/1000000
-- end_def test_cases

The other issue is with problem 82, the error being on line 89 in the correctness proof. The proof is incomplete anyway, a simple fix could be adding refine ⟨h_is_prime.1, ?_⟩ just before line 89.

If this sounds good I will commit these edits

amit9oct · 2026-03-24T00:21:47Z

@amit9oct For problem 32 I suggest fixing it with
-- start_def test_cases
#test |implementation [1, 2] - (-1/2)| ≤ 1/1000000
#test |implementation [-6, 11, -6, 1] - 3| ≤ 1/1000000
-- end_def test_cases
The other issue is with problem 82, the error being on line 89 in the correctness proof. The proof is incomplete anyway, a simple fix could be adding refine ⟨h_is_prime.1, ?_⟩ just before line 89.

If this sounds good I will commit these edits
@barabbs
For now this is fine, although we should allow multiple roots. I will fix that later.

barabbs · 2026-03-24T11:56:47Z

@amit9oct I have implemented the fixes. For problem 32 I have quickly added support for multiple roots with

#test |implementation [-6, 11, -6, 1] - 1| ≤ 1/1000000 ∨ |implementation [-6, 11, -6, 1] - 2| ≤ 1/1000000 ∨ |implementation [-6, 11, -6, 1] - 3| ≤ 1/1000000

This may work as placeholder for now. Other ways to implement this may be:

#test [root_1, root_2, ...].any fun r => |implementation [...] - r| ≤ 1/1000000 more compact
#test |evalPoly coeffs (implementation coeffs)| ≤ eps with an helper evalPoly, possibly not a good way to test as the exact roots are dynamically computed rather than provided

In any case, everything now compiles correctly on my machine. If this sounds good I would ask you to approve the workflow again, and I'll then merge :)

…lementation

amit9oct · 2026-03-24T12:18:03Z

@amit9oct I have implemented the fixes. For problem 32 I have quickly added support for multiple roots with
#test |implementation [-6, 11, -6, 1] - 1| ≤ 1/1000000 ∨ |implementation [-6, 11, -6, 1] - 2| ≤ 1/1000000 ∨ |implementation [-6, 11, -6, 1] - 3| ≤ 1/1000000
This may work as placeholder for now. Other ways to implement this may be:

#test [root_1, root_2, ...].any fun r => |implementation [...] - r| ≤ 1/1000000 more compact

#test |evalPoly coeffs (implementation coeffs)| ≤ eps with an helper evalPoly, possibly not a good way to test as the exact roots are dynamically computed rather than provided

In any case, everything now compiles correctly on my machine. If this sounds good I would ask you to approve the workflow again, and I'll then merge :)

I have approved. You can merge as soon as the build succeeds.

barabbs · 2026-03-24T12:38:13Z

@amit9oct all checks passed, but I cannot merge the PR myself (as I believe I don't have the permissions)

amit9oct · 2026-03-24T12:46:05Z

@amit9oct all checks passed, but I cannot merge the PR myself (as I believe I don't have the permissions)

@barabbs, I have merged your PR, thank you so much for your contribution. I will do a new release of the Python package.

barabbs · 2026-03-24T12:46:30Z

@amit9oct Thank you!

Bumping the version for the new release after fixes are done in #67

barabbs added 2 commits March 5, 2026 16:49

fixes of bugs in issue trishullab#65 (sample_examples: 0, human_eval:…

c462ac1

… 1, 6, 9, 13, 18, 20, 31, 32, 36, 73, 78, 82, 101, 108)

fixes of bugs in issue trishullab#65 (human_eval: 19, 25)

1085385

Copilot AI review requested due to automatic review settings March 5, 2026 15:58

Copilot started reviewing on behalf of barabbs March 5, 2026 15:58 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

amit9oct reviewed Mar 7, 2026

View reviewed changes

src/lean4/human_eval/problem_1.lean Outdated Show resolved Hide resolved

src/lean4/human_eval/problem_101.lean Outdated Show resolved Hide resolved

src/lean4/human_eval/problem_108.lean Show resolved Hide resolved

src/lean4/human_eval/problem_13.lean Show resolved Hide resolved

amit9oct reviewed Mar 13, 2026

View reviewed changes

amit9oct reviewed Mar 16, 2026

View reviewed changes

src/lean4/human_eval/problem_32.lean Show resolved Hide resolved

src/lean4/human_eval/problem_32.lean Show resolved Hide resolved

src/lean4/human_eval/problem_73.lean Show resolved Hide resolved

barabbs and others added 4 commits March 17, 2026 11:19

fix typo

8b20e9d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

separated proof of termination in human_eval/problem_1

077f218

human_eval/problem_6 fixed Int to Nat in generated_spec

3b07ea1

fixed wrong spec in human_eval/problem_101

bb12aea

amit9oct approved these changes Mar 18, 2026

View reviewed changes

fixed errors in problem 32 and 82

16e7f98

added multiple option tests in problem_32.lean

98ec4fd

updated test case in problem_details to reflect output of current imp…

0652a09

…lementation

amit9oct merged commit 5a178ba into trishullab:main Mar 24, 2026
1 check passed

amit9oct added a commit that referenced this pull request Mar 24, 2026

Bump version from 1.4.0 to 1.5.0

b2c19a5

Bumping the version for the new release after fixes are done in #67

amit9oct mentioned this pull request Mar 25, 2026

Potential errors in some benchmark entries #65

Closed

Conversation

barabbs commented Mar 5, 2026

Uh oh!

amit9oct commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amit9oct left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amit9oct left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amit9oct left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amit9oct left a comment

Choose a reason for hiding this comment

Uh oh!

barabbs commented Mar 18, 2026

Uh oh!

amit9oct commented Mar 18, 2026

Uh oh!

amit9oct commented Mar 19, 2026

Uh oh!

barabbs commented Mar 19, 2026

Uh oh!

barabbs commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amit9oct commented Mar 24, 2026

Uh oh!

barabbs commented Mar 24, 2026

Uh oh!

amit9oct commented Mar 24, 2026

Uh oh!

barabbs commented Mar 24, 2026

Uh oh!

Uh oh!

amit9oct commented Mar 24, 2026

Uh oh!

barabbs commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amit9oct left a comment •

edited

Loading

barabbs commented Mar 20, 2026 •

edited

Loading