Bug fix of specs/implementations in benchmark#67
Conversation
… 1, 6, 9, 13, 18, 20, 31, 32, 36, 73, 78, 82, 101, 108)
Thank you so much for your evaluations and possible fixes. I will go through your PR and will get the changes merged in a few days. Since you mentioned finding some proofs, let me know if you are interested in being added to our leaderboard here: https://trishullab.github.io/clever/leaderboard.html Thank you so much for putting in the hard work. I will go through your PR ASAP. |
There was a problem hiding this comment.
Pull request overview
Fixes multiple benchmark entries where the written specs didn’t match the intended behavior / reference implementations (per issue #65), improving solvability and correctness of the Lean4 benchmark problems.
Changes:
- Adjusts several specs to align with intended semantics (e.g., prime checks require
2 ≤ n, palindrome-change spec, closest-pair duplicate handling, empty-substring case). - Fixes/updates a few implementations to match the corrected specs (notably rolling max initialization, factorization loop, polynomial root finding).
- Updates minor API usages and indexing forms (e.g.,
Nat.digits, list indexing notation).
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/lean4/sample_examples/problem_0.lean | Updates paren-balance implementation to detect excess closing parens via Option Nat. |
| src/lean4/human_eval/problem_1.lean | Adds precondition requiring filtered input parentheses to be balanced before specifying behavior. |
| src/lean4/human_eval/problem_6.lean | Aligns result type with count_max_paren_depth by switching from List Int to List Nat. |
| src/lean4/human_eval/problem_9.lean | Fixes rolling-max initialization for non-empty lists. |
| src/lean4/human_eval/problem_13.lean | Tightens GCD-like spec (nonnegative result + divisibility characterization). |
| src/lean4/human_eval/problem_18.lean | Refines substring-occurrence spec (including empty-substring case) and occurrence definition. |
| src/lean4/human_eval/problem_19.lean | Fixes sortedness helper initial flag; minor list-head access change. |
| src/lean4/human_eval/problem_20.lean | Refines “closest elements” spec for duplicates and distinct-pair minimization; indexing cleanup. |
| src/lean4/human_eval/problem_25.lean | Fixes factorization loop bounds/state and appends remaining prime factor. |
| src/lean4/human_eval/problem_31.lean | Aligns primality spec with Nat.Prime convention (2 ≤ n ∧ ...). |
| src/lean4/human_eval/problem_32.lean | Replaces non-robust Newton iteration with bounded bisection approach. |
| src/lean4/human_eval/problem_36.lean | Corrects off-by-one references in spec (n-1 vs n). |
| src/lean4/human_eval/problem_73.lean | Rewrites spec to characterize minimum changes needed to reach a palindrome. |
| src/lean4/human_eval/problem_78.lean | Adjusts spec to count prime hex digits (1 per digit) and constrains to valid hex range. |
| src/lean4/human_eval/problem_82.lean | Aligns prime-length spec with Nat.Prime convention (2 ≤ n ∧ ...). |
| src/lean4/human_eval/problem_101.lean | Fixes word-splitting implementation (comma normalization) and strengthens spec invariants. |
| src/lean4/human_eval/problem_108.lean | Updates digit-sum helper to use Nat.digits. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
amit9oct
left a comment
There was a problem hiding this comment.
I have reviewed some problems (problem_{1,101, 108,13,18,19,20}.lean in human_eval). Will review more problems in a few days. Once again, thank you for raising this PR, there are some good catches, but there are some changes that don't lead to correct specification (I have provided details in my comments).
There was a problem hiding this comment.
@barabbs Thank you very much for the contribution! I’ve reviewed the specifications, and most of the changes look correct. The only issue I noticed is in src/lean4/human_eval/problem_73.lean.
We really appreciate the effort from the community in helping maintain and improve the benchmark. If you have run any evaluations with your approach or models, please feel free to share them—we would be happy to update the leaderboard with your results.
amit9oct
left a comment
There was a problem hiding this comment.
@barabbs thank you so much for your contribution and interest in our benchmark. Most of your specification changes are correct except for problem 101. Please do those minor changes as discussed, and I will approve the PR.
Again, thank you so much for your fixes. You mentioned finding some hard proofs. Please DM me your results. I would love to update our leaderboard here: https://trishullab.github.io/clever/leaderboard.html
|
@amit9oct It says that there's still 1 workflow awaiting approval, dosen't allow me to merge yet |
Approved the workflow. Once this gets merged I will update the PyPi package version and do a new release. |
@barabbs The build is failing for problem 32, see the error message: ^ Ran this on my machine; somehow, the logs from the GitHub action are not available, so I built on my machine. |
|
@amit9oct I'll look into these issues today/tomorrow and see if I can propose a fix |
|
@amit9oct For problem 32 I suggest fixing it with -- start_def test_cases
#test |implementation [1, 2] - (-1/2)| ≤ 1/1000000
#test |implementation [-6, 11, -6, 1] - 3| ≤ 1/1000000
-- end_def test_casesThe other issue is with problem 82, the error being on line 89 in the correctness proof. The proof is incomplete anyway, a simple fix could be adding If this sounds good I will commit these edits |
|
|
@amit9oct I have implemented the fixes. For problem 32 I have quickly added support for multiple roots with #test |implementation [-6, 11, -6, 1] - 1| ≤ 1/1000000 ∨ |implementation [-6, 11, -6, 1] - 2| ≤ 1/1000000 ∨ |implementation [-6, 11, -6, 1] - 3| ≤ 1/1000000This may work as placeholder for now. Other ways to implement this may be:
In any case, everything now compiles correctly on my machine. If this sounds good I would ask you to approve the workflow again, and I'll then merge :) |
I have approved. You can merge as soon as the build succeeds. |
|
@amit9oct all checks passed, but I cannot merge the PR myself (as I believe I don't have the permissions) |
|
@amit9oct Thank you! |
Bumping the version for the new release after fixes are done in #67
Bug fixes addressing errors listed in issue #65
In my tests, correct proofs were found for the following entries after fixing, so I am confident that the issues have been
sample_examples
human_eval
The human_eval entries problem_19.lean and problem_25.lean had no successful proof, but no system I tested was able to find counterexamples or proof of the negations after the fix, so the original issues should have been addressed at least.
I can provide more details for each of the fixes, if needed.