fix: software value failing for large repos [CM-1029]#3947
fix: software value failing for large repos [CM-1029]#3947
Conversation
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
|
|
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| return int(result.stdout.split()[0]) | ||
| except Exception: | ||
| pass | ||
| return 0 |
There was a problem hiding this comment.
Blocking synchronous subprocess call in async method
Medium Severity
_get_repo_size_bytes uses synchronous subprocess.run to execute du -sb, which blocks the asyncio event loop when called from the async def run method. Running du -sb on a large repository can take significant time (up to the 120-second timeout). The async run_shell_command utility is already imported in this file and used elsewhere in the codebase for the same purpose (e.g., _get_repo_size_mb in clone_service.py).
Additional Locations (1)
There was a problem hiding this comment.
Pull request overview
This PR improves reliability of the software value calculation for very large repositories by adding an option to skip very large files during SCC analysis, and enabling that option automatically for large repos in the Python worker.
Changes:
- Added a
--no-largeCLI flag to the Gosoftware-valuebinary and propagated it through SCC invocations. - Updated Python
SoftwareValueServiceto compute repository disk usage and automatically add--no-largefor repos ≥ 10GB. - Minor robustness/clarity improvements (usage text, error message cleanup).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| services/apps/git_integration/src/crowdgit/services/software_value/software_value_service.py | Adds repo-size detection and conditionally appends --no-large to the binary invocation. |
| services/apps/git_integration/src/crowdgit/services/software_value/main.go | Introduces --no-large flag and passes it through to SCC execution (including large-file threshold args). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| except Exception: | ||
| pass | ||
| return 0 |
There was a problem hiding this comment.
_get_repo_size_bytes() silently swallows all exceptions and returns 0. If du fails (e.g., not present, permission issues, timeout), the service will skip enabling --no-large without any signal. Please log a warning (and consider catching narrower exceptions / including stderr) so operators can tell when the size check is not working.
| repo_size = _get_repo_size_bytes(repo_path) | ||
| if repo_size >= _LARGE_REPO_THRESHOLD_BYTES: | ||
| self.logger.info( | ||
| f"Repo size {repo_size / (1024**3):.1f} GB exceeds threshold — " | ||
| "running scc with no-large (skipping files >100MB)" | ||
| ) | ||
| cmd += ["--no-large"] |
There was a problem hiding this comment.
New behavior adds conditional --no-large based on repo size, but there are no tests asserting the command built for large vs small repos. Since this is reliability-critical logic, please add unit coverage (e.g., mock _get_repo_size_bytes / run_shell_command and assert --no-large is included only when expected).
| if noLarge { | ||
| cmdArgs = append(cmdArgs, "--no-large", "--large-byte-count", "100000000") | ||
| } |
There was a problem hiding this comment.
runSCC hard-codes the large-file threshold as the string literal "100000000". To reduce magic numbers and keep the threshold consistent across Go/Python (and with the flag help text), consider extracting this into a named constant (and potentially expressing it as 10010001000 or 10010241024 explicitly).
| For repos larger than 10 GB, scc is run with minimum parallelism (1 worker) | ||
| to avoid OOM; results are identical. |
There was a problem hiding this comment.
The docstring says that for repos >10GB “scc is run with minimum parallelism (1 worker)”, but the actual behavior added below is enabling the --no-large flag (skip files >100MB). Please update the docstring to reflect the real behavior (or implement the parallelism change if that’s what was intended).
| For repos larger than 10 GB, scc is run with minimum parallelism (1 worker) | |
| to avoid OOM; results are identical. | |
| For repos larger than 10 GB, scc is run with the --no-large flag (skipping files >100MB) | |
| to reduce memory usage and avoid OOM errors. |
| result = subprocess.run( | ||
| ["du", "-sb", repo_path], capture_output=True, text=True, timeout=120 | ||
| ) |
There was a problem hiding this comment.
_get_repo_size_bytes() uses subprocess.run(...) inside an async service; this is a blocking call and can stall the event loop (and other concurrent repo processing). Consider using the existing async run_shell_command() helper (or asyncio.to_thread) to run du without blocking.


This pull request adds support for handling very large repositories in the software value calculation service. The main change is the introduction of a
--no-largeflag that, when enabled, skips files larger than 100MB to prevent out-of-memory errors during analysis. The Python service now automatically enables this flag for repositories larger than 10GB, improving reliability for large codebases. Several functions in the Go codebase are updated to propagate and handle this flag.Large repository handling:
--no-largecommand-line flag to the Go binary (main.go) to skip files larger than 100MB when analyzing repositories, preventing OOM errors on large repos. This flag is propagated through all relevant functions and passed to thescctool. [1] [2] [3] [4] [5] [6] [7] [8]software_value_service.py), added logic to check the repository size before running the Go binary. If the repo is larger than 10GB, the--no-largeflag is automatically added to the command invocation. [1] [2]Code robustness and clarity:
du -sb.Note
Medium Risk
Adds new execution path that changes how
sccis invoked for large repositories (skipping >100MB files), which could slightly alter metrics for repos containing large generated/binary files; otherwise changes are localized to this service wrapper.Overview
Prevents
software-valuefrom failing on very large repos by adding a--no-largeflag to the Go binary and threading it throughprocessRepository→getSCCReport→runSCC, which invokessccwith--no-largeand a 100MB cutoff.Updates
SoftwareValueServiceto measure repo disk usage viadu -sband automatically append--no-largewhen the repo is ≥10GB, plus minor usage/error-message fixes.Written by Cursor Bugbot for commit 76e7d30. This will update automatically on new commits. Configure here.