Skip to content

fix: software value failing for large repos [CM-1029]#3947

Open
mbani01 wants to merge 8 commits intomainfrom
fix/software_value_failure
Open

fix: software value failing for large repos [CM-1029]#3947
mbani01 wants to merge 8 commits intomainfrom
fix/software_value_failure

Conversation

@mbani01
Copy link
Contributor

@mbani01 mbani01 commented Mar 23, 2026

This pull request adds support for handling very large repositories in the software value calculation service. The main change is the introduction of a --no-large flag that, when enabled, skips files larger than 100MB to prevent out-of-memory errors during analysis. The Python service now automatically enables this flag for repositories larger than 10GB, improving reliability for large codebases. Several functions in the Go codebase are updated to propagate and handle this flag.

Large repository handling:

  • Added a --no-large command-line flag to the Go binary (main.go) to skip files larger than 100MB when analyzing repositories, preventing OOM errors on large repos. This flag is propagated through all relevant functions and passed to the scc tool. [1] [2] [3] [4] [5] [6] [7] [8]
  • In the Python service (software_value_service.py), added logic to check the repository size before running the Go binary. If the repo is larger than 10GB, the --no-large flag is automatically added to the command invocation. [1] [2]

Code robustness and clarity:

  • Improved error handling and messaging in several places, including fixing a typo in an error message and updating usage instructions to reflect the new flag. [1] [2]
  • Added a helper function in Python to determine repository size using du -sb.

Note

Medium Risk
Adds new execution path that changes how scc is invoked for large repositories (skipping >100MB files), which could slightly alter metrics for repos containing large generated/binary files; otherwise changes are localized to this service wrapper.

Overview
Prevents software-value from failing on very large repos by adding a --no-large flag to the Go binary and threading it through processRepositorygetSCCReportrunSCC, which invokes scc with --no-large and a 100MB cutoff.

Updates SoftwareValueService to measure repo disk usage via du -sb and automatically append --no-large when the repo is ≥10GB, plus minor usage/error-message fixes.

Written by Cursor Bugbot for commit 76e7d30. This will update automatically on new commits. Configure here.

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

mbani01 added 6 commits March 23, 2026 16:30
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
@mbani01 mbani01 changed the title chore: allow dry-run for non-prod envs fix: software value failing for large repos [CM-1029] Mar 24, 2026
@mbani01 mbani01 marked this pull request as ready for review March 24, 2026 17:26
Copilot AI review requested due to automatic review settings March 24, 2026 17:26
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

return int(result.stdout.split()[0])
except Exception:
pass
return 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking synchronous subprocess call in async method

Medium Severity

_get_repo_size_bytes uses synchronous subprocess.run to execute du -sb, which blocks the asyncio event loop when called from the async def run method. Running du -sb on a large repository can take significant time (up to the 120-second timeout). The async run_shell_command utility is already imported in this file and used elsewhere in the codebase for the same purpose (e.g., _get_repo_size_mb in clone_service.py).

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves reliability of the software value calculation for very large repositories by adding an option to skip very large files during SCC analysis, and enabling that option automatically for large repos in the Python worker.

Changes:

  • Added a --no-large CLI flag to the Go software-value binary and propagated it through SCC invocations.
  • Updated Python SoftwareValueService to compute repository disk usage and automatically add --no-large for repos ≥ 10GB.
  • Minor robustness/clarity improvements (usage text, error message cleanup).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
services/apps/git_integration/src/crowdgit/services/software_value/software_value_service.py Adds repo-size detection and conditionally appends --no-large to the binary invocation.
services/apps/git_integration/src/crowdgit/services/software_value/main.go Introduces --no-large flag and passes it through to SCC execution (including large-file threshold args).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +23 to +25
except Exception:
pass
return 0
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_repo_size_bytes() silently swallows all exceptions and returns 0. If du fails (e.g., not present, permission issues, timeout), the service will skip enabling --no-large without any signal. Please log a warning (and consider catching narrower exceptions / including stderr) so operators can tell when the size check is not working.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +57
repo_size = _get_repo_size_bytes(repo_path)
if repo_size >= _LARGE_REPO_THRESHOLD_BYTES:
self.logger.info(
f"Repo size {repo_size / (1024**3):.1f} GB exceeds threshold — "
"running scc with no-large (skipping files >100MB)"
)
cmd += ["--no-large"]
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New behavior adds conditional --no-large based on repo size, but there are no tests asserting the command built for large vs small repos. Since this is reliability-critical logic, please add unit coverage (e.g., mock _get_repo_size_bytes / run_shell_command and assert --no-large is included only when expected).

Copilot uses AI. Check for mistakes.
Comment on lines +218 to +220
if noLarge {
cmdArgs = append(cmdArgs, "--no-large", "--large-byte-count", "100000000")
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runSCC hard-codes the large-file threshold as the string literal "100000000". To reduce magic numbers and keep the threshold consistent across Go/Python (and with the flag help text), consider extracting this into a named constant (and potentially expressing it as 10010001000 or 10010241024 explicitly).

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +41
For repos larger than 10 GB, scc is run with minimum parallelism (1 worker)
to avoid OOM; results are identical.
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says that for repos >10GB “scc is run with minimum parallelism (1 worker)”, but the actual behavior added below is enabling the --no-large flag (skip files >100MB). Please update the docstring to reflect the real behavior (or implement the parallelism change if that’s what was intended).

Suggested change
For repos larger than 10 GB, scc is run with minimum parallelism (1 worker)
to avoid OOM; results are identical.
For repos larger than 10 GB, scc is run with the --no-large flag (skipping files >100MB)
to reduce memory usage and avoid OOM errors.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +20
result = subprocess.run(
["du", "-sb", repo_path], capture_output=True, text=True, timeout=120
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_repo_size_bytes() uses subprocess.run(...) inside an async service; this is a blocking call and can stall the event loop (and other concurrent repo processing). Consider using the existing async run_shell_command() helper (or asyncio.to_thread) to run du without blocking.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants