Conversation
* update doc * bump gpgr version
ec2700b to
8db35b3
Compare
54672c1 to
9c613d4
Compare
9c613d4 to
8223ef9
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR upgrades PCGR from v1.4.1 to v2.2.1 and adds handling for hypermutated samples. The upgrade requires updating field names and annotations to match the new PCGR version's output format, while new functionality filters variants in hypermutated samples to stay within PCGR's variant count limits.
Key changes:
- Updated PCGR/CPSR field names to match v2.2.1 output (e.g.,
PCGR_TIER→PCGR_ACTIONABILITY_TIER) - Added hypermutation handling with variant filtering logic for samples exceeding 450,000 variants
- Implemented chunked PCGR processing for large variant sets with parallel execution
Reviewed Changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Version bump to 0.2.17 |
| docker/Dockerfile.pcgr | Updated PCGR conda environment from v1.4.1 to v2.2.1 |
| docker/Dockerfile.gpgr | Updated r-gpgr from 2.1.3 to 2.2.11 |
| conda/env/bolt_env.yml | Added ncurses dependency |
| bolt/workflows/sv_somatic/annotate.py | Added logging infrastructure |
| bolt/workflows/smlv_somatic/rescue.py | Added logging infrastructure |
| bolt/workflows/smlv_somatic/report.py | Added hypermutation variant filtering and updated PCGR integration |
| bolt/workflows/smlv_somatic/filter.py | Updated field names for PCGR v2.2.1 and added helper function to handle missing values |
| bolt/workflows/smlv_somatic/annotate.py | Refactored PCGR processing to support chunking and updated field names |
| bolt/workflows/smlv_germline/report.py | Added logging and vep_dir parameter |
| bolt/workflows/other/cancer_report.py | Made dragen_hrd_fp optional and added new signature analysis parameters |
| bolt/util.py | Enhanced command execution with logging, added VCF splitting/merging utilities |
| bolt/logging_config.py | New logging configuration module |
| bolt/common/pcgr.py | Major refactor for PCGR v2.2.1 compatibility, chunked processing, and updated field mappings |
| bolt/common/constants.py | Updated field names and added hypermutation filtering constants |
| tests/test_smlv_somatic_filter.py | Updated test data to match new PCGR field names |
| README.md | Updated version references to 0.2.17 |
| CHANGELOG.md | Added changelog entries for recent PRs |
| .bumpversion.cfg | Updated version to 0.2.17 |
Comments suppressed due to low confidence (1)
bolt/common/pcgr.py:1
- Missing spaces around the
==operator. Should beif value == '' or value == '.':
import collections
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
* update "HIGH_INF_POS" to "HIGH_I_I_POS" in VCF header for latest pcgr version * update constant for pcgr 2.2.5
* change sage vcf header to match 2024 version * move check function * add check in rescue for sage vcf * change for consistencency with pcgr
…arallelisation because OOM hypermutated samples
24be1fb to
c42d725
Compare
a4beb4d to
03bba1d
Compare
…'t use more core and so more memory
…e_command. Fixes #26 - Commands now fail immediately instead of continuing silently
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 21 changed files in this pull request and generated 26 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pos='1', | ||
| vid='.', | ||
| ref='.', | ||
| alt='.', | ||
| ref='A', | ||
| alt='T', |
There was a problem hiding this comment.
Test data changed from default values to specific values. The change from pos='.' to pos='1', ref='.' to ref='A', and alt='.' to alt='T' looks correct for creating valid test records, but ensure all tests that rely on this function still work correctly with these new default values.
|
|
||
| if (variant_count - filter_sum) <= constants.MAX_SOMATIC_VARIANTS: | ||
| break | ||
|
|
||
| filter_sum += len(variants_sorted.get(key, [])) |
There was a problem hiding this comment.
Logic error in hypermutation filtering: Line 339 checks if (variant_count - filter_sum) <= constants.MAX_SOMATIC_VARIANTS, but filter_sum represents variants to be filtered out, while the condition should check if remaining variants are under the limit. The correct logic should be checking if we've accumulated enough variants to filter to bring the total under the threshold. This could result in incorrect filtering behavior.
| if (variant_count - filter_sum) <= constants.MAX_SOMATIC_VARIANTS: | |
| break | |
| filter_sum += len(variants_sorted.get(key, [])) | |
| cat_count = len(variants_sorted.get(key, [])) | |
| # Check if filtering the variants in this category as well would bring | |
| # the total number of remaining variants under the PCGR limit | |
| if (variant_count - (filter_sum + cat_count)) <= constants.MAX_SOMATIC_VARIANTS: | |
| filter_categories.append(key) | |
| break | |
| filter_sum += cat_count |
| chunk_size = kwargs.get('pcgr_variant_chunk_size') | ||
| if chunk_size is not None and chunk_size <= 0: | ||
| raise click.BadParameter('must be a positive integer', param_hint='--pcgr_variant_chunk_size') | ||
| chunk_size = chunk_size or constants.MAX_SOMATIC_VARIANTS |
There was a problem hiding this comment.
Missing validation: Lines 104-105 check if pcgr_variant_chunk_size is negative or zero, but line 106 then allows the value to be None by using chunk_size = chunk_size or constants.MAX_SOMATIC_VARIANTS. However, if a user passes 0, the validation would raise an error. The logic should clarify whether 0 is acceptable (and treated as None) or not. Current implementation is inconsistent.
| chunk_size = chunk_size or constants.MAX_SOMATIC_VARIANTS | |
| if chunk_size is None: | |
| chunk_size = constants.MAX_SOMATIC_VARIANTS |
| --solver libmamba \ | ||
| --name pcgr \ | ||
| --file https://raw.githubusercontent.com/sigven/pcgr/v1.4.1/conda/env/lock/pcgr-linux-64.lock | ||
| --file https://raw.githubusercontent.com/sigven/pcgr/refs/tags/v2.2.5/conda/env/lock/pcgr-linux-64.lock |
There was a problem hiding this comment.
Version mismatch: The PR description states "pcgrr v1.4.1 -> pcgr v2.2.1", but the Dockerfile is being updated to v2.2.5. The version should be consistent with what's documented in the PR description or the description should be updated to reflect v2.2.5.
|
|
||
| # Set filter category for each variant | ||
| variants_sorted = collections.defaultdict(list) | ||
| for variant_count, variant in enumerate(cyvcf2.VCF(fp_annotated_out), 1): |
There was a problem hiding this comment.
Variable naming issue: Line 321 uses for variant_count, variant in enumerate(cyvcf2.VCF(fp_annotated_out), 1), which shadows the outer scope where variant_count might be used later (line 339). This can lead to confusion. Consider using a different variable name like idx or i for the enumeration.
|
|
||
|
|
||
| import click | ||
| import logging |
There was a problem hiding this comment.
Import of 'logging' is not used.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 21 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| conda create \ | ||
| --solver libmamba \ | ||
| --name pcgr \ | ||
| --file https://raw.githubusercontent.com/sigven/pcgr/v1.4.1/conda/env/lock/pcgr-linux-64.lock | ||
| --file https://raw.githubusercontent.com/sigven/pcgr/refs/tags/v2.2.5/conda/env/lock/pcgr-linux-64.lock | ||
|
|
||
| RUN \ | ||
| conda create \ | ||
| --solver libmamba \ | ||
| --name pcgrr \ | ||
| --file https://raw.githubusercontent.com/sigven/pcgr/v1.4.1/conda/env/lock/pcgrr-linux-64.lock | ||
| --file https://raw.githubusercontent.com/sigven/pcgr/refs/tags/v2.2.5/conda/env/lock/pcgrr-linux-64.lock |
There was a problem hiding this comment.
The PR description says PCGR is upgraded to v2.2.1, but this Dockerfile pulls lock files from tag v2.2.5. Either update the PR description/release notes to match, or change the Dockerfile to the intended PCGR version so the built image matches what’s documented.
| | Name | Docker image URI | Commands | | ||
| | --- | --- | --- | | ||
| | pcgr | ghcr.io/scwatts/bolt:0.2.14-pcgr | • `bolt smlv_germline report`<br />• `bolt smlv_somatic annotate`<br />• `bolt smlv_somatic report`<br /> | | ||
| | gpgr | ghcr.io/scwatts/bolt:0.2.14-gpgr | • `bolt other cancer_report` | | ||
| | snpeff | ghcr.io/scwatts/bolt:0.2.14-snpeff | • `bolt sv_somatic annotate` | | ||
| | circos | ghcr.io/scwatts/bolt:0.2.14-circos | • `bolt other purple_baf_plot` | | ||
| | multiqc | ghcr.io/scwatts/bolt:0.2.14-multiqc | • `bolt other multiqc_report` | | ||
| | base | ghcr.io/scwatts/bolt:0.2.14 | • `bolt smlv_germline prepare`<br />• `bolt smlv_somatic rescue`<br />• `bolt smlv_somatic filter`<br />• `bolt sv_somatic prioritise`<br /> | | ||
| | pcgr | ghcr.io/scwatts/bolt:0.2.17-pcgr | • `bolt smlv_germline report`<br />• `bolt smlv_somatic annotate`<br />• `bolt smlv_somatic report`<br /> | | ||
| | gpgr | ghcr.io/scwatts/bolt:0.2.17-gpgr | • `bolt other cancer_report` | | ||
| | snpeff | ghcr.io/scwatts/bolt:0.2.17-snpeff | • `bolt sv_somatic annotate` | | ||
| | circos | ghcr.io/scwatts/bolt:0.2.17-circos | • `bolt other purple_baf_plot` | | ||
| | multiqc | ghcr.io/scwatts/bolt:0.2.17-multiqc | • `bolt other multiqc_report` | | ||
| | base | ghcr.io/scwatts/bolt:0.2.17 | • `bolt smlv_germline prepare`<br />• `bolt smlv_somatic rescue`<br />• `bolt smlv_somatic filter`<br />• `bolt sv_somatic prioritise`<br /> | | ||
|
|
||
| ## Usage | ||
|
|
||
| Given the nature of software dependencies required, it is strongly recommended to run `bolt` commands via the existing | ||
| [Docker images](#docker-images): | ||
|
|
||
| ```bash | ||
| docker run -ti -v $(pwd):$(pwd) -w $(pwd) ghcr.io/scwatts/bolt:0.2.14 \ | ||
| docker run -ti -v $(pwd):$(pwd) -w $(pwd) ghcr.io/scwatts/bolt:0.2.17 \ | ||
| bolt smlv_somatic filter \ |
There was a problem hiding this comment.
README Docker image tags are set to 0.2.17, but the project version in pyproject.toml/.bumpversion.cfg is 0.3.0. Update the README image URIs and example docker run command to reference the same release version to avoid publishing/usage confusion.
CHANGELOG.md
Outdated
| - [3](https://github.com/scwatts/bolt/pull/3) - Improve PCGR / CPSR argument handling | ||
|
|
||
| - [6](https://github.com/umccr/bolt/pull/6) - Change oncoanalyser v2.0.0 uptade, with switch sv caller from GRIPSS to eSVee No newline at end of file | ||
| - [6](https://github.com/umccr/bolt/pull/6) - Change oncoanalyser v2.0.0 uptade, with switch sv caller from GRIPSS to eSVee |
There was a problem hiding this comment.
Typo in changelog entry: "uptade" should be "update".
| - [6](https://github.com/umccr/bolt/pull/6) - Change oncoanalyser v2.0.0 uptade, with switch sv caller from GRIPSS to eSVee | |
| - [6](https://github.com/umccr/bolt/pull/6) - Change oncoanalyser v2.0.0 update, with switch sv caller from GRIPSS to eSVee |
| filter_category = (data['tier'], *variant_filter) | ||
| variant_repr = pcgr.get_variant_repr(variant) | ||
| variants_sorted[filter_category].append(variant_repr) |
There was a problem hiding this comment.
filter_category prefixes with data['tier'] from PCGR_ACTIONABILITY_TIER (values like '1'..'4'/'N'), but pcgr.get_ordering() iterates tiers from constants.PCGR_TIERS_FILTERING (currently TIER_1…/NONCODING). With this mismatch, variants_sorted.get(key, []) won’t match and hypermutation down-selection won’t actually reduce the VCF below MAX_SOMATIC_VARIANTS. Align tier values (constants and/or mapping) so ordering keys match data['tier'].
| # - PCGR ACMG TIER [INFO/PCGR_TIER] | ||
| # - VEP consequence [INFO/PCR_CSQ] | ||
| # - Known mutation hotspot [INFO/PCGR_MUTATION_HOTSPOT] | ||
| # - ClinVar clinical significant [INFO/PCGR_CLINVAR_CLNSIG] | ||
| # - Hits in COSMIC [INFO/PCGR_COSMIC_COUNT] | ||
| # - ClinVar clinical significant [INFO/PCGR_CLNSIG] | ||
| # - Hits in TCGA [INFO/PCGR_TCGA_PANCANCER_COUNT] |
There was a problem hiding this comment.
These comments still reference legacy PCGR output fields (PCGR_TIER, PCGR_CLINVAR_CLNSIG, COSMIC/PCAWG counts), but the code/constants have moved to PCGR_ACTIONABILITY_TIER and PCGR_CLINVAR_CLASSIFICATION and no longer transfer COSMIC/PCAWG counts. Update the comment block to match the current annotations being transferred.
| PCGR_TIERS_FILTERING = ( | ||
| 'TIER_1', | ||
| 'TIER_2', | ||
| } | ||
| 'TIER_3', | ||
| 'TIER_4', | ||
| 'NONCODING', | ||
| ) |
There was a problem hiding this comment.
PCGR_TIERS_FILTERING is defined as TIER_1…NONCODING, but the code normalizes/uses actionability tiers as '1'..'4' / 'N' (see bolt/common/pcgr.py where PCGR_ACTIONABILITY_TIER is set). This mismatch breaks tier-based ordering/filtering (e.g., hypermutation down-selection). Align the constants with the actual tier values used in VCF/TSV, or add a mapping layer.
| ''') | ||
|
|
||
| # Set filter category for each variant | ||
| variants_sorted = collections.defaultdict(list) |
There was a problem hiding this comment.
variant_count is introduced by the enumerate(..., 1) loop, but it’s later used after the loop to decide whether enough variants have been filtered out. If the annotated VCF has zero records, variant_count will be undefined and raise UnboundLocalError. Initialize variant_count = 0 before the loop (or compute total count explicitly).
| variants_sorted = collections.defaultdict(list) | |
| variants_sorted = collections.defaultdict(list) | |
| variant_count = 0 |
| --file https://raw.githubusercontent.com/sigven/pcgr/refs/tags/v2.2.5/conda/env/lock/pcgr-linux-64.lock | ||
|
|
||
| RUN \ | ||
| conda create \ | ||
| --solver libmamba \ | ||
| --name pcgrr \ | ||
| --file https://raw.githubusercontent.com/sigven/pcgr/v1.4.1/conda/env/lock/pcgrr-linux-64.lock | ||
| --file https://raw.githubusercontent.com/sigven/pcgr/refs/tags/v2.2.5/conda/env/lock/pcgrr-linux-64.lock |
There was a problem hiding this comment.
The Docker build downloads conda lockfiles for pcgr and pcgrr directly from raw.githubusercontent.com using only a tag reference, which can change over time. If the upstream repository or tag is compromised, an attacker could modify the environment definition to pull and execute malicious packages during image build or when the container runs, leading to a supply-chain compromise. To reduce this risk, pin these URLs to immutable commit SHAs or vendor the lockfiles into this repository and verify their integrity before use.
Add
Upgrade
Removed