Skip to content

First commit of the library, structure;l purpose etc is defined here#1

Open
ypriverol wants to merge 33 commits intomainfrom
dev
Open

First commit of the library, structure;l purpose etc is defined here#1
ypriverol wants to merge 33 commits intomainfrom
dev

Conversation

@ypriverol
Copy link
Contributor

@ypriverol ypriverol commented Mar 11, 2026

This pull request introduces the initial implementation of the pyprideap Python library for handling affinity proteomics datasets from the PRIDE Affinity Archive (PAD). It includes core data structures, file readers for multiple platforms and formats, validation logic, statistics computation, and CI setup. The changes establish the foundation for reading, validating, and analyzing Olink and SomaScan datasets, and provide development tooling for linting, testing, and type checking.

Core library structure and data model:

  • Introduced core data classes (AffinityDataset, Platform, Level, ValidationResult) in src/pyprideap/core.py to represent datasets and validation results.
  • Added main package initializer src/pyprideap/__init__.py to expose top-level API including reading, validation, statistics, and PRIDE client functionality.

File readers for supported platforms and formats:

  • Implemented file readers for Olink and SomaScan platforms: CSV, Parquet, XLSX, ADAT, and SomaScan CSV, each producing an AffinityDataset. [1] [2] [3] [4] [5]
  • Added registry logic in src/pyprideap/readers/registry.py for automatic format detection and unified dataset reading.

Validation and statistics:

  • Added platform-specific validators and a unified validate function to check datasets against PRIDE guidelines.
  • Implemented dataset statistics computation in src/pyprideap/stats.py with summary metrics for samples, features, detection rate, QC, and value distribution.

PRIDE Archive integration and development tooling:

  • Introduced PrideClient class in src/pyprideap/pride.py for fetching project metadata and files from the PRIDE Archive API.
  • Added CI workflow (.github/workflows/ci.yml) for linting, testing, and type checking, and defined project metadata and dev dependencies in pyproject.toml. [1] [2]

Documentation:

  • Overhauled README.md to document installation, usage, supported formats, validation rules, data model, and development instructions.

Summary by CodeRabbit

  • New Features

    • Multi-format readers for Olink and SomaScan; unified AffinityDataset top-level API and version metadata
    • Platform-aware validation, dataset statistics, LOD analysis, filtering utilities, PRIDE archive client
    • QC toolkit: compute/render functions and a self-contained HTML QC report generator
  • Documentation

    • Expanded README with quick start, examples, supported formats, validation rules, and developer guidance
  • Chores

    • Project packaging, CI workflow, .gitignore updates, comprehensive tests, fixtures, and test helpers

ypriverol and others added 21 commits March 10, 2026 20:47
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix ruff lint errors (unused imports, import ordering)
- Apply ruff formatting across all source and test files
- Fix mypy type errors in validators, stats, and pride modules
- Add types-requests to dev dependencies

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests the full read -> validate -> stats pipeline for all four
supported formats (Olink CSV, Parquet, SomaScan ADAT, SomaScan CSV).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@qodo-code-review
Copy link

Review Summary by Qodo

Initial release of pyprideap library for PRIDE Affinity Proteomics data

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Initial library release with complete project scaffolding
• Multi-format readers for Olink (CSV, Parquet, XLSX) and SomaScan (ADAT, CSV)
• Comprehensive validation framework with platform-specific validators
• Statistics computation and PRIDE API client integration
• Full test suite with CI/CD pipeline and documentation
Diagram
flowchart LR
  A["Input Files<br/>CSV/Parquet/XLSX/ADAT"] -->|Auto-detect| B["Registry<br/>Format Detection"]
  B -->|Route| C["Platform Readers<br/>Olink/SomaScan"]
  C -->|Parse| D["AffinityDataset<br/>Core Data Model"]
  D -->|Validate| E["Validators<br/>Schema/QC/Range"]
  D -->|Analyze| F["Statistics<br/>Detection Rate/QC"]
  D -->|Fetch| G["PRIDE Client<br/>Project Metadata"]
  E -->|Results| H["ValidationResult<br/>Error/Warning/Info"]
  F -->|Output| I["DatasetStats<br/>Summary"]
Loading

Grey Divider

File Changes

1. pyproject.toml ⚙️ Configuration changes +43/-0

Project configuration with dependencies and tools

pyproject.toml


2. README.md 📝 Documentation +113/-2

Comprehensive documentation with usage examples

README.md


3. .github/workflows/ci.yml ⚙️ Configuration changes +42/-0

GitHub Actions CI pipeline for lint, test, typecheck

.github/workflows/ci.yml


View more (29)
4. src/pyprideap/__init__.py ✨ Enhancement +24/-0

Public API exports and version management

src/pyprideap/init.py


5. src/pyprideap/core.py ✨ Enhancement +36/-0

Core data structures and enums for library

src/pyprideap/core.py


6. src/pyprideap/pride.py ✨ Enhancement +41/-0

PRIDE Archive API client for project metadata

src/pyprideap/pride.py


7. src/pyprideap/readers/__init__.py ✨ Enhancement +13/-0

Reader module exports and public interface

src/pyprideap/readers/init.py


8. src/pyprideap/readers/olink_csv.py ✨ Enhancement +42/-0

Olink NPX CSV long-format data parser

src/pyprideap/readers/olink_csv.py


9. src/pyprideap/readers/olink_parquet.py ✨ Enhancement +40/-0

Olink Explore HT parquet format reader

src/pyprideap/readers/olink_parquet.py


10. src/pyprideap/readers/olink_xlsx.py ✨ Enhancement +35/-0

Olink XLSX spreadsheet format reader

src/pyprideap/readers/olink_xlsx.py


11. src/pyprideap/readers/somascan_adat.py ✨ Enhancement +81/-0

SomaScan ADAT three-section format parser

src/pyprideap/readers/somascan_adat.py


12. src/pyprideap/readers/somascan_csv.py ✨ Enhancement +32/-0

SomaScan wide-format CSV data reader

src/pyprideap/readers/somascan_csv.py


13. src/pyprideap/readers/registry.py ✨ Enhancement +64/-0

Auto-detection registry for format routing

src/pyprideap/readers/registry.py


14. src/pyprideap/stats.py ✨ Enhancement +78/-0

Dataset statistics computation and summary

src/pyprideap/stats.py


15. src/pyprideap/validators/__init__.py ✨ Enhancement +27/-0

Validator dispatch and public interface

src/pyprideap/validators/init.py


16. src/pyprideap/validators/base.py ✨ Enhancement +9/-0

BaseValidator protocol definition

src/pyprideap/validators/base.py


17. src/pyprideap/validators/olink_explore.py ✨ Enhancement +124/-0

Olink Explore schema and QC validation rules

src/pyprideap/validators/olink_explore.py


18. src/pyprideap/validators/olink_target.py ✨ Enhancement +20/-0

Olink Target validator extending Explore

src/pyprideap/validators/olink_target.py


19. src/pyprideap/validators/somascan.py ✨ Enhancement +87/-0

SomaScan RFU and metadata validation rules

src/pyprideap/validators/somascan.py


20. tests/conftest.py 🧪 Tests +30/-0

Pytest fixtures for test data paths

tests/conftest.py


21. tests/create_fixtures.py 🧪 Tests +51/-0

Fixture generation script for test data

tests/create_fixtures.py


22. tests/data/olink_sample.parquet 🧪 Tests +0/-0

Binary parquet test fixture for Olink

tests/data/olink_sample.parquet


23. tests/data/olink_sample.xlsx 🧪 Tests +0/-0

Binary XLSX test fixture for Olink

tests/data/olink_sample.xlsx


24. tests/data/somascan_sample.adat 🧪 Tests +16/-0

Text ADAT test fixture for SomaScan

tests/data/somascan_sample.adat


25. tests/test_core.py 🧪 Tests +42/-0

Unit tests for core data structures

tests/test_core.py


26. tests/test_integration.py 🧪 Tests +39/-0

End-to-end integration tests for API

tests/test_integration.py


27. tests/test_pride.py 🧪 Tests +63/-0

Unit tests for PRIDE API client

tests/test_pride.py


28. tests/test_readers.py 🧪 Tests +151/-0

Unit tests for all file format readers

tests/test_readers.py


29. tests/test_registry.py 🧪 Tests +52/-0

Tests for format detection and routing

tests/test_registry.py


30. tests/test_stats.py 🧪 Tests +83/-0

Unit tests for statistics computation

tests/test_stats.py


31. tests/test_validators.py 🧪 Tests +240/-0

Unit tests for all validators

tests/test_validators.py


32. tests/__init__.py Additional files +0/-0

...

tests/init.py


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 11, 2026

Code Review by Qodo

🐞 Bugs (4) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. SomaScan CSV always invalid🐞 Bug ✓ Correctness
Description
read_somascan_csv() populates ds.features with only a SeqId column, but SomaScanValidator requires
UniProt/Target/Dilution as well, so validate() will always emit missing_feature_column errors for
any SomaScan CSV read through this API.
Code

src/pyprideap/readers/somascan_csv.py[R17-25]

+    seq_cols = [c for c in df.columns if c.startswith("SeqId.")]
+    meta_cols = [c for c in df.columns if c not in seq_cols]
+
+    samples = df[meta_cols].reset_index(drop=True)
+    expression = df[seq_cols].astype(float).reset_index(drop=True)
+
+    seq_ids = [c.replace("SeqId.", "") for c in seq_cols]
+    features = pd.DataFrame({"SeqId": seq_ids})
+
Evidence
The SomaScan CSV reader constructs a features DataFrame containing only SeqId, while the validator
enforces additional required feature columns (UniProt, Target, Dilution). The provided SomaScan CSV
fixture also lacks those columns, demonstrating that the current read→validate workflow is
guaranteed to fail schema validation for this format.

src/pyprideap/readers/somascan_csv.py[17-25]
src/pyprideap/validators/somascan.py[8-10]
src/pyprideap/validators/somascan.py[34-43]
tests/data/somascan_sample.csv[1-4]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`read_somascan_csv()` produces `AffinityDataset.features` with only `SeqId`, but `SomaScanValidator` requires `UniProt`, `Target`, and `Dilution`. This makes `pyprideap.read()` + `pyprideap.validate()` unusable for SomaScan CSV inputs.
## Issue Context
SomaScan ADAT parsing provides rich feature metadata, but the SomaScan CSV format in fixtures contains only `SeqId.*` expression columns and sample metadata. The validator is currently keyed only by `Platform.SOMASCAN`, so both ADAT and CSV go through the same strict schema checks.
## Fix Focus Areas
- src/pyprideap/readers/somascan_csv.py[17-32]
- src/pyprideap/validators/somascan.py[8-45]
- src/pyprideap/validators/__init__.py[7-19]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. CSV/XLSX mis-detection🐞 Bug ✓ Correctness
Description
detect_format() uses set intersection for CSV/XLSX Olink detection, so any file containing just one
of {SampleID, OlinkID, NPX} can be misclassified as Olink and routed to the wrong reader, leading to
crashes or incorrect parsing.
Code

src/pyprideap/readers/registry.py[R37-50]

+    if suffix == ".csv":
+        df_head = pd.read_csv(path, nrows=1)
+        cols = set(df_head.columns)
+        has_seqid_cols = any(c.startswith("SeqId.") for c in cols)
+        if has_seqid_cols or _SOMASCAN_MARKER_COLS.intersection(cols):
+            return "somascan_csv"
+        if _OLINK_MARKER_COLS.intersection(cols):
+            return "olink_csv"
+
+    if suffix == ".xlsx":
+        df_head = pd.read_excel(path, nrows=1)
+        cols = set(df_head.columns)
+        if _OLINK_MARKER_COLS.intersection(cols):
+            return "olink_xlsx"
Evidence
Parquet detection requires all Olink marker columns via issubset, but CSV/XLSX detection only checks
for any overlap with marker columns. Because intersection is truthy for a single shared column
(e.g., 'SampleID'), non-Olink inputs can be misrouted to Olink readers.

src/pyprideap/readers/registry.py[27-33]
src/pyprideap/readers/registry.py[37-45]
src/pyprideap/readers/registry.py[46-50]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`detect_format()` uses `_OLINK_MARKER_COLS.intersection(cols)` for CSV/XLSX detection, which is true if *any* one marker column is present. This can mis-route non-Olink files (e.g., any dataset with `SampleID`) to an Olink reader.
## Issue Context
Parquet detection already uses `_OLINK_MARKER_COLS.issubset(cols)`, so CSV/XLSX behavior is inconsistent and weaker.
## Fix Focus Areas
- src/pyprideap/readers/registry.py[15-52]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Olink schema KeyErrors🐞 Bug ⛯ Reliability
Description
The Olink readers hard-reference SampleID/OlinkID/NPX in drop_duplicates() and pivot_table() without
checking those columns exist, so malformed inputs fail with opaque pandas KeyErrors rather than a
clear, actionable schema error.
Code

src/pyprideap/readers/olink_csv.py[R20-32]

+    sample_cols = [c for c in df.columns if c in _SAMPLE_COLS]
+    samples = df[sample_cols].drop_duplicates(subset=["SampleID"]).reset_index(drop=True)
+
+    feature_cols = [c for c in df.columns if c in _FEATURE_COLS]
+    features = df[feature_cols].drop_duplicates(subset=["OlinkID"]).reset_index(drop=True)
+
+    expression = df.pivot_table(
+        index="SampleID",
+        columns="OlinkID",
+        values="NPX",
+        aggfunc="first",
+    )
+    expression = expression.reindex(samples["SampleID"].values).reset_index(drop=True)
Evidence
Each reader first builds sample_cols/feature_cols by filtering known columns, but then
unconditionally uses subset=["SampleID"], subset=["OlinkID"], and
values="NPX"/index="SampleID"/columns="OlinkID". If any of those required columns are missing,
pandas will raise at runtime.

src/pyprideap/readers/olink_csv.py[20-31]
src/pyprideap/readers/olink_parquet.py[20-33]
src/pyprideap/readers/olink_xlsx.py[20-27]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The Olink readers assume `SampleID`, `OlinkID`, and `NPX` exist and will throw pandas `KeyError` when inputs are malformed. This produces low-quality error messages and makes debugging/user experience poor.
## Issue Context
Each reader filters columns into `sample_cols`/`feature_cols`, but still hard-references required columns later.
## Fix Focus Areas
- src/pyprideap/readers/olink_csv.py[18-33]
- src/pyprideap/readers/olink_parquet.py[18-33]
- src/pyprideap/readers/olink_xlsx.py[18-28]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. Version import crashes🐞 Bug ⛯ Reliability
Description
Importing pyprideap from a source checkout can crash because __init__.py evaluates
importlib.metadata.version("pyprideap") at import time, which raises PackageNotFoundError when the
distribution metadata is unavailable.
Code

src/pyprideap/init.py[R3-12]

+from importlib.metadata import version
+
+from pyprideap.core import AffinityDataset, Level, Platform, ValidationResult
+from pyprideap.pride import PrideClient
+from pyprideap.readers.registry import read
+from pyprideap.stats import DatasetStats, compute_stats
+from pyprideap.validators import validate
+
+__version__ = version("pyprideap")
+
Evidence
The package computes __version__ via importlib.metadata.version() unconditionally during import,
which depends on installed distribution metadata rather than the source tree being on PYTHONPATH.

src/pyprideap/init.py[3-12]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`__version__ = version(&amp;amp;quot;pyprideap&amp;amp;quot;)` can raise `PackageNotFoundError` when importing from a source checkout without an installed distribution.
## Issue Context
CI installs the package (`pip install -e`), but developers/users may import directly from a checkout.
## Fix Focus Areas
- src/pyprideap/__init__.py[3-12]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new pyprideap package surface: core datamodels, many format readers and registry, LOD analysis, filtering, stats, validators, QC compute/render/report, a PRIDE API client, top-level exports, CI config, tests, and test fixtures.

Changes

Cohort / File(s) Summary
Project config & docs
​.github/workflows/ci.yml, ​.gitignore, pyproject.toml, README.md
Adds CI jobs (lint/test/typecheck), expands README, adds pyproject.toml with dependencies/tooling, and updates .gitignore entries.
Top-level API & core models
src/pyprideap/__init__.py, src/pyprideap/core.py
Exports version and many re-exports via all; adds Platform/Level enums, ValidationResult and AffinityDataset dataclasses.
PRIDE client
src/pyprideap/pride.py
New PrideClient with session management, project fetch, file listing, and download-URL extraction.
Readers & registry
src/pyprideap/readers/..., src/pyprideap/readers/registry.py
Adds readers for Olink (CSV, Parquet, XLSX) and SomaScan (ADAT, CSV) plus detect_format and read dispatcher.
Filtering & LOD analysis
src/pyprideap/filtering.py, src/pyprideap/lod.py
Adds filter_controls/filter_qc utilities and extensive LOD API (NCLOD/FIXED/REPORTED), loaders, compute_lod_stats, get_valid_proteins, LodStats and LodMethod.
Statistics
src/pyprideap/stats.py
Adds DatasetStats dataclass and compute_stats() for dataset summaries and distributions.
Validators
src/pyprideap/validators/...
Introduces validator framework and platform validators (OlinkExplore, OlinkTarget, SomaScan) with a validate() dispatcher.
QC compute/render/report
src/pyprideap/qc/...
Adds QC compute dataclasses/functions, Plotly renderers, and qc_report HTML generator with integrated plots and metadata.
Tests, fixtures & data
tests/*, tests/data/*
Adds extensive unit and integration tests, fixture generator script, and sample data files covering readers, registry, validators, LOD, filtering, stats, PRIDE client, and QC code.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(200,200,255,0.5)
    participant User
    participant Registry as readers.registry.read()
    participant Detect as readers.registry.detect_format()
    participant Reader as Reader (CSV/Parquet/ADAT/XLSX)
    end
    rect rgba(200,255,200,0.5)
    participant Dataset as AffinityDataset
    participant Validator as validators.validate()
    participant LOD as lod.compute_lod_stats()
    participant Stats as stats.compute_stats()
    end

    User->>Registry: read(path)
    Registry->>Detect: detect_format(path)
    Detect-->>Registry: format_type
    Registry->>Reader: call reader_func(path)
    Reader->>Reader: parse file, assemble samples/features/expression
    Reader-->>Dataset: return AffinityDataset
    Registry-->>User: dataset

    User->>Validator: validate(dataset)
    Validator->>Validator: run platform checks
    Validator-->>User: list[ValidationResult]

    User->>LOD: compute_lod_stats(dataset)
    LOD->>LOD: derive LODs & per-sample rates
    LOD-->>User: LodStats

    User->>Stats: compute_stats(dataset)
    Stats->>Stats: aggregate metrics
    Stats-->>User: DatasetStats
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through files with nimble paws,

Readers and LODs and validators' laws,
Stats counted carrots, QC sang along,
CI kept the tests marching strong,
A cheerful rabbit nibbles—code well done.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and contains a typo ('structure;l'). It uses non-descriptive phrasing that does not clearly convey the main purpose of the changeset despite being an initial commit. Revise the title to be clear and specific, e.g., 'Initial library implementation with core readers, validators, and PRIDE client' or 'Set up pyprideap: add readers, validators, statistics, and PRIDE integration'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ypriverol ypriverol requested a review from chakrabandla March 11, 2026 10:29
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (12)
pyproject.toml (1)

5-18: Consider adding project metadata for PyPI publication.

The core configuration looks solid. For eventual PyPI publication, consider adding author information, classifiers, and project URLs:

authors = [
    {name = "Your Name", email = "email@example.com"}
]
classifiers = [
    "Development Status :: 3 - Alpha",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
]
[project.urls]
Homepage = "https://github.com/PRIDE-Archive/pyprideap"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 5 - 18, Add PyPI metadata under the existing
[project] table: populate authors (e.g., authors = [{name = "Your Name", email =
"you@example.com"}]), add a classifiers array (include Development Status and
supported Python versions like "Programming Language :: Python ::
3.10"/3.11/3.12), and add a [project.urls] section with useful links (Homepage,
Source, Documentation) so the package metadata for pyprideap is complete for
PyPI publication; update the version/description if needed but keep the existing
dependencies intact.
src/pyprideap/pride.py (1)

11-15: Consider adding context manager support for proper session cleanup.

The requests.Session is created but never explicitly closed, which could lead to resource leaks in long-running applications.

💡 Optional: Add context manager support
 class PrideClient:
     def __init__(self, base_url: str = _BASE_URL, timeout: int = _DEFAULT_TIMEOUT):
         self.base_url = base_url
         self.timeout = timeout
         self._session = requests.Session()
+
+    def close(self) -> None:
+        self._session.close()
+
+    def __enter__(self) -> "PrideClient":
+        return self
+
+    def __exit__(self, *args: object) -> None:
+        self.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/pride.py` around lines 11 - 15, The PrideClient currently
creates a requests.Session in __init__ and never closes it; add explicit session
cleanup by implementing context manager support (define __enter__ to return self
and __exit__ to call self.close()) plus a close() method that calls
self._session.close(); ensure existing callers can also call close() directly
and consider adding a __del__ that calls close() as a fallback. Reference the
PrideClient class, its __init__, the self._session attribute, and add methods
__enter__, __exit__, and close to manage lifecycle.
.github/workflows/ci.yml (2)

10-19: Consider adding pip caching to speed up CI.

Adding pip caching can significantly reduce CI time:

💡 Optional: Add caching to workflow
       - uses: actions/setup-python@v5
         with:
           python-version: "3.12"
+          cache: 'pip'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 10 - 19, Add pip caching to the "lint"
job by caching pip's wheel/cache directory so subsequent runs skip
re-downloading ruff; modify the steps around uses: actions/setup-python@v5 and
the run: pip install ruff command to include an actions/cache step (e.g., cache
key based on runner, python-version and a static ruff-version or requirements
hash) that caches ~/.cache/pip (or the platform-appropriate pip cache path)
before installing, and restore that cache so the existing run: pip install ruff
becomes fast and CI time is reduced.

23-32: Consider adding fail-fast: false to ensure all Python versions complete their tests.

This allows each version to run independently, so you can see test results across all Python versions even if one fails:

Example change
     strategy:
+      fail-fast: false
       matrix:
         python-version: ["3.10", "3.11", "3.12", "3.13"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 23 - 32, Update the GitHub Actions job
strategy to disable fast-fail so all matrix entries run: under the existing
`strategy` -> `matrix` configuration referencing `python-version`, add
`fail-fast: false` (i.e., ensure `strategy` contains `matrix:` and `fail-fast:
false`) so tests for "python-version" 3.10–3.13 all complete even if one fails.
src/pyprideap/validators/olink_target.py (1)

7-20: Consider simplifying rule prefix handling.

The _rule_prefix class attribute is only used within validate(). Since the replacement logic is tightly coupled to this single method, it could be simplified.

However, the current approach works correctly and allows for potential subclass customization. The implementation properly rewrites all rule strings from the parent validator.

♻️ Optional: Simplify if subclass customization is not needed
 class OlinkTargetValidator(OlinkExploreValidator):
-    _rule_prefix = "olink_target"
-
     def validate(self, dataset: AffinityDataset) -> list[ValidationResult]:
         results = super().validate(dataset)
         return [
             ValidationResult(
                 level=r.level,
-                rule=r.rule.replace("olink.", f"{self._rule_prefix}.", 1),
+                rule=r.rule.replace("olink.", "olink_target.", 1),
                 message=r.message,
                 details=r.details,
             )
             for r in results
         ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/validators/olink_target.py` around lines 7 - 20, The
class-level attribute _rule_prefix is only used inside
OlinkTargetValidator.validate; simplify by removing the _rule_prefix attribute
and use a local variable or inline string in validate instead: update
OlinkTargetValidator to delete the _rule_prefix declaration and change the
replace call in validate to r.rule.replace("olink.", "olink_target.", 1) (or
assign prefix = "olink_target" at the top of validate and use f"{prefix}.")
while keeping the existing mapping logic that constructs ValidationResult from
super().validate(dataset).
tests/test_validators.py (1)

147-176: Consider adding a test to verify rule prefix rewriting.

The tests validate that OlinkTargetValidator inherits validation behavior from OlinkExploreValidator, but they don't explicitly verify that rule names are rewritten from olink. to olink_target. prefix.

💚 Optional: Add test for rule prefix
def test_rule_prefix_rewriting(self):
    ds = _make_olink_target_dataset(
        features=pd.DataFrame({"OlinkID": ["OID1"]}),
    )
    results = OlinkTargetValidator().validate(ds)
    errors = [r for r in results if r.level == Level.ERROR]
    # Verify rules use olink_target prefix, not olink
    assert all(r.rule.startswith("olink_target.") for r in errors)
    assert not any(r.rule.startswith("olink.") for r in errors)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_validators.py` around lines 147 - 176, Add a unit test that
ensures OlinkTargetValidator rewrites inherited rule names from the olink.
prefix to olink_target.; create a new test method (e.g.,
test_rule_prefix_rewriting) inside TestOlinkTargetValidator that builds a
minimal dataset via _make_olink_target_dataset, runs
OlinkTargetValidator().validate(ds), filters error-level results using Level,
and asserts that every error.rule starts with "olink_target." and that none
start with "olink."; this verifies the prefix rewriting behavior for inherited
rules.
tests/test_registry.py (1)

35-52: Consider adding a test for reading SomaScan CSV files.

TestDetectFormat includes test_somascan_csv_detected, but TestAutoRead lacks a corresponding test_read_somascan_csv. For consistency and completeness, consider adding this test case to verify the auto-read functionality for SomaScan CSV format.

💡 Suggested test addition
     def test_read_xlsx(self, olink_xlsx_path):
         ds = read(olink_xlsx_path)
         assert isinstance(ds, AffinityDataset)
         assert ds.platform == Platform.OLINK_EXPLORE
+
+    def test_read_somascan_csv(self, somascan_csv_path):
+        ds = read(somascan_csv_path)
+        assert isinstance(ds, AffinityDataset)
+        assert ds.platform == Platform.SOMASCAN
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_registry.py` around lines 35 - 52, Add a new test method
test_read_somascan_csv to the TestAutoRead class that calls
read(somascan_csv_path), asserts the returned object is the expected dataset
type (e.g., AffinityDataset if applicable) and that ds.platform ==
Platform.SOMASCAN; reference the existing read() call pattern and the
somascan_csv_path fixture used in TestDetectFormat to mirror other tests (e.g.,
test_read_somascan_adat) for consistency.
tests/test_pride.py (1)

36-43: Consider using a more specific exception type in the mock.

The test mocks raise_for_status.side_effect with a generic Exception("404 Not Found"). The requests library raises requests.exceptions.HTTPError for failed status codes. Using the correct exception type would make the test more realistic.

💡 Suggested improvement
+from requests.exceptions import HTTPError
+
     def test_get_project_not_found(self):
         mock_response = MagicMock()
         mock_response.status_code = 404
-        mock_response.raise_for_status.side_effect = Exception("404 Not Found")
+        mock_response.raise_for_status.side_effect = HTTPError("404 Not Found")
         with patch("pyprideap.pride.requests.Session.get", return_value=mock_response):
             client = PrideClient()
-            with pytest.raises(Exception, match="404"):
+            with pytest.raises(HTTPError, match="404"):
                 client.get_project("PAD999999")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_pride.py` around lines 36 - 43, The test
test_get_project_not_found should mock the same exception type Requests would
raise: replace the generic Exception thrown from
mock_response.raise_for_status.side_effect with requests.exceptions.HTTPError so
the behavior matches real requests behavior; update the test to import requests
(or reference requests.exceptions.HTTPError) and set
mock_response.raise_for_status.side_effect = requests.exceptions.HTTPError("404
Not Found"), keeping the patch on pyprideap.pride.requests.Session.get and the
existing pytest.raises assertion targeting the HTTPError message.
src/pyprideap/readers/olink_csv.py (2)

18-32: Add validation for required columns before processing.

The code assumes columns SampleID, OlinkID, and NPX exist without validation. If any are missing, pivot_table or drop_duplicates will raise cryptic errors. Adding upfront validation would provide clearer error messages.

💡 Suggested validation
     df = pd.read_csv(path)
 
+    required_cols = {"SampleID", "OlinkID", "NPX"}
+    missing = required_cols - set(df.columns)
+    if missing:
+        raise ValueError(f"Missing required columns: {missing}")
+
     sample_cols = [c for c in df.columns if c in _SAMPLE_COLS]
     samples = df[sample_cols].drop_duplicates(subset=["SampleID"]).reset_index(drop=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/readers/olink_csv.py` around lines 18 - 32, After reading the
CSV into df, validate that the required columns "SampleID", "OlinkID", and "NPX"
are present before any downstream operations (before computing
sample_cols/feature_cols or calling df.pivot_table). If any are missing, raise a
clear ValueError listing the missing columns so callers get an actionable
message; place this check immediately after df = pd.read_csv(path) and before
usage of sample_cols, feature_cols, samples, features, or expression.

26-31: Consider logging or warning when aggfunc="first" silently resolves duplicates.

Using aggfunc="first" in pivot_table silently picks the first value when there are duplicate (SampleID, OlinkID) combinations. This could mask data quality issues. Consider either validating uniqueness beforehand or logging when duplicates are encountered.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/readers/olink_csv.py` around lines 26 - 31, The pivot uses
df.pivot_table(..., aggfunc="first") which silently drops duplicate (SampleID,
OlinkID) rows; before calling df.pivot_table (the expression variable), detect
duplicates with df.duplicated(subset=["SampleID","OlinkID"], keep=False), and if
any are found emit a warning/log (include count and a small sample of offending
rows) or optionally raise; then either deduplicate explicitly (e.g., groupby +
aggregate) or proceed to pivot—update the code around df.pivot_table to perform
this check and logging so duplicates are visible instead of silently ignored.
tests/test_integration.py (1)

6-12: Consider adding assertions on validation results content.

The test checks that results is a list but doesn't verify its contents. Integration tests would be more valuable if they asserted on expected validation outcomes (e.g., no errors for valid test data, or specific expected warnings).

💡 Example enhancement
     results = pyprideap.validate(dataset)
     stats = pyprideap.compute_stats(dataset)
     assert stats.n_samples > 0
     assert stats.n_features > 0
     assert isinstance(results, list)
+    # Verify no critical validation errors for valid test data
+    from pyprideap.core import Level
+    critical_errors = [r for r in results if r.level == Level.ERROR]
+    assert len(critical_errors) == 0, f"Unexpected validation errors: {critical_errors}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_integration.py` around lines 6 - 12, The test currently only
asserts that results is a list; enhance it to verify expected validation
outcomes by checking that results contains no errors for this valid fixture:
after calling pyprideap.validate(dataset) assert either that results is empty or
more robustly assert that no entry in results has severity 'error' (e.g., assert
not any(r.get('severity') == 'error' for r in results)). Also consider adding an
assertion on results structure (e.g., each item has keys like 'message' or
'severity') to ensure the validation output format from pyprideap.validate
remains stable.
src/pyprideap/readers/registry.py (1)

37-44: Potential fall-through issue: unrecognized CSV files silently fail.

If a .csv file doesn't match SomaScan or Olink marker columns, the function falls through to the .xlsx check and eventually raises ValueError. However, this path is unclear from the code structure. Consider adding an explicit else or comment to clarify this intentional fall-through behavior.

💡 Suggested clarification
     if suffix == ".csv":
         df_head = pd.read_csv(path, nrows=1)
         cols = set(df_head.columns)
         has_seqid_cols = any(c.startswith("SeqId.") for c in cols)
         if has_seqid_cols or _SOMASCAN_MARKER_COLS.intersection(cols):
             return "somascan_csv"
         if _OLINK_MARKER_COLS.intersection(cols):
             return "olink_csv"
+        # Fall through to raise ValueError if no markers matched
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/readers/registry.py` around lines 37 - 44, The CSV branch
currently inspects df_head and checks _SOMASCAN_MARKER_COLS and
_OLINK_MARKER_COLS but then silently falls through if no match; make this
explicit by either adding an else to return a sentinel (e.g., None) or, to
preserve current behavior, add a clear comment stating the fall-through is
intentional so later .xlsx handling/ValueError will apply; update the block
around suffix, pd.read_csv, df_head, has_seqid_cols, _SOMASCAN_MARKER_COLS and
_OLINK_MARKER_COLS to include that explicit else or comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pyprideap/__init__.py`:
- Around line 3-11: Wrap the importlib.metadata.version call in a try/except to
avoid raising PackageNotFoundError when the package metadata is missing: import
PackageNotFoundError from importlib.metadata and replace the direct assignment
to __version__ with a guarded assignment that sets __version__ =
version("pyprideap") inside the try block and sets a safe fallback (e.g.,
"0.0.0" or None) in the except PackageNotFoundError block; update the top-level
symbols (__version__, version, PackageNotFoundError) so importing pyprideap from
a source tree doesn't fail.

In `@src/pyprideap/pride.py`:
- Around line 37-40: The loop that sets urls[name] uses loc["value"] without
checking the key, which can raise KeyError; update the block that iterates over
locations (the for loc in locations loop) to safely access the value by either
checking "value" in loc or using loc.get("value") and only assign urls[name]
when a non-None value is present (and keep the existing break behavior after
assignment).

In `@src/pyprideap/readers/olink_parquet.py`:
- Around line 23-32: The expression matrix created by pivot_table is left with
columns in pivot order, causing mismatch with features; after building features
= df[feature_cols].drop_duplicates(subset=["OlinkID"]).reset_index(drop=True),
reindex the expression's columns to match features["OlinkID"] order (e.g. use
expression.reindex(columns=features["OlinkID"].values]) to align columns with
the features DataFrame) and then reset_index(drop=True) as done for the sample
axis so that expression columns and features["OlinkID"] have the same order and
indexing.

In `@src/pyprideap/readers/olink_xlsx.py`:
- Around line 23-27: The pivot_table call sorts OlinkID columns and can misalign
with the first-seen feature order in feature_cols/features; to fix, after
building expression (pivot_table and reindexing rows) explicitly reindex its
columns to match the feature order from features["OlinkID"] (or the
feature_cols-derived OlinkID order) so that expression.columns align with
features rows (use expression.reindex(columns=features["OlinkID"].values] or
equivalent) before returning/resetting index.

In `@src/pyprideap/readers/somascan_csv.py`:
- Around line 17-24: The features DataFrame only contains "SeqId" which breaks
SomaScanValidator expecting {"SeqId","UniProt","Target","Dilution"} and the
expression columns keep the "SeqId." prefix while seq_ids are stripped; update
the logic around seq_cols/seq_ids/features to (1) populate features with the
required columns by extracting UniProt/Target/Dilution from the original df if
present or else adding these columns with NaN/None defaults, and (2) make
expression column names and features.SeqId consistent by either stripping the
"SeqId." prefix from expression column names or keeping the prefix in
features.SeqId so that seq_cols, seq_ids, expression, and features all refer to
the same identifiers (see symbols: seq_cols, seq_ids, expression, features,
SomaScanValidator).

In `@tests/create_fixtures.py`:
- Around line 30-46: The test fixture function create_olink_xlsx is missing the
PlateQC column required by the olink xlsx reader's _SAMPLE_COLS; update the
DataFrame constructed in create_olink_xlsx to include a "PlateQC" column (e.g.,
values like "PASS" or appropriate per-row QC statuses) so the produced
tests/data/olink_sample.xlsx contains PlateQC for each row and matches the
reader expectation in src/pyprideap/readers/olink_xlsx.py.

---

Nitpick comments:
In @.github/workflows/ci.yml:
- Around line 10-19: Add pip caching to the "lint" job by caching pip's
wheel/cache directory so subsequent runs skip re-downloading ruff; modify the
steps around uses: actions/setup-python@v5 and the run: pip install ruff command
to include an actions/cache step (e.g., cache key based on runner,
python-version and a static ruff-version or requirements hash) that caches
~/.cache/pip (or the platform-appropriate pip cache path) before installing, and
restore that cache so the existing run: pip install ruff becomes fast and CI
time is reduced.
- Around line 23-32: Update the GitHub Actions job strategy to disable fast-fail
so all matrix entries run: under the existing `strategy` -> `matrix`
configuration referencing `python-version`, add `fail-fast: false` (i.e., ensure
`strategy` contains `matrix:` and `fail-fast: false`) so tests for
"python-version" 3.10–3.13 all complete even if one fails.

In `@pyproject.toml`:
- Around line 5-18: Add PyPI metadata under the existing [project] table:
populate authors (e.g., authors = [{name = "Your Name", email =
"you@example.com"}]), add a classifiers array (include Development Status and
supported Python versions like "Programming Language :: Python ::
3.10"/3.11/3.12), and add a [project.urls] section with useful links (Homepage,
Source, Documentation) so the package metadata for pyprideap is complete for
PyPI publication; update the version/description if needed but keep the existing
dependencies intact.

In `@src/pyprideap/pride.py`:
- Around line 11-15: The PrideClient currently creates a requests.Session in
__init__ and never closes it; add explicit session cleanup by implementing
context manager support (define __enter__ to return self and __exit__ to call
self.close()) plus a close() method that calls self._session.close(); ensure
existing callers can also call close() directly and consider adding a __del__
that calls close() as a fallback. Reference the PrideClient class, its __init__,
the self._session attribute, and add methods __enter__, __exit__, and close to
manage lifecycle.

In `@src/pyprideap/readers/olink_csv.py`:
- Around line 18-32: After reading the CSV into df, validate that the required
columns "SampleID", "OlinkID", and "NPX" are present before any downstream
operations (before computing sample_cols/feature_cols or calling
df.pivot_table). If any are missing, raise a clear ValueError listing the
missing columns so callers get an actionable message; place this check
immediately after df = pd.read_csv(path) and before usage of sample_cols,
feature_cols, samples, features, or expression.
- Around line 26-31: The pivot uses df.pivot_table(..., aggfunc="first") which
silently drops duplicate (SampleID, OlinkID) rows; before calling df.pivot_table
(the expression variable), detect duplicates with
df.duplicated(subset=["SampleID","OlinkID"], keep=False), and if any are found
emit a warning/log (include count and a small sample of offending rows) or
optionally raise; then either deduplicate explicitly (e.g., groupby + aggregate)
or proceed to pivot—update the code around df.pivot_table to perform this check
and logging so duplicates are visible instead of silently ignored.

In `@src/pyprideap/readers/registry.py`:
- Around line 37-44: The CSV branch currently inspects df_head and checks
_SOMASCAN_MARKER_COLS and _OLINK_MARKER_COLS but then silently falls through if
no match; make this explicit by either adding an else to return a sentinel
(e.g., None) or, to preserve current behavior, add a clear comment stating the
fall-through is intentional so later .xlsx handling/ValueError will apply;
update the block around suffix, pd.read_csv, df_head, has_seqid_cols,
_SOMASCAN_MARKER_COLS and _OLINK_MARKER_COLS to include that explicit else or
comment.

In `@src/pyprideap/validators/olink_target.py`:
- Around line 7-20: The class-level attribute _rule_prefix is only used inside
OlinkTargetValidator.validate; simplify by removing the _rule_prefix attribute
and use a local variable or inline string in validate instead: update
OlinkTargetValidator to delete the _rule_prefix declaration and change the
replace call in validate to r.rule.replace("olink.", "olink_target.", 1) (or
assign prefix = "olink_target" at the top of validate and use f"{prefix}.")
while keeping the existing mapping logic that constructs ValidationResult from
super().validate(dataset).

In `@tests/test_integration.py`:
- Around line 6-12: The test currently only asserts that results is a list;
enhance it to verify expected validation outcomes by checking that results
contains no errors for this valid fixture: after calling
pyprideap.validate(dataset) assert either that results is empty or more robustly
assert that no entry in results has severity 'error' (e.g., assert not
any(r.get('severity') == 'error' for r in results)). Also consider adding an
assertion on results structure (e.g., each item has keys like 'message' or
'severity') to ensure the validation output format from pyprideap.validate
remains stable.

In `@tests/test_pride.py`:
- Around line 36-43: The test test_get_project_not_found should mock the same
exception type Requests would raise: replace the generic Exception thrown from
mock_response.raise_for_status.side_effect with requests.exceptions.HTTPError so
the behavior matches real requests behavior; update the test to import requests
(or reference requests.exceptions.HTTPError) and set
mock_response.raise_for_status.side_effect = requests.exceptions.HTTPError("404
Not Found"), keeping the patch on pyprideap.pride.requests.Session.get and the
existing pytest.raises assertion targeting the HTTPError message.

In `@tests/test_registry.py`:
- Around line 35-52: Add a new test method test_read_somascan_csv to the
TestAutoRead class that calls read(somascan_csv_path), asserts the returned
object is the expected dataset type (e.g., AffinityDataset if applicable) and
that ds.platform == Platform.SOMASCAN; reference the existing read() call
pattern and the somascan_csv_path fixture used in TestDetectFormat to mirror
other tests (e.g., test_read_somascan_adat) for consistency.

In `@tests/test_validators.py`:
- Around line 147-176: Add a unit test that ensures OlinkTargetValidator
rewrites inherited rule names from the olink. prefix to olink_target.; create a
new test method (e.g., test_rule_prefix_rewriting) inside
TestOlinkTargetValidator that builds a minimal dataset via
_make_olink_target_dataset, runs OlinkTargetValidator().validate(ds), filters
error-level results using Level, and asserts that every error.rule starts with
"olink_target." and that none start with "olink."; this verifies the prefix
rewriting behavior for inherited rules.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 348a20d9-ebc1-4e1a-a5c1-1d90108116cd

📥 Commits

Reviewing files that changed from the base of the PR and between 46b0ee9 and 910ac2c.

⛔ Files ignored due to path filters (4)
  • tests/data/olink_sample.npx.csv is excluded by !**/*.csv
  • tests/data/olink_sample.parquet is excluded by !**/*.parquet
  • tests/data/olink_sample.xlsx is excluded by !**/*.xlsx
  • tests/data/somascan_sample.csv is excluded by !**/*.csv
📒 Files selected for processing (31)
  • .github/workflows/ci.yml
  • .gitignore
  • README.md
  • pyproject.toml
  • src/pyprideap/__init__.py
  • src/pyprideap/core.py
  • src/pyprideap/pride.py
  • src/pyprideap/readers/__init__.py
  • src/pyprideap/readers/olink_csv.py
  • src/pyprideap/readers/olink_parquet.py
  • src/pyprideap/readers/olink_xlsx.py
  • src/pyprideap/readers/registry.py
  • src/pyprideap/readers/somascan_adat.py
  • src/pyprideap/readers/somascan_csv.py
  • src/pyprideap/stats.py
  • src/pyprideap/validators/__init__.py
  • src/pyprideap/validators/base.py
  • src/pyprideap/validators/olink_explore.py
  • src/pyprideap/validators/olink_target.py
  • src/pyprideap/validators/somascan.py
  • tests/__init__.py
  • tests/conftest.py
  • tests/create_fixtures.py
  • tests/data/somascan_sample.adat
  • tests/test_core.py
  • tests/test_integration.py
  • tests/test_pride.py
  • tests/test_readers.py
  • tests/test_registry.py
  • tests/test_stats.py
  • tests/test_validators.py

Comment on lines +37 to +40
for loc in locations:
if loc.get("name") == "FTP Protocol":
urls[name] = loc["value"]
break
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against missing value key to prevent KeyError.

Line 39 accesses loc["value"] without verifying the key exists. If the API response lacks a value field for an FTP location, this will raise a KeyError.

🛡️ Proposed defensive fix
             for loc in locations:
                 if loc.get("name") == "FTP Protocol":
-                    urls[name] = loc["value"]
+                    if "value" in loc:
+                        urls[name] = loc["value"]
                     break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/pride.py` around lines 37 - 40, The loop that sets urls[name]
uses loc["value"] without checking the key, which can raise KeyError; update the
block that iterates over locations (the for loc in locations loop) to safely
access the value by either checking "value" in loc or using loc.get("value") and
only assign urls[name] when a non-None value is present (and keep the existing
break behavior after assignment).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/pyprideap/__init__.py (1)

3-13: ⚠️ Potential issue | 🟠 Major

Guard __version__ for source-tree imports.

version("pyprideap") raises PackageNotFoundError when the package is imported without installed metadata, so import pyprideap can fail in local-dev and test setups.

Suggested fix
-from importlib.metadata import version
+from importlib.metadata import PackageNotFoundError, version
@@
-__version__ = version("pyprideap")
+try:
+    __version__ = version("pyprideap")
+except PackageNotFoundError:
+    __version__ = "0+unknown"
#!/bin/bash
python - <<'PY'
from pathlib import Path

text = Path("src/pyprideap/__init__.py").read_text()
print("has PackageNotFoundError guard:", "PackageNotFoundError" in text)
print('has direct version() assignment:', '__version__ = version("pyprideap")' in text)
PY

python - <<'PY'
from importlib.metadata import PackageNotFoundError, version

try:
    version("definitely-not-installed-package-123456")
except PackageNotFoundError as exc:
    print(type(exc).__name__)
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/__init__.py` around lines 3 - 13, The module-level assignment
__version__ = version("pyprideap") can raise PackageNotFoundError when running
from the source tree; wrap the call to importlib.metadata.version in a
try/except that catches importlib.metadata.PackageNotFoundError (or the imported
PackageNotFoundError symbol) and set a safe fallback (e.g. "__version__" =
"0.0.0" or None) in the except branch so importing pyprideap in local-dev/tests
won't fail; update the top-level import/assignment involving version and
__version__ accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pyprideap/filtering.py`:
- Around line 43-45: The code applies a boolean Series (keep_mask) from
dataset.samples directly to dataset.expression which risks label-aligned
selection if indexes diverge; change those selections to positional indexing
using .iloc with the boolean array (e.g., replace dataset.expression[keep_mask]
with dataset.expression.iloc[keep_mask.to_numpy()] and do the same for any other
occurrence such as the selection at lines referencing keep_mask/samples) and
then call .reset_index(drop=True) as before.

In `@src/pyprideap/lod.py`:
- Around line 56-69: The code currently falls back to any control types when no
negative controls exist; change this to require negative controls only by
setting control_mask = is_negative (remove the fallback to is_control) and then
raise a ValueError if not is_negative.any() or if control_mask.sum() <
_MIN_CONTROLS_FOR_LOD; update the check around
compute_lod_stats()/get_valid_proteins to rely solely on control_mask (negative
samples) and keep the existing error message using _MIN_CONTROLS_FOR_LOD and
control_mask.sum() to report insufficient negatives.

In `@src/pyprideap/validators/olink_explore.py`:
- Around line 65-74: The _check_qc_consistency method currently skips sample
rows when there is no corresponding expression row, allowing misaligned datasets
to pass; update _check_qc_consistency (in class handling AffinityDataset) to
detect a mismatch between ds.samples and ds.expression lengths up front and fail
fast by returning a ValidationResult (or raising a ValidationError) indicating a
fatal row-count mismatch instead of silently continuing; specifically, add a
check comparing len(ds.samples) and len(ds.expression) at the start of
_check_qc_consistency and emit/return a clear error ValidationResult referencing
the dataset and the mismatch so downstream code stops processing misaligned
tables.

---

Duplicate comments:
In `@src/pyprideap/__init__.py`:
- Around line 3-13: The module-level assignment __version__ =
version("pyprideap") can raise PackageNotFoundError when running from the source
tree; wrap the call to importlib.metadata.version in a try/except that catches
importlib.metadata.PackageNotFoundError (or the imported PackageNotFoundError
symbol) and set a safe fallback (e.g. "__version__" = "0.0.0" or None) in the
except branch so importing pyprideap in local-dev/tests won't fail; update the
top-level import/assignment involving version and __version__ accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8f31f39c-4945-440a-94b8-52ffaadf5309

📥 Commits

Reviewing files that changed from the base of the PR and between 910ac2c and bc32e82.

📒 Files selected for processing (6)
  • src/pyprideap/__init__.py
  • src/pyprideap/filtering.py
  • src/pyprideap/lod.py
  • src/pyprideap/validators/olink_explore.py
  • tests/test_filtering.py
  • tests/test_lod.py

Comment on lines +43 to +45
keep_mask = ~is_control
samples = dataset.samples[keep_mask].reset_index(drop=True)
expression = dataset.expression[keep_mask].reset_index(drop=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import pandas as pd

samples = pd.DataFrame({"SampleType": ["Sample", "Control"]}, index=[100, 101])
expression = pd.DataFrame({"OID0": [1.0, 2.0]}, index=[0, 1])
mask = samples["SampleType"].eq("Sample")

print("samples.index   =", list(samples.index))
print("expression.index=", list(expression.index))

try:
    expression[mask]
except Exception as exc:
    print(type(exc).__name__, str(exc))

print("positional fix:")
print(expression.loc[mask.to_numpy()])
PY

Repository: PRIDE-Archive/pyprideap

Length of output: 185


🏁 Script executed:

cat -n src/pyprideap/filtering.py | head -80

Repository: PRIDE-Archive/pyprideap

Length of output: 2884


🏁 Script executed:

cat -n src/pyprideap/core.py | grep -A 20 "class AffinityDataset"

Repository: PRIDE-Archive/pyprideap

Length of output: 298


🏁 Script executed:

rg "def samples|def expression" src/pyprideap/core.py -A 5

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

rg "AffinityDataset(" src/pyprideap/ -B 2 -A 5 | head -100

Repository: PRIDE-Archive/pyprideap

Length of output: 149


🏁 Script executed:

find src/pyprideap -name "*.py" -exec grep -l "samples.*expression" {} \;

Repository: PRIDE-Archive/pyprideap

Length of output: 93


🏁 Script executed:

grep -n "\.index\|reset_index\|set_index" src/pyprideap/filtering.py

Repository: PRIDE-Archive/pyprideap

Length of output: 346


🏁 Script executed:

rg "read_csv|read_parquet|pd\.DataFrame" src/pyprideap/ -B 1 -A 1 | head -80

Repository: PRIDE-Archive/pyprideap

Length of output: 1926


🏁 Script executed:

find src/pyprideap -name "*.py" -exec grep -l "def __init__\|class.*:" {} \; | xargs ls -la

Repository: PRIDE-Archive/pyprideap

Length of output: 700


🏁 Script executed:

cat -n src/pyprideap/core.py | head -100

Repository: PRIDE-Archive/pyprideap

Length of output: 1016


🏁 Script executed:

cat -n src/pyprideap/readers/somascan_csv.py

Repository: PRIDE-Archive/pyprideap

Length of output: 1175


🏁 Script executed:

cat -n src/pyprideap/readers/somascan_adat.py

Repository: PRIDE-Archive/pyprideap

Length of output: 3326


🏁 Script executed:

cat -n src/pyprideap/readers/olink_csv.py

Repository: PRIDE-Archive/pyprideap

Length of output: 1608


🏁 Script executed:

cat -n src/pyprideap/readers/olink_parquet.py

Repository: PRIDE-Archive/pyprideap

Length of output: 1609


Use positional masking when subsetting expression.

Lines 45 and 74 apply a boolean Series from dataset.samples directly to dataset.expression. While this currently works because readers ensure matching RangeIndex(0, n) for both DataFrames, the code relies on an undocumented invariant. If the indexes ever diverge, pandas' label-aligned indexing will select wrong rows or raise errors.

Suggested fix
-    expression = dataset.expression[keep_mask].reset_index(drop=True)
+    expression = dataset.expression.iloc[keep_mask.to_numpy()].reset_index(drop=True)

Also applies to: 73-74

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/filtering.py` around lines 43 - 45, The code applies a boolean
Series (keep_mask) from dataset.samples directly to dataset.expression which
risks label-aligned selection if indexes diverge; change those selections to
positional indexing using .iloc with the boolean array (e.g., replace
dataset.expression[keep_mask] with dataset.expression.iloc[keep_mask.to_numpy()]
and do the same for any other occurrence such as the selection at lines
referencing keep_mask/samples) and then call .reset_index(drop=True) as before.

Comment on lines +56 to +69
# Further narrow to negative controls only
is_negative = (
dataset.samples["SampleType"]
.astype(str)
.str.lower()
.str.strip()
.isin({"negative", "negative control", "negative_control", "neg"})
)

control_mask = is_negative if is_negative.any() else is_control
if control_mask.sum() < _MIN_CONTROLS_FOR_LOD:
raise ValueError(
f"Need at least {_MIN_CONTROLS_FOR_LOD} control samples for LOD calculation, found {control_mask.sum()}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Require negative controls instead of falling back to any control.

The current fallback computes LOD from positive/QC/reference controls when no negative controls exist. That contradicts the docstring and can materially distort thresholds returned by compute_lod_stats() and get_valid_proteins().

Suggested fix
-    control_mask = is_negative if is_negative.any() else is_control
-    if control_mask.sum() < _MIN_CONTROLS_FOR_LOD:
+    if not is_negative.any():
+        raise ValueError("Negative control samples required for LOD calculation")
+
+    control_mask = is_negative
+    if control_mask.sum() < _MIN_CONTROLS_FOR_LOD:
         raise ValueError(
             f"Need at least {_MIN_CONTROLS_FOR_LOD} control samples for LOD calculation, found {control_mask.sum()}"
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/lod.py` around lines 56 - 69, The code currently falls back to
any control types when no negative controls exist; change this to require
negative controls only by setting control_mask = is_negative (remove the
fallback to is_control) and then raise a ValueError if not is_negative.any() or
if control_mask.sum() < _MIN_CONTROLS_FOR_LOD; update the check around
compute_lod_stats()/get_valid_proteins to rely solely on control_mask (negative
samples) and keep the existing error message using _MIN_CONTROLS_FOR_LOD and
control_mask.sum() to report insufficient negatives.

Comment on lines +65 to +74
def _check_qc_consistency(self, ds: AffinityDataset) -> list[ValidationResult]:
if "SampleQC" not in ds.samples.columns:
return []
results = []
for idx in range(len(ds.samples)):
row = ds.samples.iloc[idx]
qc = row.get("SampleQC")
if idx >= len(ds.expression):
continue
expr_row = ds.expression.iloc[idx]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast on sample/expression row-count mismatches.

Line 72 silently skips samples that have no corresponding expression row. That turns a corrupted dataset into a partially validated one and leaves downstream filtering/stats code operating on misaligned tables.

Suggested fix
     def _check_qc_consistency(self, ds: AffinityDataset) -> list[ValidationResult]:
         if "SampleQC" not in ds.samples.columns:
             return []
+        if len(ds.samples) != len(ds.expression):
+            return [
+                ValidationResult(
+                    level=Level.ERROR,
+                    rule="olink.expression.sample_count_mismatch",
+                    message="Sample table and expression matrix must have the same number of rows",
+                    details={
+                        "n_samples": len(ds.samples),
+                        "n_expression_rows": len(ds.expression),
+                    },
+                )
+            ]
         results = []
         for idx in range(len(ds.samples)):
             row = ds.samples.iloc[idx]
             qc = row.get("SampleQC")
-            if idx >= len(ds.expression):
-                continue
             expr_row = ds.expression.iloc[idx]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _check_qc_consistency(self, ds: AffinityDataset) -> list[ValidationResult]:
if "SampleQC" not in ds.samples.columns:
return []
results = []
for idx in range(len(ds.samples)):
row = ds.samples.iloc[idx]
qc = row.get("SampleQC")
if idx >= len(ds.expression):
continue
expr_row = ds.expression.iloc[idx]
def _check_qc_consistency(self, ds: AffinityDataset) -> list[ValidationResult]:
if "SampleQC" not in ds.samples.columns:
return []
if len(ds.samples) != len(ds.expression):
return [
ValidationResult(
level=Level.ERROR,
rule="olink.expression.sample_count_mismatch",
message="Sample table and expression matrix must have the same number of rows",
details={
"n_samples": len(ds.samples),
"n_expression_rows": len(ds.expression),
},
)
]
results = []
for idx in range(len(ds.samples)):
row = ds.samples.iloc[idx]
qc = row.get("SampleQC")
expr_row = ds.expression.iloc[idx]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/validators/olink_explore.py` around lines 65 - 74, The
_check_qc_consistency method currently skips sample rows when there is no
corresponding expression row, allowing misaligned datasets to pass; update
_check_qc_consistency (in class handling AffinityDataset) to detect a mismatch
between ds.samples and ds.expression lengths up front and fail fast by returning
a ValidationResult (or raising a ValidationError) indicating a fatal row-count
mismatch instead of silently continuing; specifically, add a check comparing
len(ds.samples) and len(ds.expression) at the start of _check_qc_consistency and
emit/return a clear error ValidationResult referencing the dataset and the
mismatch so downstream code stops processing misaligned tables.

ypriverol and others added 3 commits March 11, 2026 15:28
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nalysis

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s, compute_cv_distribution, compute_detection_rate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
src/pyprideap/readers/somascan_csv.py (1)

24-30: ⚠️ Potential issue | 🟠 Major

Still unresolved: use the same SeqId values in features and expression.columns.

features["SeqId"] strips the SeqId. prefix, but the expression matrix keeps the prefixed names. Any downstream lookup keyed by SeqId will miss even though both structures describe the same assays.

Suggested fix
-    expression = df[seq_cols].astype(float).reset_index(drop=True)
-
     seq_ids = [c.replace("SeqId.", "") for c in seq_cols]
+    expression = df[seq_cols].astype(float).reset_index(drop=True)
+    expression.columns = seq_ids
     features = pd.DataFrame(
         {
             "SeqId": seq_ids,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/readers/somascan_csv.py` around lines 24 - 30, The features
DataFrame is created with SeqId values that have the "SeqId." prefix stripped
(seq_ids) while the expression DataFrame columns remain as the original prefixed
names (expression.columns from seq_cols), causing mismatches; fix by making the
two representations identical — either keep the prefix when building features
(set features["SeqId"] to seq_cols or to the same strings used in
expression.columns) or normalize expression.columns to seq_ids so that
expression.columns and features["SeqId"] match (update where seq_ids, seq_cols,
features, and expression are handled).
src/pyprideap/readers/olink_parquet.py (1)

27-36: ⚠️ Potential issue | 🟠 Major

Still unresolved: reindex the feature axis before returning the dataset.

This has the same silent misalignment risk as the XLSX reader: features keeps first-seen OlinkID order, while the pivoted matrix can use a different column order. Because AffinityDataset does not validate alignment, callers can get the wrong metadata for each expression column.

Suggested fix
     feature_cols = [c for c in df.columns if c in _FEATURE_COLS]
     features = df[feature_cols].drop_duplicates(subset=["OlinkID"]).reset_index(drop=True)
 
     expression = df.pivot_table(
         index="SampleID",
         columns="OlinkID",
         values="NPX",
         aggfunc="first",
     )
-    expression = expression.reindex(samples["SampleID"].values).reset_index(drop=True)
+    expression = expression.reindex(
+        index=samples["SampleID"].values,
+        columns=features["OlinkID"].values,
+    ).reset_index(drop=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/readers/olink_parquet.py` around lines 27 - 36, features and
the pivoted expression can be misaligned because features is built from
first-seen OlinkID order while expression.columns may be different; after
creating expression (the pivot_table) reorder/reindex the features DataFrame to
match expression.columns (use expression.columns / OlinkID values) so that the
features rows align with expression columns before returning or passing into
AffinityDataset (e.g., set features =
features.set_index("OlinkID").reindex(expression.columns).reset_index()).
src/pyprideap/readers/olink_xlsx.py (1)

27-31: ⚠️ Potential issue | 🟠 Major

Still unresolved: keep the feature axis aligned with features.

features preserves first-seen OlinkID order, but pivot_table() can return columns in a different order. On unsorted XLSX input, features.iloc[i] no longer describes expression.iloc[:, i], and AffinityDataset will return that mismatch unchecked.

Suggested fix
     feature_cols = [c for c in df.columns if c in _FEATURE_COLS]
     features = df[feature_cols].drop_duplicates(subset=["OlinkID"]).reset_index(drop=True)
 
     expression = df.pivot_table(index="SampleID", columns="OlinkID", values="NPX", aggfunc="first")
-    expression = expression.reindex(samples["SampleID"].values).reset_index(drop=True)
+    expression = expression.reindex(
+        index=samples["SampleID"].values,
+        columns=features["OlinkID"].values,
+    ).reset_index(drop=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/readers/olink_xlsx.py` around lines 27 - 31, The feature axis
of expression can be out of order versus features because pivot_table may sort
columns differently; after creating features (feature_cols and features =
df[feature_cols]...), reindex the columns of expression to the OlinkID order in
features (use features["OlinkID"] as the column index ordering) so that
expression.iloc[:, i] corresponds to features.iloc[i]; perform this reindexing
on expression.columns (before any reset_index that drops column alignment) and
handle missing columns by allowing NaNs or filling as appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Around line 55-56: Add a pytest pythonpath entry under the existing
[tool.pytest.ini_options] so pytest can locate the package in a src/ layout;
specifically, update the [tool.pytest.ini_options] block (the same section that
contains testpaths) to include a pythonpath setting that points to "src" so
tests won't raise ModuleNotFoundError when running pytest directly.

In `@src/pyprideap/readers/olink_csv.py`:
- Around line 27-36: The expression DataFrame columns can be alphabetically
sorted by pivot_table(), causing misalignment with features (built via
feature_cols and features = df[feature_cols].drop_duplicates(subset=["OlinkID"])
), so after creating expression (the pivot_table and reindex by SampleID)
reindex or reorder expression.columns to match the features["OlinkID"] order;
update the olink_csv.py (and analogously olink_xlsx.py and olink_parquet.py) to
set expression = expression.reindex(columns=features["OlinkID"].values) or
otherwise reorder expression.columns to features' OlinkID sequence so positional
pairing is preserved.

In `@src/pyprideap/readers/registry.py`:
- Around line 40-42: detect_format currently classifies a file as "somascan_csv"
when _SOMASCAN_MARKER_COLS is a subset of cols, but read_somascan_csv expects
wide-format files with SeqId.* columns and will fail for long-format inputs;
change the detection logic in detect_format() to only return "somascan_csv" when
has_seqid_cols (i.e., any column startswith "SeqId.") is true, removing the
_SOMASCAN_MARKER_COLS.issubset(cols) branch so only wide-format files are routed
to read_somascan_csv(); if long-format SomaScan CSV support is needed later, add
a separate reader instead of reusing read_somascan_csv().

---

Duplicate comments:
In `@src/pyprideap/readers/olink_parquet.py`:
- Around line 27-36: features and the pivoted expression can be misaligned
because features is built from first-seen OlinkID order while expression.columns
may be different; after creating expression (the pivot_table) reorder/reindex
the features DataFrame to match expression.columns (use expression.columns /
OlinkID values) so that the features rows align with expression columns before
returning or passing into AffinityDataset (e.g., set features =
features.set_index("OlinkID").reindex(expression.columns).reset_index()).

In `@src/pyprideap/readers/olink_xlsx.py`:
- Around line 27-31: The feature axis of expression can be out of order versus
features because pivot_table may sort columns differently; after creating
features (feature_cols and features = df[feature_cols]...), reindex the columns
of expression to the OlinkID order in features (use features["OlinkID"] as the
column index ordering) so that expression.iloc[:, i] corresponds to
features.iloc[i]; perform this reindexing on expression.columns (before any
reset_index that drops column alignment) and handle missing columns by allowing
NaNs or filling as appropriate.

In `@src/pyprideap/readers/somascan_csv.py`:
- Around line 24-30: The features DataFrame is created with SeqId values that
have the "SeqId." prefix stripped (seq_ids) while the expression DataFrame
columns remain as the original prefixed names (expression.columns from
seq_cols), causing mismatches; fix by making the two representations identical —
either keep the prefix when building features (set features["SeqId"] to seq_cols
or to the same strings used in expression.columns) or normalize
expression.columns to seq_ids so that expression.columns and features["SeqId"]
match (update where seq_ids, seq_cols, features, and expression are handled).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9dc01a85-4667-4b9b-a88d-7ae79f74a129

📥 Commits

Reviewing files that changed from the base of the PR and between bc32e82 and e39bda9.

📒 Files selected for processing (12)
  • .github/workflows/ci.yml
  • pyproject.toml
  • src/pyprideap/__init__.py
  • src/pyprideap/pride.py
  • src/pyprideap/readers/olink_csv.py
  • src/pyprideap/readers/olink_parquet.py
  • src/pyprideap/readers/olink_xlsx.py
  • src/pyprideap/readers/registry.py
  • src/pyprideap/readers/somascan_csv.py
  • tests/test_integration.py
  • tests/test_registry.py
  • tests/test_validators.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • tests/test_validators.py
  • .github/workflows/ci.yml
  • tests/test_integration.py
  • tests/test_registry.py
  • src/pyprideap/pride.py

Comment on lines +55 to +56
[tool.pytest.ini_options]
testpaths = ["tests"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add pythonpath for src-layout compatibility.

With a src/ layout, pytest won't find the pyprideap package without either an editable install or explicit path configuration. This causes ModuleNotFoundError when running pytest directly.

🔧 Proposed fix
 [tool.pytest.ini_options]
 testpaths = ["tests"]
+pythonpath = ["src"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[tool.pytest.ini_options]
testpaths = ["tests"]
[tool.pytest.ini_options]
testpaths = ["tests"]
pythonpath = ["src"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 55 - 56, Add a pytest pythonpath entry under the
existing [tool.pytest.ini_options] so pytest can locate the package in a src/
layout; specifically, update the [tool.pytest.ini_options] block (the same
section that contains testpaths) to include a pythonpath setting that points to
"src" so tests won't raise ModuleNotFoundError when running pytest directly.

Comment on lines +40 to +42
has_seqid_cols = any(c.startswith("SeqId.") for c in cols)
if has_seqid_cols or _SOMASCAN_MARKER_COLS.issubset(cols):
return "somascan_csv"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make SomaScan CSV detection match the reader’s actual input contract.

detect_format() accepts any CSV with SeqId and SomaId, but read_somascan_csv() only supports wide files with SeqId.* expression columns. That means read() can classify a file as somascan_csv and then fail immediately with No SeqId.* columns found.

Suggested fix
-        if has_seqid_cols or _SOMASCAN_MARKER_COLS.issubset(cols):
+        if has_seqid_cols:
             return "somascan_csv"

If long-format SomaScan CSV support is intended later, it should get its own reader instead of being routed to the current wide-format implementation.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
has_seqid_cols = any(c.startswith("SeqId.") for c in cols)
if has_seqid_cols or _SOMASCAN_MARKER_COLS.issubset(cols):
return "somascan_csv"
has_seqid_cols = any(c.startswith("SeqId.") for c in cols)
if has_seqid_cols:
return "somascan_csv"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/readers/registry.py` around lines 40 - 42, detect_format
currently classifies a file as "somascan_csv" when _SOMASCAN_MARKER_COLS is a
subset of cols, but read_somascan_csv expects wide-format files with SeqId.*
columns and will fail for long-format inputs; change the detection logic in
detect_format() to only return "somascan_csv" when has_seqid_cols (i.e., any
column startswith "SeqId.") is true, removing the
_SOMASCAN_MARKER_COLS.issubset(cols) branch so only wide-format files are routed
to read_somascan_csv(); if long-format SomaScan CSV support is needed later, add
a separate reader instead of reusing read_somascan_csv().

ypriverol and others added 6 commits March 11, 2026 15:37
…gator

Merged PCA, correlation, missing values, CV distribution, and detection
rate into compute.py alongside distribution, QC summary, and LOD analysis.
Added compute_all() that returns all applicable plot data for a dataset.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… public API

- qc_report() assembles all computed plot data into a standalone HTML file
  with Plotly CDN, table of contents, and platform summary header
- compute_qc() (alias for compute_all) and qc_report() added to __all__
- Supports both Olink and SomaScan platform reports

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion,

missing frequency histogram, QC×LOD stacked bar, PCA with sample labels

- Distribution: overlaid per-sample histograms instead of single combined
- Missing Frequency: new plot — per-assay missing freq with 30% threshold
- QC Summary: stacked bar crossing QC status with above/below LOD
- PCA: use SampleQC for color when all same SampleType, show labels on points
- Add MissingFrequencyData and QcLodSummaryData dataclasses

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- PCA: set explicit height=500px on all figures to prevent div collapse
  when using to_html(include_plotlyjs=False)
- Missing Frequency: use MissingFreq column from Olink features metadata
  (per-assay metric from Olink) instead of computing NaN rate from
  pivoted expression matrix which is always 0%
- olink_csv reader: add MissingFreq to extracted feature columns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@qodo-code-review
Copy link

qodo-code-review bot commented Mar 12, 2026

CI Feedback 🧐

(Feedback updated until commit 5cb8d56)

A test triggered by this PR failed. Here is an AI-generated analysis of the failure:

Action: typecheck

Failed stage: Run mypy src/pyprideap/ --ignore-missing-imports [❌]

Failed test name: ""

Failure summary:

The GitHub Action failed during the mypy src/pyprideap/ --ignore-missing-imports step because mypy
reported 7 type-checking errors (exit code 1).
Errors were reported in:
- src/pyprideap/lod.py:
-
Line 127: returning Any from a function declared to return Series[Any] ([no-any-return])
- Lines
197 and 412: incompatible argument type passed to numpy.tile (ndarray[...] | ExtensionArray not
matching expected types) ([arg-type])
- Line 202: returning Any from a function declared to return
DataFrame | Series[Any] ([no-any-return])
- src/pyprideap/qc/compute.py:
- Line 149: list(...)
called with an object that does not match any list constructor overload ([call-overload])
-
src/pyprideap/qc/report.py:
- Line 252: attempting to call a function of unknown type ([operator])

- Line 256: accessing .title on an object that does not define that attribute ([attr-defined])

Relevant error logs:
1:  ##[group]Runner Image Provisioner
2:  Hosted Compute Agent
...

273:  Stored in directory: /tmp/pip-ephem-wheel-cache-3jlgn90r/wheels/c1/f8/d8/2821f504d00caaf76261bd56e020104d1119cf37a51a94cf2d
274:  Successfully built pyprideap
275:  Installing collected packages: urllib3, typing_extensions, threadpoolctl, six, ruff, pygments, pyarrow, pluggy, pathspec, packaging, numpy, narwhals, mypy_extensions, librt, joblib, iniconfig, idna, et-xmlfile, charset_normalizer, certifi, types-requests, scipy, requests, python-dateutil, pytest, plotly, pandas-stubs, openpyxl, mypy, scikit-learn, pandas, pyprideap
276:  Successfully installed certifi-2026.2.25 charset_normalizer-3.4.5 et-xmlfile-2.0.0 idna-3.11 iniconfig-2.3.0 joblib-1.5.3 librt-0.8.1 mypy-1.19.1 mypy_extensions-1.1.0 narwhals-2.18.0 numpy-2.4.3 openpyxl-3.1.5 packaging-26.0 pandas-3.0.1 pandas-stubs-3.0.0.260204 pathspec-1.0.4 plotly-6.6.0 pluggy-1.6.0 pyarrow-23.0.1 pygments-2.19.2 pyprideap-0.1.0 pytest-9.0.2 python-dateutil-2.9.0.post0 requests-2.32.5 ruff-0.15.5 scikit-learn-1.8.0 scipy-1.17.1 six-1.17.0 threadpoolctl-3.6.0 types-requests-2.32.4.20260107 typing_extensions-4.15.0 urllib3-2.6.3
277:  ##[group]Run mypy src/pyprideap/ --ignore-missing-imports
278:  �[36;1mmypy src/pyprideap/ --ignore-missing-imports�[0m
279:  shell: /usr/bin/bash -e {0}
280:  env:
281:  pythonLocation: /opt/hostedtoolcache/Python/3.12.12/x64
282:  PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.12.12/x64/lib/pkgconfig
283:  Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.12.12/x64
284:  Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.12.12/x64
285:  Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.12.12/x64
286:  LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.12.12/x64/lib
287:  ##[endgroup]
288:  src/pyprideap/lod.py:127: error: Returning Any from function declared to return "Series[Any]"  [no-any-return]
289:  src/pyprideap/lod.py:197: error: Argument 1 to "tile" has incompatible type "ndarray[tuple[int], dtype[Any]] | ExtensionArray"; expected "_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]]"  [arg-type]
290:  src/pyprideap/lod.py:202: error: Returning Any from function declared to return "DataFrame | Series[Any]"  [no-any-return]
291:  src/pyprideap/lod.py:412: error: Argument 1 to "tile" has incompatible type "ndarray[tuple[int], dtype[Any]] | ExtensionArray"; expected "_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]]"  [arg-type]
292:  src/pyprideap/qc/compute.py:149: error: No overload variant of "list" matches argument type "object"  [call-overload]
293:  src/pyprideap/qc/compute.py:149: note: Possible overload variants:
294:  src/pyprideap/qc/compute.py:149: note:     def [_T] __init__(self) -> list[_T]
295:  src/pyprideap/qc/compute.py:149: note:     def [_T] __init__(self, Iterable[_T], /) -> list[_T]
296:  src/pyprideap/qc/report.py:252: error: Cannot call function of unknown type  [operator]
297:  src/pyprideap/qc/report.py:256: error: "object" has no attribute "title"  [attr-defined]
298:  Found 7 errors in 3 files (checked 22 source files)
299:  ##[error]Process completed with exit code 1.
300:  Post job cleanup.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (1)
src/pyprideap/filtering.py (1)

43-45: ⚠️ Potential issue | 🟠 Major

Use positional row filtering for expression and lod_matrix.

Lines 45, 53, 82, and 90 apply a boolean Series built from dataset.samples directly to other DataFrames. Pandas aligns that mask by index label, so any future index drift between these frames will mis-filter rows or raise. Convert the mask once with keep_mask.to_numpy() and use .iloc[...] for the row-aligned matrices.

🐛 Suggested fix
     keep_mask = ~is_control
-    samples = dataset.samples[keep_mask].reset_index(drop=True)
-    expression = dataset.expression[keep_mask].reset_index(drop=True)
+    row_mask = keep_mask.to_numpy()
+    samples = dataset.samples.loc[keep_mask].reset_index(drop=True)
+    expression = dataset.expression.iloc[row_mask].reset_index(drop=True)
@@
         lod_df = metadata["lod_matrix"]
         if isinstance(lod_df, pd.DataFrame):
-            metadata["lod_matrix"] = lod_df[keep_mask].reset_index(drop=True)
+            metadata["lod_matrix"] = lod_df.iloc[row_mask].reset_index(drop=True)
@@
-    samples = dataset.samples[keep_mask].reset_index(drop=True)
-    expression = dataset.expression[keep_mask].reset_index(drop=True)
+    row_mask = keep_mask.to_numpy()
+    samples = dataset.samples.loc[keep_mask].reset_index(drop=True)
+    expression = dataset.expression.iloc[row_mask].reset_index(drop=True)
@@
         lod_df = metadata["lod_matrix"]
         if isinstance(lod_df, pd.DataFrame):
-            metadata["lod_matrix"] = lod_df[keep_mask].reset_index(drop=True)
+            metadata["lod_matrix"] = lod_df.iloc[row_mask].reset_index(drop=True)

Also applies to: 51-53, 81-82, 88-90

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/filtering.py` around lines 43 - 45, The boolean mask built from
dataset.samples (keep_mask = ~is_control) is being applied directly to other
DataFrames which causes label-based alignment bugs; convert the mask to a
positional boolean array (e.g., keep_pos = keep_mask.to_numpy()) and use
.iloc[...] to select rows for dataset.samples, dataset.expression and
dataset.lod_matrix (and any other uses) — replace expressions like samples =
dataset.samples[keep_mask].reset_index(...) and expression =
dataset.expression[keep_mask].reset_index(...) with samples =
dataset.samples.iloc[keep_pos].reset_index(drop=True), expression =
dataset.expression.iloc[keep_pos].reset_index(drop=True), and similarly use
.iloc[keep_pos] when slicing lod_matrix.
🧹 Nitpick comments (1)
src/pyprideap/qc/render.py (1)

45-60: Normalize the histogram traces if this chart is supposed to compare distributions.

The docstring says density/KDE-like output, but Lines 54-60 use histnorm="", which plots raw counts. Samples with more non-missing measurements will therefore look "larger" even when the underlying shape is identical. If the intent is distribution comparison, switch to histnorm="probability density"/"probability" or update the copy to match the current behavior.

💡 Suggested fix
-                histnorm="",
+                histnorm="probability density",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/qc/render.py` around lines 45 - 60, The histogram traces in
render_distribution currently use histnorm="" which shows raw counts and biases
samples with more measurements; update the go.Histogram call in
render_distribution to use an appropriate normalization such as
histnorm="probability density" (or "probability") so the per-sample overlaid
curves compare distributions rather than counts, and ensure the docstring still
states density/KDE-like output (or adjust the docstring if you intentionally
keep raw counts).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pyprideap/__init__.py`:
- Around line 3-13: The import block is too long and triggers Ruff I001/E501;
reflow the pyprideap.lod import into a parenthesized, multi-line import and let
Ruff/isort reorder groups so line lengths and ordering meet style rules—e.g.
replace the single long line that imports LodMethod, LodStats,
compute_lod_from_controls, compute_nclod, compute_lod_stats,
get_bundled_fixed_lod_path, get_reported_lod, get_valid_proteins, load_fixed_lod
with a parenthesized import list (one name per line) and then run ruff/isort to
fix ordering and spacing.

In `@src/pyprideap/lod.py`:
- Around line 486-488: The comprehension creating panel_cols is too long and
exceeds the 120-char limit; split it across multiple lines to satisfy the linter
by breaking after the opening bracket and aligning the conditional onto its own
line. Update the expression that builds panel_cols (the list comprehension
referencing dataset.expression.columns, panel_map, panel_name, and
has_lod_per_col) so each logical part is on a separate line (e.g., start the
list on one line, put the for clause on the next line, and the if condition on
another) to keep line length <=120.
- Line 247: The hardcoded _CONFIGS_DIR path will break after pip installation
and causes get_bundled_fixed_lod_path() to return None; replace filesystem-based
resolution with importlib.resources (or importlib.resources.files/open_text) to
load the bundled CSVs from the package resources and update pyproject.toml to
include those CSVs as package data (or configure hatchling's include section) so
they are distributed; modify get_bundled_fixed_lod_path() (and any callers) to
use importlib.resources APIs to locate/read the CSVs rather than
Path(__file__).resolve().parent... and remove reliance on _CONFIGS_DIR.

In `@src/pyprideap/qc/compute.py`:
- Around line 98-99: The helper _sample_id_col currently returns "SampleId" even
when neither "SampleID" nor "SampleId" exists; change _sample_id_col to return
None (or raise a ValueError with a clear message) when no valid column is found,
and update callers such as compute_correlation to check the return value before
indexing dataset.samples (raise a clear error or handle the missing ID column
case) so a KeyError is avoided and errors are informative.
- Around line 148-149: The error arises because dataset.metadata is typed
dict[str, object] so dataset.metadata["MissingFreq"] is object and list(...)
fails; fix by casting that value to the expected iterable type before
converting/assigning to freq. In compute.py, import typing.cast and change the
assignment in the branch with "MissingFreq" to use cast(List[str] or
Iterable[str], dataset.metadata["MissingFreq"]) (or cast to the concrete type
you expect) so freq has the proper type; reference the variables/members
dataset.metadata, "MissingFreq", and freq when making this change.
- Around line 218-231: The loop over numeric.columns uses the index i to grab
assay IDs and panels from dataset.features which can misalign; instead, map by
column name (col) to the feature row: look up dataset.features rows where
dataset.features[id_col] == col and use that row's id and panel values (fall
back to col or empty string if no match). Update the assignments that populate
assay_ids and panels (references: numeric.columns loop, dataset.features,
id_col, panel_col, assay_ids, panels) to perform a name-based lookup (or an
indexed mapping built once outside the loop) rather than using i-based
positional indexing.

In `@src/pyprideap/qc/report.py`:
- Around line 26-30: The help text for the "missing_frequency" entry contains
escaped percent signs ("30%%") that render as "30%%" in the HTML; update the
string literal under the "missing_frequency" key in src/pyprideap/qc/report.py
to use single percent signs ("30%") instead of "30%%" so the HTML shows "30%".
- Line 335: The HTML file is written with output.write_text(html) using
platform-default encoding while the content declares UTF-8 and contains
non-ASCII characters; change the write to explicitly use UTF-8 by calling
output.write_text(html, encoding="utf-8") and update any tests that read this
file with Path.read_text() to call read_text(encoding="utf-8") so bytes and
declared charset always match (look for the output.write_text(html) call in
report generation and corresponding read_text() uses in tests).
- Around line 220-230: Narrow the type of plot_data entries before calling the
renderer by checking the runtime type: after retrieving data =
plot_data.get(key) and ensuring data is not None, use isinstance(data, _dtype)
(where _dtype and renderer are the two elements from the _RENDERERS tuple you
unpack) or cast to the dataclass so that renderer(data) and access like
data.title are type-safe; also fix the help string literals by replacing the
double-percent occurrences ("30%% threshold", "more than 30%% of") with single
percent signs ("30% threshold", "more than 30% of"); finally, write the HTML
with explicit UTF-8 encoding by changing output.write_text(html) to
output.write_text(html, encoding="utf-8").

In `@src/pyprideap/readers/olink_xlsx.py`:
- Line 10: The XLSX reader currently drops the MissingFreq column because
_FEATURE_COLS in src/pyprideap/readers/olink_xlsx.py lacks "MissingFreq"; update
the _FEATURE_COLS set to include "MissingFreq" (matching the entry in
src/pyprideap/readers/olink_csv.py) so feature metadata extraction in the Olink
XLSX reader (referencing _FEATURE_COLS) preserves that field and yields
consistent metadata across CSV and XLSX loaders.

---

Duplicate comments:
In `@src/pyprideap/filtering.py`:
- Around line 43-45: The boolean mask built from dataset.samples (keep_mask =
~is_control) is being applied directly to other DataFrames which causes
label-based alignment bugs; convert the mask to a positional boolean array
(e.g., keep_pos = keep_mask.to_numpy()) and use .iloc[...] to select rows for
dataset.samples, dataset.expression and dataset.lod_matrix (and any other uses)
— replace expressions like samples = dataset.samples[keep_mask].reset_index(...)
and expression = dataset.expression[keep_mask].reset_index(...) with samples =
dataset.samples.iloc[keep_pos].reset_index(drop=True), expression =
dataset.expression.iloc[keep_pos].reset_index(drop=True), and similarly use
.iloc[keep_pos] when slicing lod_matrix.

---

Nitpick comments:
In `@src/pyprideap/qc/render.py`:
- Around line 45-60: The histogram traces in render_distribution currently use
histnorm="" which shows raw counts and biases samples with more measurements;
update the go.Histogram call in render_distribution to use an appropriate
normalization such as histnorm="probability density" (or "probability") so the
per-sample overlaid curves compare distributions rather than counts, and ensure
the docstring still states density/KDE-like output (or adjust the docstring if
you intentionally keep raw counts).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 790cb379-de63-4432-b5f5-73c0b0ac749e

📥 Commits

Reviewing files that changed from the base of the PR and between e39bda9 and 5cb8d56.

⛔ Files ignored due to path filters (3)
  • configs/Explore-3072-Fixed-LOD-2024-12-19.csv is excluded by !**/*.csv
  • configs/Explore-HT-Fixed-LOD.csv is excluded by !**/*.csv
  • configs/Reveal-Fixed-LOD.csv is excluded by !**/*.csv
📒 Files selected for processing (15)
  • .gitignore
  • pyproject.toml
  • src/pyprideap/__init__.py
  • src/pyprideap/filtering.py
  • src/pyprideap/lod.py
  • src/pyprideap/qc/__init__.py
  • src/pyprideap/qc/compute.py
  • src/pyprideap/qc/render.py
  • src/pyprideap/qc/report.py
  • src/pyprideap/readers/olink_csv.py
  • src/pyprideap/readers/olink_xlsx.py
  • tests/test_lod.py
  • tests/test_qc_compute.py
  • tests/test_qc_render.py
  • tests/test_qc_report.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • pyproject.toml
  • tests/test_lod.py
  • .gitignore

Comment on lines +3 to +13
from importlib.metadata import PackageNotFoundError, version

from pyprideap.core import AffinityDataset, Level, Platform, ValidationResult
from pyprideap.filtering import filter_controls, filter_qc
from pyprideap.lod import LodMethod, LodStats, compute_lod_from_controls, compute_nclod, compute_lod_stats, get_bundled_fixed_lod_path, get_reported_lod, get_valid_proteins, load_fixed_lod
from pyprideap.pride import PrideClient
from pyprideap.qc.compute import compute_all as compute_qc
from pyprideap.qc.report import qc_report
from pyprideap.readers.registry import read
from pyprideap.stats import DatasetStats, compute_stats
from pyprideap.validators import validate
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reflow the import block so Ruff passes.

CI is already failing on I001/E501 for this block. Breaking the long pyprideap.lod import into a parenthesized block and letting Ruff/isort reorder the group should clear it.

💡 Suggested fix
-from pyprideap.lod import LodMethod, LodStats, compute_lod_from_controls, compute_nclod, compute_lod_stats, get_bundled_fixed_lod_path, get_reported_lod, get_valid_proteins, load_fixed_lod
+from pyprideap.lod import (
+    LodMethod,
+    LodStats,
+    compute_lod_from_controls,
+    compute_lod_stats,
+    compute_nclod,
+    get_bundled_fixed_lod_path,
+    get_reported_lod,
+    get_valid_proteins,
+    load_fixed_lod,
+)
🧰 Tools
🪛 GitHub Actions: CI

[error] 3-13: I001 Import block is un-sorted or un-formatted.


[error] 7-7: E501 Line too long (188 > 120)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/__init__.py` around lines 3 - 13, The import block is too long
and triggers Ruff I001/E501; reflow the pyprideap.lod import into a
parenthesized, multi-line import and let Ruff/isort reorder groups so line
lengths and ordering meet style rules—e.g. replace the single long line that
imports LodMethod, LodStats, compute_lod_from_controls, compute_nclod,
compute_lod_stats, get_bundled_fixed_lod_path, get_reported_lod,
get_valid_proteins, load_fixed_lod with a parenthesized import list (one name
per line) and then run ruff/isort to fix ordering and spacing.

return pd.Series(lod_series.values, index=dataset.expression.columns[: len(lod_series)])


_CONFIGS_DIR = Path(__file__).resolve().parent.parent.parent / "configs"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if configs directory is included in package data or exists at expected location
fd -t d "configs" --max-depth 4
# Check pyproject.toml for package data configuration
cat pyproject.toml | grep -A 10 "package-data\|include\|data-files" || echo "No package data config found"

Repository: PRIDE-Archive/pyprideap

Length of output: 104


🏁 Script executed:

# Check for setup.py and MANIFEST.in
fd -t f "setup.py|setup.cfg|MANIFEST.in" --max-depth 2

# Get the full pyproject.toml to understand distribution setup
cat pyproject.toml

Repository: PRIDE-Archive/pyprideap

Length of output: 1669


🏁 Script executed:

# Find usage of _CONFIGS_DIR
rg "_CONFIGS_DIR" -n

# Check what files are in configs directory
fd -t f . configs/

# See how get_bundled_fixed_lod_path handles missing files
sed -n '250,270p' src/pyprideap/lod.py

Repository: PRIDE-Archive/pyprideap

Length of output: 1023


🏁 Script executed:

# Check for tests related to bundled LOD
rg "get_bundled_fixed_lod_path|_BUNDLED_FIXED_LOD" tests/ --type py

# Check README for any mention of configs or bundled data
grep -i "config\|bundle\|fixed.lod" README.md

Repository: PRIDE-Archive/pyprideap

Length of output: 49


Path resolution will fail after pip installation; bundled LOD data will not be accessible.

The _CONFIGS_DIR path uses Path(__file__).resolve().parent.parent.parent / "configs", which works in development but breaks in distributed installations. The configs/ directory is not included in the package distribution (pyproject.toml has no package-data configuration and hatchling only includes Python packages by default). After pip install, this path resolution will point to the wrong location, causing get_bundled_fixed_lod_path() to silently return None for all platforms.

Use importlib.resources to access the bundled CSV files instead, ensuring they are included in the package distribution via pyproject.toml configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/lod.py` at line 247, The hardcoded _CONFIGS_DIR path will break
after pip installation and causes get_bundled_fixed_lod_path() to return None;
replace filesystem-based resolution with importlib.resources (or
importlib.resources.files/open_text) to load the bundled CSVs from the package
resources and update pyproject.toml to include those CSVs as package data (or
configure hatchling's include section) so they are distributed; modify
get_bundled_fixed_lod_path() (and any callers) to use importlib.resources APIs
to locate/read the CSVs rather than Path(__file__).resolve().parent... and
remove reliance on _CONFIGS_DIR.

Comment on lines +486 to +488
panel_cols = [
c for c in dataset.expression.columns if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False)
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix line length to resolve pipeline failure.

Line 487 exceeds 120 characters (121 > 120), causing the linter to fail.

🔧 Proposed fix
             panel_cols = [
-                c for c in dataset.expression.columns if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False)
+                c for c in dataset.expression.columns
+                if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False)
             ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
panel_cols = [
c for c in dataset.expression.columns if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False)
]
panel_cols = [
c for c in dataset.expression.columns
if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False)
]
🧰 Tools
🪛 GitHub Actions: CI

[error] 487-487: E501 Line too long (121 > 120)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/lod.py` around lines 486 - 488, The comprehension creating
panel_cols is too long and exceeds the 120-char limit; split it across multiple
lines to satisfy the linter by breaking after the opening bracket and aligning
the conditional onto its own line. Update the expression that builds panel_cols
(the list comprehension referencing dataset.expression.columns, panel_map,
panel_name, and has_lod_per_col) so each logical part is on a separate line
(e.g., start the list on one line, put the for clause on the next line, and the
if condition on another) to keep line length <=120.

Comment on lines +98 to +99
def _sample_id_col(dataset: AffinityDataset) -> str:
return "SampleID" if "SampleID" in dataset.samples.columns else "SampleId"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

_sample_id_col returns "SampleId" even when neither column exists.

When dataset.samples contains neither "SampleID" nor "SampleId" (common for SomaScan ADAT files per the reader implementation), this function returns "SampleId" unconditionally. Callers like compute_correlation (line 288) use this to index into dataset.samples, which would raise a KeyError.

Consider returning None or raising a clearer error when no valid column is found.

🛡️ Proposed fix
 def _sample_id_col(dataset: AffinityDataset) -> str | None:
-    return "SampleID" if "SampleID" in dataset.samples.columns else "SampleId"
+    if "SampleID" in dataset.samples.columns:
+        return "SampleID"
+    if "SampleId" in dataset.samples.columns:
+        return "SampleId"
+    return None

Then update callers to check for None before indexing.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _sample_id_col(dataset: AffinityDataset) -> str:
return "SampleID" if "SampleID" in dataset.samples.columns else "SampleId"
def _sample_id_col(dataset: AffinityDataset) -> str | None:
if "SampleID" in dataset.samples.columns:
return "SampleID"
if "SampleId" in dataset.samples.columns:
return "SampleId"
return None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/qc/compute.py` around lines 98 - 99, The helper _sample_id_col
currently returns "SampleId" even when neither "SampleID" nor "SampleId" exists;
change _sample_id_col to return None (or raise a ValueError with a clear
message) when no valid column is found, and update callers such as
compute_correlation to check the return value before indexing dataset.samples
(raise a clear error or handle the missing ID column case) so a KeyError is
avoided and errors are informative.

Comment on lines +148 to +149
elif "MissingFreq" in dataset.metadata:
freq = list(dataset.metadata["MissingFreq"])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix type error causing pipeline failure.

The pipeline fails with: No overload variant of "list" matches argument type "object". Since dataset.metadata is typed as dict[str, object], the value returned by dataset.metadata["MissingFreq"] has type object, which list() cannot accept without a cast.

🔧 Proposed fix
     elif "MissingFreq" in dataset.metadata:
-        freq = list(dataset.metadata["MissingFreq"])
+        mf = dataset.metadata["MissingFreq"]
+        freq = list(mf) if hasattr(mf, "__iter__") else []  # type: ignore[arg-type]

Or use an explicit cast:

+from typing import cast, Iterable
 ...
     elif "MissingFreq" in dataset.metadata:
-        freq = list(dataset.metadata["MissingFreq"])
+        freq = list(cast(Iterable[float], dataset.metadata["MissingFreq"]))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif "MissingFreq" in dataset.metadata:
freq = list(dataset.metadata["MissingFreq"])
elif "MissingFreq" in dataset.metadata:
mf = dataset.metadata["MissingFreq"]
freq = list(mf) if hasattr(mf, "__iter__") else [] # type: ignore[arg-type]
🧰 Tools
🪛 GitHub Actions: CI

[error] 149-149: No overload variant of "list" matches argument type "object" [call-overload]

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/qc/compute.py` around lines 148 - 149, The error arises because
dataset.metadata is typed dict[str, object] so dataset.metadata["MissingFreq"]
is object and list(...) fails; fix by casting that value to the expected
iterable type before converting/assigning to freq. In compute.py, import
typing.cast and change the assignment in the branch with "MissingFreq" to use
cast(List[str] or Iterable[str], dataset.metadata["MissingFreq"]) (or cast to
the concrete type you expect) so freq has the proper type; reference the
variables/members dataset.metadata, "MissingFreq", and freq when making this
change.

Comment on lines +26 to +30
"missing_frequency": (
"Displays how many assays fall into each missing-frequency bin. The x-axis is the fraction of samples "
"where a given assay has no measured value; the y-axis (log scale) is the count of assays. The red "
"dashed line marks the 30%% threshold — assays above this line are missing in more than 30%% of "
"samples and may need to be excluded from downstream analysis."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the doubled percent signs in the help copy.

These strings are emitted verbatim into the HTML, so the report currently shows 30%% instead of 30%.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/qc/report.py` around lines 26 - 30, The help text for the
"missing_frequency" entry contains escaped percent signs ("30%%") that render as
"30%%" in the HTML; update the string literal under the "missing_frequency" key
in src/pyprideap/qc/report.py to use single percent signs ("30%") instead of
"30%%" so the HTML shows "30%".

Comment on lines +220 to +230
_RENDERERS = {
"distribution": (DistributionData, R.render_distribution),
"missing_frequency": (MissingFrequencyData, R.render_missing_frequency),
"qc_summary": (QcLodSummaryData, R.render_qc_summary),
"lod_analysis": (LodAnalysisData, R.render_lod_analysis),
"pca": (PcaData, R.render_pca),
"correlation": (CorrelationData, R.render_correlation),
"missing_values": (MissingValuesData, R.render_missing_values),
"cv_distribution": (CvDistributionData, R.render_cv_distribution),
"detection_rate": (DetectionRateData, R.render_detection_rate),
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/pyprideap/qc/report.py | sed -n '220,270p'

Repository: PRIDE-Archive/pyprideap

Length of output: 2467


🏁 Script executed:

cat -n src/pyprideap/qc/report.py | sed -n '1,250p' | tail -n 50

Repository: PRIDE-Archive/pyprideap

Length of output: 2300


🏁 Script executed:

cat -n src/pyprideap/qc/compute.py | sed -n '340,365p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1246


🏁 Script executed:

rg "30%%" src/pyprideap/qc/report.py

Repository: PRIDE-Archive/pyprideap

Length of output: 172


🏁 Script executed:

rg "write_text" src/pyprideap/qc/report.py

Repository: PRIDE-Archive/pyprideap

Length of output: 94


🏁 Script executed:

cat -n src/pyprideap/qc/report.py | grep -A 5 -B 5 "write_text"

Repository: PRIDE-Archive/pyprideap

Length of output: 315


🏁 Script executed:

cat -n src/pyprideap/qc/report.py | grep -B 10 "30%%" | head -n 20

Repository: PRIDE-Archive/pyprideap

Length of output: 931


🏁 Script executed:

cat -n src/pyprideap/qc/report.py | sed -n '19,35p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1394


🏁 Script executed:

cat -n src/pyprideap/qc/report.py | sed -n '300,336p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1902


🏁 Script executed:

python3 << 'EOF'
# Test what %% in a regular string literal produces
test_str = "This is 30%% of samples"
print("String value:", repr(test_str))
print("Rendered:", test_str)

# Compare with what should happen in format context
format_str = "This is 30%% of {}"
print("Format string rendered:", format_str.format("samples"))
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 188


🏁 Script executed:

rg "encoding" src/pyprideap/qc/report.py

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

python3 << 'EOF'
# Simulate what mypy sees
from typing import Dict, Any

# This is what compute_all returns
plot_data: Dict[str, object] = {"key": "value"}

# At line 249
data = plot_data.get("key")  # type: object | None

# After the None check at line 250-251
if data is not None:
    # data is still typed as object, not narrowed to the specific dataclass type
    # because dict.get() with dict[str, object] returns object | None
    # Accessing data.title would fail mypy type check
    print(type(data))  # Would need isinstance check to narrow
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 80


Narrow plot_data before invoking the renderer registry, fix the %% escaping in help text, and specify UTF-8 encoding for HTML output.

compute_all() returns dict[str, object] (compute.py:347), so at line 249 data = plot_data.get(key) is typed as object | None. The if data is None: continue check at lines 250–251 narrows only to object, not to the specific dataclass type. Line 252 calls renderer(data) and line 256 accesses data.title—both fail mypy type checking because object has no title attribute and renderer expects a concrete dataclass type. Add an isinstance(data, _dtype) guard or cast after the None check to properly narrow the type.

The help text at lines 29–30 contains literal "30%% threshold" and "more than 30%% of". In a regular string literal (not f-string or format context), %% is not escaped and will appear as 30%% in the rendered HTML instead of 30%.

Line 335 calls output.write_text(html) without specifying an encoding. The HTML contains non-ASCII characters (em dash \u2014 at line 319), and the default platform encoding may vary. Use output.write_text(html, encoding="utf-8") for consistency and portability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/qc/report.py` around lines 220 - 230, Narrow the type of
plot_data entries before calling the renderer by checking the runtime type:
after retrieving data = plot_data.get(key) and ensuring data is not None, use
isinstance(data, _dtype) (where _dtype and renderer are the two elements from
the _RENDERERS tuple you unpack) or cast to the dataclass so that renderer(data)
and access like data.title are type-safe; also fix the help string literals by
replacing the double-percent occurrences ("30%% threshold", "more than 30%% of")
with single percent signs ("30% threshold", "more than 30% of"); finally, write
the HTML with explicit UTF-8 encoding by changing output.write_text(html) to
output.write_text(html, encoding="utf-8").

"</body>\n</html>"
)

output.write_text(html)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

wc -l src/pyprideap/qc/report.py

Repository: PRIDE-Archive/pyprideap

Length of output: 97


🏁 Script executed:

cat -n src/pyprideap/qc/report.py | sed -n '330,340p'

Repository: PRIDE-Archive/pyprideap

Length of output: 315


🏁 Script executed:

# Search for where the HTML is generated and what special characters it contains
rg "—|≤|−" src/pyprideap/qc/report.py -B 3 -A 3

Repository: PRIDE-Archive/pyprideap

Length of output: 1438


🏁 Script executed:

# Check for test files related to this module
find . -name "*test*report*" -o -name "*report*test*" | head -20

Repository: PRIDE-Archive/pyprideap

Length of output: 92


🏁 Script executed:

# Check the test file to see how it handles encoding
cat -n tests/test_qc_report.py

Repository: PRIDE-Archive/pyprideap

Length of output: 3279


🏁 Script executed:

# Check Python's write_text() behavior - look at what encoding is used by default
python3 <<'EOF'
from pathlib import Path
import inspect
# Check write_text signature
sig = inspect.signature(Path.write_text)
print(f"write_text signature: {sig}")
# Check default encoding
print(f"\nDefault encoding for write_text: {inspect.signature(Path.write_text).parameters}")
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 374


🏁 Script executed:

# Search for any other places where output might be read back
rg "read_text|open.*encoding" src/pyprideap/qc/report.py

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Check the context of where output is defined - where the Path is created
cat -n src/pyprideap/qc/report.py | sed -n '300,340p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1902


🏁 Script executed:

# Check the qc_report function signature to understand output parameter type
cat -n src/pyprideap/qc/report.py | sed -n '260,280p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1052


🏁 Script executed:

# Check if there are any other write_text calls in the file that might already use encoding
rg "write_text|write_bytes" src/pyprideap/qc/report.py

Repository: PRIDE-Archive/pyprideap

Length of output: 94


🏁 Script executed:

# Get the full function signature
cat -n src/pyprideap/qc/report.py | sed -n '200,220p'

Repository: PRIDE-Archive/pyprideap

Length of output: 829


🏁 Script executed:

# Check if the output is a Path object
rg "def qc_report" src/pyprideap/qc/report.py -A 5

Repository: PRIDE-Archive/pyprideap

Length of output: 385


Specify UTF-8 encoding when writing HTML.

The generated HTML declares <meta charset="utf-8"> but contains non-ASCII characters (—, ≤, −) and is written with write_text() using the platform-default encoding. On non-UTF-8 locales, this causes a mismatch: the bytes won't be UTF-8 even though the charset is declared as UTF-8. Use output.write_text(html, encoding="utf-8") to ensure the file is always written in UTF-8. Additionally, the tests that read this file back should also specify encoding="utf-8" in their read_text() calls to maintain consistency.

💡 Suggested fix
-    output.write_text(html)
+    output.write_text(html, encoding="utf-8")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
output.write_text(html)
output.write_text(html, encoding="utf-8")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/qc/report.py` at line 335, The HTML file is written with
output.write_text(html) using platform-default encoding while the content
declares UTF-8 and contains non-ASCII characters; change the write to explicitly
use UTF-8 by calling output.write_text(html, encoding="utf-8") and update any
tests that read this file with Path.read_text() to call
read_text(encoding="utf-8") so bytes and declared charset always match (look for
the output.write_text(html) call in report generation and corresponding
read_text() uses in tests).

from pyprideap.core import AffinityDataset, Platform

_SAMPLE_COLS = {"SampleID", "PlateID", "WellID", "SampleType", "SampleQC", "PlateQC"}
_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD"}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve MissingFreq in XLSX feature metadata.

src/pyprideap/readers/olink_csv.py:9-11 includes MissingFreq in _FEATURE_COLS, but this reader drops it entirely. The same Olink dataset will therefore expose different feature metadata depending on whether it is loaded from CSV or XLSX, which makes downstream QC/statistics format-dependent.

Proposed fix
-_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD"}
+_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD", "MissingFreq"}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD"}
_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD", "MissingFreq"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/readers/olink_xlsx.py` at line 10, The XLSX reader currently
drops the MissingFreq column because _FEATURE_COLS in
src/pyprideap/readers/olink_xlsx.py lacks "MissingFreq"; update the
_FEATURE_COLS set to include "MissingFreq" (matching the entry in
src/pyprideap/readers/olink_csv.py) so feature metadata extraction in the Olink
XLSX reader (referencing _FEATURE_COLS) preserves that field and yields
consistent metadata across CSV and XLSX loaders.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (4)
src/pyprideap/readers/olink_xlsx.py (1)

10-10: ⚠️ Potential issue | 🟠 Major

Add MissingFreq to the XLSX feature schema.

src/pyprideap/readers/olink_csv.py preserves MissingFreq, but this reader drops it entirely. The same Olink export will therefore expose different features metadata depending on whether it is loaded from CSV or XLSX.

Suggested fix
-_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD"}
+_FEATURE_COLS = {"OlinkID", "UniProt", "Assay", "Panel", "LOD", "MissingFreq"}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/readers/olink_xlsx.py` at line 10, The XLSX reader's feature
schema currently omits the MissingFreq column causing inconsistent features
metadata vs the CSV reader; update the _FEATURE_COLS set in
src/pyprideap/readers/olink_xlsx.py to include "MissingFreq" (same symbol
_FEATURE_COLS) so the XLSX parser preserves that column, and ensure any
downstream usage in the reader logic that iterates over _FEATURE_COLS (e.g.,
feature extraction or validation code in the olink_xlsx reader) continues to
handle MissingFreq the same way as the CSV reader.
src/pyprideap/lod.py (3)

96-103: ⚠️ Potential issue | 🟠 Major

Require negative controls for NCLOD.

This still falls back to positive/QC/reference controls when no negatives exist, which changes the NCLOD threshold instead of failing fast.

Suggested fix
     is_negative = st.isin(_NEGATIVE_CONTROL_TYPES)
     if is_negative.any():
         return is_negative
-
-    # Fallback: use any control-like sample
-    is_control = st.isin(_CONTROL_SAMPLE_TYPES)
-    if is_control.any():
-        return is_control
 
     raise ValueError("No negative control samples found in SampleType column")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/lod.py` around lines 96 - 103, The current logic uses
_CONTROL_SAMPLE_TYPES fallback when no negatives are present which alters NCLOD;
change the behavior so NCLOD requires negative controls only: in the function
that computes is_negative (look for variables is_negative and
_NEGATIVE_CONTROL_TYPES in src/pyprideap/lod.py) remove or disable the fallback
branch that checks _CONTROL_SAMPLE_TYPES/is_control and instead raise a clear
error (or return a failing status) when is_negative.any() is false so the caller
fails fast if no negative controls are provided.

488-490: ⚠️ Potential issue | 🟡 Minor

Wrap panel_cols to unblock the linter.

CI is still failing on Line 489 with E501.

Suggested fix
             panel_cols = [
-                c for c in dataset.expression.columns if panel_map.get(c) == panel_name and has_lod_per_col.get(c, False)
+                c
+                for c in dataset.expression.columns
+                if panel_map.get(c) == panel_name
+                and has_lod_per_col.get(c, False)
             ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/lod.py` around lines 488 - 490, Line is too long (E501): break
the comprehension for panel_cols into multiple lines to satisfy the linter. Edit
the expression that builds panel_cols (the list comprehension using
dataset.expression.columns, panel_map, panel_name, and has_lod_per_col) so that
the opening bracket, the iterator expression, and the conditional are split
across lines (or assign intermediate variables) to keep line length under limit
and preserve the same logic.

249-269: ⚠️ Potential issue | 🟠 Major

Bundled FixedLOD resolution will break outside a source checkout.

Path(__file__).resolve().parent.parent.parent / "configs" depends on the repo layout. After installation, get_bundled_fixed_lod_path() can no longer reliably find the bundled CSVs.

#!/bin/bash
fd -t d '^configs$'
printf '\n--- pyproject packaging hints ---\n'
rg -n "package-data|include|artifacts|only-include" pyproject.toml
printf '\n--- current bundled LOD path logic ---\n'
sed -n '249,317p' src/pyprideap/lod.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/lod.py` around lines 249 - 269, The current _CONFIGS_DIR
constant and get_bundled_fixed_lod_path() rely on a repo-relative path that
breaks after installation; instead package the CSVs as package data and resolve
them at runtime via package resources. Update packaging (pyproject.toml) to
include the configs CSVs, remove/stop relying on _CONFIGS_DIR, and change
get_bundled_fixed_lod_path(platform) to use importlib.resources (or
importlib_resources for py<3.9) to locate the file inside the installed
pyprideap package (e.g.
importlib.resources.files("pyprideap").joinpath("configs", filename)) and return
None if the resource is missing; keep the _BUNDLED_FIXED_LOD mapping and
filename lookup logic the same, only replace the filesystem resolution with
package-resource resolution.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/pyprideap/lod.py`:
- Around line 125-127: In _nclod_base(), avoid using np.maximum on a pandas
Series (numeric_expr) because np.maximum returns Any; instead compute medians =
numeric_expr.median() and stds = numeric_expr.std() as before but replace the
np.maximum call with stds.multiply(3).clip(lower=_MIN_STD_FLOOR) (or equivalent
Series.clip usage) and return medians + that clipped Series so the function
retains a pd.Series return type and satisfies mypy; update references to
medians, stds, and _MIN_STD_FLOOR accordingly.
- Around line 198-204: The DataFrame construction and addition degrade type
precision because base.values and adj.values are typed as ndarray |
ExtensionArray; update the lod_matrix creation to use base.to_numpy() when
tiling and convert adj to a pandas.Series (or ensure adj is a Series) before
adding so mypy can infer a concrete DataFrame return type: use
np.tile(base.to_numpy(), (len(dataset.samples), 1)) for the data and ensure the
addition uses a Series (e.g., pd.Series(adj) or adj if already a Series) with
axis=0 when calling lod_matrix.add; keep the same column=index and
index=dataset.expression.index and return lod_matrix.

In `@src/pyprideap/qc/render.py`:
- Around line 110-123: Remove the dead variable cumulative in the loop inside
the render function (where cumulative = 0.0 is set and cumulative += pct is
applied); delete both the initialization and the increment since cumulative is
never read, leaving the loop that builds the go.Bar traces using pct,
data.categories, data.counts, and colors intact.
- Around line 31-42: The _QC_LOD_COLORS mapping assigns alarming red ("#e74c3c")
to "WARN & NPX ≤ LOD" while various "FAIL" keys use gray, which flips visual
severity; update the _QC_LOD_COLORS dict so FAIL-related keys (e.g., "FAIL",
"FAIL & NPX > LOD", "FAIL & NPX ≤ LOD") use a more alarming color (e.g., red)
and WARN-related keys (e.g., "WARN", "WARN & NPX > LOD", "WARN & NPX ≤ LOD") use
a less severe color (e.g., orange), or document the intended semantics if the
current mapping is deliberate; locate and change the color hex values in the
_QC_LOD_COLORS constant in render.py accordingly.

In `@src/pyprideap/readers/olink_csv.py`:
- Around line 47-56: The LOD pivot can drop assay columns with all-NA values
causing metadata["lod_matrix"] to not match expression.columns; after creating
lod_matrix with df.pivot_table (and before resetting index), reindex its columns
to the same assay order used for expression/features (the same reference used at
line where features are aligned) and fill missing columns with NaN so
lod_matrix.columns == expression.columns; apply the identical fix in
olink_xlsx.py where lod_matrix is built, ensuring you reference the same
sample_order and assay ordering variable (expression.columns or features) when
reindexing.

---

Duplicate comments:
In `@src/pyprideap/lod.py`:
- Around line 96-103: The current logic uses _CONTROL_SAMPLE_TYPES fallback when
no negatives are present which alters NCLOD; change the behavior so NCLOD
requires negative controls only: in the function that computes is_negative (look
for variables is_negative and _NEGATIVE_CONTROL_TYPES in src/pyprideap/lod.py)
remove or disable the fallback branch that checks
_CONTROL_SAMPLE_TYPES/is_control and instead raise a clear error (or return a
failing status) when is_negative.any() is false so the caller fails fast if no
negative controls are provided.
- Around line 488-490: Line is too long (E501): break the comprehension for
panel_cols into multiple lines to satisfy the linter. Edit the expression that
builds panel_cols (the list comprehension using dataset.expression.columns,
panel_map, panel_name, and has_lod_per_col) so that the opening bracket, the
iterator expression, and the conditional are split across lines (or assign
intermediate variables) to keep line length under limit and preserve the same
logic.
- Around line 249-269: The current _CONFIGS_DIR constant and
get_bundled_fixed_lod_path() rely on a repo-relative path that breaks after
installation; instead package the CSVs as package data and resolve them at
runtime via package resources. Update packaging (pyproject.toml) to include the
configs CSVs, remove/stop relying on _CONFIGS_DIR, and change
get_bundled_fixed_lod_path(platform) to use importlib.resources (or
importlib_resources for py<3.9) to locate the file inside the installed
pyprideap package (e.g.
importlib.resources.files("pyprideap").joinpath("configs", filename)) and return
None if the resource is missing; keep the _BUNDLED_FIXED_LOD mapping and
filename lookup logic the same, only replace the filesystem resolution with
package-resource resolution.

In `@src/pyprideap/readers/olink_xlsx.py`:
- Line 10: The XLSX reader's feature schema currently omits the MissingFreq
column causing inconsistent features metadata vs the CSV reader; update the
_FEATURE_COLS set in src/pyprideap/readers/olink_xlsx.py to include
"MissingFreq" (same symbol _FEATURE_COLS) so the XLSX parser preserves that
column, and ensure any downstream usage in the reader logic that iterates over
_FEATURE_COLS (e.g., feature extraction or validation code in the olink_xlsx
reader) continues to handle MissingFreq the same way as the CSV reader.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb03d159-5d1d-4635-bd95-7a0843b5735b

📥 Commits

Reviewing files that changed from the base of the PR and between 5cb8d56 and 7198f8d.

📒 Files selected for processing (7)
  • src/pyprideap/lod.py
  • src/pyprideap/qc/compute.py
  • src/pyprideap/qc/render.py
  • src/pyprideap/qc/report.py
  • src/pyprideap/readers/olink_csv.py
  • src/pyprideap/readers/olink_parquet.py
  • src/pyprideap/readers/olink_xlsx.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/pyprideap/readers/olink_parquet.py
  • src/pyprideap/qc/report.py
  • src/pyprideap/qc/compute.py

Comment on lines +125 to +127
medians = numeric_expr.median()
stds = numeric_expr.std()
return medians + np.maximum(_MIN_STD_FLOOR, 3 * stds)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n src/pyprideap/lod.py | head -150

Repository: PRIDE-Archive/pyprideap

Length of output: 6256


🏁 Script executed:

# Check for mypy configuration and type checking setup
fd -t f "pyproject.toml|setup.cfg|mypy.ini|py.typed" .

Repository: PRIDE-Archive/pyprideap

Length of output: 83


🏁 Script executed:

cat pyproject.toml

Repository: PRIDE-Archive/pyprideap

Length of output: 1669


🏁 Script executed:

# Check for GitHub Actions or CI configuration
fd -t f "\.github.*|\.travis|\.gitlab|circle|jenkins" .

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

ls -la .github/ 2>/dev/null || echo "No .github directory"

Repository: PRIDE-Archive/pyprideap

Length of output: 233


🏁 Script executed:

ls -la .github/workflows/

Repository: PRIDE-Archive/pyprideap

Length of output: 233


🏁 Script executed:

cat .github/workflows/ci.yml

Repository: PRIDE-Archive/pyprideap

Length of output: 1143


🏁 Script executed:

# Let's check if there are any type checking hints or test cases for this
rg "np.maximum|\.clip" src/

Repository: PRIDE-Archive/pyprideap

Length of output: 145


🏁 Script executed:

# Let's create a simple test to understand type behavior
python3 << 'EOF'
import numpy as np
import pandas as pd
from typing import get_type_hints

# Simulate the scenario
numeric_expr = pd.DataFrame({
    'col1': [1.0, 2.0, 3.0],
    'col2': [4.0, 5.0, 6.0]
})

medians = numeric_expr.median()
stds = numeric_expr.std()

print("medians type:", type(medians))
print("stds type:", type(stds))
print("3 * stds type:", type(3 * stds))

# Test np.maximum
result1 = np.maximum(0.2, 3 * stds)
print("np.maximum result type:", type(result1))
print("np.maximum result:", result1)

# Test clip
result2 = (3 * stds).clip(lower=0.2)
print("clip result type:", type(result2))
print("clip result:", result2)

# Test final return
final1 = medians + np.maximum(0.2, 3 * stds)
print("medians + np.maximum type:", type(final1))

final2 = medians + (3 * stds).clip(lower=0.2)
print("medians + clip type:", type(final2))

# Check if they're equal
print("Are results equal?", np.allclose(final1, final2))
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 184


🏁 Script executed:

# Let me check the pandas type stubs for Series operations to confirm
# Look for any type annotations in the codebase
rg "@.*return|->.*Series" src/pyprideap/lod.py -A 2

Repository: PRIDE-Archive/pyprideap

Length of output: 1022


🏁 Script executed:

# Let's check if there are any tests or type annotations related to this function
rg "_nclod_base|nclod" tests/ --type py

Repository: PRIDE-Archive/pyprideap

Length of output: 178


🏁 Script executed:

cat tests/test_lod.py

Repository: PRIDE-Archive/pyprideap

Length of output: 12453


Use Series.clip() to preserve type safety in _nclod_base()

The current code returns medians + np.maximum(_MIN_STD_FLOOR, 3 * stds), which widens the return type to Any because np.maximum() doesn't have proper type stubs for pandas Series. This triggers a mypy failure with warn_return_any = true. The Series.clip() method preserves the declared pd.Series return type and is mathematically equivalent:

Suggested fix
-    return medians + np.maximum(_MIN_STD_FLOOR, 3 * stds)
+    return medians + (3 * stds).clip(lower=_MIN_STD_FLOOR)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
medians = numeric_expr.median()
stds = numeric_expr.std()
return medians + np.maximum(_MIN_STD_FLOOR, 3 * stds)
medians = numeric_expr.median()
stds = numeric_expr.std()
return medians + (3 * stds).clip(lower=_MIN_STD_FLOOR)
🧰 Tools
🪛 GitHub Actions: CI

[error] 127-127: mypy: Returning Any from function declared to return "Series[Any]". (no-any-return)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/lod.py` around lines 125 - 127, In _nclod_base(), avoid using
np.maximum on a pandas Series (numeric_expr) because np.maximum returns Any;
instead compute medians = numeric_expr.median() and stds = numeric_expr.std() as
before but replace the np.maximum call with
stds.multiply(3).clip(lower=_MIN_STD_FLOOR) (or equivalent Series.clip usage)
and return medians + that clipped Series so the function retains a pd.Series
return type and satisfies mypy; update references to medians, stds, and
_MIN_STD_FLOOR accordingly.

Comment on lines +198 to +204
lod_matrix = pd.DataFrame(
np.tile(base.values, (len(dataset.samples), 1)),
columns=base.index,
index=dataset.expression.index,
)
lod_matrix = lod_matrix.add(adj.values, axis=0)
return lod_matrix
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the repository structure
git ls-files | head -20

Repository: PRIDE-Archive/pyprideap

Length of output: 581


🏁 Script executed:

# Find the file in question
fd "lod.py" --type f

Repository: PRIDE-Archive/pyprideap

Length of output: 105


🏁 Script executed:

# Read the relevant section of the file
cat -n src/pyprideap/lod.py | sed -n '190,210p'

Repository: PRIDE-Archive/pyprideap

Length of output: 856


🏁 Script executed:

# Get the full function context to understand the types involved
cat -n src/pyprideap/lod.py | sed -n '180,210p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1273


🏁 Script executed:

# Check the function signature to understand parameter types
cat -n src/pyprideap/lod.py | grep -B 20 "lod_matrix = pd.DataFrame" | head -25

Repository: PRIDE-Archive/pyprideap

Length of output: 924


🏁 Script executed:

# Get the full function definition that contains line 198-204
cat -n src/pyprideap/lod.py | sed -n '170,180p'

Repository: PRIDE-Archive/pyprideap

Length of output: 580


🏁 Script executed:

# Search for the function signature that contains this code
rg -B 30 "np.tile\(base.values" src/pyprideap/lod.py | head -40

Repository: PRIDE-Archive/pyprideap

Length of output: 1092


🏁 Script executed:

# Find _nclod_base function
rg -A 10 "def _nclod_base" src/pyprideap/lod.py

Repository: PRIDE-Archive/pyprideap

Length of output: 497


🏁 Script executed:

# Find _intensity_adjustment function
rg -A 10 "def _intensity_adjustment" src/pyprideap/lod.py

Repository: PRIDE-Archive/pyprideap

Length of output: 554


🏁 Script executed:

# Check the imports and pandas type hints
rg "import pandas|from pandas" src/pyprideap/lod.py

Repository: PRIDE-Archive/pyprideap

Length of output: 86


🏁 Script executed:

# Check for mypy configuration
fd -e toml -e ini | xargs grep -l "mypy" 2>/dev/null

Repository: PRIDE-Archive/pyprideap

Length of output: 81


🏁 Script executed:

# Check pyproject.toml for type checking config
cat pyproject.toml | grep -A 10 mypy

Repository: PRIDE-Archive/pyprideap

Length of output: 344


🏁 Script executed:

# Get the complete definition of _nclod_base to see the exact return type
cat -n src/pyprideap/lod.py | sed -n '110,150p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1825


🏁 Script executed:

# Check what numpy.tile expects for its first argument and what it returns
python3 << 'EOF'
import inspect
import numpy as np

# Check np.tile signature
sig = inspect.signature(np.tile)
print("np.tile signature:", sig)

# Check return annotation
print("\nnp.tile return annotation:", sig.return_annotation)
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 184


🏁 Script executed:

# Look for any mypy test or type checking in the CI/tests
cat .github/workflows/ci.yml

Repository: PRIDE-Archive/pyprideap

Length of output: 1143


🏁 Script executed:

# Check if there are any type stubs or annotations for Series.values
rg "Series.values" src/pyprideap/lod.py

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Let's look at the full context around the function to understand the types better
cat -n src/pyprideap/lod.py | sed -n '1,30p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1585


🏁 Script executed:

# Check the pandas imports and version specifications
rg "pandas" pyproject.toml

Repository: PRIDE-Archive/pyprideap

Length of output: 110


🏁 Script executed:

# Let's check if there are any tests for this function
cat -n tests/test_lod.py | head -80

Repository: PRIDE-Archive/pyprideap

Length of output: 2956


🏁 Script executed:

# Look at the actual return type by checking what pd.Series.values returns according to pandas stubs
# and also check numpy.tile's behavior with Series.values
python3 << 'EOF'
# Since we can't import numpy/pandas directly in sandbox, let's check the types from pandas documentation
# pandas.Series.values returns ndarray | ExtensionArray in strict typing
# numpy.tile(arr, reps) expects array_like, which accepts both ndarray and ExtensionArray
# But the return type of np.tile is always ndarray

# Let's check if there's a type issue with pd.DataFrame.add when receiving values from another Series

print("Analysis of type chain:")
print("1. base: pd.Series (return from _nclod_base)")
print("2. base.values: ndarray | ExtensionArray (pandas type stubs definition)")
print("3. np.tile(base.values, ...): always returns ndarray")
print("4. pd.DataFrame(..., data=ndarray): creates DataFrame[Any] or DataFrame[Any] based on dtype")
print("5. lod_matrix.add(adj.values, axis=0): when adj.values is ndarray | ExtensionArray, result could be Any")
print("")
print("The issue is that when intermediate types are union types, mypy may lose track of the final type.")
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 555


🏁 Script executed:

# Check for type annotations or comments in the file
cat -n src/pyprideap/lod.py | sed -n '1,50p'

Repository: PRIDE-Archive/pyprideap

Length of output: 2150


🏁 Script executed:

# Look for the exact types that pandas-stubs uses for Series
rg "def values" src/pyprideap/ || echo "No explicit values definition in source"

Repository: PRIDE-Archive/pyprideap

Length of output: 106


🏁 Script executed:

# Let's examine the entire _nclod_base function to verify its return type
cat -n src/pyprideap/lod.py | sed -n '108,128p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1013


🏁 Script executed:

# Let's verify if there are actual mypy errors by checking if mypy is run in CI
# and looking at any mypy output or error logs
grep -r "mypy" .github/workflows/

Repository: PRIDE-Archive/pyprideap

Length of output: 149


🏁 Script executed:

# Let's also check the full _intensity_adjustment function to see its complete return type
cat -n src/pyprideap/lod.py | sed -n '130,170p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1946


🏁 Script executed:

# Check if there's a setup.py or any build logs that might show mypy errors
fd "setup.py|setup.cfg|*.log" --type f

Repository: PRIDE-Archive/pyprideap

Length of output: 389


🏁 Script executed:

# Let's verify what the expression index length should be vs samples length
cat -n tests/test_lod.py | sed -n '70,150p'

Repository: PRIDE-Archive/pyprideap

Length of output: 3466


🏁 Script executed:

# Check the AffinityDataset definition to understand expression vs samples
rg "class AffinityDataset" src/pyprideap/ -A 15

Repository: PRIDE-Archive/pyprideap

Length of output: 388


🏁 Script executed:

# Let's check the actual comparison between samples and expression indices
cat -n tests/test_lod.py | sed -n '23,80p'

Repository: PRIDE-Archive/pyprideap

Length of output: 2359


🏁 Script executed:

# Let's verify that dataset.samples and dataset.expression have the same length
python3 << 'EOF'
# From test setup, we can see:
# - samples DataFrame is created with n_samples rows
# - expression DataFrame is created with n_samples rows
# - They should have the same number of rows (samples)

print("Dataset structure verification:")
print("- samples: rows = sample count, columns = sample metadata")
print("- expression: rows = samples (same as samples), columns = assays")
print("- base (from _nclod_base): Series indexed by assay names, one value per assay")
print("- adj (from _intensity_adjustment): Series indexed by sample index, one value per sample")
print("")
print("Code at line 199:")
print("  np.tile(base.values, (len(dataset.samples), 1))")
print("  - Tiles base (assays) across all samples")
print("  - Result shape: (num_samples, num_assays)")
print("")
print("Code at line 200-201:")
print("  columns=base.index  -> one column per assay")
print("  index=dataset.expression.index  -> one row per sample")
print("")
print("The suggested fix changes to:")
print("  np.tile(base_values, (len(dataset.expression.index), 1))")
print("  - Same result since len(dataset.expression.index) == len(dataset.samples)")
print("  - But more semantically explicit")
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 870


🏁 Script executed:

# Let's check pandas-stubs to understand the actual type issue better
# Search for any type: ignore comments that might exist
rg "type: ignore" src/pyprideap/lod.py

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Let's also look at what the actual code does with Series.add
cat -n src/pyprideap/lod.py | sed -n '195,210p'

Repository: PRIDE-Archive/pyprideap

Length of output: 712


🏁 Script executed:

# Check if there are any existing type hints or casts in the file
rg "cast\|TYPE_CHECKING" src/pyprideap/lod.py

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Let's look at whether the return type truly degrades to Any by examining the full function signature
cat -n src/pyprideap/lod.py | sed -n '170,175p'

Repository: PRIDE-Archive/pyprideap

Length of output: 295


🏁 Script executed:

# Check if there's a way to see actual mypy output or check for pandas-stubs behavior
# Let's verify the pandas-stubs typing for DataFrame.add with array-like arguments
python3 << 'EOF'
# According to pandas-stubs and pandas typing, when you call:
# DataFrame.add(other: ArrayLike, axis=0)
# where ArrayLike = ndarray | ExtensionArray | Index | Series
# The return type depends on the union narrowing

# The chain:
# 1. adj: pd.Series | None (checked against None, so pd.Series in this context)
# 2. adj.values: ndarray | ExtensionArray (per pandas-stubs)
# 3. lod_matrix.add(adj.values, axis=0): DataFrame

# When the argument is a union type (ndarray | ExtensionArray),
# mypy may not be able to narrow the return type and could conservatively
# assume Any or a less precise type.

# The suggested fix:
# - base_values: ndarray (explicit conversion)
# - adjustment: pd.Series (explicit wrapping)
# Both have clear, non-union types that mypy can track through the chain

print("Type narrowing analysis:")
print("Original code has union types (ndarray | ExtensionArray)")
print("at multiple points in the chain, causing potential type loss.")
print("")
print("Suggested fix converts these to concrete types:")
print("- to_numpy() -> ndarray")
print("- Explicit pd.Series wrapping -> Series")
print("")
print("This should preserve type information through the chain.")
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 381


🏁 Script executed:

# Let's verify the behavior difference between the two approaches
# by checking if there are any tests that might fail with the suggested changes
grep -n "compute_nclod\|plate_adjusted" tests/test_lod.py | head -20

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Check for plate-adjusted LOD tests
rg -A 20 "plate_adjusted.*True" tests/test_lod.py

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Let's examine the entire flow more carefully and check if there are any issues with the suggested fix
cat -n src/pyprideap/lod.py | sed -n '107,130p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1063


🏁 Script executed:

# Check what happens when Series are passed to np.tile and DataFrame constructor
python3 << 'EOF'
# Trace through the suggested fix vs original:

print("=== ORIGINAL CODE ===")
print("base: pd.Series -> base.values: ndarray | ExtensionArray")
print("np.tile(base.values, ...) -> ndarray")
print("pd.DataFrame(ndarray, ...) -> DataFrame")
print("adj: pd.Series | None -> adj.values: ndarray | ExtensionArray")
print("DataFrame.add(ndarray | ExtensionArray, axis=0) -> potentially loses type info with union")
print("return lod_matrix -> return type: DataFrame[Any] or imprecise type")
print("")
print("=== SUGGESTED FIX ===")
print("base_values = base.to_numpy(dtype=float, na_value=np.nan) -> ndarray[float]")
print("np.tile(base_values, ...) -> ndarray[float]")
print("pd.DataFrame(ndarray[float], ...) -> DataFrame[float]")
print("adjustment = pd.Series(adj.to_numpy(dtype=float), index=...) -> Series[float]")
print("DataFrame.add(Series[float], axis=0) -> DataFrame[float]")
print("return lod_matrix -> return type: DataFrame[float]")
print("")
print("Type preservation: ✓ Fixes union type issue")
print("Semantic correctness: ✓ Converts to float, handles NA values")
print("Index alignment: ✓ Explicitly sets index to dataset.expression.index")
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 999


🏁 Script executed:

# Let's verify there are no tests that would break with these changes
grep -B 5 -A 10 "plate_adjusted" tests/test_lod.py | grep -A 10 "def test"

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Check if there are tests specifically for plate-adjusted NCLOD
fd "test.*lod" --type f -x cat {} \; | grep -B 5 -A 10 "compute_nclod.*plate"

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Let's look for any tests that verify the return type or structure
rg "def test.*nclod" tests/test_lod.py -A 20 | head -80

Repository: PRIDE-Archive/pyprideap

Length of output: 897


🏁 Script executed:

# Let's verify the exact behavior of to_numpy and ensure the fix is sound
python3 << 'EOF'
# Check pandas Series.to_numpy signature and behavior
print("Verification of suggested fix:")
print("")
print("1. base.to_numpy(dtype=float, na_value=np.nan)")
print("   - Returns ndarray with float dtype")
print("   - Handles NA values properly")
print("   - base is pd.Series from _nclod_base, numeric data")
print("")
print("2. pd.Series(adj.to_numpy(dtype=float), index=dataset.expression.index)")
print("   - Creates Series from ndarray with explicit index")
print("   - Ensures index alignment")
print("   - adj is pd.Series from _intensity_adjustment, indexed by samples")
print("")
print("3. np.tile(base_values, (len(dataset.expression.index), 1))")
print("   - base_values is ndarray[float] (no union type)")
print("   - Clear type throughout")
print("")
print("4. lod_matrix.add(adjustment, axis=0)")
print("   - Both operands have clear types")
print("   - DataFrame.add(Series, axis=0) preserves DataFrame type")
print("")
print("Result: Type chain is fully typed, no Any inference needed")
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 827


🏁 Script executed:

# Let's look for any compute_nclod tests that test plate_adjusted behavior
cat -n tests/test_lod.py | grep -B 5 -A 15 "def test" | grep -B 5 -A 15 "compute_nclod\|compute_lod_from_controls\|plate"

Repository: PRIDE-Archive/pyprideap

Length of output: 2700


🏁 Script executed:

# Look for integration tests that actually call compute_nclod with plate_adjusted
rg "compute_nclod\|compute_lod_from_controls" tests/test_lod.py -B 2 -A 5

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Check the complete test file to see if plate_adjusted scenarios are tested
wc -l tests/test_lod.py && cat -n tests/test_lod.py | tail -100

Repository: PRIDE-Archive/pyprideap

Length of output: 4676


🏁 Script executed:

# Let's search for any tests that specifically test the plate_adjusted=True path
rg "compute_nclod" tests/test_lod.py -B 2 -A 8

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Check if there's a test that sets up PlateID to test the plate_adjusted path
rg "PlateID\|plate_adjusted" tests/test_lod.py -B 3 -A 3

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Let's verify if the suggested fix maintains the correct behavior by checking the indices
python3 << 'EOF'
# Verify the index alignment in the suggested fix

print("Index alignment verification:")
print("")
print("Original code:")
print("  index=dataset.expression.index")
print("")
print("Suggested fix:")
print("  adjustment = pd.Series(adj.to_numpy(dtype=float), index=dataset.expression.index)")
print("")
print("Key observation:")
print("- _intensity_adjustment returns Series indexed by dataset.samples.index")
print("- In the fix, adjustment is re-indexed to dataset.expression.index")
print("- This assumes dataset.samples.index == dataset.expression.index")
print("- From AffinityDataset definition, both have same row count")
print("")
print("Potential issue:")
print("- If indices don't match in order, re-indexing could cause NaN values")
print("- The original code uses adj.values which doesn't check index alignment")
print("- The suggested fix explicitly re-indexes, which is safer but might change behavior")
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 771


🏁 Script executed:

# Let's check if dataset.samples.index and dataset.expression.index are always aligned
grep -B 5 -A 10 "class AffinityDataset" src/pyprideap/core.py

Repository: PRIDE-Archive/pyprideap

Length of output: 331


🏁 Script executed:

# Let's look at how the datasets are created in tests to see if indices are aligned
cat -n tests/test_lod.py | sed -n '39,80p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1789


🏁 Script executed:

# Let's verify one more important thing - whether the indices actually match or if there's a behavioral change risk
python3 << 'EOF'
# Check the implications of the behavioral change

print("Behavioral analysis of the suggested fix:")
print("")
print("Original code:")
print("  lod_matrix.add(adj.values, axis=0)")
print("  - adj.values is ndarray, treated as positional broadcast")
print("  - Assumes values are in the same order as lod_matrix rows")
print("")
print("Suggested fix:")
print("  adjustment = pd.Series(adj.to_numpy(dtype=float), index=dataset.expression.index)")
print("  lod_matrix.add(adjustment, axis=0)")
print("  - adjustment is Series with explicit index")
print("  - Pandas will align by index when adding Series")
print("")
print("Risk assessment:")
print("- If indices match by value and order: Behavior is identical")
print("- If indices don't match: Suggested fix could cause NaN due to index misalignment")
print("")
print("From code inspection:")
print("- Both samples and expression use default RangeIndex in tests")
print("- _intensity_adjustment returns Series(index=dataset.samples.index)")
print("- Default index is RangeIndex(0, n_samples)")
print("- Both should be identical in practice")
print("")
print("Conclusion: Suggested fix is safer (explicit index) and equivalent in typical cases")
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 994


🏁 Script executed:

# Check if there are any mypy configuration details that would affect type narrowing
cat pyproject.toml | grep -A 5 "mypy"

Repository: PRIDE-Archive/pyprideap

Length of output: 287


🏁 Script executed:

# Look at the complete function to understand all type implications
cat -n src/pyprideap/lod.py | sed -n '170,205p'

Repository: PRIDE-Archive/pyprideap

Length of output: 1492


Improve type narrowing for DataFrame return type.

base.values and adj.values are typed as ndarray | ExtensionArray by pandas-stubs, which prevents mypy from tracking a precise return type through the DataFrame construction and addition operations. Converting to explicit concrete types (ndarray via to_numpy() and Series via explicit construction) ensures the return type remains DataFrame rather than degrading to Any under warn_return_any mode.

Suggested fix
+    base_values = base.to_numpy(dtype=float, na_value=np.nan)
+    adjustment = pd.Series(adj.to_numpy(dtype=float), index=dataset.expression.index)
     lod_matrix = pd.DataFrame(
-        np.tile(base.values, (len(dataset.samples), 1)),
+        np.tile(base_values, (len(dataset.expression.index), 1)),
         columns=base.index,
         index=dataset.expression.index,
     )
-    lod_matrix = lod_matrix.add(adj.values, axis=0)
-    return lod_matrix
+    return lod_matrix.add(adjustment, axis=0)
🧰 Tools
🪛 GitHub Actions: CI

[error] 199-199: mypy: Argument 1 to "tile" has incompatible type "ndarray[tuple[int], dtype[Any]] | ExtensionArray"; expected "_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]]" (arg-type)


[error] 204-204: mypy: Returning Any from function declared to return "DataFrame | Series[Any]". (no-any-return)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/lod.py` around lines 198 - 204, The DataFrame construction and
addition degrade type precision because base.values and adj.values are typed as
ndarray | ExtensionArray; update the lod_matrix creation to use base.to_numpy()
when tiling and convert adj to a pandas.Series (or ensure adj is a Series)
before adding so mypy can infer a concrete DataFrame return type: use
np.tile(base.to_numpy(), (len(dataset.samples), 1)) for the data and ensure the
addition uses a Series (e.g., pd.Series(adj) or adj if already a Series) with
axis=0 when calling lod_matrix.add; keep the same column=index and
index=dataset.expression.index and return lod_matrix.

Comment on lines +31 to +42
_QC_LOD_COLORS = {
"PASS & NPX > LOD": "#2ecc71",
"PASS & NPX ≤ LOD": "#3498db",
"WARN & NPX > LOD": "#f39c12",
"WARN & NPX ≤ LOD": "#e74c3c",
"FAIL & NPX > LOD": "#95a5a6",
"FAIL & NPX ≤ LOD": "#7f8c8d",
"PASS": "#2ecc71",
"WARN": "#f39c12",
"FAIL": "#e74c3c",
"NA": "#95a5a6",
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Verify color semantics for QC categories.

The color assignment may cause confusion: "WARN & NPX ≤ LOD" uses red (#e74c3c), which is the same color as fallback "FAIL", while "FAIL & NPX > LOD" and "FAIL & NPX ≤ LOD" use gray tones. This makes WARN conditions appear more severe than FAIL conditions visually.

If intentional (e.g., "below LOD" is the primary concern), consider documenting the rationale. Otherwise, consider adjusting colors so FAIL-related categories use more alarming colors than WARN-related ones.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/qc/render.py` around lines 31 - 42, The _QC_LOD_COLORS mapping
assigns alarming red ("#e74c3c") to "WARN & NPX ≤ LOD" while various "FAIL" keys
use gray, which flips visual severity; update the _QC_LOD_COLORS dict so
FAIL-related keys (e.g., "FAIL", "FAIL & NPX > LOD", "FAIL & NPX ≤ LOD") use a
more alarming color (e.g., red) and WARN-related keys (e.g., "WARN", "WARN & NPX
> LOD", "WARN & NPX ≤ LOD") use a less severe color (e.g., orange), or document
the intended semantics if the current mapping is deliberate; locate and change
the color hex values in the _QC_LOD_COLORS constant in render.py accordingly.

Comment on lines +110 to +123
cumulative = 0.0
for cat, cnt, color in zip(data.categories, data.counts, colors):
pct = cnt / total * 100 if total > 0 else 0
fig.add_trace(
go.Bar(
x=["Samples"],
y=[pct],
name=f"{cat} {cnt} ({pct:.1f}%)",
marker_color=color,
text=f"{pct:.1f}%",
textposition="inside",
)
)
cumulative += pct
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused cumulative variable.

The variable cumulative is incremented on line 123 but never read. This appears to be dead code, possibly a leftover from development.

🧹 Proposed fix
     fig = go.Figure()
-    cumulative = 0.0
     for cat, cnt, color in zip(data.categories, data.counts, colors):
         pct = cnt / total * 100 if total > 0 else 0
         fig.add_trace(
             go.Bar(
                 x=["Samples"],
                 y=[pct],
                 name=f"{cat} {cnt} ({pct:.1f}%)",
                 marker_color=color,
                 text=f"{pct:.1f}%",
                 textposition="inside",
             )
         )
-        cumulative += pct
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cumulative = 0.0
for cat, cnt, color in zip(data.categories, data.counts, colors):
pct = cnt / total * 100 if total > 0 else 0
fig.add_trace(
go.Bar(
x=["Samples"],
y=[pct],
name=f"{cat} {cnt} ({pct:.1f}%)",
marker_color=color,
text=f"{pct:.1f}%",
textposition="inside",
)
)
cumulative += pct
for cat, cnt, color in zip(data.categories, data.counts, colors):
pct = cnt / total * 100 if total > 0 else 0
fig.add_trace(
go.Bar(
x=["Samples"],
y=[pct],
name=f"{cat} {cnt} ({pct:.1f}%)",
marker_color=color,
text=f"{pct:.1f}%",
textposition="inside",
)
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/qc/render.py` around lines 110 - 123, Remove the dead variable
cumulative in the loop inside the render function (where cumulative = 0.0 is set
and cumulative += pct is applied); delete both the initialization and the
increment since cumulative is never read, leaving the loop that builds the
go.Bar traces using pct, data.categories, data.counts, and colors intact.

Comment on lines +47 to +56
# Build per-sample × per-assay LOD matrix if LOD column exists
if "LOD" in df.columns:
lod_matrix = df.pivot_table(
index="SampleID",
columns="OlinkID",
values="LOD",
aggfunc="first",
)
lod_matrix = lod_matrix.reindex(sample_order).reset_index(drop=True)
metadata["lod_matrix"] = lod_matrix
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the olink_csv.py file in full context around the flagged lines
cat -n src/pyprideap/readers/olink_csv.py | head -100

Repository: PRIDE-Archive/pyprideap

Length of output: 2789


🏁 Script executed:

# Check the file size to understand its structure
wc -l src/pyprideap/readers/olink_csv.py

Repository: PRIDE-Archive/pyprideap

Length of output: 104


🏁 Script executed:

# Look at olink_xlsx.py to verify the same pattern exists
cat -n src/pyprideap/readers/olink_xlsx.py | head -100

Repository: PRIDE-Archive/pyprideap

Length of output: 2484


🏁 Script executed:

# Check lod.py to understand how lod_matrix is consumed
cat -n src/pyprideap/lod.py

Repository: PRIDE-Archive/pyprideap

Length of output: 23379


🏁 Script executed:

python3 << 'EOF'
import pandas as pd
import numpy as np

# Simulate data: one assay (OlinkID_A) has NPX values but all-null LOD
df = pd.DataFrame({
    'SampleID': ['S1', 'S1', 'S2', 'S2'],
    'OlinkID': ['OlinkID_A', 'OlinkID_B', 'OlinkID_A', 'OlinkID_B'],
    'NPX': [1.0, 2.0, 3.0, 4.0],
    'LOD': [0.1, 0.2, 0.3, np.nan],  # OlinkID_B has one null LOD
})

print("Original DataFrame:")
print(df)
print("\n" + "="*60)

# Build expression (NPX pivot)
expression = df.pivot_table(
    index='SampleID',
    columns='OlinkID',
    values='NPX',
    aggfunc='first',
)
print("\nExpression (NPX pivot):")
print(expression)
print(f"Columns: {list(expression.columns)}")

print("\n" + "="*60)

# Build lod_matrix (LOD pivot) with some null values
lod_matrix = df.pivot_table(
    index='SampleID',
    columns='OlinkID',
    values='LOD',
    aggfunc='first',
)
print("\nLOD Matrix (LOD pivot):")
print(lod_matrix)
print(f"Columns: {list(lod_matrix.columns)}")

print("\n" + "="*60)
print("\nColumn alignment issue:")
print(f"expression has {len(expression.columns)} columns: {list(expression.columns)}")
print(f"lod_matrix has {len(lod_matrix.columns)} columns: {list(lod_matrix.columns)}")
print(f"Alignment problem: {set(expression.columns) != set(lod_matrix.columns)}")

# Test with completely null LOD for one OlinkID
print("\n" + "="*60)
print("\nTest 2: OlinkID with ALL-NULL LOD")
df2 = pd.DataFrame({
    'SampleID': ['S1', 'S1', 'S2', 'S2'],
    'OlinkID': ['OlinkID_A', 'OlinkID_B', 'OlinkID_A', 'OlinkID_B'],
    'NPX': [1.0, 2.0, 3.0, 4.0],
    'LOD': [np.nan, np.nan, np.nan, np.nan],  # ALL null LOD
})

expression2 = df2.pivot_table(
    index='SampleID',
    columns='OlinkID',
    values='NPX',
    aggfunc='first',
)
lod_matrix2 = df2.pivot_table(
    index='SampleID',
    columns='OlinkID',
    values='LOD',
    aggfunc='first',
)

print(f"expression2 columns: {list(expression2.columns)}")
print(f"lod_matrix2 columns: {list(lod_matrix2.columns)}")
print(f"Alignment problem: {set(expression2.columns) != set(lod_matrix2.columns)}")
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 185


🏁 Script executed:

# Also verify the suggested fix would work
python3 << 'EOF'
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'SampleID': ['S1', 'S1', 'S2', 'S2'],
    'OlinkID': ['OlinkID_A', 'OlinkID_B', 'OlinkID_A', 'OlinkID_B'],
    'NPX': [1.0, 2.0, 3.0, 4.0],
    'LOD': [np.nan, np.nan, np.nan, np.nan],
})

sample_order = df['SampleID'].unique()
expression = df.pivot_table(
    index='SampleID', columns='OlinkID', values='NPX', aggfunc='first',
).reindex(sample_order).reset_index(drop=True)

lod_matrix = df.pivot_table(
    index='SampleID', columns='OlinkID', values='LOD', aggfunc='first',
)

print("Before fix:")
print(f"expression columns: {list(expression.columns)}")
print(f"lod_matrix columns: {list(lod_matrix.columns)}")

# Apply suggested fix
lod_matrix_fixed = lod_matrix.reindex(
    index=sample_order,
    columns=expression.columns,
).reset_index(drop=True)

print("\nAfter fix:")
print(f"expression columns: {list(expression.columns)}")
print(f"lod_matrix_fixed columns: {list(lod_matrix_fixed.columns)}")
print(f"Aligned: {list(expression.columns) == list(lod_matrix_fixed.columns)}")
EOF

Repository: PRIDE-Archive/pyprideap

Length of output: 185


🏁 Script executed:

# Check if there are any tests in the codebase that verify lod_matrix alignment
find . -type f -name "*.py" -path "*/test*" | head -20

Repository: PRIDE-Archive/pyprideap

Length of output: 427


🏁 Script executed:

# Search for tests related to olink readers
find . -type f -name "*.py" | xargs grep -l "read_olink\|lod_matrix" | grep -i test

Repository: PRIDE-Archive/pyprideap

Length of output: 135


🏁 Script executed:

# Examine test_readers.py to see what's tested
cat -n tests/test_readers.py | head -200

Repository: PRIDE-Archive/pyprideap

Length of output: 6906


🏁 Script executed:

# Examine test_lod.py
cat -n tests/test_lod.py | head -150

Repository: PRIDE-Archive/pyprideap

Length of output: 5992


🏁 Script executed:

# Check conftest.py for fixtures
cat -n tests/conftest.py

Repository: PRIDE-Archive/pyprideap

Length of output: 792


🏁 Script executed:

# Look for fixture data that might test lod_matrix
find tests -type f \( -name "*.csv" -o -name "*.xlsx" \) | head -20

Repository: PRIDE-Archive/pyprideap

Length of output: 158


🏁 Script executed:

# Search specifically for any tests that verify lod_matrix has same columns as expression
rg "lod_matrix.*columns\|expression\.columns" tests/

Repository: PRIDE-Archive/pyprideap

Length of output: 49


🏁 Script executed:

# Check the olink sample data to see if it has LOD column and structure
head -20 tests/data/olink_sample.npx.csv

Repository: PRIDE-Archive/pyprideap

Length of output: 1012


🏁 Script executed:

# Examine the structure more carefully
head -10 tests/data/olink_sample.npx.csv | cat -A

Repository: PRIDE-Archive/pyprideap

Length of output: 780


🏁 Script executed:

# Count rows and columns
wc -l tests/data/olink_sample.npx.csv

Repository: PRIDE-Archive/pyprideap

Length of output: 101


🏁 Script executed:

# Get the CSV structure to understand the test data
cut -d',' -f1-10 tests/data/olink_sample.npx.csv | head -15

Repository: PRIDE-Archive/pyprideap

Length of output: 875


🏁 Script executed:

# Check if there are any other usages of lod_matrix in the codebase
rg "lod_matrix" --type py -A 3 -B 3

Repository: PRIDE-Archive/pyprideap

Length of output: 9550


Ensure lod_matrix columns match expression.columns at creation.

pivot_table() can drop assay columns where LOD values are entirely null, causing metadata["lod_matrix"] to be narrower than expression. While lod.py includes defensive alignment in _above_lod_matrix(), the mismatch should be prevented at creation time. This pattern is inconsistent with how features are explicitly aligned (line 43). The same issue exists in src/pyprideap/readers/olink_xlsx.py.

Suggested fix
     if "LOD" in df.columns:
         lod_matrix = df.pivot_table(
             index="SampleID",
             columns="OlinkID",
             values="LOD",
             aggfunc="first",
         )
-        lod_matrix = lod_matrix.reindex(sample_order).reset_index(drop=True)
+        lod_matrix = lod_matrix.reindex(
+            index=sample_order,
+            columns=expression.columns,
+        ).reset_index(drop=True)
         metadata["lod_matrix"] = lod_matrix
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/pyprideap/readers/olink_csv.py` around lines 47 - 56, The LOD pivot can
drop assay columns with all-NA values causing metadata["lod_matrix"] to not
match expression.columns; after creating lod_matrix with df.pivot_table (and
before resetting index), reindex its columns to the same assay order used for
expression/features (the same reference used at line where features are aligned)
and fill missing columns with NaN so lod_matrix.columns == expression.columns;
apply the identical fix in olink_xlsx.py where lod_matrix is built, ensuring you
reference the same sample_order and assay ordering variable (expression.columns
or features) when reindexing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant