[epic] collection optimization tools

# Collection Optimization Tools - Conversation Summary

This document summarizes the discussion and design process that led to the creation of two new fromager commands for optimizing collection organization.

## Context

Date: 2026-03-21
Participants: Doug Hellmann, Claude (Sonnet 4.5)

## Initial Request

**User Request:**
> I want to add a command to fromager to help me decide how to optimize the builds by combining sets of packages into collections. I already have some collections defined, and each produces a graph file describing the dependencies of the items in the collection. I also have a collection of packages being "onboarded" to the build that have not been added to the permanent collections.

**Goal:** Create tools to help organize packages into collections efficiently to minimize duplicated builds and optimize parallel build processes.

## Discovery Phase

### Sample Data Analysis

Located sample graph files in `~/tmp/sample-graphs/`:
- `3.4.2293+notebook-cuda12.9-ubi9-aarch64-graph.json` - 444 packages, 63 top-level
- `3.4.2386+onboarding-cuda12.9-ubi9-x86_64-graph.json` - 851 packages, 136 top-level
- `3.4.2440+rhai-innovation-cuda12.9-ubi9-x86_64-graph.json` - 418 packages, 16 top-level

### Codebase Exploration

**Existing Infrastructure:**
- `DependencyGraph` class (`src/fromager/dependency_graph.py`) - Loads and manipulates graph.json files
- Graph files use JSON format with `"package==version"` keys and special `""` ROOT node
- TOP_LEVEL packages identified via `req_type: 'toplevel'` in ROOT node edges
- Existing graph commands in `src/fromager/commands/graph.py` follow `@graph.command()` pattern
- Rich table output pattern in `stats.py`, JSON output pattern in `list_overrides.py`

## Command 1: `graph suggest-collection`

### Purpose
Help assign onboarding packages to existing collections by analyzing dependency overlap.

### Requirements Gathering

**Key Questions & Decisions:**

1. **Version Matching:** Name only (ignore versions) ✓
   - Rationale: Collections regularly build multiple versions of the same package

2. **Dependency Depth:** Full transitive closure ✓
   - Rationale: Need complete picture of what would be added

3. **Output Format:** Rich table (default) + JSON option ✓
   - User note: "Default to a rich table but provide an option for JSON output"

4. **Command Location:** Subcommand under `graph` group ✓
   - Fits with existing graph commands

### Algorithm

For each top-level package in onboarding graph:
1. Extract full dependency closure (all transitive dependencies by canonical name)
2. For each existing collection:
   - Calculate new packages = closure - collection packages
   - Calculate existing packages = closure ∩ collection packages
   - Calculate coverage % = existing / total * 100
3. Rank collections by:
   - Primary: fewest new packages (ascending)
   - Secondary: highest coverage % (descending)
4. Display best-fit collection with statistics

### Output Example

**Table:**
```
Collection Suggestions for Onboarding Packages

┃ Package             ┃ Version ┃ Total Deps ┃ Best Fit        ┃ New Pkgs ┃ Existing ┃ Coverage ┃
┃━━━━━━━━━━━━━━━━━━━━━┃━━━━━━━━━┃━━━━━━━━━━━━┃━━━━━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━┃
┃ fastapi             ┃ 0.135.1 ┃ 45         ┃ notebook        ┃ 3        ┃ 42       ┃ 93.3%    ┃
┃ instructorembedding ┃ 1.0.1   ┃ 123        ┃ notebook        ┃ 8        ┃ 115      ┃ 93.5%    ┃
┃ torch               ┃ 2.9.1   ┃ 89         ┃ rhai-innovation ┃ 12       ┃ 77       ┃ 86.5%    ┃
```

**Command:**
```bash
fromager graph suggest-collection ONBOARDING-GRAPH COLLECTION-GRAPHS...
```

### Implementation Notes
- Location: `src/fromager/commands/graph.py`
- Key functions: `get_dependency_closure()`, `get_package_names()`, `extract_collection_name()`
- Reuse: `DependencyGraph.from_file()`, `.get_all_nodes()`, `.get_root_node()`
- Pattern references: `stats.py` (tables), `list_overrides.py` (--format option)

### GitHub Issue
**Created:** Issue #971 - https://github.com/python-wheel-build/fromager/issues/971

---

## Command 2: `graph suggest-base`

### Purpose
Identify packages shared across multiple collections that should be factored out into a base collection for efficiency.

### Requirements Gathering

**Key Questions & Decisions:**

1. **Package Scope:** All packages (including transitive deps) ✓
   - Rationale: Any shared dependency is a candidate for factoring out

2. **Minimum Threshold:** Configurable with default of 2 ✓
   - User note: "Make the threshold configurable but default to 2"
   - Allows flexibility for different use cases

3. **Output Grouping:** Individual packages ✓
   - Simpler to review and make decisions

4. **Output Format:** Rich table (default) + JSON option ✓
   - Consistent with suggest-collection command

### Use Case

**Problem:**
- Multiple collections share common dependencies (numpy, pandas, setuptools, etc.)
- Each collection builds these packages independently
- Wasted build time and resources

**Solution:**
- Build shared packages once in a base collection
- Build specialized collections in parallel, all depending on base
- Dramatically reduce total build time

### Algorithm

1. Load all collection graphs and extract all packages (by canonical name)
2. For each unique package, count how many collections contain it
3. Filter packages appearing in >= `--min-collections` collections (default: 2)
4. Rank by collection count (descending), then alphabetically
5. If `--base` provided, mark which packages are already in base vs. new candidates

### Output Example

**Without existing base:**
```
Base Collection Candidates
Analyzing 3 collections: notebook, rhai-innovation, data-science
Threshold: packages appearing in >= 2 collections

┃ Package       ┃ Collections ┃ Coverage ┃ Appears In                          ┃
┃━━━━━━━━━━━━━━━┃━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┃
┃ numpy         ┃ 3/3         ┃ 100.0%   ┃ notebook, rhai-innovation, data-sci ┃
┃ setuptools    ┃ 3/3         ┃ 100.0%   ┃ notebook, rhai-innovation, data-sci ┃
┃ wheel         ┃ 3/3         ┃ 100.0%   ┃ notebook, rhai-innovation, data-sci ┃
┃ pandas        ┃ 2/3         ┃ 66.7%    ┃ notebook, data-sci                  ┃

Summary:
  Shared packages found: 6
  Packages in all collections: 3 (50.0%)
```

**With existing base:**
```
Base Collection Enhancement Candidates
Current base: 45 packages

┃ Package       ┃ Collections ┃ Coverage ┃ In Base ┃ Appears In                          ┃
┃━━━━━━━━━━━━━━━┃━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━┃━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┃
┃ numpy         ┃ 3/3         ┃ 100.0%   ┃ Yes     ┃ notebook, rhai-innovation, data-sci ┃
┃ pandas        ┃ 2/3         ┃ 66.7%    ┃ No      ┃ notebook, data-sci                  ┃

Summary:
  Already in base: 3 (50.0%)
  New candidates: 3 (50.0%)

Recommendation: Add pandas, requests to base collection
```

**Command:**
```bash
# Find shared packages
fromager graph suggest-base COLLECTION-GRAPHS... [--min-collections N]

# Enhance existing base
fromager graph suggest-base COLLECTION-GRAPHS... --base BASE-GRAPH [--min-collections N]
```

### Implementation Notes
- Location: `src/fromager/commands/graph.py`
- Key functions: `get_all_package_names()`, `find_shared_packages()`
- Options: `--base`, `--min-collections`, `--format`
- Reuse patterns from suggest-collection command

### GitHub Issue
**Created:** Issue #973 - https://github.com/python-wheel-build/fromager/issues/973

---

## Key Design Decisions

### 1. Package Matching by Name Only
Both commands ignore package versions when comparing dependencies.

**Rationale:**
- Collections regularly build multiple versions of the same package
- Version differences shouldn't affect collection assignment decisions
- Focus on package presence, not specific versions

**Implementation:**
- Use `node.canonicalized_name` instead of `node.key` (which includes version)
- Use `canonicalize_name()` from `packaging.utils` for normalization

### 2. Full Dependency Closure Analysis
Both commands analyze complete transitive dependency trees, not just top-level packages.

**Rationale:**
- Transitive dependencies are just as important for build efficiency
- A package with shared dependencies is a better fit even if top-level packages differ
- Complete picture enables better optimization decisions

**Implementation:**
- Traverse full dependency graph using depth-first search
- Use visited set to handle circular dependencies
- Include both install and build dependencies

### 3. Consistent Output Formats
Both commands default to rich tables with `--format json` option.

**Rationale:**
- Human-readable tables for interactive use
- Machine-readable JSON for automation/scripting
- Consistency across related commands improves UX

**Implementation:**
- Follow pattern from existing `stats.py` and `list_overrides.py`
- Use rich library for table formatting
- JSON output enables pipeline integration

### 4. Configurable Thresholds
`suggest-base` uses configurable `--min-collections` threshold.

**Rationale:**
- Different use cases require different thresholds
- Finding packages in ALL collections (min = N) identifies universal dependencies
- Finding packages in ANY 2+ collections (min = 2) maximizes candidates
- Default of 2 balances utility and specificity

**Implementation:**
- Validation: threshold must be >= 2 and <= number of collections
- Clear error messages for invalid thresholds

### 5. Optional Base Enhancement
`suggest-base` supports enhancing existing base collections via `--base` option.

**Rationale:**
- Base collections evolve over time
- Need to add new shared packages as collections grow
- Want to validate existing base packages are still useful

**Implementation:**
- Load existing base graph
- Compare against suggested candidates
- Mark packages as "already in base" vs. "new candidates"
- Identify orphaned base packages (in base but not used)

---

## Technical Architecture

### Graph File Structure
```json
{
  "": {  // ROOT node
    "edges": [
      {"key": "package==1.0", "req_type": "toplevel", "req": "package>=1.0"}
    ]
  },
  "package==1.0": {
    "canonicalized_name": "package",
    "version": "1.0",
    "download_url": "...",
    "pre_built": false,
    "edges": [...]
  }
}
```

### Code Organization
Both commands added to: `src/fromager/commands/graph.py`

**Shared utilities:**
- `get_package_names()` / `get_all_package_names()` - Extract package sets
- `extract_collection_name()` - Parse collection name from file path
- Output formatting functions for table and JSON

**Command-specific:**
- `suggest_collection`: `get_dependency_closure()` for transitive deps
- `suggest_base`: `find_shared_packages()` for overlap analysis

### Testing Strategy
Both commands require:
- Unit tests for helper functions
- Integration tests with real/generated graph files
- Edge case coverage (empty graphs, circular deps, tied rankings, etc.)
- Type checking with mypy
- Code formatting with ruff

---

## Related Work

### Existing Fromager Commands
- `graph to-constraints` - Graph conversion
- `graph explain-duplicates` - Version conflict analysis
- `graph why` - Dependency chain explanation
- `stats` - Build statistics with rich tables

### Pattern References
- **Table output:** `src/fromager/commands/stats.py`
- **JSON output:** `src/fromager/commands/list_overrides.py`
- **Graph loading:** Standard pattern across all graph commands
- **Click decorators:** Consistent argument/option patterns

---

## Success Metrics

### For `suggest-collection` (#971)
- Correctly identifies best-fit collections based on dependency overlap
- Saves time in manually assigning 100+ onboarding packages
- Enables data-driven collection organization decisions

### For `suggest-base` (#973)
- Identifies high-value candidates for base collection
- Reduces build time by eliminating duplicate builds
- Enables efficient parallel builds of multiple collections

### Overall Impact
- Improved build efficiency through better collection organization
- Reduced manual effort in collection management
- Data-driven decisions replacing guesswork
- Clear visibility into collection relationships and overlap

---

## Next Steps

1. **Implementation** - Both commands ready for development
2. **Testing** - Use sample graphs in `~/tmp/sample-graphs/` for validation
3. **Integration** - Commands work together:
   - Use `suggest-base` to create/enhance base collection
   - Use `suggest-collection` to assign new packages to specialized collections
4. **Documentation** - Update user docs with new commands and workflows

---

## Issue References

- **Epic:** Issue #972 - Collection optimization tools
- **Command 1:** Issue #971 - Add `graph suggest-collection` command
- **Command 2:** Issue #973 - Add `graph suggest-base` command

---

## Appendix: Example Workflow

### Step 1: Identify Base Collection Candidates
```bash
fromager graph suggest-base \
  notebook.json \
  rhai-innovation.json \
  data-science.json \
  --min-collections 3
```
Result: Packages appearing in all 3 collections

### Step 2: Create Base Collection
(Manual process - build the identified shared packages)

### Step 3: Assign Onboarding Packages
```bash
fromager graph suggest-collection \
  onboarding.json \
  notebook.json \
  rhai-innovation.json \
  data-science.json
```
Result: Best-fit collection for each onboarding package

### Step 4: Enhance Base Collection (Later)
```bash
fromager graph suggest-base \
  notebook.json \
  rhai-innovation.json \
  data-science.json \
  --base current-base.json \
  --min-collections 2
```
Result: New candidates to add to existing base


[epic] collection optimization tools #972

Description

Collection Optimization Tools - Conversation Summary

Context

Initial Request

Discovery Phase

Sample Data Analysis

Codebase Exploration

Command 1: graph suggest-collection

Purpose

Requirements Gathering

Algorithm

Output Example

Implementation Notes

GitHub Issue

Command 2: graph suggest-base

Purpose

Requirements Gathering

Use Case

Algorithm

Output Example

Implementation Notes

GitHub Issue

Key Design Decisions

1. Package Matching by Name Only

2. Full Dependency Closure Analysis

3. Consistent Output Formats

4. Configurable Thresholds

5. Optional Base Enhancement

Technical Architecture

Graph File Structure

Code Organization

Testing Strategy

Related Work

Existing Fromager Commands

Pattern References

Success Metrics

For suggest-collection (#971)

For suggest-base (#973)

Overall Impact

Next Steps

Issue References

Appendix: Example Workflow

Step 1: Identify Base Collection Candidates

Step 2: Create Base Collection

Step 3: Assign Onboarding Packages

Step 4: Enhance Base Collection (Later)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Command 1: `graph suggest-collection`

Command 2: `graph suggest-base`

For `suggest-collection` (#971)

For `suggest-base` (#973)