Skip to content

[epic] collection optimization tools #972

@dhellmann

Description

@dhellmann

Collection Optimization Tools - Conversation Summary

This document summarizes the discussion and design process that led to the creation of two new fromager commands for optimizing collection organization.

Context

Date: 2026-03-21
Participants: Doug Hellmann, Claude (Sonnet 4.5)

Initial Request

User Request:

I want to add a command to fromager to help me decide how to optimize the builds by combining sets of packages into collections. I already have some collections defined, and each produces a graph file describing the dependencies of the items in the collection. I also have a collection of packages being "onboarded" to the build that have not been added to the permanent collections.

Goal: Create tools to help organize packages into collections efficiently to minimize duplicated builds and optimize parallel build processes.

Discovery Phase

Sample Data Analysis

Located sample graph files in ~/tmp/sample-graphs/:

  • 3.4.2293+notebook-cuda12.9-ubi9-aarch64-graph.json - 444 packages, 63 top-level
  • 3.4.2386+onboarding-cuda12.9-ubi9-x86_64-graph.json - 851 packages, 136 top-level
  • 3.4.2440+rhai-innovation-cuda12.9-ubi9-x86_64-graph.json - 418 packages, 16 top-level

Codebase Exploration

Existing Infrastructure:

  • DependencyGraph class (src/fromager/dependency_graph.py) - Loads and manipulates graph.json files
  • Graph files use JSON format with "package==version" keys and special "" ROOT node
  • TOP_LEVEL packages identified via req_type: 'toplevel' in ROOT node edges
  • Existing graph commands in src/fromager/commands/graph.py follow @graph.command() pattern
  • Rich table output pattern in stats.py, JSON output pattern in list_overrides.py

Command 1: graph suggest-collection

Purpose

Help assign onboarding packages to existing collections by analyzing dependency overlap.

Requirements Gathering

Key Questions & Decisions:

  1. Version Matching: Name only (ignore versions) ✓

    • Rationale: Collections regularly build multiple versions of the same package
  2. Dependency Depth: Full transitive closure ✓

    • Rationale: Need complete picture of what would be added
  3. Output Format: Rich table (default) + JSON option ✓

    • User note: "Default to a rich table but provide an option for JSON output"
  4. Command Location: Subcommand under graph group ✓

    • Fits with existing graph commands

Algorithm

For each top-level package in onboarding graph:

  1. Extract full dependency closure (all transitive dependencies by canonical name)
  2. For each existing collection:
    • Calculate new packages = closure - collection packages
    • Calculate existing packages = closure ∩ collection packages
    • Calculate coverage % = existing / total * 100
  3. Rank collections by:
    • Primary: fewest new packages (ascending)
    • Secondary: highest coverage % (descending)
  4. Display best-fit collection with statistics

Output Example

Table:

Collection Suggestions for Onboarding Packages

┃ Package             ┃ Version ┃ Total Deps ┃ Best Fit        ┃ New Pkgs ┃ Existing ┃ Coverage ┃
┃━━━━━━━━━━━━━━━━━━━━━┃━━━━━━━━━┃━━━━━━━━━━━━┃━━━━━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━┃
┃ fastapi             ┃ 0.135.1 ┃ 45         ┃ notebook        ┃ 3        ┃ 42       ┃ 93.3%    ┃
┃ instructorembedding ┃ 1.0.1   ┃ 123        ┃ notebook        ┃ 8        ┃ 115      ┃ 93.5%    ┃
┃ torch               ┃ 2.9.1   ┃ 89         ┃ rhai-innovation ┃ 12       ┃ 77       ┃ 86.5%    ┃

Command:

fromager graph suggest-collection ONBOARDING-GRAPH COLLECTION-GRAPHS...

Implementation Notes

  • Location: src/fromager/commands/graph.py
  • Key functions: get_dependency_closure(), get_package_names(), extract_collection_name()
  • Reuse: DependencyGraph.from_file(), .get_all_nodes(), .get_root_node()
  • Pattern references: stats.py (tables), list_overrides.py (--format option)

GitHub Issue

Created: Issue #971 - #971


Command 2: graph suggest-base

Purpose

Identify packages shared across multiple collections that should be factored out into a base collection for efficiency.

Requirements Gathering

Key Questions & Decisions:

  1. Package Scope: All packages (including transitive deps) ✓

    • Rationale: Any shared dependency is a candidate for factoring out
  2. Minimum Threshold: Configurable with default of 2 ✓

    • User note: "Make the threshold configurable but default to 2"
    • Allows flexibility for different use cases
  3. Output Grouping: Individual packages ✓

    • Simpler to review and make decisions
  4. Output Format: Rich table (default) + JSON option ✓

    • Consistent with suggest-collection command

Use Case

Problem:

  • Multiple collections share common dependencies (numpy, pandas, setuptools, etc.)
  • Each collection builds these packages independently
  • Wasted build time and resources

Solution:

  • Build shared packages once in a base collection
  • Build specialized collections in parallel, all depending on base
  • Dramatically reduce total build time

Algorithm

  1. Load all collection graphs and extract all packages (by canonical name)
  2. For each unique package, count how many collections contain it
  3. Filter packages appearing in >= --min-collections collections (default: 2)
  4. Rank by collection count (descending), then alphabetically
  5. If --base provided, mark which packages are already in base vs. new candidates

Output Example

Without existing base:

Base Collection Candidates
Analyzing 3 collections: notebook, rhai-innovation, data-science
Threshold: packages appearing in >= 2 collections

┃ Package       ┃ Collections ┃ Coverage ┃ Appears In                          ┃
┃━━━━━━━━━━━━━━━┃━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┃
┃ numpy         ┃ 3/3         ┃ 100.0%   ┃ notebook, rhai-innovation, data-sci ┃
┃ setuptools    ┃ 3/3         ┃ 100.0%   ┃ notebook, rhai-innovation, data-sci ┃
┃ wheel         ┃ 3/3         ┃ 100.0%   ┃ notebook, rhai-innovation, data-sci ┃
┃ pandas        ┃ 2/3         ┃ 66.7%    ┃ notebook, data-sci                  ┃

Summary:
  Shared packages found: 6
  Packages in all collections: 3 (50.0%)

With existing base:

Base Collection Enhancement Candidates
Current base: 45 packages

┃ Package       ┃ Collections ┃ Coverage ┃ In Base ┃ Appears In                          ┃
┃━━━━━━━━━━━━━━━┃━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━┃━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┃
┃ numpy         ┃ 3/3         ┃ 100.0%   ┃ Yes     ┃ notebook, rhai-innovation, data-sci ┃
┃ pandas        ┃ 2/3         ┃ 66.7%    ┃ No      ┃ notebook, data-sci                  ┃

Summary:
  Already in base: 3 (50.0%)
  New candidates: 3 (50.0%)

Recommendation: Add pandas, requests to base collection

Command:

# Find shared packages
fromager graph suggest-base COLLECTION-GRAPHS... [--min-collections N]

# Enhance existing base
fromager graph suggest-base COLLECTION-GRAPHS... --base BASE-GRAPH [--min-collections N]

Implementation Notes

  • Location: src/fromager/commands/graph.py
  • Key functions: get_all_package_names(), find_shared_packages()
  • Options: --base, --min-collections, --format
  • Reuse patterns from suggest-collection command

GitHub Issue

Created: Issue #973 - #973


Key Design Decisions

1. Package Matching by Name Only

Both commands ignore package versions when comparing dependencies.

Rationale:

  • Collections regularly build multiple versions of the same package
  • Version differences shouldn't affect collection assignment decisions
  • Focus on package presence, not specific versions

Implementation:

  • Use node.canonicalized_name instead of node.key (which includes version)
  • Use canonicalize_name() from packaging.utils for normalization

2. Full Dependency Closure Analysis

Both commands analyze complete transitive dependency trees, not just top-level packages.

Rationale:

  • Transitive dependencies are just as important for build efficiency
  • A package with shared dependencies is a better fit even if top-level packages differ
  • Complete picture enables better optimization decisions

Implementation:

  • Traverse full dependency graph using depth-first search
  • Use visited set to handle circular dependencies
  • Include both install and build dependencies

3. Consistent Output Formats

Both commands default to rich tables with --format json option.

Rationale:

  • Human-readable tables for interactive use
  • Machine-readable JSON for automation/scripting
  • Consistency across related commands improves UX

Implementation:

  • Follow pattern from existing stats.py and list_overrides.py
  • Use rich library for table formatting
  • JSON output enables pipeline integration

4. Configurable Thresholds

suggest-base uses configurable --min-collections threshold.

Rationale:

  • Different use cases require different thresholds
  • Finding packages in ALL collections (min = N) identifies universal dependencies
  • Finding packages in ANY 2+ collections (min = 2) maximizes candidates
  • Default of 2 balances utility and specificity

Implementation:

  • Validation: threshold must be >= 2 and <= number of collections
  • Clear error messages for invalid thresholds

5. Optional Base Enhancement

suggest-base supports enhancing existing base collections via --base option.

Rationale:

  • Base collections evolve over time
  • Need to add new shared packages as collections grow
  • Want to validate existing base packages are still useful

Implementation:

  • Load existing base graph
  • Compare against suggested candidates
  • Mark packages as "already in base" vs. "new candidates"
  • Identify orphaned base packages (in base but not used)

Technical Architecture

Graph File Structure

{
  "": {  // ROOT node
    "edges": [
      {"key": "package==1.0", "req_type": "toplevel", "req": "package>=1.0"}
    ]
  },
  "package==1.0": {
    "canonicalized_name": "package",
    "version": "1.0",
    "download_url": "...",
    "pre_built": false,
    "edges": [...]
  }
}

Code Organization

Both commands added to: src/fromager/commands/graph.py

Shared utilities:

  • get_package_names() / get_all_package_names() - Extract package sets
  • extract_collection_name() - Parse collection name from file path
  • Output formatting functions for table and JSON

Command-specific:

  • suggest_collection: get_dependency_closure() for transitive deps
  • suggest_base: find_shared_packages() for overlap analysis

Testing Strategy

Both commands require:

  • Unit tests for helper functions
  • Integration tests with real/generated graph files
  • Edge case coverage (empty graphs, circular deps, tied rankings, etc.)
  • Type checking with mypy
  • Code formatting with ruff

Related Work

Existing Fromager Commands

  • graph to-constraints - Graph conversion
  • graph explain-duplicates - Version conflict analysis
  • graph why - Dependency chain explanation
  • stats - Build statistics with rich tables

Pattern References

  • Table output: src/fromager/commands/stats.py
  • JSON output: src/fromager/commands/list_overrides.py
  • Graph loading: Standard pattern across all graph commands
  • Click decorators: Consistent argument/option patterns

Success Metrics

For suggest-collection (#971)

  • Correctly identifies best-fit collections based on dependency overlap
  • Saves time in manually assigning 100+ onboarding packages
  • Enables data-driven collection organization decisions

For suggest-base (#973)

  • Identifies high-value candidates for base collection
  • Reduces build time by eliminating duplicate builds
  • Enables efficient parallel builds of multiple collections

Overall Impact

  • Improved build efficiency through better collection organization
  • Reduced manual effort in collection management
  • Data-driven decisions replacing guesswork
  • Clear visibility into collection relationships and overlap

Next Steps

  1. Implementation - Both commands ready for development
  2. Testing - Use sample graphs in ~/tmp/sample-graphs/ for validation
  3. Integration - Commands work together:
    • Use suggest-base to create/enhance base collection
    • Use suggest-collection to assign new packages to specialized collections
  4. Documentation - Update user docs with new commands and workflows

Issue References


Appendix: Example Workflow

Step 1: Identify Base Collection Candidates

fromager graph suggest-base \
  notebook.json \
  rhai-innovation.json \
  data-science.json \
  --min-collections 3

Result: Packages appearing in all 3 collections

Step 2: Create Base Collection

(Manual process - build the identified shared packages)

Step 3: Assign Onboarding Packages

fromager graph suggest-collection \
  onboarding.json \
  notebook.json \
  rhai-innovation.json \
  data-science.json

Result: Best-fit collection for each onboarding package

Step 4: Enhance Base Collection (Later)

fromager graph suggest-base \
  notebook.json \
  rhai-innovation.json \
  data-science.json \
  --base current-base.json \
  --min-collections 2

Result: New candidates to add to existing base

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions