Skip to content

Add graph suggest-collection command to optimize collection organization #971

@dhellmann

Description

@dhellmann

Add graph suggest-collection command to optimize collection organization

Summary

Add a new subcommand fromager graph suggest-collection that analyzes dependency overlap between onboarding packages and existing collections to recommend optimal collection assignments. This helps maintainers decide which collection each package should join based on minimizing the number of new packages that would need to be added.

Motivation

When onboarding new packages to fromager builds, maintainers must decide which existing collection each package should join. This decision should be based on dependency overlap - packages that share many dependencies with an existing collection are good candidates for that collection.

Currently, this analysis is done manually, which is:

  • Time-consuming: With 100+ onboarding packages and their transitive dependencies, manual analysis is slow
  • Error-prone: It's easy to miss shared dependencies or assign packages suboptimally
  • Opaque: No visibility into how well different collections would fit

The result is suboptimal collection organization with duplicated packages across collections, leading to inefficient builds.

Proposed Solution

Create a new command that automates dependency analysis:

fromager graph suggest-collection ONBOARDING-GRAPH COLLECTION-GRAPHS...

For each top-level package in the onboarding graph:

  1. Calculate its full dependency closure (all transitive dependencies)
  2. Compare this closure against each existing collection
  3. Rank collections by "fit" - how many new packages would need to be added
  4. Display recommendations with statistics

Command Interface

fromager graph suggest-collection [OPTIONS] ONBOARDING-GRAPH COLLECTION-GRAPHS...

Arguments:

  • ONBOARDING-GRAPH: Graph file (graph.json) containing packages to assign
  • COLLECTION-GRAPHS: One or more graph files for existing collections

Options:

  • --format [table|json]: Output format (default: table)
    • table: Human-readable rich table showing best-fit collection for each package
    • json: Machine-readable JSON with all collections ranked for each package
  • --help: Show help message

Example:

fromager graph suggest-collection \
  onboarding-graph.json \
  notebook-graph.json \
  rhai-innovation-graph.json

Functional Requirements

Core Functionality

  1. Load graph files: Parse multiple graph.json files using DependencyGraph.from_file()

  2. Identify top-level packages: Extract packages marked with req_type: 'toplevel' from the ROOT node in the onboarding graph

  3. Calculate dependency closures: For each top-level package, traverse the full dependency tree to get all transitive dependencies (both install and build dependencies)

  4. Compare by package name only: When comparing dependencies, match packages by canonical name ignoring version numbers (e.g., numpy==1.26.0 and numpy==1.25.0 are considered the same package)

  5. Rank collections by fit: For each top-level package and collection pair:

    • Calculate set difference: packages in dependency closure NOT in collection
    • Calculate set intersection: packages in dependency closure AND in collection
    • Primary sort: fewest new packages (ascending)
    • Secondary sort: highest coverage percentage (descending)
  6. Display results: Show each package with its best-fit collection and statistics

Output Requirements

Table format (default --format table):

  • One row per onboarding package
  • Columns: Package name, Version, Total dependencies, Best fit collection, New packages, Existing packages, Coverage %
  • Sorted alphabetically by package name
  • Clear, scannable output for human review

JSON format (--format json):

  • Array of objects, one per onboarding package
  • Each object includes all collections ranked by fit
  • Valid, parseable JSON for automation/scripting
  • Structure enables downstream processing

Algorithm

For each top-level package in the onboarding graph:

  1. Extract dependency closure:

    closure = set()  # Set of canonical package names
    visited = set()  # Set of node keys to prevent cycles
    stack = [top_level_node]
    
    while stack:
        node = stack.pop()
        if node.key in visited:
            continue
        visited.add(node.key)
        closure.add(node.canonicalized_name)
    
        # Add all children (both install and build deps)
        for edge in node.children:
            stack.append(edge.destination_node)
  2. Calculate fit with each collection:

    for collection in collections:
        collection_packages = {node.canonicalized_name for node in collection.get_all_nodes()}
        new_packages = closure - collection_packages
        existing_packages = closure & collection_packages
    
        fit_score = {
            "new_packages": len(new_packages),
            "existing_packages": len(existing_packages),
            "coverage_pct": len(existing_packages) / len(closure) * 100
        }
  3. Rank collections:

    • Sort by new_packages (ascending) - fewer is better
    • Then by coverage_pct (descending) - higher is better

Example Output

Table Format (Default)

Collection Suggestions for Onboarding Packages

┃ Package             ┃ Version ┃ Total Deps ┃ Best Fit        ┃ New Pkgs ┃ Existing ┃ Coverage ┃
┃━━━━━━━━━━━━━━━━━━━━━┃━━━━━━━━━┃━━━━━━━━━━━━┃━━━━━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━┃
┃ fastapi             ┃ 0.135.1 ┃ 45         ┃ notebook        ┃ 3        ┃ 42       ┃ 93.3%    ┃
┃ instructorembedding ┃ 1.0.1   ┃ 123        ┃ notebook        ┃ 8        ┃ 115      ┃ 93.5%    ┃
┃ torch               ┃ 2.9.1   ┃ 89         ┃ rhai-innovation ┃ 12       ┃ 77       ┃ 86.5%    ┃

Interpretation:

  • fastapi should join the notebook collection - only 3 new packages needed (93.3% overlap)
  • torch fits better in rhai-innovation despite needing 12 new packages (still 86.5% overlap)

JSON Format (--format json)

[
  {
    "package": "fastapi",
    "version": "0.135.1",
    "total_dependencies": 45,
    "collections": [
      {
        "collection": "notebook",
        "new_packages": 3,
        "existing_packages": 42,
        "total_dependencies": 45,
        "coverage_pct": 93.3
      },
      {
        "collection": "rhai-innovation",
        "new_packages": 15,
        "existing_packages": 30,
        "total_dependencies": 45,
        "coverage_pct": 66.7
      }
    ]
  },
  {
    "package": "torch",
    "version": "2.9.1",
    "total_dependencies": 89,
    "collections": [
      {
        "collection": "rhai-innovation",
        "new_packages": 12,
        "existing_packages": 77,
        "total_dependencies": 89,
        "coverage_pct": 86.5
      },
      {
        "collection": "notebook",
        "new_packages": 34,
        "existing_packages": 55,
        "total_dependencies": 89,
        "coverage_pct": 61.8
      }
    ]
  }
]

Implementation Guidance

File Location

Add to /src/fromager/commands/graph.py as a new subcommand:

@graph.command()
@click.option("--format", "output_format", type=click.Choice(["table", "json"]), default="table")
@click.argument("onboarding-graph", type=str)
@click.argument("collection-graphs", type=str, nargs=-1, required=True)
@click.pass_obj
def suggest_collection(
    wkctx: context.WorkContext,
    onboarding_graph: str,
    collection_graphs: tuple[str, ...],
    output_format: str,
) -> None:
    """Suggest which collection each onboarding package should join."""

Utilities to Reuse

From existing codebase:

  • DependencyGraph.from_file(path) - Load graph files
  • graph.get_root_node() - Get ROOT node to find top-level packages
  • graph.get_all_nodes() - Get all nodes for building package sets
  • node.canonicalized_name - Get normalized package name (ignore version)
  • node.children - Traverse dependency edges
  • canonicalize_name() from packaging.utils - Normalize names
  • RequirementType.TOP_LEVEL - Filter for top-level packages

Reference Implementations

For patterns and style:

  • Table output: See src/fromager/commands/stats.py for rich table construction
  • JSON output: See src/fromager/commands/list_overrides.py for --format option pattern
  • Graph traversal: See src/fromager/commands/graph.py for existing graph walking patterns

Helper Functions Needed

def get_dependency_closure(graph: DependencyGraph, node: DependencyNode) -> set[NormalizedName]:
    """Get all transitive dependencies of a node by canonicalized name."""

def get_package_names(graph: DependencyGraph) -> set[NormalizedName]:
    """Extract all unique package names from a graph."""

def extract_collection_name(graph_path: str) -> str:
    """Parse collection name from graph file path."""

def output_table(results: list[dict]) -> None:
    """Display results as a rich table."""

def output_json(results: list[dict]) -> None:
    """Display results as JSON."""

Testing Requirements

Unit Tests

Add to /tests/test_graph.py:

def test_get_dependency_closure() -> None:
    """Test extracting full dependency closure by name."""

def test_get_package_names() -> None:
    """Test extracting unique package names from graph."""

def test_extract_collection_name() -> None:
    """Test parsing collection names from file paths."""

def test_suggest_collection_basic(tmp_path: pathlib.Path) -> None:
    """Test basic collection suggestion logic."""

Integration Tests

def test_suggest_collection_table_output(tmp_path: pathlib.Path) -> None:
    """Test command with table output format."""

def test_suggest_collection_json_output(tmp_path: pathlib.Path) -> None:
    """Test command with JSON output format."""

def test_suggest_collection_multiple_collections(tmp_path: pathlib.Path) -> None:
    """Test with multiple collection graphs."""

Test Data

Use real graph files for integration testing or create minimal test graphs with known dependency structures.

Edge Cases

The implementation should handle:

  1. Empty onboarding graph (no top-level packages):

    • Output message: "No top-level packages found in onboarding graph"
    • Exit gracefully
  2. Package with no dependencies:

    • Dependency closure contains only the package itself
    • All collections have same fit (0 new packages)
  3. Multiple versions of same package:

    • Already handled by using canonicalized_name instead of versioned key
    • Version differences should not affect fit calculation
  4. Circular dependencies:

    • Use visited set in traversal to prevent infinite loops
    • Each node visited only once
  5. Tied fit scores:

    • Multiple collections with same new_packages count
    • Use coverage_pct as tiebreaker (higher is better)
  6. Collection name extraction:

    • Handle various file naming patterns: 3.4.2293+notebook-cuda12.9-ubi9-aarch64-graph.json
    • Fallback to filename without extensions if pattern doesn't match
  7. Large graphs:

    • Algorithm is O(n*m) where n=top-level packages, m=collections
    • Should be acceptable for typical sizes (100s of packages, handful of collections)

Success Criteria

  • Command loads multiple graph files correctly
  • Identifies all top-level packages from onboarding graph
  • Calculates complete dependency closures
  • Compares packages by canonical name (ignoring versions)
  • Ranks collections correctly (fewest new packages first)
  • Table output is clear and easy to read
  • JSON output is valid and machine-parseable
  • All unit tests pass
  • All integration tests pass
  • Edge cases handled gracefully
  • Code follows fromager standards (type hints, docstrings, formatting)
  • Documentation updated (if applicable)

Related Work

Existing graph commands that may provide useful patterns:

  • graph to-constraints - Loads graphs, processes nodes
  • graph explain-duplicates - Analyzes multiple package versions
  • graph why - Traverses dependency chains
  • stats - Displays statistics with rich tables

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions