-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Add graph suggest-collection command to optimize collection organization
Summary
Add a new subcommand fromager graph suggest-collection that analyzes dependency overlap between onboarding packages and existing collections to recommend optimal collection assignments. This helps maintainers decide which collection each package should join based on minimizing the number of new packages that would need to be added.
Motivation
When onboarding new packages to fromager builds, maintainers must decide which existing collection each package should join. This decision should be based on dependency overlap - packages that share many dependencies with an existing collection are good candidates for that collection.
Currently, this analysis is done manually, which is:
- Time-consuming: With 100+ onboarding packages and their transitive dependencies, manual analysis is slow
- Error-prone: It's easy to miss shared dependencies or assign packages suboptimally
- Opaque: No visibility into how well different collections would fit
The result is suboptimal collection organization with duplicated packages across collections, leading to inefficient builds.
Proposed Solution
Create a new command that automates dependency analysis:
fromager graph suggest-collection ONBOARDING-GRAPH COLLECTION-GRAPHS...For each top-level package in the onboarding graph:
- Calculate its full dependency closure (all transitive dependencies)
- Compare this closure against each existing collection
- Rank collections by "fit" - how many new packages would need to be added
- Display recommendations with statistics
Command Interface
fromager graph suggest-collection [OPTIONS] ONBOARDING-GRAPH COLLECTION-GRAPHS...Arguments:
ONBOARDING-GRAPH: Graph file (graph.json) containing packages to assignCOLLECTION-GRAPHS: One or more graph files for existing collections
Options:
--format [table|json]: Output format (default: table)table: Human-readable rich table showing best-fit collection for each packagejson: Machine-readable JSON with all collections ranked for each package
--help: Show help message
Example:
fromager graph suggest-collection \
onboarding-graph.json \
notebook-graph.json \
rhai-innovation-graph.jsonFunctional Requirements
Core Functionality
-
Load graph files: Parse multiple graph.json files using
DependencyGraph.from_file() -
Identify top-level packages: Extract packages marked with
req_type: 'toplevel'from the ROOT node in the onboarding graph -
Calculate dependency closures: For each top-level package, traverse the full dependency tree to get all transitive dependencies (both install and build dependencies)
-
Compare by package name only: When comparing dependencies, match packages by canonical name ignoring version numbers (e.g.,
numpy==1.26.0andnumpy==1.25.0are considered the same package) -
Rank collections by fit: For each top-level package and collection pair:
- Calculate set difference: packages in dependency closure NOT in collection
- Calculate set intersection: packages in dependency closure AND in collection
- Primary sort: fewest new packages (ascending)
- Secondary sort: highest coverage percentage (descending)
-
Display results: Show each package with its best-fit collection and statistics
Output Requirements
Table format (default --format table):
- One row per onboarding package
- Columns: Package name, Version, Total dependencies, Best fit collection, New packages, Existing packages, Coverage %
- Sorted alphabetically by package name
- Clear, scannable output for human review
JSON format (--format json):
- Array of objects, one per onboarding package
- Each object includes all collections ranked by fit
- Valid, parseable JSON for automation/scripting
- Structure enables downstream processing
Algorithm
For each top-level package in the onboarding graph:
-
Extract dependency closure:
closure = set() # Set of canonical package names visited = set() # Set of node keys to prevent cycles stack = [top_level_node] while stack: node = stack.pop() if node.key in visited: continue visited.add(node.key) closure.add(node.canonicalized_name) # Add all children (both install and build deps) for edge in node.children: stack.append(edge.destination_node)
-
Calculate fit with each collection:
for collection in collections: collection_packages = {node.canonicalized_name for node in collection.get_all_nodes()} new_packages = closure - collection_packages existing_packages = closure & collection_packages fit_score = { "new_packages": len(new_packages), "existing_packages": len(existing_packages), "coverage_pct": len(existing_packages) / len(closure) * 100 }
-
Rank collections:
- Sort by
new_packages(ascending) - fewer is better - Then by
coverage_pct(descending) - higher is better
- Sort by
Example Output
Table Format (Default)
Collection Suggestions for Onboarding Packages
┃ Package ┃ Version ┃ Total Deps ┃ Best Fit ┃ New Pkgs ┃ Existing ┃ Coverage ┃
┃━━━━━━━━━━━━━━━━━━━━━┃━━━━━━━━━┃━━━━━━━━━━━━┃━━━━━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━┃
┃ fastapi ┃ 0.135.1 ┃ 45 ┃ notebook ┃ 3 ┃ 42 ┃ 93.3% ┃
┃ instructorembedding ┃ 1.0.1 ┃ 123 ┃ notebook ┃ 8 ┃ 115 ┃ 93.5% ┃
┃ torch ┃ 2.9.1 ┃ 89 ┃ rhai-innovation ┃ 12 ┃ 77 ┃ 86.5% ┃
Interpretation:
fastapishould join thenotebookcollection - only 3 new packages needed (93.3% overlap)torchfits better inrhai-innovationdespite needing 12 new packages (still 86.5% overlap)
JSON Format (--format json)
[
{
"package": "fastapi",
"version": "0.135.1",
"total_dependencies": 45,
"collections": [
{
"collection": "notebook",
"new_packages": 3,
"existing_packages": 42,
"total_dependencies": 45,
"coverage_pct": 93.3
},
{
"collection": "rhai-innovation",
"new_packages": 15,
"existing_packages": 30,
"total_dependencies": 45,
"coverage_pct": 66.7
}
]
},
{
"package": "torch",
"version": "2.9.1",
"total_dependencies": 89,
"collections": [
{
"collection": "rhai-innovation",
"new_packages": 12,
"existing_packages": 77,
"total_dependencies": 89,
"coverage_pct": 86.5
},
{
"collection": "notebook",
"new_packages": 34,
"existing_packages": 55,
"total_dependencies": 89,
"coverage_pct": 61.8
}
]
}
]Implementation Guidance
File Location
Add to /src/fromager/commands/graph.py as a new subcommand:
@graph.command()
@click.option("--format", "output_format", type=click.Choice(["table", "json"]), default="table")
@click.argument("onboarding-graph", type=str)
@click.argument("collection-graphs", type=str, nargs=-1, required=True)
@click.pass_obj
def suggest_collection(
wkctx: context.WorkContext,
onboarding_graph: str,
collection_graphs: tuple[str, ...],
output_format: str,
) -> None:
"""Suggest which collection each onboarding package should join."""Utilities to Reuse
From existing codebase:
DependencyGraph.from_file(path)- Load graph filesgraph.get_root_node()- Get ROOT node to find top-level packagesgraph.get_all_nodes()- Get all nodes for building package setsnode.canonicalized_name- Get normalized package name (ignore version)node.children- Traverse dependency edgescanonicalize_name()frompackaging.utils- Normalize namesRequirementType.TOP_LEVEL- Filter for top-level packages
Reference Implementations
For patterns and style:
- Table output: See
src/fromager/commands/stats.pyfor rich table construction - JSON output: See
src/fromager/commands/list_overrides.pyfor--formatoption pattern - Graph traversal: See
src/fromager/commands/graph.pyfor existing graph walking patterns
Helper Functions Needed
def get_dependency_closure(graph: DependencyGraph, node: DependencyNode) -> set[NormalizedName]:
"""Get all transitive dependencies of a node by canonicalized name."""
def get_package_names(graph: DependencyGraph) -> set[NormalizedName]:
"""Extract all unique package names from a graph."""
def extract_collection_name(graph_path: str) -> str:
"""Parse collection name from graph file path."""
def output_table(results: list[dict]) -> None:
"""Display results as a rich table."""
def output_json(results: list[dict]) -> None:
"""Display results as JSON."""Testing Requirements
Unit Tests
Add to /tests/test_graph.py:
def test_get_dependency_closure() -> None:
"""Test extracting full dependency closure by name."""
def test_get_package_names() -> None:
"""Test extracting unique package names from graph."""
def test_extract_collection_name() -> None:
"""Test parsing collection names from file paths."""
def test_suggest_collection_basic(tmp_path: pathlib.Path) -> None:
"""Test basic collection suggestion logic."""Integration Tests
def test_suggest_collection_table_output(tmp_path: pathlib.Path) -> None:
"""Test command with table output format."""
def test_suggest_collection_json_output(tmp_path: pathlib.Path) -> None:
"""Test command with JSON output format."""
def test_suggest_collection_multiple_collections(tmp_path: pathlib.Path) -> None:
"""Test with multiple collection graphs."""Test Data
Use real graph files for integration testing or create minimal test graphs with known dependency structures.
Edge Cases
The implementation should handle:
-
Empty onboarding graph (no top-level packages):
- Output message: "No top-level packages found in onboarding graph"
- Exit gracefully
-
Package with no dependencies:
- Dependency closure contains only the package itself
- All collections have same fit (0 new packages)
-
Multiple versions of same package:
- Already handled by using
canonicalized_nameinstead of versionedkey - Version differences should not affect fit calculation
- Already handled by using
-
Circular dependencies:
- Use
visitedset in traversal to prevent infinite loops - Each node visited only once
- Use
-
Tied fit scores:
- Multiple collections with same
new_packagescount - Use
coverage_pctas tiebreaker (higher is better)
- Multiple collections with same
-
Collection name extraction:
- Handle various file naming patterns:
3.4.2293+notebook-cuda12.9-ubi9-aarch64-graph.json - Fallback to filename without extensions if pattern doesn't match
- Handle various file naming patterns:
-
Large graphs:
- Algorithm is O(n*m) where n=top-level packages, m=collections
- Should be acceptable for typical sizes (100s of packages, handful of collections)
Success Criteria
- Command loads multiple graph files correctly
- Identifies all top-level packages from onboarding graph
- Calculates complete dependency closures
- Compares packages by canonical name (ignoring versions)
- Ranks collections correctly (fewest new packages first)
- Table output is clear and easy to read
- JSON output is valid and machine-parseable
- All unit tests pass
- All integration tests pass
- Edge cases handled gracefully
- Code follows fromager standards (type hints, docstrings, formatting)
- Documentation updated (if applicable)
Related Work
Existing graph commands that may provide useful patterns:
graph to-constraints- Loads graphs, processes nodesgraph explain-duplicates- Analyzes multiple package versionsgraph why- Traverses dependency chainsstats- Displays statistics with rich tables