-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Collection Optimization Tools - Conversation Summary
This document summarizes the discussion and design process that led to the creation of two new fromager commands for optimizing collection organization.
Context
Date: 2026-03-21
Participants: Doug Hellmann, Claude (Sonnet 4.5)
Initial Request
User Request:
I want to add a command to fromager to help me decide how to optimize the builds by combining sets of packages into collections. I already have some collections defined, and each produces a graph file describing the dependencies of the items in the collection. I also have a collection of packages being "onboarded" to the build that have not been added to the permanent collections.
Goal: Create tools to help organize packages into collections efficiently to minimize duplicated builds and optimize parallel build processes.
Discovery Phase
Sample Data Analysis
Located sample graph files in ~/tmp/sample-graphs/:
3.4.2293+notebook-cuda12.9-ubi9-aarch64-graph.json- 444 packages, 63 top-level3.4.2386+onboarding-cuda12.9-ubi9-x86_64-graph.json- 851 packages, 136 top-level3.4.2440+rhai-innovation-cuda12.9-ubi9-x86_64-graph.json- 418 packages, 16 top-level
Codebase Exploration
Existing Infrastructure:
DependencyGraphclass (src/fromager/dependency_graph.py) - Loads and manipulates graph.json files- Graph files use JSON format with
"package==version"keys and special""ROOT node - TOP_LEVEL packages identified via
req_type: 'toplevel'in ROOT node edges - Existing graph commands in
src/fromager/commands/graph.pyfollow@graph.command()pattern - Rich table output pattern in
stats.py, JSON output pattern inlist_overrides.py
Command 1: graph suggest-collection
Purpose
Help assign onboarding packages to existing collections by analyzing dependency overlap.
Requirements Gathering
Key Questions & Decisions:
-
Version Matching: Name only (ignore versions) ✓
- Rationale: Collections regularly build multiple versions of the same package
-
Dependency Depth: Full transitive closure ✓
- Rationale: Need complete picture of what would be added
-
Output Format: Rich table (default) + JSON option ✓
- User note: "Default to a rich table but provide an option for JSON output"
-
Command Location: Subcommand under
graphgroup ✓- Fits with existing graph commands
Algorithm
For each top-level package in onboarding graph:
- Extract full dependency closure (all transitive dependencies by canonical name)
- For each existing collection:
- Calculate new packages = closure - collection packages
- Calculate existing packages = closure ∩ collection packages
- Calculate coverage % = existing / total * 100
- Rank collections by:
- Primary: fewest new packages (ascending)
- Secondary: highest coverage % (descending)
- Display best-fit collection with statistics
Output Example
Table:
Collection Suggestions for Onboarding Packages
┃ Package ┃ Version ┃ Total Deps ┃ Best Fit ┃ New Pkgs ┃ Existing ┃ Coverage ┃
┃━━━━━━━━━━━━━━━━━━━━━┃━━━━━━━━━┃━━━━━━━━━━━━┃━━━━━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━┃
┃ fastapi ┃ 0.135.1 ┃ 45 ┃ notebook ┃ 3 ┃ 42 ┃ 93.3% ┃
┃ instructorembedding ┃ 1.0.1 ┃ 123 ┃ notebook ┃ 8 ┃ 115 ┃ 93.5% ┃
┃ torch ┃ 2.9.1 ┃ 89 ┃ rhai-innovation ┃ 12 ┃ 77 ┃ 86.5% ┃
Command:
fromager graph suggest-collection ONBOARDING-GRAPH COLLECTION-GRAPHS...Implementation Notes
- Location:
src/fromager/commands/graph.py - Key functions:
get_dependency_closure(),get_package_names(),extract_collection_name() - Reuse:
DependencyGraph.from_file(),.get_all_nodes(),.get_root_node() - Pattern references:
stats.py(tables),list_overrides.py(--format option)
GitHub Issue
Command 2: graph suggest-base
Purpose
Identify packages shared across multiple collections that should be factored out into a base collection for efficiency.
Requirements Gathering
Key Questions & Decisions:
-
Package Scope: All packages (including transitive deps) ✓
- Rationale: Any shared dependency is a candidate for factoring out
-
Minimum Threshold: Configurable with default of 2 ✓
- User note: "Make the threshold configurable but default to 2"
- Allows flexibility for different use cases
-
Output Grouping: Individual packages ✓
- Simpler to review and make decisions
-
Output Format: Rich table (default) + JSON option ✓
- Consistent with suggest-collection command
Use Case
Problem:
- Multiple collections share common dependencies (numpy, pandas, setuptools, etc.)
- Each collection builds these packages independently
- Wasted build time and resources
Solution:
- Build shared packages once in a base collection
- Build specialized collections in parallel, all depending on base
- Dramatically reduce total build time
Algorithm
- Load all collection graphs and extract all packages (by canonical name)
- For each unique package, count how many collections contain it
- Filter packages appearing in >=
--min-collectionscollections (default: 2) - Rank by collection count (descending), then alphabetically
- If
--baseprovided, mark which packages are already in base vs. new candidates
Output Example
Without existing base:
Base Collection Candidates
Analyzing 3 collections: notebook, rhai-innovation, data-science
Threshold: packages appearing in >= 2 collections
┃ Package ┃ Collections ┃ Coverage ┃ Appears In ┃
┃━━━━━━━━━━━━━━━┃━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┃
┃ numpy ┃ 3/3 ┃ 100.0% ┃ notebook, rhai-innovation, data-sci ┃
┃ setuptools ┃ 3/3 ┃ 100.0% ┃ notebook, rhai-innovation, data-sci ┃
┃ wheel ┃ 3/3 ┃ 100.0% ┃ notebook, rhai-innovation, data-sci ┃
┃ pandas ┃ 2/3 ┃ 66.7% ┃ notebook, data-sci ┃
Summary:
Shared packages found: 6
Packages in all collections: 3 (50.0%)
With existing base:
Base Collection Enhancement Candidates
Current base: 45 packages
┃ Package ┃ Collections ┃ Coverage ┃ In Base ┃ Appears In ┃
┃━━━━━━━━━━━━━━━┃━━━━━━━━━━━━━┃━━━━━━━━━━┃━━━━━━━━━┃━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┃
┃ numpy ┃ 3/3 ┃ 100.0% ┃ Yes ┃ notebook, rhai-innovation, data-sci ┃
┃ pandas ┃ 2/3 ┃ 66.7% ┃ No ┃ notebook, data-sci ┃
Summary:
Already in base: 3 (50.0%)
New candidates: 3 (50.0%)
Recommendation: Add pandas, requests to base collection
Command:
# Find shared packages
fromager graph suggest-base COLLECTION-GRAPHS... [--min-collections N]
# Enhance existing base
fromager graph suggest-base COLLECTION-GRAPHS... --base BASE-GRAPH [--min-collections N]Implementation Notes
- Location:
src/fromager/commands/graph.py - Key functions:
get_all_package_names(),find_shared_packages() - Options:
--base,--min-collections,--format - Reuse patterns from suggest-collection command
GitHub Issue
Key Design Decisions
1. Package Matching by Name Only
Both commands ignore package versions when comparing dependencies.
Rationale:
- Collections regularly build multiple versions of the same package
- Version differences shouldn't affect collection assignment decisions
- Focus on package presence, not specific versions
Implementation:
- Use
node.canonicalized_nameinstead ofnode.key(which includes version) - Use
canonicalize_name()frompackaging.utilsfor normalization
2. Full Dependency Closure Analysis
Both commands analyze complete transitive dependency trees, not just top-level packages.
Rationale:
- Transitive dependencies are just as important for build efficiency
- A package with shared dependencies is a better fit even if top-level packages differ
- Complete picture enables better optimization decisions
Implementation:
- Traverse full dependency graph using depth-first search
- Use visited set to handle circular dependencies
- Include both install and build dependencies
3. Consistent Output Formats
Both commands default to rich tables with --format json option.
Rationale:
- Human-readable tables for interactive use
- Machine-readable JSON for automation/scripting
- Consistency across related commands improves UX
Implementation:
- Follow pattern from existing
stats.pyandlist_overrides.py - Use rich library for table formatting
- JSON output enables pipeline integration
4. Configurable Thresholds
suggest-base uses configurable --min-collections threshold.
Rationale:
- Different use cases require different thresholds
- Finding packages in ALL collections (min = N) identifies universal dependencies
- Finding packages in ANY 2+ collections (min = 2) maximizes candidates
- Default of 2 balances utility and specificity
Implementation:
- Validation: threshold must be >= 2 and <= number of collections
- Clear error messages for invalid thresholds
5. Optional Base Enhancement
suggest-base supports enhancing existing base collections via --base option.
Rationale:
- Base collections evolve over time
- Need to add new shared packages as collections grow
- Want to validate existing base packages are still useful
Implementation:
- Load existing base graph
- Compare against suggested candidates
- Mark packages as "already in base" vs. "new candidates"
- Identify orphaned base packages (in base but not used)
Technical Architecture
Graph File Structure
{
"": { // ROOT node
"edges": [
{"key": "package==1.0", "req_type": "toplevel", "req": "package>=1.0"}
]
},
"package==1.0": {
"canonicalized_name": "package",
"version": "1.0",
"download_url": "...",
"pre_built": false,
"edges": [...]
}
}Code Organization
Both commands added to: src/fromager/commands/graph.py
Shared utilities:
get_package_names()/get_all_package_names()- Extract package setsextract_collection_name()- Parse collection name from file path- Output formatting functions for table and JSON
Command-specific:
suggest_collection:get_dependency_closure()for transitive depssuggest_base:find_shared_packages()for overlap analysis
Testing Strategy
Both commands require:
- Unit tests for helper functions
- Integration tests with real/generated graph files
- Edge case coverage (empty graphs, circular deps, tied rankings, etc.)
- Type checking with mypy
- Code formatting with ruff
Related Work
Existing Fromager Commands
graph to-constraints- Graph conversiongraph explain-duplicates- Version conflict analysisgraph why- Dependency chain explanationstats- Build statistics with rich tables
Pattern References
- Table output:
src/fromager/commands/stats.py - JSON output:
src/fromager/commands/list_overrides.py - Graph loading: Standard pattern across all graph commands
- Click decorators: Consistent argument/option patterns
Success Metrics
For suggest-collection (#971)
- Correctly identifies best-fit collections based on dependency overlap
- Saves time in manually assigning 100+ onboarding packages
- Enables data-driven collection organization decisions
For suggest-base (#973)
- Identifies high-value candidates for base collection
- Reduces build time by eliminating duplicate builds
- Enables efficient parallel builds of multiple collections
Overall Impact
- Improved build efficiency through better collection organization
- Reduced manual effort in collection management
- Data-driven decisions replacing guesswork
- Clear visibility into collection relationships and overlap
Next Steps
- Implementation - Both commands ready for development
- Testing - Use sample graphs in
~/tmp/sample-graphs/for validation - Integration - Commands work together:
- Use
suggest-baseto create/enhance base collection - Use
suggest-collectionto assign new packages to specialized collections
- Use
- Documentation - Update user docs with new commands and workflows
Issue References
- Epic: Issue [epic] collection optimization tools #972 - Collection optimization tools
- Command 1: Issue Add
graph suggest-collectioncommand to optimize collection organization #971 - Addgraph suggest-collectioncommand - Command 2: Issue Add
graph suggest-basecommand to identify shared dependencies for base collection #973 - Addgraph suggest-basecommand
Appendix: Example Workflow
Step 1: Identify Base Collection Candidates
fromager graph suggest-base \
notebook.json \
rhai-innovation.json \
data-science.json \
--min-collections 3Result: Packages appearing in all 3 collections
Step 2: Create Base Collection
(Manual process - build the identified shared packages)
Step 3: Assign Onboarding Packages
fromager graph suggest-collection \
onboarding.json \
notebook.json \
rhai-innovation.json \
data-science.jsonResult: Best-fit collection for each onboarding package
Step 4: Enhance Base Collection (Later)
fromager graph suggest-base \
notebook.json \
rhai-innovation.json \
data-science.json \
--base current-base.json \
--min-collections 2Result: New candidates to add to existing base