OCPCRT-450: Improve ClusterBot's cloud platform dispursment#604
OCPCRT-450: Improve ClusterBot's cloud platform dispursment#604thiagoalessio wants to merge 4 commits intoopenshift:mainfrom
Conversation
Renaming this function to better reflect its purpose, especially if at some point we have more than two accounts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce the data structures that will replace the hardcoded switch block for quota-slice selection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract the quota-slice selection logic into a standalone function that loops over platformQuotaSlices entries, queries Boskos metrics for each, and returns the profile with the most free resources. The old switch block still exists for now. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction Got rid of the switch block in favor of the newly introduced `selectCloudAccountProfile`. Also replaced the other switch block with those wrapper functions in `prow.go` with a single nil check that calls applyClusterProfile directly from the profile struct fields. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@thiagoalessio: This pull request references OCPCRT-450 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR introduces a quota-slice mechanism for cloud platform account management. It adds a CloudAccountProfile structure and selection logic based on Boskos metrics, replaces per-platform boolean flags with structured account profiles, and consolidates platform-specific conversion logic into a generic profile application function. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can generate a title for your PR based on the changes with custom instructions.Set the |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: thiagoalessio The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@thiagoalessio: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/manager/prow.go (1)
1732-1734:⚠️ Potential issue | 🟠 MajorInitialize
Environmentbefore settingBASE_DOMAIN.Many
launchconfigs have noenvironment:block. When that happens, Lines 1732-1734 silently drop the alternate-domain override, soaws-2/azure-2jobs still use the default base domain.Suggested fix
- if accountDomain != "" && matchedTarget.MultiStageTestConfiguration != nil && matchedTarget.MultiStageTestConfiguration.Environment != nil { - matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] = accountDomain - } + if accountDomain != "" { + if matchedTarget.MultiStageTestConfiguration.Environment == nil { + matchedTarget.MultiStageTestConfiguration.Environment = make(citools.TestEnvironment) + } + matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] = accountDomain + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/manager/prow.go` around lines 1732 - 1734, The code sets matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] without ensuring Environment is initialized; modify the block that checks accountDomain and matchedTarget.MultiStageTestConfiguration (the lines manipulating matchedTarget.MultiStageTestConfiguration.Environment) to allocate a new map if Environment is nil (e.g., set matchedTarget.MultiStageTestConfiguration.Environment = make(map[string]string)) before assigning the "BASE_DOMAIN" key so alternate-domain overrides are applied even when no environment: block exists in the launch config.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/manager/manager.go`:
- Around line 154-156: The current code returns an error when
lClient.Metrics(accounts[i].QuotaSlice) fails, which aborts scheduling; change
this to log the error and fall back to using the primary/default account instead
of returning nil. Specifically, in the block that calls lClient.Metrics for
accounts[i].QuotaSlice, catch the error, emit a warning (including the error)
and set metrics to the primary account's metrics (or mark this candidate as
using the primary quota) so the caller can continue; do this change around the
lClient.Metrics call and its error handling to avoid returning fmt.Errorf and
instead continue with a fallback metrics value.
---
Outside diff comments:
In `@pkg/manager/prow.go`:
- Around line 1732-1734: The code sets
matchedTarget.MultiStageTestConfiguration.Environment["BASE_DOMAIN"] without
ensuring Environment is initialized; modify the block that checks accountDomain
and matchedTarget.MultiStageTestConfiguration (the lines manipulating
matchedTarget.MultiStageTestConfiguration.Environment) to allocate a new map if
Environment is nil (e.g., set
matchedTarget.MultiStageTestConfiguration.Environment = make(map[string]string))
before assigning the "BASE_DOMAIN" key so alternate-domain overrides are applied
even when no environment: block exists in the launch config.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0abbf168-63b8-4399-859d-1a757db3749a
📒 Files selected for processing (4)
pkg/manager/manager.gopkg/manager/manager_test.gopkg/manager/prow.gopkg/manager/types.go
| metrics, err := lClient.Metrics(accounts[i].QuotaSlice) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to get metrics for %q leases: %v", accounts[i].QuotaSlice, err) |
There was a problem hiding this comment.
Fall back to the primary account when Boskos metrics fail.
This is a best-effort balancing check. Returning an error here means a transient Boskos issue blocks every amd64 aws / azure / gcp launch instead of just using the default account.
Suggested fix
metrics, err := lClient.Metrics(accounts[i].QuotaSlice)
if err != nil {
- return nil, fmt.Errorf("failed to get metrics for %q leases: %v", accounts[i].QuotaSlice, err)
+ klog.Warningf("failed to get metrics for %q leases, using primary account: %v", accounts[i].QuotaSlice, err)
+ return nil, nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| metrics, err := lClient.Metrics(accounts[i].QuotaSlice) | |
| if err != nil { | |
| return nil, fmt.Errorf("failed to get metrics for %q leases: %v", accounts[i].QuotaSlice, err) | |
| metrics, err := lClient.Metrics(accounts[i].QuotaSlice) | |
| if err != nil { | |
| klog.Warningf("failed to get metrics for %q leases, using primary account: %v", accounts[i].QuotaSlice, err) | |
| return nil, nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/manager/manager.go` around lines 154 - 156, The current code returns an
error when lClient.Metrics(accounts[i].QuotaSlice) fails, which aborts
scheduling; change this to log the error and fall back to using the
primary/default account instead of returning nil. Specifically, in the block
that calls lClient.Metrics for accounts[i].QuotaSlice, catch the error, emit a
warning (including the error) and set metrics to the primary account's metrics
(or mark this candidate as using the primary quota) so the caller can continue;
do this change around the lClient.Metrics call and its error handling to avoid
returning fmt.Errorf and instead continue with a fallback metrics value.
Previously, selecting which cloud account to use for a platform (e.g. aws vs aws-2) required maintaining two nearly identical switch blocks: one in
manager.goto query Boskos metrics and pick the best account, and another inprow.goto apply the corresponding cluster profile to the ProwJob. Adding or removing a cloud account meant editing both switches and adding/removing wrapper functions.Now, all account information lives in a single
platformQuotaSlicesmap. A genericselectCloudAccountProfilefunction loops over entries, queries Boskos, and picks the one with the most free resources. The result is passed directly toapplyClusterProfile.Why the quota-slice names are still listed explicitly
Ideally, I wish we could dynamically discover available quota-slice resource types from Boskos, but both Claude and I couldn't find something that would allow us to fetch a list of resources in https://github.com/kubernetes-sigs/boskos/blob/master/client/client.go.
This could have been refactored even further, extracting these values to a config file or something, but it is already better than before, so after these changes, adding a third AWS account for example can be done by simply by appending a new entry in
platformQuotaSlices.Suggestion on how to review
I'd recommend to check the individual commits, I tried to make each one small and self-contained, each advancing one refactoring step towards the end goal.