fix: extend pod hugepages based on dpdk_base_memory_mb#2399
fix: extend pod hugepages based on dpdk_base_memory_mb#2399
Conversation
How to use the Graphite Merge QueueAdd the label main-merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has required the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
Graphite Automations"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (03/25/26)2 reviewers were added to this PR based on Anton Bykov's automation. |
There was a problem hiding this comment.
Pull request overview
Adjusts the hugepages “offset” used to compute the MEMORY env var for Weka pods, aiming to better account for DPDK base memory needs on frontend containers.
Changes:
- Update
GetHugePagesOffsetdefault logic to scale frontend offset withNumCores(64 MiB per core) instead of a fixed 200 MiB. - Keep the previous 200 MiB default offset for non-frontend containers in the default mode branch.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| offset = 64 * container.Spec.NumCores // 64 is same as the value we put int dpdk_base_memory_mb in resources.json | ||
| } else { |
There was a problem hiding this comment.
The inline comment has a typo (“put int”). Also, the reference to dpdk_base_memory_mb in resources.json is hard to validate/trace from this codebase, and the raw 64 is an unexplained magic number. Please fix the typo and consider replacing 64 with a named constant (or config-driven value) and update the comment to point to the actual source of truth for this offset.
| default: | ||
| offset = 200 | ||
| if container.HasFrontend() { | ||
| offset = 64 * container.Spec.NumCores // 64 is same as the value we put int dpdk_base_memory_mb in resources.json |
There was a problem hiding this comment.
Typo in comment: “put int” should be “put in”.
| offset = 64 * container.Spec.NumCores // 64 is same as the value we put int dpdk_base_memory_mb in resources.json | |
| offset = 64 * container.Spec.NumCores // 64 is same as the value we put in dpdk_base_memory_mb in resources.json |
| if container.HasFrontend() { | ||
| offset = 64 * container.Spec.NumCores // 64 is same as the value we put int dpdk_base_memory_mb in resources.json | ||
| } else { | ||
| offset = 200 | ||
| } |
There was a problem hiding this comment.
This new frontend-specific offset logic only runs when container.Spec.HugepagesOffset == 0. In the common WekaCluster/WekaClient flows, HugepagesOffset is always set (e.g., via template defaults), so this change may not actually affect the pods you’re trying to fix. Consider updating the default hugepages offset where it’s populated (templates/factory) or intentionally leaving HugepagesOffset unset so this function’s defaults take effect.
2e1f2f0 to
b63609b
Compare
b63609b to
429633e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| numCores = template.Cores.Drive | ||
| dpdkBaseMemoryMbOverride = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.Drive | ||
| case "compute": |
There was a problem hiding this comment.
dpdkBaseMemoryMbOverride is initialized to 64, but each role branch overwrites it with cluster.Spec.GetOverrides().DpdkBaseMemoryMb.<Role> even when that value is 0 (unset). That effectively disables the intended default and can under-allocate hugepages for DPDK. Keep the 64 default unless the override is >0 (e.g., use util.GetNonZeroOrDefault(..., 64) or an explicit if > 0 check) in every role branch.
| // Add DPDK memory to hugepages based on role and cores | ||
| hp.Hugepages += dpdkBaseMemoryMbOverride * numCores | ||
|
|
There was a problem hiding this comment.
GetContainerHugepages now adds DPDK memory (hp.Hugepages += ...) but there are no unit tests asserting the default (+64 MiB per core) behavior or that per-role DpdkBaseMemoryMb overrides are applied. Since templates_test.go already has extensive coverage for hugepages calculations, add cases that validate the DPDK increment for at least one role and for both default and overridden values.
| { | ||
| Name: "DPDK_BASE_MEMORY_MB", | ||
| Value: strconv.Itoa(GetDpdkBaseMemoryMb(f.container)), | ||
| }, |
There was a problem hiding this comment.
DPDK_BASE_MEMORY_MB is sourced from GetDpdkBaseMemoryMb(f.container), but containers created via factory.NewWekaContainerForWekaCluster never populate container.Spec.Overrides.DpdkBaseMemoryMb (no assignments found), while hugepages allocation uses cluster.Spec.GetOverrides().DpdkBaseMemoryMb.<role>. As a result, a cluster-level override will change requested hugepages but the runtime will still see the default 64 MiB, causing config/hugepages mismatch. Propagate the selected per-role DPDK base memory from the cluster spec into each container spec (or otherwise derive it here) so the env var matches the allocator logic.
| hugePagesK8sSuffix = "2Mi" | ||
| offset := GetHugePagesOffset(container) | ||
| wekaMemoryString = fmt.Sprintf("%dMiB", container.Spec.Hugepages-offset) | ||
| wekaMemoryString = fmt.Sprintf("%dMiB", container.Spec.Hugepages-baseOffset-dpdkMemory) |
There was a problem hiding this comment.
wekaMemoryString is now computed as Hugepages - baseOffset - dpdkMemory and can become negative if the hugepages spec/overrides are misconfigured (or if allocator and runtime DPDK values diverge). Consider guarding against negative results (e.g., clamp to 0 or return a validation error earlier) to avoid emitting an invalid MEMORY env var like -64MiB.
| wekaMemoryString = fmt.Sprintf("%dMiB", container.Spec.Hugepages-baseOffset-dpdkMemory) | |
| availableMemoryMiB := container.Spec.Hugepages - baseOffset - dpdkMemory | |
| if availableMemoryMiB < 0 { | |
| availableMemoryMiB = 0 | |
| } | |
| wekaMemoryString = fmt.Sprintf("%dMiB", availableMemoryMiB) |
| if c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb != 0 { | ||
| dpdkBaseMemoryMb = c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb |
There was a problem hiding this comment.
This code reads c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb directly, while other override accesses in this repo consistently go through Spec.GetOverrides() (e.g., wekaClient.Spec.GetOverrides().DriversBuildId) to apply defaults / avoid nil override structs. Consider using c.wekaClient.Spec.GetOverrides().DpdkBaseMemoryMb here for consistency and safety.
| if c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb != 0 { | |
| dpdkBaseMemoryMb = c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb | |
| if c.wekaClient.Spec.GetOverrides().DpdkBaseMemoryMb != 0 { | |
| dpdkBaseMemoryMb = c.wekaClient.Spec.GetOverrides().DpdkBaseMemoryMb |
429633e to
4910260
Compare
4910260 to
0c2a623
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| numCores = template.Cores.Drive | ||
| dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.Drive | ||
| case "compute": |
There was a problem hiding this comment.
This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).
| numCores = template.Cores.Compute | ||
| dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.Compute | ||
| case "s3": |
There was a problem hiding this comment.
This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).
| numCores = template.Cores.S3 | ||
| dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.S3 | ||
| case "nfs": |
There was a problem hiding this comment.
This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).
| numCores = template.Cores.Nfs | ||
| dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.Nfs | ||
| case "smbw": |
There was a problem hiding this comment.
This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).
| dpdkBaseMemoryMb := 64 | ||
| if c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb != 0 { | ||
| dpdkBaseMemoryMb = c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb | ||
| } | ||
| return c.getClientCores() * (1500 + dpdkBaseMemoryMb) |
There was a problem hiding this comment.
This function increases hugepages by dpdkBaseMemoryMb per core, but the client WekaContainer spec created in this reconciler does not persist the effective DPDK base memory anywhere (e.g., WekaContainerSpec.DpdkBaseMemoryMb). Downstream code (pod env vars and MEMORY calculation) currently reads container.Spec.DpdkBaseMemoryMb, which will remain 0 and diverge from the hugepages sizing here. Persist the chosen DPDK base memory into the container spec so the runtime and MEMORY computation stay consistent with the hugepages reservation.
| numCores = template.Cores.Smbw | ||
| dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.Smbw | ||
| case "data-services": |
There was a problem hiding this comment.
This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).
| numCores = template.Cores.DataServices | ||
| dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.DataServices | ||
| } |
There was a problem hiding this comment.
This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).
0c2a623 to
33cc9cb
Compare
33cc9cb to
677ac07
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| if container.Spec.DpdkBaseMemoryMb != dpdkBaseMemoryMb { | ||
| container.Spec.DpdkBaseMemoryMb = dpdkBaseMemoryMb | ||
| changed = true |
There was a problem hiding this comment.
When DpdkBaseMemoryMb changes, the client container's Hugepages may also need to be updated to keep total hugepages in sync with the computed formula (since DPDK memory is part of the default hugepages sizing). Right now only container.Spec.DpdkBaseMemoryMb is updated, which can leave an existing container under-provisioned for hugepages. Consider recomputing/updating container.Spec.Hugepages as well when DpdkBaseMemoryMb changes (at least when wekaClient.Spec.HugePages == 0 so hugepages are operator-computed).
| changed = true | |
| changed = true | |
| // When hugepages are operator-computed (wekaClient.Spec.HugePages == 0), | |
| // a change in DPDK base memory should also be reflected in the container's | |
| // hugepages allocation, since the sizing formula includes DPDK memory. | |
| if c.wekaClient.Spec.HugePages == 0 { | |
| if container.Spec.Hugepages != newClientSpec.Hugepages { | |
| container.Spec.Hugepages = newClientSpec.Hugepages | |
| changed = true | |
| } | |
| } |
677ac07 to
8587d07
Compare
8587d07 to
ec50516
Compare
ec50516 to
b781661
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
internal/controllers/resources/pod.go:1065
- The comment says this function includes DPDK base memory, but
GetHugePagesOffsetcurrently only returnscontainer.Spec.HugepagesOffset(or a mode-based default) and does not incorporatecontainer.Spec.DpdkBaseMemoryMb. Either update the implementation to add the DPDK portion (e.g.,DpdkBaseMemoryMb * NumCoreswhere applicable) or adjust the comment to match the actual behavior.
// GetHugePagesOffset returns the hugepages offset for a container (for Weka allocation), including DPDK base memory.
func GetHugePagesOffset(container *weka.WekaContainer) int {
offset := container.Spec.HugepagesOffset
// get default if not set
if offset == 0 {
switch container.Spec.Mode {
case weka.WekaContainerModeDrive:
if container.UsesDriveSharing() {
offset = 200 * container.Spec.NumCores
} else {
offset = 200 * container.Spec.NumDrives
}
case weka.WekaContainerModeSSDProxy:
offset = config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB
default:
offset = 200
}
}
return offset
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Add DPDK base memory to both hugepages and offset | ||
| dpdkTotalMemory := dpdkBaseMemoryMb * numCores | ||
| hp.Hugepages += dpdkTotalMemory | ||
| hp.HugepagesOffset += dpdkTotalMemory |
There was a problem hiding this comment.
DPDK base memory is now added to both Hugepages and HugepagesOffset, but the unit tests in templates_test.go only assert Hugepages. Add assertions for HugepagesOffset to ensure the offset adjustment remains correct and to prevent regressions in MEMORY sizing (Hugepages - offset).
| // Add DPDK base memory to both hugepages and offset | |
| dpdkTotalMemory := dpdkBaseMemoryMb * numCores | |
| hp.Hugepages += dpdkTotalMemory | |
| hp.HugepagesOffset += dpdkTotalMemory | |
| // Add DPDK base memory to hugepages so that MEMORY (Hugepages - HugepagesOffset) includes it | |
| dpdkTotalMemory := dpdkBaseMemoryMb * numCores | |
| hp.Hugepages += dpdkTotalMemory |
b781661 to
ebec559
Compare
ebec559 to
3b81136
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
internal/controllers/resources/pod.go:1056
- The doc comment says
GetHugePagesOffsetincludes DPDK base memory, but the function currently only returnscontainer.Spec.HugepagesOffset(or the mode-based default) and does not referenceDpdkBaseMemoryMb. Either update the comment to reflect the actual behavior (offset is already expected to include any DPDK adjustment inSpec.HugepagesOffset), or incorporate DPDK into this helper to match the comment.
// GetHugePagesOffset returns the hugepages offset for a container (for Weka allocation), including DPDK base memory.
func GetHugePagesOffset(container *weka.WekaContainer) int {
offset := container.Spec.HugepagesOffset
// get default if not set
if offset == 0 {
switch container.Spec.Mode {
case weka.WekaContainerModeDrive:
if container.UsesDriveSharing() {
offset = 200 * container.Spec.NumCores
} else {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

No description provided.