Skip to content

fix: extend pod hugepages based on dpdk_base_memory_mb#2399

Open
assafgi wants to merge 1 commit intomainfrom
03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb
Open

fix: extend pod hugepages based on dpdk_base_memory_mb#2399
assafgi wants to merge 1 commit intomainfrom
03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb

Conversation

@assafgi
Copy link
Copy Markdown
Contributor

@assafgi assafgi commented Mar 25, 2026

No description provided.

@assafgi assafgi marked this pull request as ready for review March 25, 2026 17:10
@assafgi assafgi requested a review from a team as a code owner March 25, 2026 17:10
Copilot AI review requested due to automatic review settings March 25, 2026 17:10
Copy link
Copy Markdown
Contributor Author

assafgi commented Mar 25, 2026


How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has required the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 25, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 3b81136.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@graphite-app
Copy link
Copy Markdown

graphite-app bot commented Mar 25, 2026

Graphite Automations

"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (03/25/26)

2 reviewers were added to this PR based on Anton Bykov's automation.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the hugepages “offset” used to compute the MEMORY env var for Weka pods, aiming to better account for DPDK base memory needs on frontend containers.

Changes:

  • Update GetHugePagesOffset default logic to scale frontend offset with NumCores (64 MiB per core) instead of a fixed 200 MiB.
  • Keep the previous 200 MiB default offset for non-frontend containers in the default mode branch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1041 to +1042
offset = 64 * container.Spec.NumCores // 64 is same as the value we put int dpdk_base_memory_mb in resources.json
} else {
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline comment has a typo (“put int”). Also, the reference to dpdk_base_memory_mb in resources.json is hard to validate/trace from this codebase, and the raw 64 is an unexplained magic number. Please fix the typo and consider replacing 64 with a named constant (or config-driven value) and update the comment to point to the actual source of truth for this offset.

Copilot uses AI. Check for mistakes.
default:
offset = 200
if container.HasFrontend() {
offset = 64 * container.Spec.NumCores // 64 is same as the value we put int dpdk_base_memory_mb in resources.json
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: “put int” should be “put in”.

Suggested change
offset = 64 * container.Spec.NumCores // 64 is same as the value we put int dpdk_base_memory_mb in resources.json
offset = 64 * container.Spec.NumCores // 64 is same as the value we put in dpdk_base_memory_mb in resources.json

Copilot uses AI. Check for mistakes.
Comment on lines +1040 to +1044
if container.HasFrontend() {
offset = 64 * container.Spec.NumCores // 64 is same as the value we put int dpdk_base_memory_mb in resources.json
} else {
offset = 200
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new frontend-specific offset logic only runs when container.Spec.HugepagesOffset == 0. In the common WekaCluster/WekaClient flows, HugepagesOffset is always set (e.g., via template defaults), so this change may not actually affect the pods you’re trying to fix. Consider updating the default hugepages offset where it’s populated (templates/factory) or intentionally leaving HugepagesOffset unset so this function’s defaults take effect.

Copilot uses AI. Check for mistakes.
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from 2e1f2f0 to b63609b Compare March 26, 2026 15:35
Copilot AI review requested due to automatic review settings March 29, 2026 16:53
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from b63609b to 429633e Compare March 29, 2026 16:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +154 to 156
numCores = template.Cores.Drive
dpdkBaseMemoryMbOverride = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.Drive
case "compute":
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dpdkBaseMemoryMbOverride is initialized to 64, but each role branch overwrites it with cluster.Spec.GetOverrides().DpdkBaseMemoryMb.<Role> even when that value is 0 (unset). That effectively disables the intended default and can under-allocate hugepages for DPDK. Keep the 64 default unless the override is >0 (e.g., use util.GetNonZeroOrDefault(..., 64) or an explicit if > 0 check) in every role branch.

Copilot uses AI. Check for mistakes.
Comment on lines +218 to +220
// Add DPDK memory to hugepages based on role and cores
hp.Hugepages += dpdkBaseMemoryMbOverride * numCores

Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetContainerHugepages now adds DPDK memory (hp.Hugepages += ...) but there are no unit tests asserting the default (+64 MiB per core) behavior or that per-role DpdkBaseMemoryMb overrides are applied. Since templates_test.go already has extensive coverage for hugepages calculations, add cases that validate the DPDK increment for at least one role and for both default and overridden values.

Copilot uses AI. Check for mistakes.
Comment on lines +311 to +314
{
Name: "DPDK_BASE_MEMORY_MB",
Value: strconv.Itoa(GetDpdkBaseMemoryMb(f.container)),
},
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DPDK_BASE_MEMORY_MB is sourced from GetDpdkBaseMemoryMb(f.container), but containers created via factory.NewWekaContainerForWekaCluster never populate container.Spec.Overrides.DpdkBaseMemoryMb (no assignments found), while hugepages allocation uses cluster.Spec.GetOverrides().DpdkBaseMemoryMb.<role>. As a result, a cluster-level override will change requested hugepages but the runtime will still see the default 64 MiB, causing config/hugepages mismatch. Propagate the selected per-role DPDK base memory from the cluster spec into each container spec (or otherwise derive it here) so the env var matches the allocator logic.

Copilot uses AI. Check for mistakes.
hugePagesK8sSuffix = "2Mi"
offset := GetHugePagesOffset(container)
wekaMemoryString = fmt.Sprintf("%dMiB", container.Spec.Hugepages-offset)
wekaMemoryString = fmt.Sprintf("%dMiB", container.Spec.Hugepages-baseOffset-dpdkMemory)
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wekaMemoryString is now computed as Hugepages - baseOffset - dpdkMemory and can become negative if the hugepages spec/overrides are misconfigured (or if allocator and runtime DPDK values diverge). Consider guarding against negative results (e.g., clamp to 0 or return a validation error earlier) to avoid emitting an invalid MEMORY env var like -64MiB.

Suggested change
wekaMemoryString = fmt.Sprintf("%dMiB", container.Spec.Hugepages-baseOffset-dpdkMemory)
availableMemoryMiB := container.Spec.Hugepages - baseOffset - dpdkMemory
if availableMemoryMiB < 0 {
availableMemoryMiB = 0
}
wekaMemoryString = fmt.Sprintf("%dMiB", availableMemoryMiB)

Copilot uses AI. Check for mistakes.
Comment on lines +1046 to +1047
if c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb != 0 {
dpdkBaseMemoryMb = c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code reads c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb directly, while other override accesses in this repo consistently go through Spec.GetOverrides() (e.g., wekaClient.Spec.GetOverrides().DriversBuildId) to apply defaults / avoid nil override structs. Consider using c.wekaClient.Spec.GetOverrides().DpdkBaseMemoryMb here for consistency and safety.

Suggested change
if c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb != 0 {
dpdkBaseMemoryMb = c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb
if c.wekaClient.Spec.GetOverrides().DpdkBaseMemoryMb != 0 {
dpdkBaseMemoryMb = c.wekaClient.Spec.GetOverrides().DpdkBaseMemoryMb

Copilot uses AI. Check for mistakes.
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from 429633e to 4910260 Compare March 29, 2026 18:05
Copilot AI review requested due to automatic review settings March 30, 2026 07:57
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from 4910260 to 0c2a623 Compare March 30, 2026 07:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +154 to 156
numCores = template.Cores.Drive
dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.Drive
case "compute":
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).

Copilot uses AI. Check for mistakes.
Comment on lines +170 to 172
numCores = template.Cores.Compute
dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.Compute
case "s3":
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).

Copilot uses AI. Check for mistakes.
Comment on lines +181 to 183
numCores = template.Cores.S3
dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.S3
case "nfs":
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).

Copilot uses AI. Check for mistakes.
Comment on lines +192 to 194
numCores = template.Cores.Nfs
dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.Nfs
case "smbw":
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).

Copilot uses AI. Check for mistakes.
Comment on lines +1045 to +1049
dpdkBaseMemoryMb := 64
if c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb != 0 {
dpdkBaseMemoryMb = c.wekaClient.Spec.Overrides.DpdkBaseMemoryMb
}
return c.getClientCores() * (1500 + dpdkBaseMemoryMb)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function increases hugepages by dpdkBaseMemoryMb per core, but the client WekaContainer spec created in this reconciler does not persist the effective DPDK base memory anywhere (e.g., WekaContainerSpec.DpdkBaseMemoryMb). Downstream code (pod env vars and MEMORY calculation) currently reads container.Spec.DpdkBaseMemoryMb, which will remain 0 and diverge from the hugepages sizing here. Persist the chosen DPDK base memory into the container spec so the runtime and MEMORY computation stay consistent with the hugepages reservation.

Copilot uses AI. Check for mistakes.
Comment on lines +203 to 205
numCores = template.Cores.Smbw
dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.Smbw
case "data-services":
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).

Copilot uses AI. Check for mistakes.
Comment on lines +214 to 216
numCores = template.Cores.DataServices
dpdkBaseMemoryMb = cluster.Spec.GetOverrides().DpdkBaseMemoryMb.DataServices
}
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites the dpdkBaseMemoryMb default (64) with the override value even when the override is unset (0), which effectively disables the intended default. Only override the default when the per-role override is non-zero (e.g., use a non-zero-or-default helper or an explicit if != 0 check).

Copilot uses AI. Check for mistakes.
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from 0c2a623 to 33cc9cb Compare March 30, 2026 08:08
Copilot AI review requested due to automatic review settings March 30, 2026 08:52
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from 33cc9cb to 677ac07 Compare March 30, 2026 08:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
if container.Spec.DpdkBaseMemoryMb != dpdkBaseMemoryMb {
container.Spec.DpdkBaseMemoryMb = dpdkBaseMemoryMb
changed = true
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When DpdkBaseMemoryMb changes, the client container's Hugepages may also need to be updated to keep total hugepages in sync with the computed formula (since DPDK memory is part of the default hugepages sizing). Right now only container.Spec.DpdkBaseMemoryMb is updated, which can leave an existing container under-provisioned for hugepages. Consider recomputing/updating container.Spec.Hugepages as well when DpdkBaseMemoryMb changes (at least when wekaClient.Spec.HugePages == 0 so hugepages are operator-computed).

Suggested change
changed = true
changed = true
// When hugepages are operator-computed (wekaClient.Spec.HugePages == 0),
// a change in DPDK base memory should also be reflected in the container's
// hugepages allocation, since the sizing formula includes DPDK memory.
if c.wekaClient.Spec.HugePages == 0 {
if container.Spec.Hugepages != newClientSpec.Hugepages {
container.Spec.Hugepages = newClientSpec.Hugepages
changed = true
}
}

Copilot uses AI. Check for mistakes.
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from 677ac07 to 8587d07 Compare March 30, 2026 11:12
Copilot AI review requested due to automatic review settings March 30, 2026 12:50
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from 8587d07 to ec50516 Compare March 30, 2026 12:50
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from ec50516 to b781661 Compare March 30, 2026 12:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

internal/controllers/resources/pod.go:1065

  • The comment says this function includes DPDK base memory, but GetHugePagesOffset currently only returns container.Spec.HugepagesOffset (or a mode-based default) and does not incorporate container.Spec.DpdkBaseMemoryMb. Either update the implementation to add the DPDK portion (e.g., DpdkBaseMemoryMb * NumCores where applicable) or adjust the comment to match the actual behavior.
// GetHugePagesOffset returns the hugepages offset for a container (for Weka allocation), including DPDK base memory.
func GetHugePagesOffset(container *weka.WekaContainer) int {
	offset := container.Spec.HugepagesOffset
	// get default if not set
	if offset == 0 {
		switch container.Spec.Mode {
		case weka.WekaContainerModeDrive:
			if container.UsesDriveSharing() {
				offset = 200 * container.Spec.NumCores
			} else {
				offset = 200 * container.Spec.NumDrives
			}
		case weka.WekaContainerModeSSDProxy:
			offset = config.Config.DriveSharing.SsdProxyHugepagesOffsetMiB
		default:
			offset = 200
		}
	}
	return offset

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +232 to +235
// Add DPDK base memory to both hugepages and offset
dpdkTotalMemory := dpdkBaseMemoryMb * numCores
hp.Hugepages += dpdkTotalMemory
hp.HugepagesOffset += dpdkTotalMemory
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DPDK base memory is now added to both Hugepages and HugepagesOffset, but the unit tests in templates_test.go only assert Hugepages. Add assertions for HugepagesOffset to ensure the offset adjustment remains correct and to prevent regressions in MEMORY sizing (Hugepages - offset).

Suggested change
// Add DPDK base memory to both hugepages and offset
dpdkTotalMemory := dpdkBaseMemoryMb * numCores
hp.Hugepages += dpdkTotalMemory
hp.HugepagesOffset += dpdkTotalMemory
// Add DPDK base memory to hugepages so that MEMORY (Hugepages - HugepagesOffset) includes it
dpdkTotalMemory := dpdkBaseMemoryMb * numCores
hp.Hugepages += dpdkTotalMemory

Copilot uses AI. Check for mistakes.
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from b781661 to ebec559 Compare March 30, 2026 16:31
@kristina-solovyova kristina-solovyova added the claude-review Force Claude review label Mar 30, 2026 — with Graphite App
Copilot AI review requested due to automatic review settings March 30, 2026 18:02
@assafgi assafgi force-pushed the 03-25-fix_extend_pod_hugepages_based_on_dpdk_base_memory_mb branch from ebec559 to 3b81136 Compare March 30, 2026 18:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

internal/controllers/resources/pod.go:1056

  • The doc comment says GetHugePagesOffset includes DPDK base memory, but the function currently only returns container.Spec.HugepagesOffset (or the mode-based default) and does not reference DpdkBaseMemoryMb. Either update the comment to reflect the actual behavior (offset is already expected to include any DPDK adjustment in Spec.HugepagesOffset), or incorporate DPDK into this helper to match the comment.
// GetHugePagesOffset returns the hugepages offset for a container (for Weka allocation), including DPDK base memory.
func GetHugePagesOffset(container *weka.WekaContainer) int {
	offset := container.Spec.HugepagesOffset
	// get default if not set
	if offset == 0 {
		switch container.Spec.Mode {
		case weka.WekaContainerModeDrive:
			if container.UsesDriveSharing() {
				offset = 200 * container.Spec.NumCores
			} else {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-review Force Claude review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants