Skip to content

Distribution of Multi-Architecture Requests in Serverledge #18

Open
FilippoMuschera wants to merge 179 commits intoserverledge-faas:mainfrom
FilippoMuschera:main
Open

Distribution of Multi-Architecture Requests in Serverledge #18
FilippoMuschera wants to merge 179 commits intoserverledge-faas:mainfrom
FilippoMuschera:main

Conversation

@FilippoMuschera
Copy link
Copy Markdown
Contributor

Introducing native support for heterogeneous clusters (x86 and ARM) within Serverledge and a new intelligent load-balancing system. The load-balancing algorithms have been extended by integrating Multi-Armed Bandit (MAB) algorithms to dynamically optimize execution times based on the architectural affinities of the functions.

Serverledge runtimes have been adapted to support ARM and extended to provide execution environments for Python functions with ML libraries, as well as for the Java and Go languages.

FilippoMuschera and others added 30 commits October 6, 2025 17:50
This commit refactors the node registration process to
include the architecture of the node. This is to prepare for function scheduling
based on architecture compatibility. Includes:

- Add Arch to NodeID and NodeRegistration
- Include architecture in ETCD registration payload
- Update runtime info with compatible architectures
Add image architecture discovery and caching in etcd to allow function to run on ARM or x86 nodes. Also add support for architectures to custom runtimes.
Update go.mod and go.sum to add new dependencies.
Refactor dependencies in go.mod and go.sum. Indirect dependency for docker was a newer version where a function was now deprecated.
Adds tests for detecting image architectures via Docker.
Also updates Makefile with unit test target.
Update etcd and test cache logic.
Improve test initialization and teardown in main_test.go.
Also update Makefile and script to reliably manage etcd
instance during test execution.
For the moment this will run only on the ArchitectureAware branch
Saved info about architectures supported by the runtime of each function in the struct Function as well, to exploit the saving of this data over etcd and to use it also for offloaded requests.
Right now it will not choose the best architecture, but will take it into consideration for compatibility: i.e.: it won't try to run an amd64-only container on an arm64 node.
…tedArchs field of Function.

Adds X86 and ARM constants for supported architectures. Normally this field is assigned by the node who registers the function the first time, and then pushed to etcd so that it can be retrieved by other nodes. Here in the test the function registration is skipped/done manually through etcd directly, and so it was necessary to manually include this field and its values.
The API test was failing intermittently due to insufficient sleep time on less powerful hardware. This tries to mitigate this issue.
The scheduler now takes into account the node architecture during offloading to ensure function compatibility.
Requests are dropped if the current node's architecture is not supported by the function's runtime.
FilippoMuschera and others added 30 commits February 6, 2026 16:18
Explicitly setting ctx to nil. This helps with
garbage collection in MAB mode where ctx might
still be set but is not used.
experiments: update files to work with new deployment system
Introduce a memory penalty factor to the LinUCB reward calculation.
The new `MAB_LINUCB_LAMBDA` configuration key controls the weight of this
penalty. The reward is now `reward = -log(durationMs) - (lambda * memPenalty(memUsage))`,
where `memPenalty` is a function that grows non-linearly from 0
at 70% memory utilization to 1 at 100% utilization.

This change aims to improve resource awareness of the LinUCB policy by
penalizing architectures experiencing high memory pressure, promoting
better load distribution and preventing resource exhaustion.
Introduce a new `Random` load balancing mode to the ArchitectureAwareBalancer.

This change adds a `selectArchitectureRandom` function that randomly
chooses between ARM and x86 architectures. This mode can be enabled
via the `LB_MODE` configuration and serves as a testing and baseline
option for experiments, complementing the existing MABs and Round Robin
modes. It provides a simple, unbiased selection mechanism for
architectures.
Temporarily disable the read lock on node.LocalResources in GetServerStatus
to prevent a deadlock. The deadlock occurs when AcquireWarmContainer is
called concurrently with GetServerStatus, as AcquireWarmContainer attempts
to acquire a write lock while GetServerStatus holds a read lock, blocking
new readers as per Go's RLock documentation.

This is a temporary solution to unblock experiments. A more robust
refactor of the resource access in GetServerStatus is needed to ensure
proper synchronization without deadlocks.
This commit extends the node metric tracking to include CPU utilization.
Previously, the load balancer only considered free memory when making
scheduling decisions. This change introduces `FreeCPU` to the `NodeMetric`
struct and updates the `NodeMetricCache.Update` method to accept and
store this new metric.
Introduce a configurable validity period for the offloading cache in
the scheduler. This allows to tune how long offloading
decisions are cached for the `EdgeOnlypolicy`.

The `OFFLOADING_CACHE_VALIDITY` constant is added to `internal/config/keys.go`
to define the configuration key. The `internal/scheduling/offloading.go`
file is updated to read this configuration value and apply it to the
`CacheValidity` duration, replacing the previous hardcoded value of
15 seconds. The default value for this configuration is 60 seconds
if not explicitly set, to match default janitor interval of Serverledge nodes
Move the acquisition of LocalResources RLock in `GetServerStatus` after fetching
`WarmStatus` and Vivaldi coordinates to prevent a deadlock with
`AcquireWarmContainer`.

Update the `api.go` handler to include a `Serverledge-Timestamp` header in
responses, allowing the load balancer (`lb.go`) to use the actual node
timestamp for metric updates instead of `time.Now().Unix()`. This ensures more
accurate latency measurements and better distributed system consistency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant