Distribution of Multi-Architecture Requests in Serverledge #18
Open
FilippoMuschera wants to merge 179 commits intoserverledge-faas:mainfrom
Open
Distribution of Multi-Architecture Requests in Serverledge #18FilippoMuschera wants to merge 179 commits intoserverledge-faas:mainfrom
FilippoMuschera wants to merge 179 commits intoserverledge-faas:mainfrom
Conversation
This reverts commit 876b23e.
This reverts commit 3e61d58.
…oud (tests might run slower)
This commit refactors the node registration process to include the architecture of the node. This is to prepare for function scheduling based on architecture compatibility. Includes: - Add Arch to NodeID and NodeRegistration - Include architecture in ETCD registration payload - Update runtime info with compatible architectures
Add image architecture discovery and caching in etcd to allow function to run on ARM or x86 nodes. Also add support for architectures to custom runtimes. Update go.mod and go.sum to add new dependencies.
Refactor dependencies in go.mod and go.sum. Indirect dependency for docker was a newer version where a function was now deprecated.
Adds tests for detecting image architectures via Docker. Also updates Makefile with unit test target.
Update etcd and test cache logic.
Improve test initialization and teardown in main_test.go. Also update Makefile and script to reliably manage etcd instance during test execution.
For the moment this will run only on the ArchitectureAware branch
Saved info about architectures supported by the runtime of each function in the struct Function as well, to exploit the saving of this data over etcd and to use it also for offloaded requests.
Right now it will not choose the best architecture, but will take it into consideration for compatibility: i.e.: it won't try to run an amd64-only container on an arm64 node.
…to a misplacing inside an inner if-block.
…tedArchs field of Function. Adds X86 and ARM constants for supported architectures. Normally this field is assigned by the node who registers the function the first time, and then pushed to etcd so that it can be retrieved by other nodes. Here in the test the function registration is skipped/done manually through etcd directly, and so it was necessary to manually include this field and its values.
The API test was failing intermittently due to insufficient sleep time on less powerful hardware. This tries to mitigate this issue.
The scheduler now takes into account the node architecture during offloading to ensure function compatibility. Requests are dropped if the current node's architecture is not supported by the function's runtime.
Explicitly setting ctx to nil. This helps with garbage collection in MAB mode where ctx might still be set but is not used.
MABs: UCB-1 and LinUCB
experiments: update files to work with new deployment system
Introduce a memory penalty factor to the LinUCB reward calculation. The new `MAB_LINUCB_LAMBDA` configuration key controls the weight of this penalty. The reward is now `reward = -log(durationMs) - (lambda * memPenalty(memUsage))`, where `memPenalty` is a function that grows non-linearly from 0 at 70% memory utilization to 1 at 100% utilization. This change aims to improve resource awareness of the LinUCB policy by penalizing architectures experiencing high memory pressure, promoting better load distribution and preventing resource exhaustion.
Introduce a new `Random` load balancing mode to the ArchitectureAwareBalancer. This change adds a `selectArchitectureRandom` function that randomly chooses between ARM and x86 architectures. This mode can be enabled via the `LB_MODE` configuration and serves as a testing and baseline option for experiments, complementing the existing MABs and Round Robin modes. It provides a simple, unbiased selection mechanism for architectures.
Temporarily disable the read lock on node.LocalResources in GetServerStatus to prevent a deadlock. The deadlock occurs when AcquireWarmContainer is called concurrently with GetServerStatus, as AcquireWarmContainer attempts to acquire a write lock while GetServerStatus holds a read lock, blocking new readers as per Go's RLock documentation. This is a temporary solution to unblock experiments. A more robust refactor of the resource access in GetServerStatus is needed to ensure proper synchronization without deadlocks.
This commit extends the node metric tracking to include CPU utilization. Previously, the load balancer only considered free memory when making scheduling decisions. This change introduces `FreeCPU` to the `NodeMetric` struct and updates the `NodeMetricCache.Update` method to accept and store this new metric.
Introduce a configurable validity period for the offloading cache in the scheduler. This allows to tune how long offloading decisions are cached for the `EdgeOnlypolicy`. The `OFFLOADING_CACHE_VALIDITY` constant is added to `internal/config/keys.go` to define the configuration key. The `internal/scheduling/offloading.go` file is updated to read this configuration value and apply it to the `CacheValidity` duration, replacing the previous hardcoded value of 15 seconds. The default value for this configuration is 60 seconds if not explicitly set, to match default janitor interval of Serverledge nodes
Move the acquisition of LocalResources RLock in `GetServerStatus` after fetching `WarmStatus` and Vivaldi coordinates to prevent a deadlock with `AcquireWarmContainer`. Update the `api.go` handler to include a `Serverledge-Timestamp` header in responses, allowing the load balancer (`lb.go`) to use the actual node timestamp for metric updates instead of `time.Now().Unix()`. This ensures more accurate latency measurements and better distributed system consistency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introducing native support for heterogeneous clusters (x86 and ARM) within Serverledge and a new intelligent load-balancing system. The load-balancing algorithms have been extended by integrating Multi-Armed Bandit (MAB) algorithms to dynamically optimize execution times based on the architectural affinities of the functions.
Serverledge runtimes have been adapted to support ARM and extended to provide execution environments for Python functions with ML libraries, as well as for the Java and Go languages.