[SIMD] Model weights updating using AVX Instructions#11
Closed
octaviansima wants to merge 17 commits intomc2-project:masterfrom
Closed
[SIMD] Model weights updating using AVX Instructions#11octaviansima wants to merge 17 commits intomc2-project:masterfrom
octaviansima wants to merge 17 commits intomc2-project:masterfrom
Conversation
Member
There was a problem hiding this comment.
I ran some performance tests using the perf-testing branch and got the following results:
| Code version | Num weights | Num Threads | Time (s) |
|---|---|---|---|
| Non SIMD | 6000 | 1 | 132.79 |
| SIMD | 6000 | 1 | 134.97 |
| Non SIMD | 15000 | 1 | 843.7 |
| SIMD | 15000 | 1 | 858.7 |
I'll continue investigating to see why there is no performance improvement
| if (k == g_accumulator.size() - 1 && iters_sum > 0) { | ||
| const float iters_sum_arr[8] = {iters_sum, iters_sum, iters_sum, iters_sum, | ||
| iters_sum, iters_sum, iters_sum, iters_sum}; | ||
| iters_sum_slice = _mm256_loadu_ps(iters_sum_arr); |
Member
There was a problem hiding this comment.
I think you can use _mm256_broadcast_ss(&iters_sum) instead of _mm256_loadu_ps. That way you don't have to first create an array. I tested this and it builds/runs, but you may want to check correctness.
| } | ||
| const float n_iter_arr[8] = {n_iter, n_iter, n_iter, n_iter, | ||
| n_iter, n_iter, n_iter, n_iter}; | ||
| __m256 n_iter_slice = _mm256_loadu_ps(n_iter_arr); |
| continue; // Didn't receive this variable from any clients | ||
| } | ||
| // Multiple the weights by local iterations and update g_old_params[v_name]. | ||
| for (int i = 0; i < acc_params[v_name].size() / 8 * 8; i += 8) { |
Member
There was a problem hiding this comment.
This variable i has already been used in the outer for loop
| _mm256_storeu_ps(g_old_params[v_name].data() + i, updated_old_params_v_name_slice); | ||
| } | ||
| // Tail case. | ||
| for (int i = acc_params[v_name].size() / 8 * 8; i < acc_params[v_name].size(); i++) { |
Member
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR includes further optimizations of the aggregation code by performing arithmetic operations using Intel's AVX 256-bit instructions. It also includes more minor optimizations regarding non-necessary copying of data, a logic change to avoid redundant loads and stores, and the conversion of double types to floats. Code used to test performance found here: https://github.com/octaviansima/secure-aggregation/blob/perf-testing/server/tests/host_test.cpp
Note that the PR is huge only due to the inclusion of Intel Intrinsics header files required for compilation.