Conversation
| // Sort a list and return the parity of the number of swaps. | ||
| template <typename Iterator> static double parity_sort(Iterator begin, Iterator end) { | ||
| std::size_t n_swaps = bubble_sort(begin, end); | ||
| return 2.0 * int{n_swaps % 2 == 0} - 1.0; |
There was a problem hiding this comment.
@Wentzell Why did you change this branch-less statement into one that has a branch? I don't see any good reason for doing so and I don't think that this counts as premature optimization.
There was a problem hiding this comment.
For gcc both versions actually generate the same assembly, c.f. https://godbolt.org/z/svq84af6j and for clang (n_swaps % 2 == 0) ? 1.0 : -1.0 is argueably faster, see https://claude.ai/share/8bfaabd2-cb34-4eb3-96c7-79f6ceb5b88a
There was a problem hiding this comment.
Did you see a difference in any microbenchmark?
There was a problem hiding this comment.
For GCC 14.3.0 on my laptop: At -O3 and -O2 there is no difference between the two, at -O0 and -Og the branchless solution is clearly faster (factor 1.2 on average). Remember that people also run debug builds of TRIQS for development and they also deserve somewhat decent performance. For some reason that I haven't investigated further at -O1 the branching version shows some weird behavior where it is factor 1.5 faster for odd numbers than for even numbers (branch misprediction maybe?).
Regarding Clang, there is a 10 year old bug report for the bad code: https://bugs.llvm.org/show_bug.cgi?id=25277
Microbenchmark
#include <benchmark/benchmark.h>
#include <cstdint>
#include <random>
static double parity_one(std::size_t n_swaps) {
return 2.0 * int{n_swaps % 2 == 0} - 1.0;
}
static double parity_two(std::size_t n_swaps) {
return (n_swaps % 2 == 0) ? 1.0 : -1.0;
}
static void args(benchmark::internal::Benchmark* b) {
std::default_random_engine rand(0xbeefcafe);
for (std::size_t i = 0; i < 10; ++i) {
b->Arg(static_cast<std::int64_t>(rand()));
}
}
static void ParityOne(benchmark::State& state) {
for (auto _ : state) {
double d = parity_one(state.range(0));
benchmark::DoNotOptimize(d);
}
}
BENCHMARK(ParityOne)->Apply(args);
static void ParityTwo(benchmark::State& state) {
for (auto _ : state) {
double d = parity_two(state.range(0));
benchmark::DoNotOptimize(d);
}
}
BENCHMARK(ParityTwo)->Apply(args);
BENCHMARK_MAIN();There was a problem hiding this comment.
I think the fastest in this case would be:
static double parity_bitcast(std::size_t n_swaps) {
static constexpr auto one = std::bit_cast<std::uint64_t>(1.0);
const auto u = one ^ (n_swaps & 1ull) << 63;
return std::bit_cast<double>(u);
}With -O3 this is 30% faster than parity_one.
With -O0 I don't see meaningful differences with gcc 15.
microbenchmark:
#include <benchmark/benchmark.h>
#include <cstdint>
#include <random>
#include <cmath>
#include <bit>
// Original versions
static double parity_one(std::size_t n_swaps) {
return 2.0 * int{n_swaps % 2 == 0} - 1.0;
}
static double parity_two(std::size_t n_swaps) {
return (n_swaps % 2 == 0) ? 1.0 : -1.0;
}
// Bitwise AND + double
static double parity_bitand(std::size_t n_swaps) {
// +1.0 if even, -1.0 if odd
return 1.0 - 2.0 * static_cast<double>(n_swaps & 1u);
}
// Integer sign then cast
static double parity_int(std::size_t n_swaps) {
int sign = 1 - 2 * static_cast<int>(n_swaps & 1u); // +1 or -1
return static_cast<double>(sign);
}
// FMA variant: 2 * parity - 1
static double parity_fma(std::size_t n_swaps) {
int parity = static_cast<int>(n_swaps & 1u); // 0 or 1
return std::fma(static_cast<double>(parity), 2.0, -1.0);
}
// copysign variant
static double parity_copysign(std::size_t n_swaps) {
double x = 0.5 - static_cast<double>(n_swaps & 1u);
return std::copysign(1.0, x);
}
// bit_cast sign-bit flip (C++20)
#if __cpp_lib_bit_cast >= 201806L
static double parity_bitcast(std::size_t n_swaps) {
auto u = std::bit_cast<std::uint64_t>(1.0);
u ^= (n_swaps & 1ull) << 63;
return std::bit_cast<double>(u);
}
#endif
// Same args generator
static void args(benchmark::internal::Benchmark* b) {
std::default_random_engine rand(0xbeefcafe);
for (std::size_t i = 0; i < 10; ++i) {
b->Arg(static_cast<std::int64_t>(rand()));
}
}
// Benchmarks
static void ParityOne(benchmark::State& state) {
for (auto _ : state) {
double d = parity_one(static_cast<std::size_t>(state.range(0)));
benchmark::DoNotOptimize(d);
}
}
BENCHMARK(ParityOne)->Apply(args);
static void ParityTwo(benchmark::State& state) {
for (auto _ : state) {
double d = parity_two(static_cast<std::size_t>(state.range(0)));
benchmark::DoNotOptimize(d);
}
}
BENCHMARK(ParityTwo)->Apply(args);
static void ParityBitAnd(benchmark::State& state) {
for (auto _ : state) {
double d = parity_bitand(static_cast<std::size_t>(state.range(0)));
benchmark::DoNotOptimize(d);
}
}
BENCHMARK(ParityBitAnd)->Apply(args);
static void ParityInt(benchmark::State& state) {
for (auto _ : state) {
double d = parity_int(static_cast<std::size_t>(state.range(0)));
benchmark::DoNotOptimize(d);
}
}
BENCHMARK(ParityInt)->Apply(args);
static void ParityFma(benchmark::State& state) {
for (auto _ : state) {
double d = parity_fma(static_cast<std::size_t>(state.range(0)));
benchmark::DoNotOptimize(d);
}
}
BENCHMARK(ParityFma)->Apply(args);
static void ParityCopysign(benchmark::State& state) {
for (auto _ : state) {
double d = parity_copysign(static_cast<std::size_t>(state.range(0)));
benchmark::DoNotOptimize(d);
}
}
BENCHMARK(ParityCopysign)->Apply(args);
#if __cpp_lib_bit_cast >= 201806L
static void ParityBitcast(benchmark::State& state) {
for (auto _ : state) {
double d = parity_bitcast(static_cast<std::size_t>(state.range(0)));
benchmark::DoNotOptimize(d);
}
}
BENCHMARK(ParityBitcast)->Apply(args);
#endif
BENCHMARK_MAIN();There was a problem hiding this comment.
Thank you both @hmenke and @DiamonDinoia for your input.
Indeed it seems the ParityBitcast dominates
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
ParityOne/1090913575 0.428 ns 0.427 ns 1624181049
ParityTwo/1090913575 0.360 ns 0.360 ns 2039897825
ParityBitAnd/1090913575 0.447 ns 0.447 ns 1558235034
ParityInt/1090913575 0.445 ns 0.445 ns 1558966762
ParityFma/1090913575 0.317 ns 0.316 ns 2208092805
ParityCopysign/1090913575 0.476 ns 0.476 ns 1473895002
ParityBitcast/1090913575 0.240 ns 0.239 ns 2933001395
I don't see how optimizing this return statement is at all necessary though, as it follows the almost always dominant call to bubble_sort.
Happy to discuss more if this actually turns out to be relevant in any profiles.
|
Sure, do as you see fit. The bit_cast solution is sort of standard in the
realm of random number generators. It was common to do this trick when FP
multiply was expensive in the past.
…On Friday, December 5, 2025, Nils Wentzell ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In c++/triqs_ctint/lazy_det_operation.cpp
<https://urldefense.com/v3/__https://github.com/TRIQS/ctint/pull/32*discussion_r2594014083__;Iw!!DSb-azq1wVFtOg!U-BeD9NvuEIU4q5bjlJlCqiXZtugRmUww9LShuEHz6zXJebRanowBv9vg9-rC1YmVwxnyMLrtBjISzPsWRsIxX26-tjkbV97$>
:
> @@ -37,6 +37,12 @@ namespace triqs_ctint {
return lower_bound(d->size(), [d](int n) { return d->get_y(n); }, cdag);
}
+ // Sort a list and return the parity of the number of swaps.
+ template <typename Iterator> static double parity_sort(Iterator begin, Iterator end) {
+ std::size_t n_swaps = bubble_sort(begin, end);
+ return 2.0 * int{n_swaps % 2 == 0} - 1.0;
Thank you both @hmenke
<https://urldefense.com/v3/__https://github.com/hmenke__;!!DSb-azq1wVFtOg!U-BeD9NvuEIU4q5bjlJlCqiXZtugRmUww9LShuEHz6zXJebRanowBv9vg9-rC1YmVwxnyMLrtBjISzPsWRsIxX26-rxLGS59$>
and @DiamonDinoia
<https://urldefense.com/v3/__https://github.com/DiamonDinoia__;!!DSb-azq1wVFtOg!U-BeD9NvuEIU4q5bjlJlCqiXZtugRmUww9LShuEHz6zXJebRanowBv9vg9-rC1YmVwxnyMLrtBjISzPsWRsIxX26-vnF_SCs$>
for your input.
Indeed it seems the ParityBitcast dominates
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
ParityOne/1090913575 0.428 ns 0.427 ns 1624181049
ParityTwo/1090913575 0.360 ns 0.360 ns 2039897825
ParityBitAnd/1090913575 0.447 ns 0.447 ns 1558235034
ParityInt/1090913575 0.445 ns 0.445 ns 1558966762
ParityFma/1090913575 0.317 ns 0.316 ns 2208092805
ParityCopysign/1090913575 0.476 ns 0.476 ns 1473895002
ParityBitcast/1090913575 0.240 ns 0.239 ns 2933001395
I don't see how optimizing this return statement is at all necessary
though, as it follows the almost always dominant call to bubble_sort.
Happy to discuss more if this actually turns out to be relevant in any
profiles.
—
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/TRIQS/ctint/pull/32*discussion_r2594014083__;Iw!!DSb-azq1wVFtOg!U-BeD9NvuEIU4q5bjlJlCqiXZtugRmUww9LShuEHz6zXJebRanowBv9vg9-rC1YmVwxnyMLrtBjISzPsWRsIxX26-tjkbV97$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACGKNQIMBZXIGXS7ESPQWG34AHZCZAVCNFSM6AAAAACOAKPS5GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKNBWGQ3TMNJUHA__;!!DSb-azq1wVFtOg!U-BeD9NvuEIU4q5bjlJlCqiXZtugRmUww9LShuEHz6zXJebRanowBv9vg9-rC1YmVwxnyMLrtBjISzPsWRsIxX26-veIXydC$>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Any particular reason it should be a bubble sort and not another sort? |
The sorting has to respect the commutation relations of quantum mechanical operators, which is why we use a sorting algorithm that only ever swaps adjacent elements (bubble sort is one of them) and keep track of the number of swaps, because each swap in a quantum expectation value incurs a minus sign. Also of note, there are usually just a few elements in those ranges, mos commonly two or four. |
Dear @hmenke, We actually do not require swapped elements to be adjacent in our sort. The single swap of two distant elements, say at positions i and j, has parity |
|
How big of a bottleneck is this? If distant swap are allowed we can use better sorts like Insertion sort. If the number of elements to sort is small there are in-cache only implementations that heavily optimized. |
|
This part should be very subdominant over the actual matrix updates / determinant ratio calculations |
No description provided.