Parity sort by hmenke · Pull Request #32 · TRIQS/ctint

hmenke · 2025-12-04T08:51:51Z

No description provided.

Wentzell · 2025-12-04T21:04:45Z

Thank you @hmenke, merged with minor edits in 71ae143 and cefd274

hmenke · 2025-12-05T08:05:12Z

c++/triqs_ctint/lazy_det_operation.cpp

+  // Sort a list and return the parity of the number of swaps.
+  template <typename Iterator> static double parity_sort(Iterator begin, Iterator end) {
+    std::size_t n_swaps = bubble_sort(begin, end);
+    return 2.0 * int{n_swaps % 2 == 0} - 1.0;


@Wentzell Why did you change this branch-less statement into one that has a branch? I don't see any good reason for doing so and I don't think that this counts as premature optimization.

ctint/c++/triqs_ctint/lazy_det_operation.cpp

Line 43 in cefd274

return (n_swaps % 2 == 0) ? 1.0 : -1.0;

For gcc both versions actually generate the same assembly, c.f. https://godbolt.org/z/svq84af6j and for clang (n_swaps % 2 == 0) ? 1.0 : -1.0 is argueably faster, see https://claude.ai/share/8bfaabd2-cb34-4eb3-96c7-79f6ceb5b88a

Did you see a difference in any microbenchmark?

For GCC 14.3.0 on my laptop: At -O3 and -O2 there is no difference between the two, at -O0 and -Og the branchless solution is clearly faster (factor 1.2 on average). Remember that people also run debug builds of TRIQS for development and they also deserve somewhat decent performance. For some reason that I haven't investigated further at -O1 the branching version shows some weird behavior where it is factor 1.5 faster for odd numbers than for even numbers (branch misprediction maybe?).

Regarding Clang, there is a 10 year old bug report for the bad code: https://bugs.llvm.org/show_bug.cgi?id=25277

Microbenchmark

#include <benchmark/benchmark.h> #include <cstdint> #include <random> static double parity_one(std::size_t n_swaps) { return 2.0 * int{n_swaps % 2 == 0} - 1.0; } static double parity_two(std::size_t n_swaps) { return (n_swaps % 2 == 0) ? 1.0 : -1.0; } static void args(benchmark::internal::Benchmark* b) { std::default_random_engine rand(0xbeefcafe); for (std::size_t i = 0; i < 10; ++i) { b->Arg(static_cast<std::int64_t>(rand())); } } static void ParityOne(benchmark::State& state) { for (auto _ : state) { double d = parity_one(state.range(0)); benchmark::DoNotOptimize(d); } } BENCHMARK(ParityOne)->Apply(args); static void ParityTwo(benchmark::State& state) { for (auto _ : state) { double d = parity_two(state.range(0)); benchmark::DoNotOptimize(d); } } BENCHMARK(ParityTwo)->Apply(args); BENCHMARK_MAIN();

I think the fastest in this case would be:

static double parity_bitcast(std::size_t n_swaps) { static constexpr auto one = std::bit_cast<std::uint64_t>(1.0); const auto u = one ^ (n_swaps & 1ull) << 63; return std::bit_cast<double>(u); }

With -O3 this is 30% faster than parity_one.
With -O0 I don't see meaningful differences with gcc 15.

microbenchmark:

#include <benchmark/benchmark.h> #include <cstdint> #include <random> #include <cmath> #include <bit> // Original versions static double parity_one(std::size_t n_swaps) { return 2.0 * int{n_swaps % 2 == 0} - 1.0; } static double parity_two(std::size_t n_swaps) { return (n_swaps % 2 == 0) ? 1.0 : -1.0; } // Bitwise AND + double static double parity_bitand(std::size_t n_swaps) { // +1.0 if even, -1.0 if odd return 1.0 - 2.0 * static_cast<double>(n_swaps & 1u); } // Integer sign then cast static double parity_int(std::size_t n_swaps) { int sign = 1 - 2 * static_cast<int>(n_swaps & 1u); // +1 or -1 return static_cast<double>(sign); } // FMA variant: 2 * parity - 1 static double parity_fma(std::size_t n_swaps) { int parity = static_cast<int>(n_swaps & 1u); // 0 or 1 return std::fma(static_cast<double>(parity), 2.0, -1.0); } // copysign variant static double parity_copysign(std::size_t n_swaps) { double x = 0.5 - static_cast<double>(n_swaps & 1u); return std::copysign(1.0, x); } // bit_cast sign-bit flip (C++20) #if __cpp_lib_bit_cast >= 201806L static double parity_bitcast(std::size_t n_swaps) { auto u = std::bit_cast<std::uint64_t>(1.0); u ^= (n_swaps & 1ull) << 63; return std::bit_cast<double>(u); } #endif // Same args generator static void args(benchmark::internal::Benchmark* b) { std::default_random_engine rand(0xbeefcafe); for (std::size_t i = 0; i < 10; ++i) { b->Arg(static_cast<std::int64_t>(rand())); } } // Benchmarks static void ParityOne(benchmark::State& state) { for (auto _ : state) { double d = parity_one(static_cast<std::size_t>(state.range(0))); benchmark::DoNotOptimize(d); } } BENCHMARK(ParityOne)->Apply(args); static void ParityTwo(benchmark::State& state) { for (auto _ : state) { double d = parity_two(static_cast<std::size_t>(state.range(0))); benchmark::DoNotOptimize(d); } } BENCHMARK(ParityTwo)->Apply(args); static void ParityBitAnd(benchmark::State& state) { for (auto _ : state) { double d = parity_bitand(static_cast<std::size_t>(state.range(0))); benchmark::DoNotOptimize(d); } } BENCHMARK(ParityBitAnd)->Apply(args); static void ParityInt(benchmark::State& state) { for (auto _ : state) { double d = parity_int(static_cast<std::size_t>(state.range(0))); benchmark::DoNotOptimize(d); } } BENCHMARK(ParityInt)->Apply(args); static void ParityFma(benchmark::State& state) { for (auto _ : state) { double d = parity_fma(static_cast<std::size_t>(state.range(0))); benchmark::DoNotOptimize(d); } } BENCHMARK(ParityFma)->Apply(args); static void ParityCopysign(benchmark::State& state) { for (auto _ : state) { double d = parity_copysign(static_cast<std::size_t>(state.range(0))); benchmark::DoNotOptimize(d); } } BENCHMARK(ParityCopysign)->Apply(args); #if __cpp_lib_bit_cast >= 201806L static void ParityBitcast(benchmark::State& state) { for (auto _ : state) { double d = parity_bitcast(static_cast<std::size_t>(state.range(0))); benchmark::DoNotOptimize(d); } } BENCHMARK(ParityBitcast)->Apply(args); #endif BENCHMARK_MAIN();

Thank you both @hmenke and @DiamonDinoia for your input.
Indeed it seems the ParityBitcast dominates

-------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- ParityOne/1090913575 0.428 ns 0.427 ns 1624181049 ParityTwo/1090913575 0.360 ns 0.360 ns 2039897825 ParityBitAnd/1090913575 0.447 ns 0.447 ns 1558235034 ParityInt/1090913575 0.445 ns 0.445 ns 1558966762 ParityFma/1090913575 0.317 ns 0.316 ns 2208092805 ParityCopysign/1090913575 0.476 ns 0.476 ns 1473895002 ParityBitcast/1090913575 0.240 ns 0.239 ns 2933001395

I don't see how optimizing this return statement is at all necessary though, as it follows the almost always dominant call to bubble_sort.

Happy to discuss more if this actually turns out to be relevant in any profiles.

DiamonDinoia · 2025-12-05T22:46:10Z

Sure, do as you see fit. The bit_cast solution is sort of standard in the realm of random number generators. It was common to do this trick when FP multiply was expensive in the past.

…

On Friday, December 5, 2025, Nils Wentzell ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In c++/triqs_ctint/lazy_det_operation.cpp <https://urldefense.com/v3/__https://github.com/TRIQS/ctint/pull/32*discussion_r2594014083__;Iw!!DSb-azq1wVFtOg!U-BeD9NvuEIU4q5bjlJlCqiXZtugRmUww9LShuEHz6zXJebRanowBv9vg9-rC1YmVwxnyMLrtBjISzPsWRsIxX26-tjkbV97$> : > @@ -37,6 +37,12 @@ namespace triqs_ctint { return lower_bound(d->size(), [d](int n) { return d->get_y(n); }, cdag); } + // Sort a list and return the parity of the number of swaps. + template <typename Iterator> static double parity_sort(Iterator begin, Iterator end) { + std::size_t n_swaps = bubble_sort(begin, end); + return 2.0 * int{n_swaps % 2 == 0} - 1.0; Thank you both @hmenke <https://urldefense.com/v3/__https://github.com/hmenke__;!!DSb-azq1wVFtOg!U-BeD9NvuEIU4q5bjlJlCqiXZtugRmUww9LShuEHz6zXJebRanowBv9vg9-rC1YmVwxnyMLrtBjISzPsWRsIxX26-rxLGS59$> and @DiamonDinoia <https://urldefense.com/v3/__https://github.com/DiamonDinoia__;!!DSb-azq1wVFtOg!U-BeD9NvuEIU4q5bjlJlCqiXZtugRmUww9LShuEHz6zXJebRanowBv9vg9-rC1YmVwxnyMLrtBjISzPsWRsIxX26-vnF_SCs$> for your input. Indeed it seems the ParityBitcast dominates -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- ParityOne/1090913575 0.428 ns 0.427 ns 1624181049 ParityTwo/1090913575 0.360 ns 0.360 ns 2039897825 ParityBitAnd/1090913575 0.447 ns 0.447 ns 1558235034 ParityInt/1090913575 0.445 ns 0.445 ns 1558966762 ParityFma/1090913575 0.317 ns 0.316 ns 2208092805 ParityCopysign/1090913575 0.476 ns 0.476 ns 1473895002 ParityBitcast/1090913575 0.240 ns 0.239 ns 2933001395 I don't see how optimizing this return statement is at all necessary though, as it follows the almost always dominant call to bubble_sort. Happy to discuss more if this actually turns out to be relevant in any profiles. — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/TRIQS/ctint/pull/32*discussion_r2594014083__;Iw!!DSb-azq1wVFtOg!U-BeD9NvuEIU4q5bjlJlCqiXZtugRmUww9LShuEHz6zXJebRanowBv9vg9-rC1YmVwxnyMLrtBjISzPsWRsIxX26-tjkbV97$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ACGKNQIMBZXIGXS7ESPQWG34AHZCZAVCNFSM6AAAAACOAKPS5GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKNBWGQ3TMNJUHA__;!!DSb-azq1wVFtOg!U-BeD9NvuEIU4q5bjlJlCqiXZtugRmUww9LShuEHz6zXJebRanowBv9vg9-rC1YmVwxnyMLrtBjISzPsWRsIxX26-veIXydC$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

DiamonDinoia · 2025-12-05T23:42:18Z

Any particular reason it should be a bubble sort and not another sort?

hmenke · 2025-12-08T05:40:04Z

Any particular reason it should be a bubble sort and not another sort?

The sorting has to respect the commutation relations of quantum mechanical operators, which is why we use a sorting algorithm that only ever swaps adjacent elements (bubble sort is one of them) and keep track of the number of swaps, because each swap in a quantum expectation value incurs a minus sign.

Also of note, there are usually just a few elements in those ranges, mos commonly two or four.

Wentzell · 2025-12-08T15:38:12Z

The sorting has to respect the commutation relations of quantum mechanical operators, which is why we use a sorting algorithm that only ever swaps adjacent elements (bubble sort is one of them) and keep track of the number of swaps, because each swap in a quantum expectation value incurs a minus sign.

Dear @hmenke, We actually do not require swapped elements to be adjacent in our sort. The single swap of two distant elements, say at positions i and j, has parity -1, and so does the sequence of element-wise swaps i->j and then j-1->i which is always an uneven count.

DiamonDinoia · 2025-12-08T17:12:46Z

How big of a bottleneck is this? If distant swap are allowed we can use better sorts like Insertion sort. If the number of elements to sort is small there are in-cache only implementations that heavily optimized.

Wentzell · 2025-12-15T20:43:25Z

This part should be very subdominant over the actual matrix updates / determinant ratio calculations

hmenke added 2 commits December 4, 2025 09:40

Use double insertion for plaquette tests

71ae143

Use new parity_sort for insert and remove

35dbe90

Wentzell closed this Dec 4, 2025

hmenke commented Dec 5, 2025

View reviewed changes

hmenke deleted the parity_sort branch December 8, 2025 05:45

Conversation

hmenke commented Dec 4, 2025

Uh oh!

Wentzell commented Dec 4, 2025

Uh oh!

hmenke Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wentzell Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Wentzell Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

hmenke Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Wentzell Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

DiamonDinoia commented Dec 5, 2025 via email

Uh oh!

DiamonDinoia commented Dec 5, 2025

Uh oh!

hmenke commented Dec 8, 2025

Uh oh!

Wentzell commented Dec 8, 2025

Uh oh!

DiamonDinoia commented Dec 8, 2025

Uh oh!

Wentzell commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hmenke Dec 5, 2025 •

edited

Loading

DiamonDinoia Dec 5, 2025 •

edited

Loading