Skip to content

Parity sort#32

Closed
hmenke wants to merge 2 commits intoTRIQS:unstablefrom
hmenke:parity_sort
Closed

Parity sort#32
hmenke wants to merge 2 commits intoTRIQS:unstablefrom
hmenke:parity_sort

Conversation

@hmenke
Copy link
Member

@hmenke hmenke commented Dec 4, 2025

No description provided.

@Wentzell
Copy link
Member

Wentzell commented Dec 4, 2025

Thank you @hmenke, merged with minor edits in 71ae143 and cefd274

@Wentzell Wentzell closed this Dec 4, 2025
// Sort a list and return the parity of the number of swaps.
template <typename Iterator> static double parity_sort(Iterator begin, Iterator end) {
std::size_t n_swaps = bubble_sort(begin, end);
return 2.0 * int{n_swaps % 2 == 0} - 1.0;
Copy link
Member Author

@hmenke hmenke Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Wentzell Why did you change this branch-less statement into one that has a branch? I don't see any good reason for doing so and I don't think that this counts as premature optimization.

return (n_swaps % 2 == 0) ? 1.0 : -1.0;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For gcc both versions actually generate the same assembly, c.f. https://godbolt.org/z/svq84af6j and for clang (n_swaps % 2 == 0) ? 1.0 : -1.0 is argueably faster, see https://claude.ai/share/8bfaabd2-cb34-4eb3-96c7-79f6ceb5b88a

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see a difference in any microbenchmark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GCC 14.3.0 on my laptop: At -O3 and -O2 there is no difference between the two, at -O0 and -Og the branchless solution is clearly faster (factor 1.2 on average). Remember that people also run debug builds of TRIQS for development and they also deserve somewhat decent performance. For some reason that I haven't investigated further at -O1 the branching version shows some weird behavior where it is factor 1.5 faster for odd numbers than for even numbers (branch misprediction maybe?).

Regarding Clang, there is a 10 year old bug report for the bad code: https://bugs.llvm.org/show_bug.cgi?id=25277

Microbenchmark
#include <benchmark/benchmark.h>
#include <cstdint>
#include <random>

static double parity_one(std::size_t n_swaps) {
  return 2.0 * int{n_swaps % 2 == 0} - 1.0;
}

static double parity_two(std::size_t n_swaps) {
  return (n_swaps % 2 == 0) ? 1.0 : -1.0;
}

static void args(benchmark::internal::Benchmark* b) {
  std::default_random_engine rand(0xbeefcafe);
  for (std::size_t i = 0; i < 10; ++i) {
    b->Arg(static_cast<std::int64_t>(rand()));
  }
}

static void ParityOne(benchmark::State& state) {
  for (auto _ : state) {
    double d = parity_one(state.range(0));
    benchmark::DoNotOptimize(d);
  }
}
BENCHMARK(ParityOne)->Apply(args);

static void ParityTwo(benchmark::State& state) {
  for (auto _ : state) {
    double d = parity_two(state.range(0));
    benchmark::DoNotOptimize(d);
  }
}
BENCHMARK(ParityTwo)->Apply(args);
BENCHMARK_MAIN();

Copy link
Collaborator

@DiamonDinoia DiamonDinoia Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fastest in this case would be:

static double parity_bitcast(std::size_t n_swaps) {
    static constexpr auto one = std::bit_cast<std::uint64_t>(1.0);
    const auto u = one ^ (n_swaps & 1ull) << 63;
    return std::bit_cast<double>(u);
}

With -O3 this is 30% faster than parity_one.
With -O0 I don't see meaningful differences with gcc 15.

microbenchmark:

#include <benchmark/benchmark.h>
#include <cstdint>
#include <random>
#include <cmath>
#include <bit>

// Original versions
static double parity_one(std::size_t n_swaps) {
  return 2.0 * int{n_swaps % 2 == 0} - 1.0;
}

static double parity_two(std::size_t n_swaps) {
  return (n_swaps % 2 == 0) ? 1.0 : -1.0;
}

// Bitwise AND + double
static double parity_bitand(std::size_t n_swaps) {
  // +1.0 if even, -1.0 if odd
  return 1.0 - 2.0 * static_cast<double>(n_swaps & 1u);
}

// Integer sign then cast
static double parity_int(std::size_t n_swaps) {
  int sign = 1 - 2 * static_cast<int>(n_swaps & 1u); // +1 or -1
  return static_cast<double>(sign);
}

// FMA variant: 2 * parity - 1
static double parity_fma(std::size_t n_swaps) {
  int parity = static_cast<int>(n_swaps & 1u); // 0 or 1
  return std::fma(static_cast<double>(parity), 2.0, -1.0);
}

// copysign variant
static double parity_copysign(std::size_t n_swaps) {
  double x = 0.5 - static_cast<double>(n_swaps & 1u);
  return std::copysign(1.0, x);
}

// bit_cast sign-bit flip (C++20)
#if __cpp_lib_bit_cast >= 201806L
static double parity_bitcast(std::size_t n_swaps) {
  auto u = std::bit_cast<std::uint64_t>(1.0);
  u ^= (n_swaps & 1ull) << 63;
  return std::bit_cast<double>(u);
}
#endif

// Same args generator
static void args(benchmark::internal::Benchmark* b) {
  std::default_random_engine rand(0xbeefcafe);
  for (std::size_t i = 0; i < 10; ++i) {
    b->Arg(static_cast<std::int64_t>(rand()));
  }
}

// Benchmarks

static void ParityOne(benchmark::State& state) {
  for (auto _ : state) {
    double d = parity_one(static_cast<std::size_t>(state.range(0)));
    benchmark::DoNotOptimize(d);
  }
}
BENCHMARK(ParityOne)->Apply(args);

static void ParityTwo(benchmark::State& state) {
  for (auto _ : state) {
    double d = parity_two(static_cast<std::size_t>(state.range(0)));
    benchmark::DoNotOptimize(d);
  }
}
BENCHMARK(ParityTwo)->Apply(args);

static void ParityBitAnd(benchmark::State& state) {
  for (auto _ : state) {
    double d = parity_bitand(static_cast<std::size_t>(state.range(0)));
    benchmark::DoNotOptimize(d);
  }
}
BENCHMARK(ParityBitAnd)->Apply(args);

static void ParityInt(benchmark::State& state) {
  for (auto _ : state) {
    double d = parity_int(static_cast<std::size_t>(state.range(0)));
    benchmark::DoNotOptimize(d);
  }
}
BENCHMARK(ParityInt)->Apply(args);

static void ParityFma(benchmark::State& state) {
  for (auto _ : state) {
    double d = parity_fma(static_cast<std::size_t>(state.range(0)));
    benchmark::DoNotOptimize(d);
  }
}
BENCHMARK(ParityFma)->Apply(args);

static void ParityCopysign(benchmark::State& state) {
  for (auto _ : state) {
    double d = parity_copysign(static_cast<std::size_t>(state.range(0)));
    benchmark::DoNotOptimize(d);
  }
}
BENCHMARK(ParityCopysign)->Apply(args);

#if __cpp_lib_bit_cast >= 201806L
static void ParityBitcast(benchmark::State& state) {
  for (auto _ : state) {
    double d = parity_bitcast(static_cast<std::size_t>(state.range(0)));
    benchmark::DoNotOptimize(d);
  }
}
BENCHMARK(ParityBitcast)->Apply(args);
#endif

BENCHMARK_MAIN();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both @hmenke and @DiamonDinoia for your input.
Indeed it seems the ParityBitcast dominates

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
ParityOne/1090913575           0.428 ns        0.427 ns   1624181049
ParityTwo/1090913575           0.360 ns        0.360 ns   2039897825
ParityBitAnd/1090913575        0.447 ns        0.447 ns   1558235034
ParityInt/1090913575           0.445 ns        0.445 ns   1558966762
ParityFma/1090913575           0.317 ns        0.316 ns   2208092805
ParityCopysign/1090913575      0.476 ns        0.476 ns   1473895002
ParityBitcast/1090913575       0.240 ns        0.239 ns   2933001395

I don't see how optimizing this return statement is at all necessary though, as it follows the almost always dominant call to bubble_sort.

Happy to discuss more if this actually turns out to be relevant in any profiles.

@DiamonDinoia
Copy link
Collaborator

DiamonDinoia commented Dec 5, 2025 via email

@DiamonDinoia
Copy link
Collaborator

Any particular reason it should be a bubble sort and not another sort?

@hmenke
Copy link
Member Author

hmenke commented Dec 8, 2025

Any particular reason it should be a bubble sort and not another sort?

The sorting has to respect the commutation relations of quantum mechanical operators, which is why we use a sorting algorithm that only ever swaps adjacent elements (bubble sort is one of them) and keep track of the number of swaps, because each swap in a quantum expectation value incurs a minus sign.

Also of note, there are usually just a few elements in those ranges, mos commonly two or four.

@hmenke hmenke deleted the parity_sort branch December 8, 2025 05:45
@Wentzell
Copy link
Member

Wentzell commented Dec 8, 2025

The sorting has to respect the commutation relations of quantum mechanical operators, which is why we use a sorting algorithm that only ever swaps adjacent elements (bubble sort is one of them) and keep track of the number of swaps, because each swap in a quantum expectation value incurs a minus sign.

Dear @hmenke, We actually do not require swapped elements to be adjacent in our sort. The single swap of two distant elements, say at positions i and j, has parity -1, and so does the sequence of element-wise swaps i->j and then j-1->i which is always an uneven count.

@DiamonDinoia
Copy link
Collaborator

How big of a bottleneck is this? If distant swap are allowed we can use better sorts like Insertion sort. If the number of elements to sort is small there are in-cache only implementations that heavily optimized.

@Wentzell
Copy link
Member

This part should be very subdominant over the actual matrix updates / determinant ratio calculations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants