LST CPU Speedups by GNiendorf · Pull Request #245 · SegmentLinking/cmssw

GNiendorf · 2026-03-18T16:43:40Z

This PR Timing (CPU) - commit 2 (pre-checks, exact trig simplifications, and additional early exits)

This PR Timing (CPU) - commit 1 (reducing redundant memory loads)

Master Timing (CPU)

GNiendorf · 2026-03-18T17:31:20Z

run-ci: all

github-actions · 2026-03-18T17:58:10Z

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     29.0    323.1    245.5    138.1     48.6    695.7     10.9    116.6    119.7    208.9      0.1    1936.1    1211.5+/- 290.1     602.5   explicit[s=4] (target branch)
   avg     28.1    218.8    178.6    127.2     49.7    700.7     10.6    109.5     83.0    202.6      0.1    1708.9     980.1+/- 239.1     545.5   explicit[s=4] (this PR)

github-actions · 2026-03-18T19:22:27Z

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

GNiendorf · 2026-03-19T10:57:53Z

run-ci: all
modifiers: gpu

github-actions · 2026-03-19T11:19:17Z

The PR was built and ran successfully in standalone mode running on GPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     31.1      0.2      0.4      0.6      0.9      0.3      0.6      0.5      0.3      1.4      0.0      36.2       4.8+/-  2.7      36.2   explicit[s=1]
   avg      1.1      0.3      0.5      0.8      1.0      0.3      0.8      0.7      0.4      1.8      0.0       7.7       6.3+/-  2.8       4.0   explicit[s=2]
   avg      2.0      0.6      0.8      1.2      1.5      0.4      1.2      1.0      0.6      2.8      0.0      12.1       9.7+/-  3.5       3.2   explicit[s=4]
   avg      3.2      0.9      1.2      1.7      2.0      0.5      1.7      1.3      0.8      3.9      0.0      17.2      13.5+/-  4.3       3.0   explicit[s=6]
   avg      3.7      1.3      1.7      2.4      2.6      0.7      2.3      1.6      1.0      4.9      0.0      22.3      17.9+/-  4.6       2.9   explicit[s=8] (target branch)
   avg     31.1      0.2      0.4      0.6      0.9      0.3      0.6      0.5      0.3      1.4      0.0      36.2       4.8+/-  2.6      36.3   explicit[s=1]
   avg      1.3      0.3      0.5      0.7      1.0      0.3      0.8      0.7      0.4      1.8      0.0       7.9       6.4+/-  2.8       4.1   explicit[s=2]
   avg      2.2      0.6      0.8      1.2      1.5      0.4      1.2      1.0      0.6      2.8      0.0      12.2       9.6+/-  3.3       3.2   explicit[s=4]
   avg      3.0      0.9      1.2      1.7      2.1      0.5      1.7      1.3      0.8      3.8      0.0      17.0      13.5+/-  4.1       3.0   explicit[s=6]
   avg      3.6      1.3      1.7      2.3      2.6      0.7      2.2      1.7      1.0      5.0      0.0      22.2      18.0+/-  4.5       2.9   explicit[s=8] (this PR)

GNiendorf · 2026-03-19T11:22:05Z

@slava77 I think this PR is good to go. Represents most of the boiler-plate changes of the CPU speedups PR.

github-actions · 2026-03-19T12:35:39Z

The PR was built and ran successfully with CMSSW running on GPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

slava77 · 2026-03-19T13:29:14Z

the GPU variant should have one more significant digit in the component columns (the total can be still with .1.
I don't have a particluar preference for this PR or separate.

slava77

nice updates.
I think the comment cleanup in the MiniDoublet code is a bit too aggressive. While some removals may be clean for some tautological docs, quite a bit is going to lose clarity. Please recover

RecoTracker/LSTCore/src/alpaka/MiniDoublet.h

RecoTracker/LSTCore/src/alpaka/PixelTriplet.h

RecoTracker/LSTCore/src/alpaka/Segment.h

RecoTracker/LSTCore/src/alpaka/Triplet.h

GNiendorf · 2026-03-19T20:52:44Z

run-ci: all

GNiendorf · 2026-03-19T21:02:54Z

run-ci: all

GNiendorf · 2026-03-19T21:16:17Z

run-ci: all
modifiers: gpu

github-actions · 2026-03-19T21:23:16Z

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     28.2    324.2    243.0    136.6     47.8    698.4     10.9    114.7    118.8    208.7      0.1    1931.4    1204.8+/- 289.9     596.7   explicit[s=4] (target branch)
   avg     31.1    219.2    182.6    133.5     47.6    698.9     10.8    110.7     83.1    185.9      0.1    1703.5     973.4+/- 227.2     541.9   explicit[s=4] (this PR)

github-actions · 2026-03-19T22:18:47Z

The PR was built and ran successfully with CMSSW running on GPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

github-actions · 2026-03-19T22:23:37Z

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

github-actions · 2026-03-19T22:39:10Z

The PR was built and ran successfully in standalone mode running on GPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     32.5      0.2      0.4      0.6      0.9      0.3      0.6      0.5      0.3      1.4      0.0      37.6       4.8+/-  2.6      37.6   explicit[s=1]
   avg      1.1      0.4      0.5      0.8      1.0      0.3      0.8      0.7      0.4      1.8      0.0       7.8       6.4+/-  2.9       4.0   explicit[s=2]
   avg      1.8      0.6      0.8      1.1      1.5      0.4      1.2      1.0      0.6      2.8      0.0      11.9       9.7+/-  3.5       3.1   explicit[s=4]
   avg      2.6      0.9      1.3      1.7      2.0      0.5      1.7      1.2      0.8      3.9      0.0      16.6      13.5+/-  4.1       2.9   explicit[s=6]
   avg      3.4      1.3      1.7      2.3      2.6      0.7      2.3      1.6      1.0      5.0      0.0      21.9      17.8+/-  4.5       2.8   explicit[s=8] (target branch)
   avg     32.6      0.2      0.4      0.6      0.8      0.3      0.6      0.5      0.3      1.4      0.0      37.7       4.8+/-  2.5      37.7   explicit[s=1]
   avg      1.2      0.4      0.5      0.8      1.0      0.3      0.8      0.8      0.4      1.9      0.0       7.9       6.5+/-  2.8       4.1   explicit[s=2]
   avg      1.8      0.6      0.8      1.1      1.5      0.4      1.2      1.0      0.6      2.8      0.0      11.8       9.6+/-  3.5       3.1   explicit[s=4]
   avg      2.6      1.0      1.2      1.7      2.0      0.5      1.7      1.3      0.8      4.0      0.0      16.7      13.6+/-  4.0       2.9   explicit[s=6]
   avg      3.4      1.3      1.7      2.2      2.6      0.7      2.2      1.7      1.0      5.0      0.0      21.9      17.8+/-  4.5       2.8   explicit[s=8] (this PR)

GNiendorf · 2026-03-19T23:04:47Z

run-ci: all

GNiendorf · 2026-03-24T21:13:19Z

@slava77 Can you check the changes in the segments file, I think I addressed your concerns

GNiendorf · 2026-03-24T21:17:32Z

run-ci: all

github-actions · 2026-03-24T21:37:41Z

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     27.9    324.5    242.0    133.4     46.1    697.3     10.8    114.6    117.2    208.7      0.1    1922.7    1197.5+/- 284.0     595.6   explicit[s=4] (target branch)
   avg     31.2    105.0    121.5    127.1     49.7    703.0     10.8     39.8     68.6    206.5      0.8    1463.9     729.7+/- 184.0     481.5   explicit[s=4] (this PR)

github-actions · 2026-03-24T22:52:57Z

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

slava77

I suggest to set up a reviewing agent that ensures that similar changes remain consistently implemented in different places, including consistent naming

RecoTracker/LSTCore/src/alpaka/Segment.h

RecoTracker/LSTCore/src/alpaka/MiniDoublet.h

RecoTracker/LSTCore/src/alpaka/Segment.h

slava77 · 2026-03-25T14:32:07Z

RecoTracker/LSTCore/src/alpaka/Segment.h

@@ -799,21 +852,81 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::lst {

  template <alpaka::concepts::Acc TAcc>
  ALPAKA_FN_ACC ALPAKA_FN_INLINE bool passDeltaPhiCutsSelector(TAcc const& acc,


I still don't understand the changes in this method.
Please don't just generate more code and clarify in plain English.

This method is used in CountMiniDoubletConnections and is logically meant to have either loose or full cuts.
The fact that dz cuts are added are probably OK, but on one hand it's a feature creep on another the method name is bad now.
Connected to the memory toggle PR, it's probably better to template this method to have a true/false loose/full cuts.

RecoTracker/LSTCore/src/alpaka/Segment.h

GNiendorf · 2026-03-26T18:25:30Z

@slava77 - Proof: Pre-checks in segments.h decreases timing for LS from 83.1 to 75.6 single stream.

This PR Timing

This PR Timing if I remove the two pre-checks in passDeltaPhiCutsBarrel and passDeltaPhiCutsEndcap

GNiendorf · 2026-03-27T10:57:51Z

I forgot a dphichange precheck in the MD endcap path, added it and the timing decreases from 62.6->47.7

GNiendorf · 2026-03-27T11:06:01Z

run-ci: all

github-actions · 2026-03-27T11:27:45Z

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     28.5    323.6    247.1    139.1     48.4    701.6     10.9    114.5    117.5    208.6      0.1    1939.9    1209.8+/- 292.7     598.3   explicit[s=4] (target branch)
   avg     29.0     94.1    125.9    126.2     53.5    688.2     11.8     40.3     71.3    189.0      0.1    1429.5     712.3+/- 179.7     478.1   explicit[s=4] (this PR)

github-actions · 2026-03-27T12:34:08Z

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

slava77 · 2026-03-27T13:07:26Z

I forgot a dphichange precheck in the MD endcap path, added it and the timing decreases from 62.6->47.7

is there a way to automate extraction of the improvements by category?
It looks like you need to do some of these manually.

slava77 · 2026-03-27T13:10:56Z

at least some of the comments (especially if not implemented) implied a response; is it traceable?
just making silently new changes is not the best

GNiendorf · 2026-03-27T13:41:56Z

I forgot a dphichange precheck in the MD endcap path, added it and the timing decreases from 62.6->47.7

is there a way to automate extraction of the improvements by category? It looks like you need to do some of these manually.

Can you clarify what you mean? I've shown the timing improvement from the two segments pre-checks, I feel like most of the others are obvious (e.g. using algebraic pre-checks to avoid tan or sin calls in hot paths, using exact trig identities to reduce the number of trig calls while keeping the results exact, etc.)

slava77 · 2026-03-27T14:08:10Z

Can you clarify what you mean?

I'm asking to quantify the gains by category; or, rather to automate it (mentioned already in the earlier review, was it Monday or so).
On one hand, this is going to be informative for this or other (future) code updates.
On the other, if the agentic setup is used, it's better to have somewhat incremental analysis of changes to see the progress/justification instead of just the final result.

This is somewhat different from a request of proof that the gain is positive at all.

slava77 · 2026-03-27T14:45:31Z

Can you clarify what you mean?

From Mar 20:

Ideally, before squashing everything to a single commit it would've been better to trace a sequence of improvements in finer chunks.

Is this kind of workflow possible?
I can manage through this PR without it, but is it possible for the future agent-driven updates?

GNiendorf · 2026-03-27T15:23:55Z

Is this kind of workflow possible? I can manage through this PR without it, but is it possible for the future agent-driven updates?

I'm not sure off the top of my head. We could try to draft a few paragraphs of instructions for agents working on LST about our preferences for smaller commits, naming consistency with the rest of the codebase, etc. and see how well they follow the instructions.

slava77

the main comment is related to passLooseSegmentCuts, but perhaps I'm just arguing about a part that has parallels in other precompute methods for other objects. So, changing here would only make things partially better just here.
In that case it's better to review and reuse this full/Loose selection logic in a follow up PR for other objects.

slava77 · 2026-03-27T23:20:39Z

RecoTracker/LSTCore/src/alpaka/MiniDoublet.h

+    // Algebraic: sin(atan(slope)) = |slope|/sqrt(1+slope^2), cos(atan(slope)) = 1/sqrt(1+slope^2)
+    float drprime_x, drprime_y;


Suggested change

// Algebraic: sin(atan(slope)) = |slope|/sqrt(1+slope^2), cos(atan(slope)) = 1/sqrt(1+slope^2)

float drprime_x, drprime_y;

float drprime_x, drprime_y; // drprime * {sin,cos} (atan(slope))

// Algebraic: sin(atan(slope)) = |slope|/sqrt(1+slope^2), cos(atan(slope)) = 1/sqrt(1+slope^2)

otherwise too much context is removed to follow what this code does

slava77 · 2026-03-27T23:27:10Z

RecoTracker/LSTCore/src/alpaka/MiniDoublet.h

+    // Algebraic dPhi pre-check: reject if sin^2(dPhi) >= looseCutDPhi^2, using Lagrange identity.
+    // looseCutDPhi = sdSlopeSin + sqrt(mulsAndPVoff + miniLum^2) >= sin(exact_cut)
+    // via sin(A+B) <= sin(A) + B, with A = asin(sdSlopeSin), B = sqrt(...).


the comment seems misplaced: the 10 lines below seem unrelated; move it to L563 or so

What I see below is a |dPhi| < π/2 followed by |dPhi| < π/2

slava77 · 2026-03-27T23:32:10Z

RecoTracker/LSTCore/src/alpaka/MiniDoublet.h

+    const float crossSq = crossDPhi * crossDPhi;
+    const float dotSq = dotDPhi * dotDPhi;
+
+    if (crossSq >= dotSq)


is taking squares and comparing them really cheaper than comparing abs values?

slava77 · 2026-03-27T23:34:46Z

RecoTracker/LSTCore/src/alpaka/MiniDoublet.h

+    const float r2sq = crossSq + dotSq;
+
+    if (crossSq >= looseCutDPhi * looseCutDPhi * r2sq)


I think I asked in another context: this looks like if (abs(crossSq) > looseCutDPhi * rtLower * rtUpper)
so the squares are not needed.
Is this really faster?

slava77 · 2026-03-27T23:46:34Z

RecoTracker/LSTCore/src/alpaka/Segment.h

+
+    const float cosSlope = alpaka::math::sqrt(acc, 1.f - sdSlopeSin * sdSlopeSin);
+
+    if (innerMod.subdet == Barrel && outerMod.subdet == Barrel) {


So, this is a loose implementation of runSegmentDefaultAlgoBarrel and runSegmentDefaultAlgoEndcap. Would it be practical to template those and if/else the approximate parts introduced here into those more general templated functions

GNiendorf force-pushed the cpu_speedups_hoist branch from f5fbe61 to 83f2297 Compare March 18, 2026 17:30

GNiendorf marked this pull request as ready for review March 18, 2026 17:37

slava77 reviewed Mar 19, 2026

View reviewed changes

GNiendorf force-pushed the cpu_speedups_hoist branch 2 times, most recently from 9aad224 to 727bac8 Compare March 19, 2026 20:50

remove redundant memory loads

2375562

GNiendorf force-pushed the cpu_speedups_hoist branch from 727bac8 to 2375562 Compare March 19, 2026 20:59

GNiendorf force-pushed the cpu_speedups_hoist branch from a2305fb to df0990a Compare March 19, 2026 23:24

GNiendorf changed the title ~~Remove Redundant Memory Loads~~ CPU Optimizations Mar 19, 2026

GNiendorf changed the title ~~CPU Optimizations~~ LST CPU Speedups Mar 19, 2026

GNiendorf force-pushed the cpu_speedups_hoist branch 2 times, most recently from 6ba26d8 to 912b6d9 Compare March 20, 2026 09:45

SegmentLinking deleted a comment from github-actions bot Mar 20, 2026

GNiendorf force-pushed the cpu_speedups_hoist branch 3 times, most recently from 03772d0 to df8220c Compare March 24, 2026 20:36

slava77 reviewed Mar 25, 2026

View reviewed changes

GNiendorf force-pushed the cpu_speedups_hoist branch 8 times, most recently from 7a81541 to 3c415b4 Compare March 26, 2026 17:46

pre-checks, exact trig simplifications, and additional early exits

117c197

GNiendorf force-pushed the cpu_speedups_hoist branch from 3c415b4 to 117c197 Compare March 27, 2026 10:57

slava77 approved these changes Mar 27, 2026

View reviewed changes

		@@ -799,21 +852,81 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::lst {

		template <alpaka::concepts::Acc TAcc>
		ALPAKA_FN_ACC ALPAKA_FN_INLINE bool passDeltaPhiCutsSelector(TAcc const& acc,

		// Algebraic: sin(atan(slope)) = \|slope\|/sqrt(1+slope^2), cos(atan(slope)) = 1/sqrt(1+slope^2)
		float drprime_x, drprime_y;

		const float r2sq = crossSq + dotSq;

		if (crossSq >= looseCutDPhi * looseCutDPhi * r2sq)


		const float cosSlope = alpaka::math::sqrt(acc, 1.f - sdSlopeSin * sdSlopeSin);

		if (innerMod.subdet == Barrel && outerMod.subdet == Barrel) {

Conversation

GNiendorf commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GNiendorf commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

slava77 commented Mar 19, 2026

Uh oh!

slava77 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

Uh oh!

GNiendorf commented Mar 19, 2026

Uh oh!

GNiendorf commented Mar 24, 2026

Uh oh!

GNiendorf commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

slava77 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slava77 Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GNiendorf commented Mar 26, 2026

Uh oh!

GNiendorf commented Mar 27, 2026

Uh oh!

GNiendorf commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Uh oh!

slava77 commented Mar 27, 2026

Uh oh!

GNiendorf commented Mar 18, 2026 •

edited

Loading