Skip to content

LST CPU Speedups#245

Open
GNiendorf wants to merge 2 commits intomasterfrom
cpu_speedups_hoist
Open

LST CPU Speedups#245
GNiendorf wants to merge 2 commits intomasterfrom
cpu_speedups_hoist

Conversation

@GNiendorf
Copy link
Copy Markdown
Member

@GNiendorf GNiendorf commented Mar 18, 2026

This PR Timing (CPU) - commit 2 (pre-checks, exact trig simplifications, and additional early exits)
Screenshot 2026-03-27 at 11 01 56 AM
This PR Timing (CPU) - commit 1 (reducing redundant memory loads)
Screenshot 2026-03-19 at 9 13 29 PM
Master Timing (CPU)
Screenshot 2026-03-09 at 11 07 11 PM

@GNiendorf GNiendorf force-pushed the cpu_speedups_hoist branch from f5fbe61 to 83f2297 Compare March 18, 2026 17:30
@GNiendorf
Copy link
Copy Markdown
Member Author

run-ci: all

@GNiendorf GNiendorf marked this pull request as ready for review March 18, 2026 17:37
@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     29.0    323.1    245.5    138.1     48.6    695.7     10.9    116.6    119.7    208.9      0.1    1936.1    1211.5+/- 290.1     602.5   explicit[s=4] (target branch)
   avg     28.1    218.8    178.6    127.2     49.7    700.7     10.6    109.5     83.0    202.6      0.1    1708.9     980.1+/- 239.1     545.5   explicit[s=4] (this PR)

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@GNiendorf
Copy link
Copy Markdown
Member Author

run-ci: all
modifiers: gpu

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully in standalone mode running on GPU. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     31.1      0.2      0.4      0.6      0.9      0.3      0.6      0.5      0.3      1.4      0.0      36.2       4.8+/-  2.7      36.2   explicit[s=1]
   avg      1.1      0.3      0.5      0.8      1.0      0.3      0.8      0.7      0.4      1.8      0.0       7.7       6.3+/-  2.8       4.0   explicit[s=2]
   avg      2.0      0.6      0.8      1.2      1.5      0.4      1.2      1.0      0.6      2.8      0.0      12.1       9.7+/-  3.5       3.2   explicit[s=4]
   avg      3.2      0.9      1.2      1.7      2.0      0.5      1.7      1.3      0.8      3.9      0.0      17.2      13.5+/-  4.3       3.0   explicit[s=6]
   avg      3.7      1.3      1.7      2.4      2.6      0.7      2.3      1.6      1.0      4.9      0.0      22.3      17.9+/-  4.6       2.9   explicit[s=8] (target branch)
   avg     31.1      0.2      0.4      0.6      0.9      0.3      0.6      0.5      0.3      1.4      0.0      36.2       4.8+/-  2.6      36.3   explicit[s=1]
   avg      1.3      0.3      0.5      0.7      1.0      0.3      0.8      0.7      0.4      1.8      0.0       7.9       6.4+/-  2.8       4.1   explicit[s=2]
   avg      2.2      0.6      0.8      1.2      1.5      0.4      1.2      1.0      0.6      2.8      0.0      12.2       9.6+/-  3.3       3.2   explicit[s=4]
   avg      3.0      0.9      1.2      1.7      2.1      0.5      1.7      1.3      0.8      3.8      0.0      17.0      13.5+/-  4.1       3.0   explicit[s=6]
   avg      3.6      1.3      1.7      2.3      2.6      0.7      2.2      1.7      1.0      5.0      0.0      22.2      18.0+/-  4.5       2.9   explicit[s=8] (this PR)

@GNiendorf
Copy link
Copy Markdown
Member Author

@slava77 I think this PR is good to go. Represents most of the boiler-plate changes of the CPU speedups PR.

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully with CMSSW running on GPU. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@slava77
Copy link
Copy Markdown

slava77 commented Mar 19, 2026

image

the GPU variant should have one more significant digit in the component columns (the total can be still with .1.
I don't have a particluar preference for this PR or separate.

Copy link
Copy Markdown

@slava77 slava77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice updates.
I think the comment cleanup in the MiniDoublet code is a bit too aggressive. While some removals may be clean for some tautological docs, quite a bit is going to lose clarity. Please recover

@GNiendorf GNiendorf force-pushed the cpu_speedups_hoist branch 2 times, most recently from 9aad224 to 727bac8 Compare March 19, 2026 20:50
@GNiendorf
Copy link
Copy Markdown
Member Author

run-ci: all

@GNiendorf GNiendorf force-pushed the cpu_speedups_hoist branch from 727bac8 to 2375562 Compare March 19, 2026 20:59
@GNiendorf
Copy link
Copy Markdown
Member Author

run-ci: all

@GNiendorf
Copy link
Copy Markdown
Member Author

run-ci: all
modifiers: gpu

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     28.2    324.2    243.0    136.6     47.8    698.4     10.9    114.7    118.8    208.7      0.1    1931.4    1204.8+/- 289.9     596.7   explicit[s=4] (target branch)
   avg     31.1    219.2    182.6    133.5     47.6    698.9     10.8    110.7     83.1    185.9      0.1    1703.5     973.4+/- 227.2     541.9   explicit[s=4] (this PR)

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully with CMSSW running on GPU. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully in standalone mode running on GPU. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     32.5      0.2      0.4      0.6      0.9      0.3      0.6      0.5      0.3      1.4      0.0      37.6       4.8+/-  2.6      37.6   explicit[s=1]
   avg      1.1      0.4      0.5      0.8      1.0      0.3      0.8      0.7      0.4      1.8      0.0       7.8       6.4+/-  2.9       4.0   explicit[s=2]
   avg      1.8      0.6      0.8      1.1      1.5      0.4      1.2      1.0      0.6      2.8      0.0      11.9       9.7+/-  3.5       3.1   explicit[s=4]
   avg      2.6      0.9      1.3      1.7      2.0      0.5      1.7      1.2      0.8      3.9      0.0      16.6      13.5+/-  4.1       2.9   explicit[s=6]
   avg      3.4      1.3      1.7      2.3      2.6      0.7      2.3      1.6      1.0      5.0      0.0      21.9      17.8+/-  4.5       2.8   explicit[s=8] (target branch)
   avg     32.6      0.2      0.4      0.6      0.8      0.3      0.6      0.5      0.3      1.4      0.0      37.7       4.8+/-  2.5      37.7   explicit[s=1]
   avg      1.2      0.4      0.5      0.8      1.0      0.3      0.8      0.8      0.4      1.9      0.0       7.9       6.5+/-  2.8       4.1   explicit[s=2]
   avg      1.8      0.6      0.8      1.1      1.5      0.4      1.2      1.0      0.6      2.8      0.0      11.8       9.6+/-  3.5       3.1   explicit[s=4]
   avg      2.6      1.0      1.2      1.7      2.0      0.5      1.7      1.3      0.8      4.0      0.0      16.7      13.6+/-  4.0       2.9   explicit[s=6]
   avg      3.4      1.3      1.7      2.2      2.6      0.7      2.2      1.7      1.0      5.0      0.0      21.9      17.8+/-  4.5       2.8   explicit[s=8] (this PR)

@GNiendorf
Copy link
Copy Markdown
Member Author

run-ci: all

@GNiendorf GNiendorf force-pushed the cpu_speedups_hoist branch from a2305fb to df0990a Compare March 19, 2026 23:24
@GNiendorf GNiendorf changed the title Remove Redundant Memory Loads CPU Optimizations Mar 19, 2026
@GNiendorf GNiendorf changed the title CPU Optimizations LST CPU Speedups Mar 19, 2026
@GNiendorf GNiendorf force-pushed the cpu_speedups_hoist branch 2 times, most recently from 6ba26d8 to 912b6d9 Compare March 20, 2026 09:45
@SegmentLinking SegmentLinking deleted a comment from github-actions bot Mar 20, 2026
@SegmentLinking SegmentLinking deleted a comment from github-actions bot Mar 20, 2026
@GNiendorf GNiendorf force-pushed the cpu_speedups_hoist branch 3 times, most recently from 03772d0 to df8220c Compare March 24, 2026 20:36
@GNiendorf
Copy link
Copy Markdown
Member Author

@slava77 Can you check the changes in the segments file, I think I addressed your concerns

@GNiendorf
Copy link
Copy Markdown
Member Author

run-ci: all

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     27.9    324.5    242.0    133.4     46.1    697.3     10.8    114.6    117.2    208.7      0.1    1922.7    1197.5+/- 284.0     595.6   explicit[s=4] (target branch)
   avg     31.2    105.0    121.5    127.1     49.7    703.0     10.8     39.8     68.6    206.5      0.8    1463.9     729.7+/- 184.0     481.5   explicit[s=4] (this PR)

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

Copy link
Copy Markdown

@slava77 slava77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to set up a reviewing agent that ensures that similar changes remain consistently implemented in different places, including consistent naming

@@ -799,21 +852,81 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::lst {

template <alpaka::concepts::Acc TAcc>
ALPAKA_FN_ACC ALPAKA_FN_INLINE bool passDeltaPhiCutsSelector(TAcc const& acc,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand the changes in this method.
Please don't just generate more code and clarify in plain English.

This method is used in CountMiniDoubletConnections and is logically meant to have either loose or full cuts.
The fact that dz cuts are added are probably OK, but on one hand it's a feature creep on another the method name is bad now.
Connected to the memory toggle PR, it's probably better to template this method to have a true/false loose/full cuts.

@GNiendorf GNiendorf force-pushed the cpu_speedups_hoist branch 8 times, most recently from 7a81541 to 3c415b4 Compare March 26, 2026 17:46
@GNiendorf
Copy link
Copy Markdown
Member Author

@slava77 - Proof: Pre-checks in segments.h decreases timing for LS from 83.1 to 75.6 single stream.

This PR Timing
Screenshot 2026-03-26 at 6 55 40 PM

This PR Timing if I remove the two pre-checks in passDeltaPhiCutsBarrel and passDeltaPhiCutsEndcap
Screenshot 2026-03-26 at 7 24 44 PM

@GNiendorf GNiendorf force-pushed the cpu_speedups_hoist branch from 3c415b4 to 117c197 Compare March 27, 2026 10:57
@GNiendorf
Copy link
Copy Markdown
Member Author

I forgot a dphichange precheck in the MD endcap path, added it and the timing decreases from 62.6->47.7
Screenshot 2026-03-27 at 11 01 56 AM

@GNiendorf
Copy link
Copy Markdown
Member Author

run-ci: all

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully in standalone mode running on CPU. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     28.5    323.6    247.1    139.1     48.4    701.6     10.9    114.5    117.5    208.6      0.1    1939.9    1209.8+/- 292.7     598.3   explicit[s=4] (target branch)
   avg     29.0     94.1    125.9    126.2     53.5    688.2     11.8     40.3     71.3    189.0      0.1    1429.5     712.3+/- 179.7     478.1   explicit[s=4] (this PR)

@github-actions
Copy link
Copy Markdown

The PR was built and ran successfully with CMSSW running on CPU. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@slava77
Copy link
Copy Markdown

slava77 commented Mar 27, 2026

I forgot a dphichange precheck in the MD endcap path, added it and the timing decreases from 62.6->47.7

is there a way to automate extraction of the improvements by category?
It looks like you need to do some of these manually.

@slava77
Copy link
Copy Markdown

slava77 commented Mar 27, 2026

at least some of the comments (especially if not implemented) implied a response; is it traceable?
just making silently new changes is not the best

@GNiendorf
Copy link
Copy Markdown
Member Author

I forgot a dphichange precheck in the MD endcap path, added it and the timing decreases from 62.6->47.7

is there a way to automate extraction of the improvements by category? It looks like you need to do some of these manually.

Can you clarify what you mean? I've shown the timing improvement from the two segments pre-checks, I feel like most of the others are obvious (e.g. using algebraic pre-checks to avoid tan or sin calls in hot paths, using exact trig identities to reduce the number of trig calls while keeping the results exact, etc.)

@slava77
Copy link
Copy Markdown

slava77 commented Mar 27, 2026

Can you clarify what you mean?

I'm asking to quantify the gains by category; or, rather to automate it (mentioned already in the earlier review, was it Monday or so).
On one hand, this is going to be informative for this or other (future) code updates.
On the other, if the agentic setup is used, it's better to have somewhat incremental analysis of changes to see the progress/justification instead of just the final result.

This is somewhat different from a request of proof that the gain is positive at all.

@slava77
Copy link
Copy Markdown

slava77 commented Mar 27, 2026

Can you clarify what you mean?

From Mar 20:

Ideally, before squashing everything to a single commit it would've been better to trace a sequence of improvements in finer chunks.

Is this kind of workflow possible?
I can manage through this PR without it, but is it possible for the future agent-driven updates?

@GNiendorf
Copy link
Copy Markdown
Member Author

Is this kind of workflow possible? I can manage through this PR without it, but is it possible for the future agent-driven updates?

I'm not sure off the top of my head. We could try to draft a few paragraphs of instructions for agents working on LST about our preferences for smaller commits, naming consistency with the rest of the codebase, etc. and see how well they follow the instructions.

Copy link
Copy Markdown

@slava77 slava77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main comment is related to passLooseSegmentCuts, but perhaps I'm just arguing about a part that has parallels in other precompute methods for other objects. So, changing here would only make things partially better just here.
In that case it's better to review and reuse this full/Loose selection logic in a follow up PR for other objects.

Comment on lines +307 to +308
// Algebraic: sin(atan(slope)) = |slope|/sqrt(1+slope^2), cos(atan(slope)) = 1/sqrt(1+slope^2)
float drprime_x, drprime_y;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Algebraic: sin(atan(slope)) = |slope|/sqrt(1+slope^2), cos(atan(slope)) = 1/sqrt(1+slope^2)
float drprime_x, drprime_y;
float drprime_x, drprime_y; // drprime * {sin,cos} (atan(slope))
// Algebraic: sin(atan(slope)) = |slope|/sqrt(1+slope^2), cos(atan(slope)) = 1/sqrt(1+slope^2)

otherwise too much context is removed to follow what this code does

Comment on lines +540 to +542
// Algebraic dPhi pre-check: reject if sin^2(dPhi) >= looseCutDPhi^2, using Lagrange identity.
// looseCutDPhi = sdSlopeSin + sqrt(mulsAndPVoff + miniLum^2) >= sin(exact_cut)
// via sin(A+B) <= sin(A) + B, with A = asin(sdSlopeSin), B = sqrt(...).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment seems misplaced: the 10 lines below seem unrelated; move it to L563 or so

What I see below is a |dPhi| < π/2 followed by |dPhi| < π/2

Comment on lines +546 to +549
const float crossSq = crossDPhi * crossDPhi;
const float dotSq = dotDPhi * dotDPhi;

if (crossSq >= dotSq)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is taking squares and comparing them really cheaper than comparing abs values?

Comment on lines +566 to +568
const float r2sq = crossSq + dotSq;

if (crossSq >= looseCutDPhi * looseCutDPhi * r2sq)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I asked in another context: this looks like if (abs(crossSq) > looseCutDPhi * rtLower * rtUpper)
so the squares are not needed.
Is this really faster?


const float cosSlope = alpaka::math::sqrt(acc, 1.f - sdSlopeSin * sdSlopeSin);

if (innerMod.subdet == Barrel && outerMod.subdet == Barrel) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is a loose implementation of runSegmentDefaultAlgoBarrel and runSegmentDefaultAlgoEndcap. Would it be practical to template those and if/else the approximate parts introduced here into those more general templated functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants