Skip to content

Remove j and block Kmkhz 9c #325

Open
Benjamin Went (MetBenjaminWent) wants to merge 7 commits intoMetOffice:mainfrom
MetBenjaminWent:Kmkhz-9c-by-hand-2
Open

Remove j and block Kmkhz 9c #325
Benjamin Went (MetBenjaminWent) wants to merge 7 commits intoMetOffice:mainfrom
MetBenjaminWent:Kmkhz-9c-by-hand-2

Conversation

@MetBenjaminWent
Copy link
Copy Markdown
Contributor

@MetBenjaminWent Benjamin Went (MetBenjaminWent) commented Mar 4, 2026

PR Summary

Sci/Tech Reviewer: Hacka Fett (@christophermaynard)
Code Reviewer: Harry Shepherd (@harry-shepherd)

Remove the j loop which is adversely affecting performance in the boundary layer, and improve blocking loops.

Performance data can be found in the umbrella issue, #106
closes #217
Initial KGO issues were fixed with adding denom to the private list L2546.

Taking over from #221.

Introducing further blocking loops have introduced new KGO changes, which appear to be fast debug O2 optimisation driven.
I've also noted that with differing loop ranges over threads, reducing nowaits is required, which opens the option of removing the barrier to aid further PSyclone work in the future (The barriers will remain for now).
With the above KGO occurrences, these have likely factored into the cause of KGO changes at 1T, which remain unaffected for full-debug, still indicating optimisation as the likely root cause.
As the KGO updates are holding (for the all of the tests involved in the OMP dev group) for 1, 2 or 4T, I'm reasonably confident that it's not a newly introduced race condition or similar.

Segmentaion variable will be updated in #382 to help this ticket avoid further conflicts with HoT.

Code Quality Checklist

  • I have performed a self-review of my own code
  • My code follows the project's style guidelines
  • Comments have been included that aid understanding and enhance the readability of the code
  • My changes generate no new warnings
  • All automated checks in the CI pipeline have completed successfully

Testing

  • I have tested this change locally, using the LFRic Apps rose-stem suite
  • If any tests fail (rose-stem or CI) the reason is understood and acceptable (e.g. kgo changes)
  • I have added tests to cover new functionality as appropriate (e.g. system tests, unit tests, etc.)
  • Any new tests have been assigned an appropriate amount of compute resource and have been allocated to an appropriate testing group (i.e. the developer tests are for jobs which use a small amount of compute resource and complete in a matter of minutes)

trac.log

Test Suite Results - lfric_apps - Kmkhz-9c-by-hand-2/run7

Suite Information

Item Value
Suite Name Kmkhz-9c-by-hand-2/run7
Suite User benjamin.went
Workflow Start 2026-03-05T11:07:23
Groups Run ex1a_omp_developer
Dependency Reference Main Like
casim MetOffice/casim@2026.03.1 True
jules MetOffice/jules@2026.03.1 True
lfric_apps MetBenjaminWent/lfric_apps@Kmkhz-9c-by-hand-2 False
lfric_core MetOffice/lfric_core@2026.03.1 True
moci MetOffice/moci@2026.03.1 True
SimSys_Scripts MetOffice/SimSys_Scripts@2026.03.1 True
socrates MetOffice/socrates@2026.03.1 True
socrates-spectral MetOffice/socrates-spectral@2026.03.1 True
ukca MetOffice/ukca@2026.03.1 True

Task Information

✅ succeeded tasks - 51

Test Suite Results - lfric_apps - Kmkhz-9c-by-hand-2/run20

Suite Information

Item Value
Suite Name Kmkhz-9c-by-hand-2/run20
Suite User benjamin.went
Workflow Start 2026-04-08T08:45:58
Groups Run all
Dependency Reference Main Like
casim MetOffice/casim@2026.03.2 True
jules MetOffice/jules@2026.03.2 True
lfric_apps MetBenjaminWent/lfric_apps@Kmkhz-9c-by-hand-2 False
lfric_core MetOffice/lfric_core@018e40c True
moci MetOffice/moci@2026.03.2 True
SimSys_Scripts MetOffice/SimSys_Scripts@4387949 True
socrates MetOffice/socrates@2026.03.2 True
socrates-spectral MetOffice/socrates-spectral@2026.03.2 True
ukca MetOffice/ukca@2026.03.2 True

Task Information

❌ failed tasks - 3
Task State
run_gungho_model_baroclinic-alt2-C24_MG_op_azspice_gnu_fast-debug-64bit failed
run_gungho_model_rk-dcmip301-C24_azspice_gnu_fast-debug-64bit failed
run_shallow_water_galewsky_vi-C48_azspice_gnu_fast-debug-64bit failed
-Failure timeouts unrelated to changes.
✅ succeeded tasks - 1502

Security Considerations

  • I have reviewed my changes for potential security issues
  • Sensitive data is properly handled (if applicable)
  • Authentication and authorisation are properly implemented (if applicable)

Performance Impact

  • Performance of the code has been considered and, if applicable, suitable performance measurements have been conducted

AI Assistance and Attribution

  • Some of the content of this change has been produced with the assistance of Generative AI tool name (e.g., Met Office Github Copilot Enterprise, Github Copilot Personal, ChatGPT GPT-4, etc) and I have followed the Simulation Systems AI policy (including attribution labels)

Documentation

  • Where appropriate I have updated documentation related to this change and confirmed that it builds correctly

PSyclone Approval

  • If you have edited any PSyclone-related code (e.g. PSyKAl-lite, Kernel interface, optimisation scripts, LFRic data structure code) then please contact the TCD Team

Sci/Tech Review

  • I understand this area of code and the changes being added
  • The proposed changes correspond to the pull request description
  • Documentation is sufficient (do documentation papers need updating)
  • Sufficient testing has been completed

(Please alert the code reviewer via a tag when you have approved the SR)

Code Review

  • All dependencies have been resolved
  • Related Issues have been properly linked and addressed
  • CLA compliance has been confirmed
  • Code quality standards have been met
  • Tests are adequate and have passed
  • Documentation is complete and accurate
  • Security considerations have been addressed
  • Performance impact is acceptable

@MetBenjaminWent Benjamin Went (MetBenjaminWent) changed the title post changes from existing PR to new PR Remove j and block Kmkhz 9c Mar 4, 2026
@MetBenjaminWent
Copy link
Copy Markdown
Contributor Author

kmkhz_9c_after.txt
kmkhz_9c_before.txt

Listing files before and after.

@MetBenjaminWent
Copy link
Copy Markdown
Contributor Author

From #221:

Running into some KGO issues at fast debug.
Fixing the denom value, the only new LHS variable in the history that was not in the OMP sections, means that fast-debug should be okay.

Below testing at full-debug indicates that the KGO is good, whilst 1 and 4 are technically added, 2T is consistent with trunk, and they hold between runs, which with the denom LHS above, they were not.

The most recent changes here increase the number of ii blocking loops present with the dynamic schedule.
I'll see if an update holds, otherwise my current leading theory is that, like lsppn, the blocking dynamic loops at fast debug are running into some trouble and being over-optimised. Using the UM flags resolved it.

Test Suite Results - lfric_apps - kmkhz_9c_pysclone/run20

Suite Information

Item Value
Suite Name kmkhz_9c_pysclone/run20
Suite User benjamin.went
Workflow Start 2026-02-10T12:53:47
Groups Run ex1a_omp_C48_cce_full
Dependency Reference Main Like
casim MetOffice/casim@2025.12.1 True
jules MetOffice/jules@69aaf4d True
lfric_apps MetBenjaminWent/lfric_apps@kmkhz_9c_pysclone False
lfric_core MetOffice/lfric_core@bbb3d8a True
moci MetOffice/moci@2025.12.1 True
SimSys_Scripts MetOffice/SimSys_Scripts@2025.12.1 True
socrates MetOffice/socrates@2025.12.1 True
socrates-spectral MetOffice/socrates-spectral@2025.12.1 True
ukca MetOffice/ukca@2025.12.1 True

Task Information

✅ succeeded tasks - 12

@MetBenjaminWent
Copy link
Copy Markdown
Contributor Author

From #221:
Given full-debug does not change and holds between runs, likely it might be an optimisation occurence from CCE?

Updating the CCE KGOs show it holds.

Further testing, do they hold as seg-size changes?

Test Suite Results - lfric_apps - kmkhz_9c_pysclone/run21

Suite Information

Item Value
Suite Name kmkhz_9c_pysclone/run21
Suite User benjamin.went
Workflow Start 2026-02-11T14:48:18
Groups Run ex1a_omp_developer
Dependency Reference Main Like
casim MetOffice/casim@2025.12.1 True
jules MetOffice/jules@69aaf4d True
lfric_apps MetBenjaminWent/lfric_apps@kmkhz_9c_pysclone False
lfric_core MetOffice/lfric_core@bbb3d8a True
moci MetOffice/moci@2025.12.1 True
SimSys_Scripts MetOffice/SimSys_Scripts@2025.12.1 True
socrates MetOffice/socrates@2025.12.1 True
socrates-spectral MetOffice/socrates-spectral@2025.12.1 True
ukca MetOffice/ukca@2025.12.1 True

Task Information

✅ succeeded tasks - 51

@MetBenjaminWent
Copy link
Copy Markdown
Contributor Author

From #221

Barriers have been left. They should be removable (after removing some of the nowaits), but I'll look at this further in the PSyclone version of this ticket.

Current leading thought is that KGO changes are Optimisation changes, but they are more widespread.

Likely the intial KGO change was caused by the seg size loop intros, and as the bounds change, but the no waits didn't, the threaded runs changed, then stabalised. Changing the segment size after updating the KGOs does not cause them to change.

Changing further syncronisation points may have allowed the compiler to change how it has optimised, generting further KGO shifts.

Full debug otherwise remains consistent as a reference point, where previously with a genuine bug, they did not. This reinforces that the KGO change is optmisation driven.

Copy link
Copy Markdown
Contributor

@Adrian-Lock Adrian Lock (Adrian-Lock) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks fine to me

@christophermaynard
Copy link
Copy Markdown
Collaborator

This is a very complicated code - 4000+ lines with non-trival flow control. Understanding the OpenMP behaviour is not straightforward. denom as default ( shared ) and not private is a bug, which will result in KGO change on its own. Some changes which will hopefully improve the performance at 4 threads.

  1. Restore !$OMP nowait where possible. If the iteration space and OpenMP schedule are the same as the proceeding loop, then the nowait clause is benign. This maybe complicated by the flow control.
  2. Line 1551 Block on i and hoist the block and OpenMP do outside the k-loop. Lots of small OpenMP loops (horizontal domain per layer) is probably imposing a sychronisation overhead.
  3. Lines 1726, 1796 and 3457. Remove double loop, i.e. openMP do across blocked i. Can have multiple loops over each i-block or i-segment within.
  4. line 2964 vectorises an inner-loop. Need to add a blocked outer i-loop so vectorisation occurs over cells in block or segment.

rht_max(i,j) = zero
end do
!$OMP end do
do k = 1, bl_levels-1
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To note the comment, ! j-loop outermost to allow parallelisation (k-loop is sequential)

I don't think there is a better way to do this. I agree with the cost, but without j, we might have to pay it.

We could block it instead?

end do
end do
!$OMP end do NOWAIT
!$OMP end do
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop at L1125 is over the i range. All of the loops in the if block(s) starting L1082 are over k.
Given the changing dimension ranges nowaits are not appropriate here.

sls_inc(i,j,k) = zero
qls_inc(i,j,k) = zero
end do ! i
!$OMP end do
Copy link
Copy Markdown
Contributor Author

@MetBenjaminWent Benjamin Went (MetBenjaminWent) Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aiming to remove the barrier below at L1146 to help PSyclon-ing this file, but I don't want to with these KGO changes to remove it as the cause. I'll add this one back in as it can be removed with the barrier in the future.
As the barrier exisits, the range change into L1149 is okay with a nowait, otherwise, if the barrier goes, so will the nowait.

end do
!$OMP end do NOWAIT
!$OMP end do
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Range change from i loop onto i_wt loop, no appropriate for nowaits

end do ! k
end do ! i_wt
!$OMP end do NOWAIT
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above loop is over i, and the below loop is over ii, a blocked version of i.
NOWAIT isn't appropriate here.

end do
end do ! jj
!$OMP end do NOWAIT
end do ! ii
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowait removed as moving from ii range to i range, and thread could be on the same index

end do
!$OMP end do NOWAIT
!$OMP end do
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nowait removed as moving from ii range to i range, and thread could be on the same index

end do
!$OMP end do NOWAIT
!$OMP end do
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as plan to remove the barriers to assist PSyclone, but can be left in for now, as we are leaving the barrier to remove it from the conversation regarding KGO changes

end do
end do
!$OMP end do NOWAIT
!$OMP end do
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The range changes from k here (and ii post the next set of changes), to i below.

end do
!$OMP end do NOWAIT
!$OMP end do
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing range from i to ii, possibly threads could affect the same index ranges

end do
end do
end do ! jj
!$OMP end do NOWAIT
end do ! ii
Copy link
Copy Markdown
Contributor Author

@MetBenjaminWent Benjamin Went (MetBenjaminWent) Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ii into ii may be okay, but given the dynamic schedule, the thread may end up on the same chunk

@MetBenjaminWent
Copy link
Copy Markdown
Contributor Author

This is a very complicated code - 4000+ lines with non-trival flow control. Understanding the OpenMP behaviour is not straightforward. denom as default ( shared ) and not private is a bug, which will result in KGO change on its own. Some changes which will hopefully improve the performance at 4 threads.

  1. Restore !$OMP nowait where possible. If the iteration space and OpenMP schedule are the same as the proceeding loop, then the nowait clause is benign. This maybe complicated by the flow control.
  2. Line 1551 Block on i and hoist the block and OpenMP do outside the k-loop. Lots of small OpenMP loops (horizontal domain per layer) is probably imposing a sychronisation overhead.
  3. Lines 1726, 1796 and 3457. Remove double loop, i.e. openMP do across blocked i. Can have multiple loops over each i-block or i-segment within.
  4. line 2964 vectorises an inner-loop. Need to add a blocked outer i-loop so vectorisation occurs over cells in block or segment.

Thanks for the SR Chris. In response I've:

Restored the NOWAITS I thought were possible, and commented the others where I felt otherwise. I expect PSyclone will now do something to quite similar to where we are at. There is a further note that some loops might no work on the same content as the loops before with the bounds change, but if there nowaits cascade, they could, though however unlikely. With the existing KGO change, I think caution is warranted. KGO change held.

I've blocked the loop at L1551. KGO change held.

Lines 1726, 1796 and 3457 where double loops exist, I've fused them, where the KGOs held, and extended it slightly to cover a few more. One remaining double loop of this pattern did change KGOs even further, and whilst I expect these are benign still, I'm earring with caution again given the main KGO change, so these have been reverted for now. With successful changes, KGO change held otherwise held.

Line 2964 and another similar loop I've blocked as recommended, KGOs held.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove j loop kmkhz_9c

5 participants