Remove j and block Kmkhz 9c #325
Remove j and block Kmkhz 9c #325Benjamin Went (MetBenjaminWent) wants to merge 7 commits intoMetOffice:mainfrom
Conversation
|
kmkhz_9c_after.txt Listing files before and after. |
|
From #221: Running into some KGO issues at fast debug. Below testing at full-debug indicates that the KGO is good, whilst 1 and 4 are technically added, 2T is consistent with trunk, and they hold between runs, which with the denom LHS above, they were not. The most recent changes here increase the number of ii blocking loops present with the dynamic schedule. Test Suite Results - lfric_apps - kmkhz_9c_pysclone/run20Suite Information
Task Information✅ succeeded tasks - 12 |
|
From #221: Updating the CCE KGOs show it holds. Further testing, do they hold as seg-size changes? Test Suite Results - lfric_apps - kmkhz_9c_pysclone/run21Suite Information
Task Information✅ succeeded tasks - 51 |
|
From #221 Barriers have been left. They should be removable (after removing some of the nowaits), but I'll look at this further in the PSyclone version of this ticket. Current leading thought is that KGO changes are Optimisation changes, but they are more widespread. Likely the intial KGO change was caused by the seg size loop intros, and as the bounds change, but the no waits didn't, the threaded runs changed, then stabalised. Changing the segment size after updating the KGOs does not cause them to change. Changing further syncronisation points may have allowed the compiler to change how it has optimised, generting further KGO shifts. Full debug otherwise remains consistent as a reference point, where previously with a genuine bug, they did not. This reinforces that the KGO change is optmisation driven. |
Adrian Lock (Adrian-Lock)
left a comment
There was a problem hiding this comment.
All looks fine to me
|
This is a very complicated code - 4000+ lines with non-trival flow control. Understanding the OpenMP behaviour is not straightforward. denom as default ( shared ) and not private is a bug, which will result in KGO change on its own. Some changes which will hopefully improve the performance at 4 threads.
|
| rht_max(i,j) = zero | ||
| end do | ||
| !$OMP end do | ||
| do k = 1, bl_levels-1 |
There was a problem hiding this comment.
To note the comment, ! j-loop outermost to allow parallelisation (k-loop is sequential)
I don't think there is a better way to do this. I agree with the cost, but without j, we might have to pay it.
We could block it instead?
| end do | ||
| end do | ||
| !$OMP end do NOWAIT | ||
| !$OMP end do |
There was a problem hiding this comment.
The loop at L1125 is over the i range. All of the loops in the if block(s) starting L1082 are over k.
Given the changing dimension ranges nowaits are not appropriate here.
| sls_inc(i,j,k) = zero | ||
| qls_inc(i,j,k) = zero | ||
| end do ! i | ||
| !$OMP end do |
There was a problem hiding this comment.
I'm aiming to remove the barrier below at L1146 to help PSyclon-ing this file, but I don't want to with these KGO changes to remove it as the cause. I'll add this one back in as it can be removed with the barrier in the future.
As the barrier exisits, the range change into L1149 is okay with a nowait, otherwise, if the barrier goes, so will the nowait.
| end do | ||
| !$OMP end do NOWAIT | ||
| !$OMP end do |
There was a problem hiding this comment.
Range change from i loop onto i_wt loop, no appropriate for nowaits
| end do ! k | ||
| end do ! i_wt | ||
| !$OMP end do NOWAIT |
There was a problem hiding this comment.
The above loop is over i, and the below loop is over ii, a blocked version of i.
NOWAIT isn't appropriate here.
| end do | ||
| end do ! jj | ||
| !$OMP end do NOWAIT | ||
| end do ! ii |
There was a problem hiding this comment.
Nowait removed as moving from ii range to i range, and thread could be on the same index
| end do | ||
| !$OMP end do NOWAIT | ||
| !$OMP end do |
There was a problem hiding this comment.
Nowait removed as moving from ii range to i range, and thread could be on the same index
| end do | ||
| !$OMP end do NOWAIT | ||
| !$OMP end do |
There was a problem hiding this comment.
Removed as plan to remove the barriers to assist PSyclone, but can be left in for now, as we are leaving the barrier to remove it from the conversation regarding KGO changes
| end do | ||
| end do | ||
| !$OMP end do NOWAIT | ||
| !$OMP end do |
There was a problem hiding this comment.
The range changes from k here (and ii post the next set of changes), to i below.
| end do | ||
| !$OMP end do NOWAIT | ||
| !$OMP end do |
There was a problem hiding this comment.
Changing range from i to ii, possibly threads could affect the same index ranges
| end do | ||
| end do | ||
| end do ! jj | ||
| !$OMP end do NOWAIT | ||
| end do ! ii |
There was a problem hiding this comment.
ii into ii may be okay, but given the dynamic schedule, the thread may end up on the same chunk
Thanks for the SR Chris. In response I've: Restored the NOWAITS I thought were possible, and commented the others where I felt otherwise. I expect PSyclone will now do something to quite similar to where we are at. There is a further note that some loops might no work on the same content as the loops before with the bounds change, but if there nowaits cascade, they could, though however unlikely. With the existing KGO change, I think caution is warranted. KGO change held. I've blocked the loop at L1551. KGO change held. Lines 1726, 1796 and 3457 where double loops exist, I've fused them, where the KGOs held, and extended it slightly to cover a few more. One remaining double loop of this pattern did change KGOs even further, and whilst I expect these are benign still, I'm earring with caution again given the main KGO change, so these have been reverted for now. With successful changes, KGO change held otherwise held. Line 2964 and another similar loop I've blocked as recommended, KGOs held. |
PR Summary
Sci/Tech Reviewer: Hacka Fett (@christophermaynard)
Code Reviewer: Harry Shepherd (@harry-shepherd)
Remove the j loop which is adversely affecting performance in the boundary layer, and improve blocking loops.
Performance data can be found in the umbrella issue, #106
closes #217
Initial KGO issues were fixed with adding denom to the private list L2546.
Taking over from #221.
Introducing further blocking loops have introduced new KGO changes, which appear to be fast debug O2 optimisation driven.
I've also noted that with differing loop ranges over threads, reducing nowaits is required, which opens the option of removing the barrier to aid further PSyclone work in the future (The barriers will remain for now).
With the above KGO occurrences, these have likely factored into the cause of KGO changes at 1T, which remain unaffected for full-debug, still indicating optimisation as the likely root cause.
As the KGO updates are holding (for the all of the tests involved in the OMP dev group) for 1, 2 or 4T, I'm reasonably confident that it's not a newly introduced race condition or similar.
Segmentaion variable will be updated in #382 to help this ticket avoid further conflicts with HoT.
Code Quality Checklist
Testing
trac.log
Test Suite Results - lfric_apps - Kmkhz-9c-by-hand-2/run7
Suite Information
Task Information
✅ succeeded tasks - 51
Test Suite Results - lfric_apps - Kmkhz-9c-by-hand-2/run20
Suite Information
Task Information
❌ failed tasks - 3
Security Considerations
Performance Impact
AI Assistance and Attribution
Documentation
PSyclone Approval
Sci/Tech Review
(Please alert the code reviewer via a tag when you have approved the SR)
Code Review