Conversation
Codecov Report
@@ Coverage Diff @@
## master #124 +/- ##
========================================
+ Coverage 57.5% 57.6% +0.1%
========================================
Files 38 38
Lines 2273 2282 +9
========================================
+ Hits 1307 1316 +9
Misses 966 966
Continue to review full report at Codecov.
|
dalg24
left a comment
There was a problem hiding this comment.
Did you actually profile the code before and after?
| for ( int n = 0; n < num_n; ++n ) | ||
| if ( topology(n) == element_export_ranks(i) ) | ||
| Kokkos::atomic_increment( &export_counts(n) ); | ||
| export_counts_data(n) += 1; |
There was a problem hiding this comment.
Does it support prefix increment?
|
I didn't profile but for past codes I have profiled I have never had a case where the duplication from |
|
Ok here are some speedups on Summit on communication plan construction where I expected them. The speedup listed is the fraction over 1 that was improved. So
So pretty good speedups here. Also saw the same speed ups for the halo construction and no slow downs elsewhere |
d5764e2 to
b49a406
Compare
FWIW I've seen cases where our hand written implementation is definitely faster than |
|
yeah so actually after investigating the duplication from scatter view was slower in the halo scatter - in that case there is often so few collisions that the extra duplication became an overhead. therefore I removed it. |
|
also on our MD neighbor list benchmark with parameters for water molecules I see a small speed up of a few percent which demonstrates the changes to the So, small improvement but something. |
rfbird
left a comment
There was a problem hiding this comment.
LGTM !
I think ScatterView is a really nice addition. I think this could also make a good candidate for a "performance test".
It's also worth noting that ScatterView does let you pass params to still use atomics on the host, so it's easy to map the performance space out. You can for example imagine atomics is faster in some small corner space where work is limited
| auto export_counts = Kokkos::create_mirror_view_and_copy( | ||
| memory_space(), num_export_host ); | ||
| auto export_counts_sv = | ||
| Kokkos::Experimental::create_scatter_view( export_counts ); |
There was a problem hiding this comment.
Are we worried about always having to create the views? Is it worth considering providing a fixed size allocation for us by experts with consistent/regular data patterns/sends?
There was a problem hiding this comment.
I guess then we have to be more careful about resetting the data too
There was a problem hiding this comment.
I guess I'm not sure what you mean here.
There was a problem hiding this comment.
Sorry, I should have been clearer. This makes a ScatterView per function invocation, right? And then throws it away at the end?
For regular/structured apps, does this imply memory allocation/deallocation of the same size every timestep? Or is this somehow only ever done once and I'm missing a detail?
There was a problem hiding this comment.
I believe it would imply allocation/deallocation of the replicated views if replication is the scatter method of choice. Because we don't have information about the slice the user wants to scatter a priori I think it would be difficult to save the state of a scatter view somewhere unless we developed some type of container approach where the user would register the slices they are scattering.
There was a problem hiding this comment.
Yea, that's what I'm thinking. A second advance API where they provide more information. Something for down the road perhaps.
|
Yes - if we find places where, for example, the Halo scatter operation has tons of collisions and could use duplication on the CPU then we could add back in |
|
@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both? |
Only CPU (there basically no overhead to ScatterView on GPU) |
|
May the overhead of scatter_view be creating/destroying temporary arrays on the fly? |
|
Yes the overhead would be from that as well as the recombination at the end. |
I think it's nothing to be overly concerned about right now. Two things:
|
|
I've tested ScatterView in ExaMiniMD, see ECP-copa/ExaMiniMD#21. There was little difference between creating temporary ScatterViews vs using a persistent allocation. We also compared performance to a hand-coded version, and for 1D arrays it was close. There may be more optimizations needed for 2D arrays though. |
|
FYI, see Christian's comment at the end of kokkos/kokkos#1390. |
Good point. overall I'm happy to rely on this (as I also do for a LANL production app..), hopefully it will only get better! If we see it really causing an issue I'll do something similar to Christians' comments in #1390 and file a PR.. |
|
Agreed - and we have our own performance tests to track these changes (although not automated yet). I'm very happy with a 2x speedup for the communication stuff. |
There were several instances in the code where atomics were used. Almost all instances of atomics have been replaced with
Kokkos::ScatterViewwhere possible to enhance performance on non-GPU architectures. There are still several instances ofKokkos::atomic_fetch_addin the code but these are needed to implement the current algorithms and therefore they cannot be changed without some algorithmic research.Closes #82