User ScatterView by sslattery · Pull Request #124 · ECP-copa/Cabana

sslattery · 2019-05-07T19:53:12Z

There were several instances in the code where atomics were used. Almost all instances of atomics have been replaced with Kokkos::ScatterView where possible to enhance performance on non-GPU architectures. There are still several instances of Kokkos::atomic_fetch_add in the code but these are needed to implement the current algorithms and therefore they cannot be changed without some algorithmic research.

Closes #82

codecov-io · 2019-05-07T20:14:57Z

Codecov Report

Merging #124 into master will increase coverage by 0.1%.
The diff coverage is 100%.

@@           Coverage Diff            @@
##           master    #124     +/-   ##
========================================
+ Coverage    57.5%   57.6%   +0.1%     
========================================
  Files          38      38             
  Lines        2273    2282      +9     
========================================
+ Hits         1307    1316      +9     
  Misses        966     966

Flag	Coverage Δ
#clang	`67.8% <100%> (+0.1%)`	⬆️
#doxygen	`18.3% <ø> (ø)`	⬆️
#gcc	`97.5% <100%> (ø)`	⬆️

Impacted Files	Coverage Δ
core/src/Cabana_CommunicationPlan.hpp	`94.2% <100%> (+0.1%)`	⬆️
core/src/Cabana_Halo.hpp	`95.4% <100%> (ø)`	⬆️
core/src/Cabana_LinkedCellList.hpp	`89.1% <100%> (+0.2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70179e8...b49a406. Read the comment docs.

dalg24

Did you actually profile the code before and after?

dalg24 · 2019-05-07T21:33:12Z

core/src/Cabana_CommunicationPlan.hpp

                for ( int n = 0; n < num_n; ++n )
                    if ( topology(n) == element_export_ranks(i) )
-                        Kokkos::atomic_increment( &export_counts(n) );
+                        export_counts_data(n) += 1;


Does it support prefix increment?

sslattery · 2019-05-07T21:36:56Z

I didn't profile but for past codes I have profiled I have never had a case where the duplication from ScatterView didn't perform better than using atomics on a CPU. On a GPU it defaults to atomics which is our preferred behavior.

sslattery · 2019-05-08T14:34:39Z

Ok here are some speedups on Summit on communication plan construction where I expected them. The speedup listed is the fraction over 1 that was improved. So 0.53 means a 53% speedup and -0.013 would mean a 1.3% slow down:

Version 0 - 64 Node		Version 1- 64 Node		Speed Up

cuda_host_distributor_fast_create		cuda_host_distributor_fast_create
num_rank	ave	num_rank	ave
384	90471.3	384	52303.6	0.42187633
384	89138.9	384	51588.1	0.421261649
384	88945.9	384	51603.2	0.419836103
384	88367.8	384	51286.1	0.419629096
384	84571.4	384	49329.4	0.416712979
384	80224.6	384	47086	0.413072798
384	52433.2	384	35578.3	0.321454727

cuda_host_distributor_slow_create		cuda_host_distributor_slow_create
num_rank	ave	num_rank	ave
384	94856	384	58933.8	0.378702454
384	94638.9	384	59018.6	0.376381171
384	94584.7	384	58586.8	0.380589038
384	93999.9	384	58849.6	0.37393976
384	91489.8	384	57068.2	0.376234291
384	90421.3	384	53945.7	0.403396102
384	71974.5	384	49011	0.319050497

So pretty good speedups here. Also saw the same speed ups for the halo construction and no slow downs elsewhere

rfbird · 2019-05-08T14:52:16Z

I didn't profile but for past codes I have profiled I have never had a case where the duplication from ScatterView didn't perform better than using atomics on a CPU. On a GPU it defaults to atomics which is our preferred behavior.

FWIW I've seen cases where our hand written implementation is definitely faster than ScatterView, but I agree I fully expect ScatterView to be faster than atomics on CPU for any sane use case

sslattery · 2019-05-08T14:53:36Z

yeah so actually after investigating the duplication from scatter view was slower in the halo scatter - in that case there is often so few collisions that the extra duplication became an overhead. therefore I removed it.

sslattery · 2019-05-08T15:16:13Z

also on our MD neighbor list benchmark with parameters for water molecules I see a small speed up of a few percent which demonstrates the changes to the LinkedCellList. You can use these settings for water:

./NeighborListMDPerfTest 3.0 1000000 1.0

So, small improvement but something.

rfbird

LGTM !

I think ScatterView is a really nice addition. I think this could also make a good candidate for a "performance test".

It's also worth noting that ScatterView does let you pass params to still use atomics on the host, so it's easy to map the performance space out. You can for example imagine atomics is faster in some small corner space where work is limited

rfbird · 2019-05-08T15:11:30Z

core/src/Cabana_CommunicationPlan.hpp

        auto export_counts = Kokkos::create_mirror_view_and_copy(
                memory_space(), num_export_host );
+        auto export_counts_sv =
+            Kokkos::Experimental::create_scatter_view( export_counts );


Are we worried about always having to create the views? Is it worth considering providing a fixed size allocation for us by experts with consistent/regular data patterns/sends?

I guess then we have to be more careful about resetting the data too

I guess I'm not sure what you mean here.

Sorry, I should have been clearer. This makes a ScatterView per function invocation, right? And then throws it away at the end?

For regular/structured apps, does this imply memory allocation/deallocation of the same size every timestep? Or is this somehow only ever done once and I'm missing a detail?

I believe it would imply allocation/deallocation of the replicated views if replication is the scatter method of choice. Because we don't have information about the slice the user wants to scatter a priori I think it would be difficult to save the state of a scatter view somewhere unless we developed some type of container approach where the user would register the slices they are scattering.

Yea, that's what I'm thinking. A second advance API where they provide more information. Something for down the road perhaps.

sslattery · 2019-05-08T15:19:52Z

Yes - if we find places where, for example, the Halo scatter operation has tons of collisions and could use duplication on the CPU then we could add back in ScatterView and develop some collision metric based on the communication plan (because we know all collisions once the plan is generated) to pick the implementation at run time.

guangyechen · 2019-05-08T15:21:21Z

@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both?

rfbird · 2019-05-08T15:22:10Z

@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both?

Only CPU (there basically no overhead to ScatterView on GPU)

guangyechen · 2019-05-08T15:23:50Z

May the overhead of scatter_view be creating/destroying temporary arrays on the fly?

sslattery · 2019-05-08T15:25:00Z

Yes the overhead would be from that as well as the recombination at the end.

rfbird · 2019-05-08T15:27:21Z

May the overhead of scatter_view be creating/destroying temporary arrays on the fly?

I think it's nothing to be overly concerned about right now. Two things:

My comparison wasn't apple to apples, so the data is only so-so
Kokkos actually has a performance test for this, which means they're aware and happy with the status of the implementation: https://github.com/kokkos/kokkos/blob/master/containers/performance_tests/TestScatterView.hpp

stanmoore1 · 2019-05-08T15:29:29Z

I've tested ScatterView in ExaMiniMD, see ECP-copa/ExaMiniMD#21. There was little difference between creating temporary ScatterViews vs using a persistent allocation. We also compared performance to a hand-coded version, and for 1D arrays it was close. There may be more optimizations needed for 2D arrays though.

stanmoore1 · 2019-05-08T15:32:03Z

FYI, see Christian's comment at the end of kokkos/kokkos#1390.

rfbird · 2019-05-08T15:33:59Z

FYI, see Christian's comment at the end of kokkos/kokkos#1390.

Good point. overall I'm happy to rely on this (as I also do for a LANL production app..), hopefully it will only get better! If we see it really causing an issue I'll do something similar to Christians' comments in #1390 and file a PR..

sslattery · 2019-05-08T15:35:12Z

Agreed - and we have our own performance tests to track these changes (although not automated yet). I'm very happy with a 2x speedup for the communication stuff.

sslattery added the performance label May 7, 2019

sslattery requested review from dalg24 and streeve May 7, 2019 19:53

sslattery self-assigned this May 7, 2019

sslattery requested a review from rfbird May 7, 2019 20:15

dalg24 reviewed May 7, 2019

View reviewed changes

Replacing atomic add instances with scatter view

b49a406

sslattery force-pushed the use_scatter_view branch from d5764e2 to b49a406 Compare May 8, 2019 14:45

rfbird approved these changes May 8, 2019

View reviewed changes

rfbird merged commit a444c90 into ECP-copa:master May 8, 2019

sslattery deleted the use_scatter_view branch May 8, 2019 15:28

Conversation

sslattery commented May 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented May 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dalg24 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sslattery commented May 7, 2019

Uh oh!

sslattery commented May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rfbird commented May 8, 2019

Uh oh!

sslattery commented May 8, 2019

Uh oh!

sslattery commented May 8, 2019

Uh oh!

rfbird left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sslattery commented May 8, 2019

Uh oh!

guangyechen commented May 8, 2019

Uh oh!

rfbird commented May 8, 2019

Uh oh!

guangyechen commented May 8, 2019

Uh oh!

sslattery commented May 8, 2019

Uh oh!

rfbird commented May 8, 2019

Uh oh!

stanmoore1 commented May 8, 2019

Uh oh!

stanmoore1 commented May 8, 2019

Uh oh!

rfbird commented May 8, 2019

Uh oh!

sslattery commented May 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sslattery commented May 7, 2019 •

edited

Loading

codecov-io commented May 7, 2019 •

edited

Loading

sslattery commented May 8, 2019 •

edited

Loading