Skip to content

User ScatterView#124

Merged
rfbird merged 1 commit intoECP-copa:masterfrom
sslattery:use_scatter_view
May 8, 2019
Merged

User ScatterView#124
rfbird merged 1 commit intoECP-copa:masterfrom
sslattery:use_scatter_view

Conversation

@sslattery
Copy link
Collaborator

@sslattery sslattery commented May 7, 2019

There were several instances in the code where atomics were used. Almost all instances of atomics have been replaced with Kokkos::ScatterView where possible to enhance performance on non-GPU architectures. There are still several instances of Kokkos::atomic_fetch_add in the code but these are needed to implement the current algorithms and therefore they cannot be changed without some algorithmic research.

Closes #82

@sslattery sslattery requested review from dalg24 and streeve May 7, 2019 19:53
@sslattery sslattery self-assigned this May 7, 2019
@codecov-io
Copy link

codecov-io commented May 7, 2019

Codecov Report

Merging #124 into master will increase coverage by 0.1%.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #124     +/-   ##
========================================
+ Coverage    57.5%   57.6%   +0.1%     
========================================
  Files          38      38             
  Lines        2273    2282      +9     
========================================
+ Hits         1307    1316      +9     
  Misses        966     966
Flag Coverage Δ
#clang 67.8% <100%> (+0.1%) ⬆️
#doxygen 18.3% <ø> (ø) ⬆️
#gcc 97.5% <100%> (ø) ⬆️
Impacted Files Coverage Δ
core/src/Cabana_CommunicationPlan.hpp 94.2% <100%> (+0.1%) ⬆️
core/src/Cabana_Halo.hpp 95.4% <100%> (ø) ⬆️
core/src/Cabana_LinkedCellList.hpp 89.1% <100%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70179e8...b49a406. Read the comment docs.

@sslattery sslattery requested a review from rfbird May 7, 2019 20:15
Copy link
Collaborator

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you actually profile the code before and after?

for ( int n = 0; n < num_n; ++n )
if ( topology(n) == element_export_ranks(i) )
Kokkos::atomic_increment( &export_counts(n) );
export_counts_data(n) += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it support prefix increment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@sslattery
Copy link
Collaborator Author

I didn't profile but for past codes I have profiled I have never had a case where the duplication from ScatterView didn't perform better than using atomics on a CPU. On a GPU it defaults to atomics which is our preferred behavior.

@sslattery
Copy link
Collaborator Author

sslattery commented May 8, 2019

Ok here are some speedups on Summit on communication plan construction where I expected them. The speedup listed is the fraction over 1 that was improved. So 0.53 means a 53% speedup and -0.013 would mean a 1.3% slow down:

Version 0 - 64 Node   Version 1- 64 Node   Speed Up
         
cuda_host_distributor_fast_create   cuda_host_distributor_fast_create    
num_rank ave num_rank ave  
384 90471.3 384 52303.6 0.42187633
384 89138.9 384 51588.1 0.421261649
384 88945.9 384 51603.2 0.419836103
384 88367.8 384 51286.1 0.419629096
384 84571.4 384 49329.4 0.416712979
384 80224.6 384 47086 0.413072798
384 52433.2 384 35578.3 0.321454727
         
cuda_host_distributor_slow_create   cuda_host_distributor_slow_create    
num_rank ave num_rank ave  
384 94856 384 58933.8 0.378702454
384 94638.9 384 59018.6 0.376381171
384 94584.7 384 58586.8 0.380589038
384 93999.9 384 58849.6 0.37393976
384 91489.8 384 57068.2 0.376234291
384 90421.3 384 53945.7 0.403396102
384 71974.5 384 49011 0.319050497

So pretty good speedups here. Also saw the same speed ups for the halo construction and no slow downs elsewhere

@sslattery sslattery force-pushed the use_scatter_view branch from d5764e2 to b49a406 Compare May 8, 2019 14:45
@rfbird
Copy link
Collaborator

rfbird commented May 8, 2019

I didn't profile but for past codes I have profiled I have never had a case where the duplication from ScatterView didn't perform better than using atomics on a CPU. On a GPU it defaults to atomics which is our preferred behavior.

FWIW I've seen cases where our hand written implementation is definitely faster than ScatterView, but I agree I fully expect ScatterView to be faster than atomics on CPU for any sane use case

@sslattery
Copy link
Collaborator Author

yeah so actually after investigating the duplication from scatter view was slower in the halo scatter - in that case there is often so few collisions that the extra duplication became an overhead. therefore I removed it.

@sslattery
Copy link
Collaborator Author

also on our MD neighbor list benchmark with parameters for water molecules I see a small speed up of a few percent which demonstrates the changes to the LinkedCellList. You can use these settings for water:

./NeighborListMDPerfTest 3.0 1000000 1.0

So, small improvement but something.

Copy link
Collaborator

@rfbird rfbird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

I think ScatterView is a really nice addition. I think this could also make a good candidate for a "performance test".

It's also worth noting that ScatterView does let you pass params to still use atomics on the host, so it's easy to map the performance space out. You can for example imagine atomics is faster in some small corner space where work is limited

auto export_counts = Kokkos::create_mirror_view_and_copy(
memory_space(), num_export_host );
auto export_counts_sv =
Kokkos::Experimental::create_scatter_view( export_counts );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we worried about always having to create the views? Is it worth considering providing a fixed size allocation for us by experts with consistent/regular data patterns/sends?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess then we have to be more careful about resetting the data too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm not sure what you mean here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have been clearer. This makes a ScatterView per function invocation, right? And then throws it away at the end?

For regular/structured apps, does this imply memory allocation/deallocation of the same size every timestep? Or is this somehow only ever done once and I'm missing a detail?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would imply allocation/deallocation of the replicated views if replication is the scatter method of choice. Because we don't have information about the slice the user wants to scatter a priori I think it would be difficult to save the state of a scatter view somewhere unless we developed some type of container approach where the user would register the slices they are scattering.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's what I'm thinking. A second advance API where they provide more information. Something for down the road perhaps.

@sslattery
Copy link
Collaborator Author

Yes - if we find places where, for example, the Halo scatter operation has tons of collisions and could use duplication on the CPU then we could add back in ScatterView and develop some collision metric based on the communication plan (because we know all collisions once the plan is generated) to pick the implementation at run time.

@guangyechen
Copy link
Collaborator

@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both?

@rfbird
Copy link
Collaborator

rfbird commented May 8, 2019

@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both?

Only CPU (there basically no overhead to ScatterView on GPU)

@guangyechen
Copy link
Collaborator

May the overhead of scatter_view be creating/destroying temporary arrays on the fly?

@sslattery
Copy link
Collaborator Author

Yes the overhead would be from that as well as the recombination at the end.

@rfbird
Copy link
Collaborator

rfbird commented May 8, 2019

May the overhead of scatter_view be creating/destroying temporary arrays on the fly?

I think it's nothing to be overly concerned about right now. Two things:

  1. My comparison wasn't apple to apples, so the data is only so-so
  2. Kokkos actually has a performance test for this, which means they're aware and happy with the status of the implementation: https://github.com/kokkos/kokkos/blob/master/containers/performance_tests/TestScatterView.hpp

@rfbird rfbird merged commit a444c90 into ECP-copa:master May 8, 2019
@sslattery sslattery deleted the use_scatter_view branch May 8, 2019 15:28
@stanmoore1
Copy link
Collaborator

I've tested ScatterView in ExaMiniMD, see ECP-copa/ExaMiniMD#21. There was little difference between creating temporary ScatterViews vs using a persistent allocation. We also compared performance to a hand-coded version, and for 1D arrays it was close. There may be more optimizations needed for 2D arrays though.

@stanmoore1
Copy link
Collaborator

FYI, see Christian's comment at the end of kokkos/kokkos#1390.

@rfbird
Copy link
Collaborator

rfbird commented May 8, 2019

FYI, see Christian's comment at the end of kokkos/kokkos#1390.

Good point. overall I'm happy to rely on this (as I also do for a LANL production app..), hopefully it will only get better! If we see it really causing an issue I'll do something similar to Christians' comments in #1390 and file a PR..

@sslattery
Copy link
Collaborator Author

Agreed - and we have our own performance tests to track these changes (although not automated yet). I'm very happy with a 2x speedup for the communication stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use Kokkos::ScatterView where Possible

6 participants