-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
Hi! I'm investigating a segfault in GOTCHA during Caliper initialization. Here's the original discussion thread: llnl/Caliper#678
Here's the gist:
- The code segfaults when Caliper initializes its MPI wrappers. The segfault seems to occur at some point during GOTCHA initialization while wrapping the dlsym/dlopen functions. It doesn't get to the point where it says "gotcha wrap completed successfully" for the dlsym/dlopen gotcha_wrap call.
- It happens in a larger MPI program but does not segfault with a small test app.
- The system is a 384-core machine on Azure, so I don't know if we can get direct access to it
The log shows many errors like
ERROR [86136/86136][gotcha.c:324] - GOTCHA attempted to mark both GOT and PLT GOT tables as writable and was unable to do so, calls to wrapped functions may likely fail.
[86136/86136][gotcha.c:316] - Setting library /home/hpcuser/miket/software/x86_64/RHEL9/GEOS/1.1.0--victor-caliper/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/lib/libmainInterface.so GOT and PLT table from 0x7ffff7fba000 to +16384 to writeable
ERROR [86136/86136][gotcha.c:324] - GOTCHA attempted to mark both GOT and PLT GOT tables as writable and was unable to do so, calls to wrapped functions may likely fail.
[86136/86136][gotcha.c:316] - Setting library /home/hpcuser/miket/software/x86_64/RHEL9/GEOS/1.1.0--victor-caliper/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/lib/libphysicsSolvers.so GOT and PLT table from 0x7ffff7e6c000 to +8192 to writeable
However they also appear in the small test example which didn't segfault.
Here's a stack trace:
Received signal 11: Segmentation fault
** StackTrace of 16 frames **
Frame 0: /lib64/libc.so.6
Frame 1: /home/hpcuser/miket/software/x86_64/RHEL9/GEOSTPL/1.1.0--update-hypre/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/caliper/lib64/libcaliper.so.2
Frame 2: update_all_library_gots
Frame 3: gotcha_wrap
Frame 4: gotcha_wrap
Frame 5: cali::mpiwrap_init(cali::Caliper*, cali::Channel*, cali::ConfigSet&)
Frame 6: /home/hpcuser/miket/software/x86_64/RHEL9/GEOSTPL/1.1.0--update-hypre/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/caliper/lib64/libcaliper.so.2
Frame 7: cali::services::register_configured_services(cali::Caliper*, cali::Channel*)
Frame 8: cali::Caliper::create_channel(char const*, cali::RuntimeConfig const&)
Frame 9: cali::ChannelController::create()
Frame 10: cali::ChannelController::start()
Frame 11: cali::ConfigManager::start()
Frame 12: geos::GeosxState::GeosxState(std::unique_ptr<geos::CommandLineOptions, std::default_delete<geos::CommandLineOptions> >&&)
Frame 13: main
Frame 14: /lib64/libc.so.6
Frame 15: __libc_start_main
Frame 16: _start
=====
There are full level 3 debug logs in the discussion thread linked above. Doesn't really point me to anything suspicious though other than those error messages.
Any ideas on what might be going on or any tips on how to debug this?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels