Skip to content

GOTCHA segfaults in GEOS on MI300C #167

@daboehme

Description

@daboehme

Hi! I'm investigating a segfault in GOTCHA during Caliper initialization. Here's the original discussion thread: llnl/Caliper#678

Here's the gist:

  • The code segfaults when Caliper initializes its MPI wrappers. The segfault seems to occur at some point during GOTCHA initialization while wrapping the dlsym/dlopen functions. It doesn't get to the point where it says "gotcha wrap completed successfully" for the dlsym/dlopen gotcha_wrap call.
  • It happens in a larger MPI program but does not segfault with a small test app.
  • The system is a 384-core machine on Azure, so I don't know if we can get direct access to it

The log shows many errors like

ERROR [86136/86136][gotcha.c:324] - GOTCHA attempted to mark both GOT and PLT GOT tables as writable and was unable to do so, calls to wrapped functions may likely fail.
[86136/86136][gotcha.c:316] - Setting library /home/hpcuser/miket/software/x86_64/RHEL9/GEOS/1.1.0--victor-caliper/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/lib/libmainInterface.so GOT and PLT table from 0x7ffff7fba000 to +16384 to writeable
ERROR [86136/86136][gotcha.c:324] - GOTCHA attempted to mark both GOT and PLT GOT tables as writable and was unable to do so, calls to wrapped functions may likely fail.
[86136/86136][gotcha.c:316] - Setting library /home/hpcuser/miket/software/x86_64/RHEL9/GEOS/1.1.0--victor-caliper/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/lib/libphysicsSolvers.so GOT and PLT table from 0x7ffff7e6c000 to +8192 to writeable

However they also appear in the small test example which didn't segfault.

Here's a stack trace:

Received signal 11: Segmentation fault

** StackTrace of 16 frames **
Frame 0: /lib64/libc.so.6 
Frame 1: /home/hpcuser/miket/software/x86_64/RHEL9/GEOSTPL/1.1.0--update-hypre/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/caliper/lib64/libcaliper.so.2 
Frame 2: update_all_library_gots 
Frame 3: gotcha_wrap 
Frame 4: gotcha_wrap 
Frame 5: cali::mpiwrap_init(cali::Caliper*, cali::Channel*, cali::ConfigSet&) 
Frame 6: /home/hpcuser/miket/software/x86_64/RHEL9/GEOSTPL/1.1.0--update-hypre/install-CPU-MI300C-Hypre-GCC-mpi-OMP-relwithdebinfo/caliper/lib64/libcaliper.so.2 
Frame 7: cali::services::register_configured_services(cali::Caliper*, cali::Channel*) 
Frame 8: cali::Caliper::create_channel(char const*, cali::RuntimeConfig const&) 
Frame 9: cali::ChannelController::create() 
Frame 10: cali::ChannelController::start() 
Frame 11: cali::ConfigManager::start() 
Frame 12: geos::GeosxState::GeosxState(std::unique_ptr<geos::CommandLineOptions, std::default_delete<geos::CommandLineOptions> >&&) 
Frame 13: main 
Frame 14: /lib64/libc.so.6 
Frame 15: __libc_start_main 
Frame 16: _start 
=====

There are full level 3 debug logs in the discussion thread linked above. Doesn't really point me to anything suspicious though other than those error messages.

Any ideas on what might be going on or any tips on how to debug this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions