ChaNGa

Building

This machine is really nuts, 192 cores in 2 sockets per node.

This module set: openmpi/4.1.5rc2/hpcx

Build with: ./buildold ChaNGa verbs-linux-x86_64 smp -j8 --with-production

./charmrun.smp +p 190 ++mpiexec ./ChaNGa.smp ++ppn 95 +setcpuaffinity +commap 0,96 +pemap 1-95,97-191 agora.param

works one one node

Running 3 steps with agora low:

  • 192 cores, SMP (2 processes, full commap): 167 seconds!
  • 192 cores, SMP (2 processes, no commap): > 600 seconds!
  • 96 cores, SMP (1 processes): 320 seconds
  • 192 cores without cpuaffinity/comap: doesn’t run!
  • 192 cores without SMP (192 processes): absurdly slow

Fastest choice for agora:

./charmrun.smp +p 95 ++mpiexec ./ChaNGa.smp ++ppn 95 +setcpuaffinity +commap 0 +pemap 1-95 agora.param

  • Weirdly this requires 192 MPI tasks???
    • It seems to still be just as fast if I use 48 SMP tasks as well??
    • 24 is slower by ~50%.
  • I can’t get it to be fast without 192 MPI tasks!
    • It’s slower if I spread the pe map around, but only by ~40%
    • It’s WAY slower without commap or pemap (10x slower)

What if I just try compiling with the multicore options? Hoo boy that’s fast and easy…

pkdgrav3

Compiling

  • Load these modules: python/3.12.9/gcc.8.5.0 fftw/3.3.10/gcc-8.5.0/openmpi-4.1.6 boost/1.84.0/gcc.8.5.0 gsl/2.7.1/gcc-8.5.0 openmpi/4.1.6/gcc.8.5.0/mt hdf5/1.14.6/gcc-8.5.0/openmpi-4.1.6
  • FFTW needs to be explicitly pointed at: cmake -DFFTW_ROOT=/opt/ohpc/pub/apps/uofm/fftw/3.3.10-gcc-mpi/ -S . -B build
  • Needs to use mpirun to execute, srun doesn’t work for reasons

Performance

  • Using 4 or 8 tasks per node is good.