ChaNGa
Building
This machine is really nuts, 192 cores in 2 sockets per node.
This module set:
openmpi/4.1.5rc2/hpcx
Build with:
./buildold ChaNGa verbs-linux-x86_64 smp -j8 --with-production
./charmrun.smp +p 190 ++mpiexec ./ChaNGa.smp ++ppn 95 +setcpuaffinity +commap 0,96 +pemap 1-95,97-191 agora.param
works one one node
Running 3 steps with agora low:
- 192 cores, SMP (2 processes, full commap): 167 seconds!
- 192 cores, SMP (2 processes, no commap): > 600 seconds!
- 96 cores, SMP (1 processes): 320 seconds
- 192 cores without cpuaffinity/comap: doesn’t run!
- 192 cores without SMP (192 processes): absurdly slow
Fastest choice for agora:
./charmrun.smp +p 95 ++mpiexec ./ChaNGa.smp ++ppn 95 +setcpuaffinity +commap 0 +pemap 1-95 agora.param
- Weirdly this requires 192 MPI tasks???
- It seems to still be just as fast if I use 48 SMP tasks as well??
- 24 is slower by ~50%.
- I can’t get it to be fast without 192 MPI tasks!
- It’s slower if I spread the pe map around, but only by ~40%
- It’s WAY slower without commap or pemap (10x slower)
What if I just try compiling with the multicore options? Hoo boy that’s fast and easy…
pkdgrav3
Compiling
- Load these modules:
python/3.12.9/gcc.8.5.0 fftw/3.3.10/gcc-8.5.0/openmpi-4.1.6 boost/1.84.0/gcc.8.5.0 gsl/2.7.1/gcc-8.5.0 openmpi/4.1.6/gcc.8.5.0/mt hdf5/1.14.6/gcc-8.5.0/openmpi-4.1.6
- FFTW needs to be explicitly pointed at:
cmake -DFFTW_ROOT=/opt/ohpc/pub/apps/uofm/fftw/3.3.10-gcc-mpi/ -S . -B build
- Needs to use
mpirun
to execute,srun
doesn’t work for reasons
Performance
- Using 4 or 8 tasks per node is good.