Using Simultaneous Multi-Threading on Davinci

Simultaneous Multi-Threading (SMT) is an AIX feature available for Power5 and Power6-based systems. Under SMT, the P6 (Power6) doubles active threads on a processor core by implementing a second on-board "virtual" processor enabled in the chip architecture. The basic concept of SMT is no single process can use all processor execution units at the same time, so a second thread should be able to use unused cycles on the same physical processor core.

Davinci has 32 physical cores per compute node. With SMT enabled on the nodes, it appears as if 64 virtual cpus are on each node. Running 64 tasks per node should generally provide good efficiency for pure MPI codes. However, we recommend testing your own application on 32 up to 64 tasks per node (assuming per-task memory usage allows 32 or more tasks to all fit in local on-node memory - currently about 52 GBytes). If your program is hybrid, MPI + OpenMP, experiment with various combinations of task counts and thread counts so that you total 64 threads per node, rather than 32.

In order to use SMT, no source code changes are required in Fortran/C/C++ codes, but we recommend various changes to your PBS scripts, as described in this guide. With these changes, you may be able to boost performance by 20% or more on some codes. It is often possible to get at least some speedup by exploiting SMT, and little extra effort is required to try using SMT on Davinci.

MPI JOBS

An MPI-only PBS job running 128 tasks over four 32-way nodes can be modified to utilize SMT on two 32-way nodes by changing the "-l select" PBS parameter (see below), or you can continue to use four nodes with SMT by specifying that job starts 256 tasks (64 per each node), assuming your MPI application is scalable. This latter method may be preferable if total wallclock time used is a primary consideration.

  1. Run a 128-task MPI executable on four 32-way SMT-enabled nodes (32 tasks per node) with task-affinity layout of tasks controlled via the local "launch" tool:
    #!/bin/bash -l
    ### Set the shell path with the "-S" option
    #PBS -S /bin/bash
    ### Set the name of the job with the "-N" option
    #PBS -N SMT128-32tasks
    ### Set the Project ID with the "-A" option
    #PBS -A NAVOS12345LMA
    ### Set the maximum wall-clock time with the "-l walltime" option
    #PBS -l walltime=02:00:00
    ### Set the number of nodes and number of cores with "-l select" option
    #PBS -l select=4:ncpus=32:mpiprocs=32
    ### Set separate nodes exclusively with "-l place" option
    #PBS -l place=scatter:excl
    ### Set queue name with "-q" option
    #PBS -q standard
    #
    ### Environment variables for using SMT for a pure MPI job
    export MEMORY_AFFINITY=MCM
    export UL_MODE=PURE_MPI
    export UL_TARGET_CPU_LIST=AUTO_SELECT
    #
    cd $WORKDIR
    /usr/bin/poe /site/bin/launch ./myMPI.exe
    #
    # End of example SMT128-32tasks script
  2. Run the same 128-task MPI executable on two 32-way SMT-enabled nodes (at 64 tasks per node) with task-affinity layout controlled via the local "launch" tool:
    #!/bin/bash -l
    ### Set the shell path with the "-S" option
    #PBS -S /bin/bash
    ### Set the name of the job with the "-N" option
    #PBS -N SMT128-64tasks
    ### Set the Project ID with the "-A" option
    #PBS -A NAVOS12345LMA
    ### Set the maximum wall-clock time with the "-l walltime" option
    #PBS -l walltime=02:00:00
    ### Set the number of nodes and number of cores with "-l select" option
    #PBS -l select=2:ncpus=64:mpiprocs=64
    ### Set separate nodes exclusively with "-l place" option
    #PBS -l place=scatter:excl
    ### Set queue name with "-q" option
    #PBS -q standard
    #
    ### Environment variables for using SMT for a pure MPI job
    export MEMORY_AFFINITY=MCM
    export UL_MODE=PURE_MPI
    export UL_TARGET_CPU_LIST=AUTO_SELECT
    #
    cd $WORKDIR
    /usr/bin/poe /site/bin/launch ./myMPI.exe
    #
    #End of example SMT128-64tasks script
  3. Compare results and turnaround times. If your code runs correctly under SMT at 64 tasks per node using half as many nodes and walltime is not more than 100% higher than running 32 tasks on twice as many nodes, you will burn less of your allocation hours. This is because walltime charges on Davinci will be charged at "walltime_used*32*nodes_used" whether you run 64 tasks or one task on each node assigned to your PBS job. You may also increase your chances of running jobs more often and having more jobs become backfill candidates, since less nodes would be needed for each run.

Hybrid Jobs

An SMT-aware MPI+OpenMP hybrid job that runs four MPI tasks on four nodes, with each MPI task spawning 64 OpenMP threads on a node, is listed below. Task affinity for this code is controlled by the local "launch" tool.

Note: Under AIX 5.3, there is a known defect that causes performance problems in hybrid applications when application reads stdin redirected from a file, e.g.: poe myHYBRID.exe < namelist. Workaround is to set MP_STDINMODE=0 in job's environment. This is important for best performance under SMT.

#!/bin/bash -l
### Set the shell path with the "-S" option
#PBS -S /bin/bash
### Set the name of the job with the "-N" option
#PBS -N SMT4tasks-64threads
### Set the Project ID with the "-A" option
#PBS -A NAVOS12345LMA 
### Set the maximum wall-clock time with the "-l walltime" option
#PBS -l walltime=02:00:00
### Set the number of nodes and number of cores with "-l select" option
#PBS -l select=4:mpiprocs=1:ncpus=64:ompthreads=64
### Set separate nodes exclusively with "-l place" option
#PBS -l place=scatter:excl
### Set queue name with "-q" option
#PBS -q standard
#
### Environment variables for using SMT for a HYBRID job
export OMP_NUM_THREADS=64
export UL_MODE=HYBRID
export UL_TARGET_CPU_LIST=AUTO_SELECT
#
cd $WORKDIR
/usr/bin/poe /site/bin/launch ./myHYBRID.exe
#
# End of example SMT4tasks-64threads script

Summary

To summarize, SMT should help achieve better throughput for parallel PBS jobs and provide better performance for total walltime charged to a job.

Each Davinci compute node has about 52 GBytes of local user memory. If your code cannot fit more than 32 tasks in this memory on a node, you won't be able to run more than 32 tasks per node, but you can still take advantage of SMT and task affinity by using the launch tool.

The example PBS scripts and compilation examples are available in the $SAMPLES_HOME (/site/HPC_Examples) directory on Davinci.