NUMA
uniform memory arch, all CPUs in the same sokcet shall the memory, and the memory IO is bottleneck; non uniform memory acess(NUMA), each CPU has itw own memory, inter-CPU memory access is remote.
numctl cpu_node_bind
MPI binding APIs
memory allocation strategy in NUMA
interleave: place the memory on alternating node,
the first memory portion in node0, the next
portion in node1
membind: forcely to allocate memory in special node
CPU affinity
the benifit of CPU affinity is to reduce context switch since one special process is binded to one speical CPU, so the data required in this process (no other process data will be switched in), can always in the CPU cache. espcially for NUMA arch to access data locally. and this is specially true when the application generate a large cache footprint, e.g. in scientific computing.
for general applications, CPU afinity may reduce performance, since in this way the CPU scheduler can’t work properly.
CPU info
/proc/cpuinfo
physical id: socket index
siblings: number of cores in each socket
core id: current core index
e.g. Ford HPC CPU architecture:
2 or 4 CPU sockets group into one computing node
each socket has 10 or 12 CPU cores
each socket has a on-board shared memory
atomic operation
CPU physical mechanism:
physically there is a bus #hlock pin.
if #lock is added before the assembly instruction,
the corresponding machine code will be pulled down
during the execution of the #hlock pin till the
end, basically the bus is locked only for current
instruction
cache coherence:
the cache unit tranfered between CPU and main memory is cache line. in one socket, as slibings share L3 cache, there is an scenario, when CPU1 modified one variable, but not yet writen to main memory, and CPU2 read from main memory and did modified again, then the variable in CPU1 and CPU2 is not coherence.
volatile in C, forcely to read the variable value
from main memory every time, to avoid use dirty
memory in cache
another scenario, to achieve and cache coherence, and the same variable is read and write repeatly by multiple processes, the performance is worse, “false sharing”
lock:
signal, also called "sleeping lock": used when the
lock need to keep for a longer time
spin lock: at most hold by one thread, to block
other threads into critial area, used to keep
for a short time.
write/read lock:
system performance
resident memory(res), the portion of virtual memory space that is actually in RAM; swapped memory, when the physical memory is full and the system needs more memory, inactive pages in memory moved to the shared space, and swapped usable memory in
virtual memory = res memory + swapped memory
mmap, to access the remote (data block) like access local RAM.
voluntary context switches(vcs):
when a thread makes a system call that blocks. vcs
measures the frequency of calling blocked system I/O
involuntary context switches(ivcs):
when a thread has being runing too long without
making a system call that blocks, and there are
other processes waiting for CPU, then OS will
switch for other CPUs. ivcs measures the CPU
competition, an unfinished processed is switched off
in general, as more threads, the context switch cost increase, due to the total amount of switch increase and as well each switch is more expensive, since CPU cache is limited, and each process will hold fewer data in cache.
cpu_time (clock_t):
the total amount of time that a process has actually used
user CPU time:
the amount of time spend in user space running
system CPU time:
the amount of time spent during kernel space running
wall-clock time:
the whole time from the process start to end
Linux IPC
inter-process communication(IPC) share memory is used in our application to cowork with MPI. while IPC helps to balance memory distributed in multi computing nodes, and MPI threads are the working horse to eat shared data. there are other IPC libs, e.g. Boost.interprocess.
ftok() -> to generate a IPC key, based on a special file path
shmget() -> generate a shared memory portion and return a shm_id
shmat() -> attach to the shm_id and return a pointer to that shared memory
shmctl()
shmdt()
in design, all threads call shmget() to get a pointer to the share memory section, but actually only master thread do create the portion, others read it. since all thread can access the share memory section, it’s important to keep the first returned pointer from master thread clean/unpolluated.
MPI
there are a few basic MPI APIs.
in our project, to benefit both CPU and memory performance, we actually need: 1) subgroup the MPI comm, 2) bind MPI threads to sockets.
MPI_COMM_GROUP()
MPI_Group_incl()
MPI_COMM_create()
apis and numctl during run-time:
mpirun -bycore -bind-to-socket -> bind process to each core on the socket
mpirun -bysocket -bind-to-socket
mpirun numctl -membind=0 -np 32 ./run