NUMA

uniform memory arch, all CPUs in the same sokcet shall the memory, and the memory IO is bottleneck; non uniform memory acess(NUMA), each CPU has itw own memory, inter-CPU memory access is remote.

numctl  cpu_node_bind
MPI binding APIs

memory allocation strategy in NUMA

interleave: place the memory on alternating node,
the first memory portion in node0, the next 
portion in node1
membind: forcely to allocate memory in special node

CPU affinity

the benifit of CPU affinity is to reduce context switch since one special process is binded to one speical CPU, so the data required in this process (no other process data will be switched in), can always in the CPU cache. espcially for NUMA arch to access data locally. and this is specially true when the application generate a large cache footprint, e.g. in scientific computing.

for general applications, CPU afinity may reduce performance, since in this way the CPU scheduler can’t work properly.

CPU info

/proc/cpuinfo

physical id: socket index
siblings: number of cores in each socket
core id: current core index

e.g. Ford HPC CPU architecture:

2 or 4 CPU sockets group into one computing node
each socket has 10 or 12 CPU cores
each socket has a on-board shared memory

atomic operation

CPU physical mechanism:

physically there is a bus #hlock pin. 
if #lock is added before the assembly instruction, 
the corresponding machine code will be pulled down
during the execution of the #hlock pin till the
end, basically the bus is locked only for current 
instruction

cache coherence:

the cache unit tranfered between CPU and main memory is cache line. in one socket, as slibings share L3 cache, there is an scenario, when CPU1 modified one variable, but not yet writen to main memory, and CPU2 read from main memory and did modified again, then the variable in CPU1 and CPU2 is not coherence.

volatile in C, forcely to read the variable value 
from main memory every time, to avoid use dirty
memory in cache

another scenario, to achieve and cache coherence, and the same variable is read and write repeatly by multiple processes, the performance is worse, “false sharing”

lock:

signal, also called "sleeping lock": used when the
lock need to keep for a longer time 
spin lock: at most hold by one thread, to block
other threads into critial area, used to keep 
for a short time.
write/read lock:

system performance

resident memory(res), the portion of virtual memory space that is actually in RAM; swapped memory, when the physical memory is full and the system needs more memory, inactive pages in memory moved to the shared space, and swapped usable memory in

virtual memory = res memory +  swapped memory

mmap, to access the remote (data block) like access local RAM.

voluntary context switches(vcs):

when a thread makes a system call that blocks. vcs
measures the frequency of calling blocked system I/O

involuntary context switches(ivcs):

when a thread has being runing too long without
making a system call that blocks, and there are
other processes waiting for CPU, then OS will
switch for other CPUs. ivcs measures the CPU
competition, an unfinished processed is switched off

in general, as more threads, the context switch cost increase, due to the total amount of switch increase and as well each switch is more expensive, since CPU cache is limited, and each process will hold fewer data in cache.

cpu_time (clock_t):
the total amount of time that a process has actually used

user CPU time:
the amount of time spend in user space running

system CPU time:
the amount of time spent during kernel space running

wall-clock time:
the whole time from the process start to end

Linux IPC

inter-process communication(IPC) share memory is used in our application to cowork with MPI. while IPC helps to balance memory distributed in multi computing nodes, and MPI threads are the working horse to eat shared data. there are other IPC libs, e.g. Boost.interprocess.

ftok() -> to generate a IPC key, based on a special file path

shmget() -> generate a shared memory portion and return a shm_id

shmat() -> attach to the shm_id and return a pointer to that shared memory

shmctl() 

shmdt()

in design, all threads call shmget() to get a pointer to the share memory section, but actually only master thread do create the portion, others read it. since all thread can access the share memory section, it’s important to keep the first returned pointer from master thread clean/unpolluated.

MPI

there are a few basic MPI APIs.

in our project, to benefit both CPU and memory performance, we actually need: 1) subgroup the MPI comm, 2) bind MPI threads to sockets.

MPI_COMM_GROUP()
MPI_Group_incl() 
MPI_COMM_create()

apis and numctl during run-time:

mpirun -bycore -bind-to-socket -> bind process to each core on the socket 

mpirun -bysocket  -bind-to-socket 

mpirun numctl -membind=0 -np 32 ./run