1 Deep Reinforcement Learning link
apps: motion planning
2 Convolutional Neural Networks link
apps: End-2-end driving task(pedestrian detect)
3 Recurrent Neural Networks link
apps: steering control through time
apps: motion planning
apps: End-2-end driving task(pedestrian detect)
apps: steering control through time
I used C++ couple years already, but never take a close look at it. when looking back to few C++ demo works at school, I had a strong feeling at that time, to make the code running is my only goal, how it work or how to improve it, is always ignored, I am afraid in the language details, as the wisdom say: the evil is in details.
after working for three years, I feel the necessary and urgent to back to the language itself(e.g. Linux, C/C++) as the first step to move forward in application development. Frameworks, engineering-based APIs are more close to final products, making them easy to be attracted, compared to how the details implemented. like the mechanical undergradute students, who first-time run ABAQUS with beatiful visulized results, feels so good.
anyway I have to delay the short satification or self-feeling-good. C is clean and the applications have clear structure, C++ is more metaphysics, I even don’t know where to start. even I thought I knew C++ well, but actualy there are many details behind, e.g. allocator in STL.
template is used for generic programming, e.g. both vector
|
|
the actual meaning of TYPE is deduced by compiler depending on the arg passed to this function. “class” or “typename” is similar.
template supports default parameters, once the i-th argument is set with default value, all arguments after it must using default values.
|
|
feel the power of template in STL source code.
references:
<< the annotated STL source using SGI STL >> by jjHou
a visitor guide to C++ allocator
the annotated STL source
I took several days to start, cause the first section on allocator already blocked me.
The logic of a dynamic container doesn’t depend on the specifies of how memory is obtained; namely, the design of container classes should be decoupled from the memory allocation policy.
memory allocation and object construction is separated, allocator has four operations: allocate(), deallocate(), construct(), destroy(). there are std interface to allocators
|
|
WHEN we say A is an allocator for T , where T is a type e.g. AllocatTraits::value_type. we mean, A knows how to obtain and release memory to store objects of type T. git:allocator
iterator is used to build algorithms in containers. while I don’t really get the traits
|
|
std::vector is dynamic, continous.
|
|
std::list is cycling double-direction link list.
|
|
std::dequeue can operate elements at both ends, and the memory is multi-sectional, in each memory section is linear continous, with advantage of vector and list.
|
|
std::stack only operate on the top element (first in last out), can’t iterate the container, the default container for stack is deque.
|
|
std::queue supports only pop element from front, and push element into end, the default container is dequeue
|
|
std::priority_queue is queue with priority.
actually this post should be named as “pure C multithreads”, while it’s better to write some real code. a simple threadpool in pure C
at most simple case, what a threadpool object should do?
thpool_init(num_threads);
thpool_destory(threapool_obj);
thpool_add_work(callback, args);
// so user define event callback() can be passed in;
threadpool is used to assign jobs (from job list) to a thread. so joblist should be an internal class; and it’s better to package thread, job as separate class too; to add work, what kind of funcs need for joblist ? at least, to add new job(from outside) to the joblist.
|
|
will multi threads call add_task() simutaneously? if so, jobqueue should have a mutex object;
|
|
during threadpool initialization, will all threads be created simultaneously and immediately? if not, there should be thread status flags (creating, exisiting-but-idle, working, died); and to update these flags need protect by mutex object;
|
|
this is an interesting, when design a lib, what’s in my mind. hopefully implement by the end of week.
this is a review from GNU C lib. socket is a two-way communication channel, both read and write can be performed at either end.
|
|
how to debug server/client code? chatRoom
this is a review from GNU C programming
two basic mechanism for representing the connection between your program and the file: streams & file descriptors. FD is represented as objects of type int; streams are represented as FILE * objects
a stream is an abstract concept reprensenting a communication channel to a file, a device, or consider it as a sequence of characters with functions to take characters out of one end, and put characters into the other end, like a pipe.
an integer representing the number of bytes from the beginning of the file; each time a character read or written, the file position is incremented, namely, access to a file is sequential.
|
|
|
|
usually block stands for either block data or text in fixed-size, instead of characters or lines
|
|
|
|
|
|
stream and file is not communicated character-by-character, there is a buffer for I/O.
|
|
|
|
|
|
async I/O, event I/O, interrupt driven I/O, GPIO …
/etc/profile -> hold shell environment & startup settings
~/.bashrc -> system wide definitions for shell funcs & alias
~/.bash_profile -> user environment individually
ususally during coapplication configuration on Mac or Linux, mostly use ~/.bashrc to add project related environment variables.
MPI gdb: (each MPI thread will have an independent terminal window)
mpirun -np #cpus xterm -e gdb ./exe
set breakpoint in different src:
b sth.cxx:20
print an array:
p (int[length]* a)
add input file:
set args -j input_file
load src after triggering GDB:
gdb
file exe
set args -j input_file
bash guide for beginners
tao of regular expression
Linux programmer’s manual
Linux system administrators guide
C expert programming
what every programmer should know about CPU caches
understand CPU utilization & optimization
Intel: optimization applications for NUMA
Intel guide for developing multithreaded applications
MPI parallel programming in Python
considerations in software design for multi-core, multiprocessor architectures
how to optimize GEMM
optimize for Intel AVX using MKL with DGEMM
GEMM: from pure C to SSE optimized micro kernel
a practical guide to SSE SIMDD with C++
multi-core design
Unix and pthread programming
introduction to post processing finite element results with AVS
CAE Linux: FEA inter-operability
open sourcing a Python project the right way
PPSS
pyNastran
hdfviewer
valgrind
starting from 2016, I had went through each hot topic nowadays, even did some study in DL, AV, CV etc. every time the passion burst out and I promised to e.g. study a framework, or contribute an open project. in reality, the passion dies away soon. It’s like a new and very attracting concept bump out in the market, but no business model can handle it, then it dies out.
downside of this learning pattern is that the fundmental is ignored. e.g. I can’t success in code interview, few experience in basic algorithms. kind of person always vision the big, but don’t realize how to reach there. it’s kind of wasting finally, just want to be focus at this moment.
|
|
typedef defines “gid” as an alias name to “struct gid_ “. typedef also alias function handle/pointer, which is often used in asynchronous/event callback programming. other data encapsulation are:
enum : map a list of const names to integral constants
union: store many data type in one memory address
|
|
in C++, structure initialization can also be done in constructor.
for better memory access in CPU architecture, memory alignment in structure is considered. namely:
chars can start on any byte address
2-bytes shorts must start on an even address
4-bytes ints/floats must start on an address divisible by 4
8-bytes doubles/longs must start on an address divisible by 8
the size of the whole structure, is aligned to intergeral times of the max size of its member variable. e.g. sizeof(gid) = 24, not 8+8+4.
to put the member variables in ascending/descending order is good practice.
“gid++” will step forward the sizeof(gid); structure also supports self-reference:
|
|
another common utils is structure array:
|
|
in general, structure can be passing to function by value or by pointer, but not by reference in pure C. also structure as return value from function can be value or a pointer
in C++, structure supports member functions, is same as a public class. and the initialization can be done either in constructor function or direct initialization during definition. see the difference of struct between C and C++
|
|
in pure C, string is a char array terminating with NULL(‘\0’). To initialize an array of char(string) is with a string literal(a double-quotaed string).
in general, the string literal is used as the initializer for an array of char; anywhere else, it is a static array of chars, which shouldn’t be modified by a pointer.
|
|
|
|
\
and be aware of downs of C string.
when a derived object calls the base class member function, inside which calls another virtual member function, which is implemented inside the derived class, so which virtual member function is actually called ?
|
|
output is:
ACM2::getDataIndex
so the derived object always calls the member function that most closest to its own class domain first, even if this member function is called from a base class member function.
uniform memory arch, all CPUs in the same sokcet shall the memory, and the memory IO is bottleneck; non uniform memory acess(NUMA), each CPU has itw own memory, inter-CPU memory access is remote.
numctl cpu_node_bind
MPI binding APIs
memory allocation strategy in NUMA
interleave: place the memory on alternating node,
the first memory portion in node0, the next
portion in node1
membind: forcely to allocate memory in special node
the benifit of CPU affinity is to reduce context switch since one special process is binded to one speical CPU, so the data required in this process (no other process data will be switched in), can always in the CPU cache. espcially for NUMA arch to access data locally. and this is specially true when the application generate a large cache footprint, e.g. in scientific computing.
for general applications, CPU afinity may reduce performance, since in this way the CPU scheduler can’t work properly.
/proc/cpuinfo
physical id: socket index
siblings: number of cores in each socket
core id: current core index
e.g. Ford HPC CPU architecture:
2 or 4 CPU sockets group into one computing node
each socket has 10 or 12 CPU cores
each socket has a on-board shared memory
CPU physical mechanism:
physically there is a bus #hlock pin.
if #lock is added before the assembly instruction,
the corresponding machine code will be pulled down
during the execution of the #hlock pin till the
end, basically the bus is locked only for current
instruction
cache coherence:
the cache unit tranfered between CPU and main memory is cache line. in one socket, as slibings share L3 cache, there is an scenario, when CPU1 modified one variable, but not yet writen to main memory, and CPU2 read from main memory and did modified again, then the variable in CPU1 and CPU2 is not coherence.
volatile in C, forcely to read the variable value
from main memory every time, to avoid use dirty
memory in cache
another scenario, to achieve and cache coherence, and the same variable is read and write repeatly by multiple processes, the performance is worse, “false sharing”
lock:
signal, also called "sleeping lock": used when the
lock need to keep for a longer time
spin lock: at most hold by one thread, to block
other threads into critial area, used to keep
for a short time.
write/read lock:
resident memory(res), the portion of virtual memory space that is actually in RAM; swapped memory, when the physical memory is full and the system needs more memory, inactive pages in memory moved to the shared space, and swapped usable memory in
virtual memory = res memory + swapped memory
mmap, to access the remote (data block) like access local RAM.
voluntary context switches(vcs):
when a thread makes a system call that blocks. vcs
measures the frequency of calling blocked system I/O
involuntary context switches(ivcs):
when a thread has being runing too long without
making a system call that blocks, and there are
other processes waiting for CPU, then OS will
switch for other CPUs. ivcs measures the CPU
competition, an unfinished processed is switched off
in general, as more threads, the context switch cost increase, due to the total amount of switch increase and as well each switch is more expensive, since CPU cache is limited, and each process will hold fewer data in cache.
cpu_time (clock_t):
the total amount of time that a process has actually used
user CPU time:
the amount of time spend in user space running
system CPU time:
the amount of time spent during kernel space running
wall-clock time:
the whole time from the process start to end
inter-process communication(IPC) share memory is used in our application to cowork with MPI. while IPC helps to balance memory distributed in multi computing nodes, and MPI threads are the working horse to eat shared data. there are other IPC libs, e.g. Boost.interprocess.
ftok() -> to generate a IPC key, based on a special file path
shmget() -> generate a shared memory portion and return a shm_id
shmat() -> attach to the shm_id and return a pointer to that shared memory
shmctl()
shmdt()
in design, all threads call shmget() to get a pointer to the share memory section, but actually only master thread do create the portion, others read it. since all thread can access the share memory section, it’s important to keep the first returned pointer from master thread clean/unpolluated.
there are a few basic MPI APIs.
in our project, to benefit both CPU and memory performance, we actually need: 1) subgroup the MPI comm, 2) bind MPI threads to sockets.
MPI_COMM_GROUP()
MPI_Group_incl()
MPI_COMM_create()
apis and numctl during run-time:
mpirun -bycore -bind-to-socket -> bind process to each core on the socket
mpirun -bysocket -bind-to-socket
mpirun numctl -membind=0 -np 32 ./run