gpu rendering in k8s cloud

background

to deploy simulation in docker swarm, there is a hurting that GPU rendering in docker requires a physical monitor(as $DISPLAY env variable). which is due to GeForece(Quadro 2000) GPUs, for Tesla GPUs there is no depends on monitor. really intertesting to know.

there were two thoughts how to deploy simulation rendering in cloud, either a gpu rendering cluster, or a headless mode in container cluster.

unity rendering

unity rendering pipeline is as following ：

CPU calculate what need to draw and how to draw
CPU send commands to GPU
GPU plot

direct rendering

for remote direct rendering for GLX/openGL, Direct rendering will be done on the X client (the remote machine), then rendering results will be transferred to X server for display. Indirect rendering will transmit GL commands to the X server, where those commands will be rendered using server’s hardware.

1 2	direct rendering: Yes OpenGL vendor string: NVIDIA

Direct Rendering Infrasturcture(DRI) means the 3D graphics operations are hardware acclerated, Indirect rendering is used to sya the graphics operations are all done in software.

DRI enable to communicate directly with gpus, instead of sending graphics data through X server, resulting in 10x faster than going through X server

for performance, mostly the games/video use direct rendering.

headless mode

to disable rendering, either with dummy display. there is a blog how to run opengl in xdummy in AWS cloud; eitherwith Unity headless mode, When running in batch mode, do not initialize graphics device at all. This makes it possible to run your automated workflows on machines that don’t even have a GPU. to make a game server running on linux, and use the argument “-batchmode” to make it run in headless mode. Linux -nographics support -batchmode is normally the headless flag that works without gpu and audio hw acceleration

GPU container

a gpu containeris needs to map NVIDIA driver from host to the container, which is now done by nvidia-docker; the other issue is gpu-based app usually has its own (runtime) libs, which needs to install in the containers. e.g. CUDA, deep learning libs, OpengGL/vulkan libs.

k8s device plugin

k8s has default device plugin model to support gpu, fpga e.t.c devices, but it’s in progress, nvidia has a branch to support gpu as device plugin in k8s.

nvidia docker mechanism

when application call cuda/vulkan API/libs, it further goes to cuda/vulkan runtime, which further ask operations on GPU devices.

cloud DevOps

simulation in cloud

so far, there are plenty of cloud simulation online, e.g. ali, tencent, huawei/octopus, Cognata. so how these products work? all of them has containerized. as our test said, only Tesla GPU is a good one for containers, especially for rendering needs; GeForce GPU requires DISPLAY or physical monitor to be used for rendering.

cloud vendors for carmakers

as cloud simulation is heavily infrastructure based, OEMs and cloud suppliers are coming together:

FAW -> Ali cloud
GAC -> Tencent Cloud
SAC -> Ali cloud / fin-shine
Geely -> Ali cloud
Changan -> Ali cloud
GWM -> Ali cloud
Xpeng, Nio (?)
PonyAI/WeRide (?)

obviously, Ali cloud has a major benefit among carmakers, which actually give Ali cloud new industry to deep in. traditionally, carmakers doesn’t have strong DevOps team, but now there is a change.

we can see the changing, as connected vehicles service, sharing cars, MaaS, autonomous vehicles are more and more requring a strong DevOps team. but not the leaders in carmakers realize yet.