background
previously tried to deploy lgsvl by docker compose v3, which at first sounds promising, but due to lack of runtime support, which doesn’t work any way. docker service create --generic-resource
is another choice.
docker service options
docker service support a few common options
--workdir is the working directory inside the container
--args is used to update the command the service runs
--publish <Published-Port>:<Service-Port>
--network
--mount
--mode
--env
docker service create with generic-resource
generic-resource
create services requesting generic resources is supported well:
|
|
tips: acutally the keyword NVIDIA-GPU
is not the real tags. generic_resource
is also supported in docker compose v3.5:
|
|
--generic-resource
has the ability to access GPU in service, a few blog topics:
first try
follow accessing GPUs from swarm service. install nvidia-container-runtime and install docker-compose
, and run the following script:
|
|
to understand supported dockerd options, can check here, then run the test as:
docker service create --name vkcc --generic-resource "gpu=0" --constraint 'node.role==manager' nvidia/cudagl:9.0-base-ubuntu16.04
docker service create --name vkcc --generic-resource "gpu=0" --env DISPLAY=unix:$DISPLAY --mount src="X11-unix",dst="/tmp/.X11-unix" --constraint 'node.role==manager' vkcube
which gives the errors:
1/1: no suitable node (1 node not available for new tasks; insufficient resourc…
1/1: no suitable node (insufficient resources on 2 nodes)
if run as, where GPU-9b5113ed
is the physical GPU ID in node:
docker service create --name vkcc --generic-resource "gpu=GPU-9b5113ed" nvidia/cudagl:9.0-base-ubuntu16.04
which gives the error:
invalid generic-resource request `gpu=GPU-9b5113ed`, Named Generic Resources is not supported for service create or update
these errors are due to swarm cluster can’t recognized this GPU resource, which is configured in /etc/nvidia-container-runtime/config.toml
second try
as mentioined in GPU orchestration using Docker, another change can be done:
ExecStart=/usr/bin/dockerd -H unix:///var/run/docker.sock --default-runtime=nvidia --node-generic-resource gpu=${GPU_ID}
which fixes the no suitable node
issue, but start container failed: OCI..
|
|
check daemon log with sudo journalctl -fu docker.service
, which gives:
|
|
third try
following issue #141
|
|
and run:
docker service create --name vkcc --generic-resource "gpu=1" --env DISPLAY --constraint 'node.role==manager' nvidia/cudagl:9.0-base-ubuntu16.04
it works with output verify: Service converged
. However, when test image with vucube
or lgsvl
it has errors:
|
|
to debug the non-zero exit (1)
:
docker service ls #get the dead service-ID
docker [service] inspect r14a68p6v1gu # check
docker ps -a # find the dead container-ID
docker logs ff9a1b5ca0de # check the log of the failure container
it gives: Cannot find a compatible Vulkan installable client driver (ICD)
check the issue at gitlab/nvidia-images
forth try
docker service create --name glx --generic-resource "gpu=1" --constraint 'node.role==manager' --env DISPLAY --mount src="X11-unix",dst="/tmp/.X11-unix" --mount src="tmp",dst="/root/.Xauthority" --network host 192.168.0.10:5000/glxgears
BINGO !!!!! it does serve openGL/glxgears
in service mode. However, there are a few issues:
constraint to manager node
require host network
the X11-unix
and Xauthority
are from X11 configuration, which need more study. also network
parameter need to expand to ingress overlay
mostly, vulkan image still can’t run, with the same error: Cannot find a compatible Vulkan installable client driver (ICD)
generic-resource support discussion
moby issue 33439: add support for swarmkit generic resources
- how to advertise Generic Resources(republish generic resources)
- how to request Generic Resources
nvidia-docker issue 141: support for swarm mode in Docker 1.12
docker issue 5416: Add Generic Resources
Generic resources
Generic resources are a way to select the kind of nodes your task can land on.
In a swarm cluster, nodes can advertise Generic resources as discrete values or as named values such as SSD=3 or GPU=UID1, GPU=UID2.
The Generic resources on a service allows you to request for a number of these Generic resources advertised by swarm nodes and have your tasks land on nodes with enough available resources to statisfy your request.
If you request Named Generic resource(s), the resources selected are exposed in your container through the use of environment variables. E.g: DOCKER_RESOURCE_GPU=UID1,UID2
You can only set the generic_resources resources’ reservations field.
overstack: schedule a container with swarm using GPU memory as a constraint
$ docker node update --label-add <key>=<value> <node-id>
SwarmKit
swarmkit also support GenericResource, please check design doc
|
|
./bin/swarmctl service create --device /dev/nvidia-uvm --device /dev/nvidiactl --device /dev/nvidia0 --bind /var/lib/nvidia-docker/volumes/nvidia_driver/367.35:/usr/local/nvidia --image nvidia/digits:4.0 --name digits
swarmkit add support devices option
refer
manage swarm service with config
UpCloud: how to configure Docker swarm
Docker compose v3 to swarm cluster