Serious Autonomous Vehicles


  • Home

  • Archives

  • Tags

  • Search

where are you in next 4 years (2)

Posted on 2019-12-08 |

backgroud

joined the zhongguancun self-driving car workshow today, different level than PlusAI’s tech show at last weekend. there are a few goverment speakers, e.g. zhongguancun tech officer. and sensor suppliers, e.g. SueStar, zhongke hui yan, holo matic, self-driving solution suppliers, e.g. pony.ai; xiantong tech e.t.c, and media press and investors, interesting thing the investors doesn’t look promising.

as mentioned last time, the ADS solution suppliers and most sensor startups join the champion, in about 2 years. and goverment are invovled with policy-friendy support. just the captial market or investors don’t buy it at this moment.

XianTong

XianTong, focused in city road cleaning, which is a special focus, rather than passenger vehicles, pacakge trucks, or small delivery robotics.

They have some data in the cleanning serivces in large cities, e.g. beijing, shanghai, hangzhou, and Europen.

  • current cleaner laybor’s harsh and dangerous working env
  • current cleaner laybor’s limitation to working hours, benefits requirements

they mentioned the city cleaning market is about 300 billions in China, which looks promising, but how much percent of the cleaning vechiles in this market is not talked.

it’s maybe about 20% ~ 60%, as there are a lot human resources, city green plant needs e.t.c, which eats a lot of money, and the current cleaner vehicle products that support ADS maybe has an even smaller part in the whole vehicles used in city cleaning services.

so the whole city clearning service market sounds promising, but down to the clearning vehicles, and especailly without a matured and in-market cleaner vehicle product, it’s really difficult to digest and dig golden from the market.

I have a feeling, most startups has the similar gap, they do vision big, e.g. to assist the city, the companies, the bussiness, the end customers run/live more efficiently/enjoyable/profitable.

but the reality is not that friendly for them, as they spent investor’s money, which expect to get profitable return in a short time. which may push the startups to draw big pictures far beyond their ability, or even far beyond the whole industry’s ability.

as well as they draw big pictures, they are very limited to deep into the market, to understand the customers, to design the products with original creativity.

creativity or application

for investors, these a special industry-application based startups, I think, at most may get investing profit at 1 ~ 4 times.

maybe it’s a good idea to understand the successful invest cases happened in last 5 years. And I am afraid that’s also a self-centered option, that most CV happened in high-tech, Internet-based startups.

cause the current self-driving market, especially the startups in China, which focus in ADS full-stack solutions, sensors, services providers, are not a game-change player.

in history, the first successful company to product PCs, smart phones, Internet searching serivce, social network servie, taxing service, booking (restrount) service, food delivery service, they are game changers. and somehow they are most talked in public, in investers, and most understandable by most knowledge consumers, e.g. engineers.

but are they the whole picture of the national economy? what about the local seafood resturant, the local auto repair shops; or the company who product pig’s food, who product customers’ bikes; or the sales company who sell steel to Afria.

The economy is more plentiful and complex, than a mind-straight knowledge consumer can image. For a long time, I didn’t realise the local food restarant or a local gym, but they do higher money payback, and definitely higher social status, than a fixed income engineer.

so don’t try to find out the secret of each component of the whole economy, and then try to find out the most optimized way to live a life. there is no, or every way to live a life, it is the most optimized way.

so the CEOs of these startups, are not crazy to image themselves as the game changers, like Steve Jobs, so they know their energy.

surviving first

since as they know their limitation, so they are not nuts, so they are just enough energy to find a way to survive, even not in a good shape.

that’s the other part, as an naive young man, always forgot the fact that surviving is not beautiful most times.

the nature has tell the lesson: the young brids only kill his/her brothers/sisters, then he/she can survive to grow into adult. for the deers in Afria plant, each minite is either run to survive or being eaten.

companies or human as a single, has the same situation, even the goverment try to make it a little less harsh, but most time to survive is difficult.

market sharing

as a young man, 1000 billion market sounds like a piece of cake, when comes to the small company, who work so hard to get a million level sale, sounds like a piece of shit.

and that’s another naive mind about the reality. like I can’ see the money return from a local seafood resturout, and when I found out it does get more than a million every year, the whole me is shocked.

so there is no big or small piece of cake, as it come to survive. most CEOs, they are not nuzy, and they know clearly in their heart, that their company need to survivie and make a profitable income, and that’s enough, to change the world is not most people’s responsibility, but survive it is. however, these CEOs in publich, they talk their company or product as the game-changers, that’s what the investors’ want they to say.

so don’t think the market is too small any more, as well as it can support a family to live a life.

dream is not the option to most people, that’s the final summary. but survive is a have-to.

life is evergreen

life is not only about surviving. if else, human society is an animal world. thinking reasonal always give the doomed impression, and the life in blue; once more perceptual in mind, the day is sunny and joyful.

“the theory is grey, but the tree of life is evergreen”

develop team in OEM

as mentioned, currently there are plenty of system simulation verification special suppliers, e.g. 51 vr, baidu, tencent, alibaba e.t.c, and definitely there softwares are more matured than our team. I am afriad jsut at this moment, the leader teams in OEM don’t realize that to build a simualtion tool espcially to support L3 is mission-impossible. if else, the requirements for simulation, should come from suppliers, rather than build OEM’s own simualtion develop team.

I still remember the first day to join this team, sounds like we are really creating something amazing, and nobody else there have more advantages than us. then gradually I realize we can’t customize the Unity engine, we can’t support matured cloud support, we can’t even implement the data pipeline for road test log and analysis. most current team members work stay in requirements level, and in a shadow mode. and actually most of these needs, does have a few external companies/teams have better solution.

there does a lot existing issues, from software architecture to ADS algorithms restructure, but these work is mostly not done by OEM develop team.

the second-level developing team, can’t make the top-class ADS product. as the new company will split out, this team is less possible to survive in the market, or the leaders have to set up a new developing team.

if AI is the direction, that’s another huge blank for this team. I think either go to the ADS software architecture or to AI is a better choice now.

running Vulkan in virtual display

Posted on 2019-12-04 |

install xserver-xorg-video-dummy

apt-cache search xserver-xorg-video-dummy 
apt-get update 
sudo apt-get install xserver-xorg-video-dummy

which depends on xorg-video-abi-20 and xserver-xorg-core, so need to install xserver-xorg-core first. after update xorg.conf as run Xserver using xserver-xorg-video-dummy driver, and reboot the machine, which leads both keyboard and mouse doesn’t reponse any more.

understand xorg.conf

usually, xorg.conf is not in system any more, so most common use, the xorg will configure the system device by default. if additional device need to configure, can run in root X -configure, which will generate xorg.conf.new file at /root.

there are two xorg.conf, one generated by running X -configure, which located at /root/xorg.conf.new ; the other is generated by nvidia-xconfigure, which can be found at /etc/X11/xorg.conf.

the following list is from xorg.conf doc

  • ServerLayout section

it is at the highest level, they bind together the input and output devices that will be used in a session.

input devices are described in InputDevice sections, output devices usualy consist of multiple independent components(GPU, monitor), which are defined in Screen section. each Screen section binds togethere a graphics board(GPU) and a monitor.

the GPU are described in Device sections and monitors are described in Monitor sections

  • FILES section

used to specify some path names required by the server.

e.g. ModulePath, FontPath ..

  • SERVERFLAGS section

used to specify global Xorg server options. all should be Options

"AutoAddDevices", enabled by default.

  • MODULE section

used to specify which Xorg server (extension) modules shoul be loaded.

  • INPUTDEVICE section

Recent X servers employ HAL or udev backends for input device enumeration and input hotplugging. It is usually not necessary to provide InputDevice sections in the xorg.conf if hotplugging is in use (i.e. AutoAddDevices is enabled). If hotplugging is enabled, InputDevice sections using the mouse, kbd and vmmouse driver will be ignored.

Identifier and Driver are required in all InputDevice sections. Identifier used to specify the unique name for this input device; Driver used to specify the name of the driver.

An InputDevice section is considered active if it is referenced by an active ServerLayout section, if it is referenced by the −keyboard or −pointer command line options, or if it is selected implicitly as the core pointer or keyboard device in the absence of such explicit references. The most commonly used input drivers are evdev(4) on Linux systems, and kbd(4) and mousedrv(4) on other platforms.

a few driver-independent Options in InputDevice:

CorePointer and CoreKeyboard are the inverse of option Floating, which, when enabled, the input device does not report evens through any master device or control a cursor. the device is only available to clients using X input Extension API.

  • Device section

there must be at least one, for the video card(GPU) being used. Identifier and Driver are required in all Device sections.

  • Monitor Section
    there must be at least one, for the monitor being used. the default configuration will be created when one isn’t specified. Identifier is the only mandatory.
  • Screen Section
    There must be at least one, for the “screen” being used, represents the binding of a graphics device (Device section) and a monitor (Monitor section). A Screen section is considered “active” if it is referenced by an active ServerLayout section or by the −screen command line option. The Identifier and Device entries are mandatory.

debug keyboard/mouse not response after X upgrade

  • login to Ubuntu safe mode, by F12 –> Esc (to display GRUB2 menu), then enable network –> root shell

  • run X -configure

one line say:

List of video drivers:  dummy, nvidia,  modesetting. 

uninstall xserver-xorg-video-dummy

I thought the dummy video driver is the key reason, so uninstall it, then rerun the lines above, check /var/log/Xorg.0.log:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[ 386.768] List of video drivers:
[ 386.768] nvidia
[ 386.768] modesetting
[ 386.860] (++) Using config file: "/root/xorg.conf.new"
[ 386.860] (==) Using system config directory "/usr/share/X11/xorg.conf.d"
[ 386.860] (==) ServerLayout "X.org Configured"
[ 386.860] (**) |-->Screen "Screen0" (0)
[ 386.860] (**) | |-->Monitor "Monitor0"
[ 386.861] (**) | |-->Device "Card0"
[ 386.861] (**) | |-->GPUDevice "Card0"
[ 386.861] (**) |-->Input Device "Mouse0"
[ 386.861] (**) |-->Input Device "Keyboard0"
[ 386.861] (==) Automatically adding devices
[ 386.861] (==) Automatically enabling devices
[ 386.861] (==) Automatically adding GPU devices
[ 386.861] (**) ModulePath set to "/usr/lib/xorg/modules"
[ 386.861] (WW) Hotplugging is on, devices using drivers 'kbd', 'mouse' or 'vmmouse' will be disabled.
[ 386.861] (WW) Disabling Mouse0
[ 386.861] (WW) Disabling Keyboard0
Xorg detected mouyourse at device /dev/input/mice.
Please check your config if the mouse is still not
operational, as by default Xorg tries to autodetect
the protocol.

there is a warning: (WW) Hotplugging is on, devices using drivers 'kbd', 'mouse' or 'vmmouse' will be disabled

disable Hotplugging

first generate by X -configure at /root/xorg.conf.new, and copy it to /etc/X11/xorg.conf. then add the additional section in /etc/X11/xorg.conf, , which will disable Hotplugging:

1
2
3
4
Section "ServerFlags"
Option "AllowEmptyInput" "True"
Option "AutoAddDevices" "False"
EndSection

however, it reports:

1
2
3
4
5
6
7
8
(EE) Failed to load module "evdev" (module does not exist, 0)
(EE) NVIDIA(0): Failed to initialize the GLX module; please check in your X
(EE) NVIDIA(0): log file that the GLX module has been loaded in your X
(EE) NVIDIA(0): server, and that the module is the NVIDIA GLX module. If
(EE) NVIDIA(0): you continue to encounter problems, Please try
(EE) NVIDIA(0): reinstalling the NVIDIA driver.
(EE) Failed to load module "mouse" (module does not exist, 0)
(EE) No input driver matching `mouse'

switch to nvidia xorg.conf

which reports:

1
2
3
4
(EE) Failed to load module "mouse" (module does not exist, 0)
(EE) No input driver matching `mouse'
(EE) Failed to load module "evdev" (module does not exist, 0)
(EE) No input driver matching `evdev'

it fix the Nvidia issue, but still can’t fix the input device and driver issue.

switch to evdev driver

as mentioned previously, evdev driver is the default driver for Linux, and will be loaded by Xserver by default. so try to both Keyboard and Mouse driver to evdev,

which reports:

1
2
3
4
(EE) No input driver matching `kbd'
(EE) Failed to load module "kbd" (module does not exist, 0)
(EE) No input driver matching `mouse'
(EE) Failed to load module "mouse" (module does not exist, 0)

looks it’s the problem of driver, even the default driver is missed. I try to copy master node’s /usr/lib/xorg/modules/input/ to worker node, then it reports :

1
2
(EE) module ABI major version (24) doesn't match the server's version (22)
(EE) Failed to load module "evdev" (module requirement mismatch, 0)

which can be fixed by adding Option IgnoreABI .

delete customized keyboard and mouse

if enable Hotplugging, the X will auto detect the device, I’d try:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen0" 0 0
EndSection
Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
EndSection
Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
SubSection "Display"
Depth 24
EndSubSection
EndSection
Section "ServerFlags"
Option "AllowEmptyInput" "True"
Option "IgnoreABI" "True"
EndSection

which reports:

1
2
3
(II) No input driver specified, ignoring this device.
(II) This device may have been added with another device file.
(II) config/udev: Adding input device Lenovo Precision USB Mouse (/dev/input/mouse0

there is no ERROR any more, but looks the default Input driver (evdev?) can’t be found out …

reinstall xorg

Mouse and keyboard can be driven by evdev or mouse/keyboard driver respectively. Xorg will load only endev automatically, To use mouse and/or keyboard driver instead of evdev they must be loaded in xorg.conf. There is no need to generate xorg.conf unless you want to fine tune your setup or need to customize keyboard layout or mouse/touchpad functionality.

  • firstly configure new network interface for worker node:

configure DHCP network connection

setting at /etc/network/interface:

1
2
auto enp0s25
iface enp0s25 inet dhcp

ifconfig enp0s25 down
ifconfig enp0s25 up

  • then reinstall xorg:
sudo apt-get update 
sudo apt-get upgrade
sudo apt-get install xserver-xorg-core  xserver-xorg  xorg

which install these libs, xserver-xorg-input-all, xserver-xorg-input-evdev, xserver-xorg-inut-wacom, xserver-xorg-input-vmouse, xserver-xorg-input-synaptics, these are the exact missing parts(input device and drivers). it looks when uninstall video-dummy, these modules are deleted by accident.

  • reboot, both keyboard and mouse work !

  • “sudo startx” through ssh

now the user password doesn’t work in normal login, but when ssh login from another machine, the password verify well. which can be fixed by ssh login from remote host first, then run sudo startx, which will bring the user-password verification back

virtual display

xdummy

xdummy: xorg.conf

run: Xorg -noreset +extension GLX +extension RANDR +extension RENDER -logfile ./10.log -config ./xorg.conf :10

  • test with glxgears/OpengGL works

1) DISPLAY=localhost:10.0 works

2) DISPLAY=:0 works, but you can’t see it, cause the worker host is in virtual display

  • test with vkcube/Vulkan failed

in summary, the vitual display can support OpenGL running, but doesn’t support Vulkan yet. unity simulation cloud SDK is the vendor’s solution, but licensed.

refer

sample xorg.conf for dummy device

Keyboard and mouse not responding at reboot after xorg.conf update

how to Xconfigure

no input drivers loading in X

recent thoughts in ADS

Posted on 2019-11-29 |

the following are some pieces of ideas during discussion and online resources.

system engineering in practical

the following idea is coming from the expert of system engineering.

originally, system engineering or model based design sounds come from aerospace, defense department. the feature of these products:

1) they are the most complex system

2) they are sponsored by the goverment

3) they are unique and no competitors

which means they don’t need worry about money and time, so to gurantee the product finally works, they can design from top to down in a long time.

the degrade order of requirements level comes as:

areospace, defense product >> vehicle level product >> industry level product >> customer level product

usually the techs used in the top level is gradually degrading into the next lower level in years. e.g. GPS, Internet, autonomous e.t.c. at the same time, the metholodies from top level go to lower level as well.

I suppose that’s why system engineeering design comes to vehicle industry. however, does it really work in this competitional industry? I got the first experince when running scenaio testes in simulation SIL. as the system engineering team define the test cases/scenarios, e.g. 400 test scenarios; on the other hand, the vehicle test team does the road test a few times every week.the result is, most time the 400 test scenarios never catch a system failure; but most road test failure scenario does can be repeated in the simulation env.

system engineering based design doesn’t fit well. there are a lot reasons. at first, traditionally the design lifetime of a new vehicle model is about 3~5 years, and startup EV companies recently has even shorter design life cycle, about 1~2 years. so a top-down design at the early time, to cover every aspect of a new model, does almost not make sense. in the V development model, most fresh engineers thought the top-down design is done once for all, the reality is most early stage system engineering desgin need be reconstructured.

secondly, system engineering design usually is abstract and high beyond and except engineering considerations, as the system engineers mostly doesn’t have engineering experience in most sections of the sysetem. which results in the system engineering based requirements are not testable, can’t measure during section implementation.

there are a few suggestions to set a workable system engineering process:

the system engineering team should sit by the develop teams and test teams, they should have a lot of communication, and balance the high-level requirements and also testable, measurable, implementable requirements. basically, system engineering design should have product/component developers as input.

both the system engineers and developers should understand the whole V model, including system requirements, component requirements are iteratable.

focus on the special requirement, and not always start from the top, each special requirement is like a point, and all these existing points(already finished requirements) will merged to the whole picture finally.

take an example, during the road test, there will come a new requirement to have a HMI visulization, then focus on this HMI requirement, cause this requirements may not exist in the top down design. but it is the actual need.

system test and verification CI/CD

as most OEMs have said they will massive product L3 ADS around 2022, it is the time to jump into the ADS system test and verification. just knew that Nvidia has the full-stack hardware lines: the AI chips in car(e.g. Xavier), the AI training workstation(e.g. DGX), and the ADS system verifcation platform(e.g. Constallation box).

data needs

the ADS development depends a lot on data infrastructure:

data collect --> data storage --> data analysis

there are many small pieces as well, e.g. data cleaning, labeling, training, mining, visulization e.t.c

from different dev stage or teams, there are different focus.

  • road test/sensor team, they need a lot of online vehicle status/sensor data check, data logging, visulization(dev HMI), as well as offline data analysis and storage

  • perception team, need a lot of raw image/radar data, used to train, mine, as well as to query and store.

  • planning/control team, need high quality data to test algorithms as well as a good structured in-car computer.

  • HMI team, are focusing on friendly data display

  • fleet operation team, need think about how to transfer data in cloud, vehicle, OEM data centers e.t.c.

sooner or later, data pipepline built up is a have to choice.

data collection vendors

road test data collection equipment used in ADS development, is actually not a very big market, compared to in-var computers. but still there are a few vendors already.

  • the top chip OEMs, e.g. Nvidia, Intel has these products.
  • chip poxy, e.g. Inspire
  • traditional vehicle test vendors, e.g. Dspace, Vector, Prescan
  • startups, e.g. horizon

Nvidia constellation

ADS system test usually includes simulation test and road test. and the road test is also called vehicle-in-loop, which is highly expensive and not easy to repeat; then is hardware-in-loop(HIL) test, basically including only the domain controller/ECU in test loop; finally is the is software-in-loop(SIL) test, which is most controllable but also not that reliable.

in practical, it’s not easy to build up a closed-loop verification(CI/CD) process from SIL to HIL to road test. and once CI/CD is setted up, the whole team can be turned into data/simulation/test driven.

the difficult and hidden part is the supporting toolchain development. Most vehicle test vendors have their special full-stack solution toolchains, but most of them are too eco-customized, it’s really difficult for ADS team, specially OEMs, to follow a speical vendor solution.

another reason, test vehicles include components from different vendors, e.g. camera from sony, radar from bosch, Lidar from a Chinese startup, logging equipment from dSpace, and ECUs from Conti. which makes it difficult to fit this mixed system into a Vector verification platform.

Nvidia Constallation is trying to meet the gap from SIL, HIL to road test. as it can suppport most customized ECUs.

  • from road test to HIL, it use the exactly same chip.

  • for road test resimulation, Nvidia offer a sim env, and the road test log can feed in directly

the ability to do resimulation of road test is a big step, the input is directly scanned images/cloud points, even lgsvl, Carla has no such direct support. but resimulation is really useful for CI/CD. Nvidia constallation as said, is the solution from captured data to KPIs.

another big thing is about their high-level scenario description language(HLSDL), which I think is more abstract than OpenScenario. the HLSDL engine use hyper-parameters, SOTIF embedded scenario idea, and optimized scenario generator, which should be massive, random as well as KPI significantly, it should be a good scenario engine if it has these features.

Bosch VMS

vehicle management system(VMS) is cloud nature framework from Bosch, which is used to meet the similar requirements as Nvidia’s solution, to bring the closed-loop(CI/CD) from road test data collection, data anlaysis to fleet management. they have a few applications based on VMS:

  • fleet batteries management(FBM)

for single vehicle’s diaglostic, prediction; and for the EV market, FBM can be used as certification for second-hand EV dealers

  • road coefficient system(RCS)

Bosch has both in-vehicle data collection box and cloud server, RCS will be taken as additional sensor for ADS in prodcut

  • VMS in itself

Bosch would like to think VMS as the PLM for ADS, from design, test, to deployment. and it shoul be easy to integrate many dev tools, e.g. labeling, simulation e.t.c

what about safety

as mentioned previously, 80% of Tesla FSD is to handle AI computing, Nvidia Xavier has about 50% GPU; Mobileye has very limited support for AI. all right, Tesla is most AI aggressive, then Nvidia, then Mobileye is most conserved. which make OEMs take Mobileye solution as more safety, but AI does better performance in perception, so how to balance these two ways?

I realized the greats of Mobileye’s new concept: responsibility sensitive safety(RSS), RSS can be used as the ADS safety boundary, but inside either AI or CV make the house power. a lot of AI research on mixed traditional algorithms with AI algorithms, RSS sounds the good solution. would be nice to build a general RSS Mixing AI(RMA) framework.

X11 GUI in docker

Posted on 2019-11-29 |

Xorg

Xorg is client-server architecture, including Xprotocol, Xclient, Xserver. Linux itself has no graphics interface, all GUI apps in Linux is based on X protcol.

Xserver used to manage the Display device, e.g. monitor, Xserver is responsible for displaying, and send the device input(e.g. keyboard click) to Xclient.

Xclient, or X app, which includes grahics libs, e.g. OpenGL, Vulkan e.t.c

xauthority

Xauthority file can be found in each user home directory and is used to store credentials in cookies used by xauth for authentication of X sessions. Once an X session is started, the cookie is used to authenticate connections to that specific display. You can find more info on X authentication and X authority in the xauth man pages (type man xauth in a terminal). if you are not the owner of this file you can’t login since you can’t store your credentials there.

when Xorg starts, .Xauthority file is send to Xorg, review this file by xauth -f ~/.Xauthority

ubuntu@ubuntu:~$ xauth -f ~/.Xauthority
Using authority file /home/wubantu/.Xauthority
xauth> list
ubuntu/unix:1 MIT-MAGIC-COOKIE-1 ee227cb9465ac073a072b9d263b4954e
ubuntu/unix:0 MIT-MAGIC-COOKIE-1 71cdd2303de2ef9cf7abc91714bbb417
ubuntu/unix:10 MIT-MAGIC-COOKIE-1 7541848bd4e0ce920277cb0bb2842828

Xserver is the host who will used to display/render graphics, and the other host is Xclient. if Xclient is from remote host, then need configure $DISPLAY in Xserver. To display X11 on remote Xserver, need to copy the .Xauthority from Xserver to Xclient machine, and export $DISPLAY and $XAUTHORITY

1
2
export DISPLAY={Display number stored in the Xauthority file}
export XAUTHORITY={the file path of .Xauthority}

xhost

Xhost is used to grant access to Xserver (on your local host), by default, the local client can access the local Xserer, but any remote client need get granted first through Xhost. taking an example, when ssh from hostA to hostB, and run glxgears in this ssh shel. for grahics/GPU resources, hostA is used to display, so hostA is the Xserver.

x11 forwarding

when Xserver and Xclient are in the same host machine, nothing big deal. but Xserver, Xclient can totally be on different machines, as well as Xprotocol communication between them. this is how SSH -X helps to run the app in Xclient, and display in Xserver, which needs X11 Forwarding.

image

test benchmark

1
2
ssh 192.16.0.13
xeyes

/tmp/.X11-unix

the X11(xorg) server communicates with client via some kind of reliable stream of bytes.

A Unix-domain socket is like the more familiar TCP ones, except that instead of connecting to an address and port, you connect to a path. You use an actual file (a socket file) to connect.

srwxrwxrwx 1 root root 0 Nov 26 08:49 X0

the s in front of the permissions, which means its a socket. If you have multiple X servers running, you’ll have more than one file there.

is where X server put listening AF_DOMAIN sockets.

DISPLAY device

DISPLAY format: hostname: displaynumber.screennumber

hostname is the hostname or hostname IP of Xserver

displaynumber starting from 0
screennumber starting from 0

when using TCP(x11-unix protocol only works when Xclient and Xserver are in the same machine), displaynumber is the connection port number minus 6000; so if displaynumber is 0, namely the port is 6000. DISPLAY refers to a display device, and all graphics will be displayed on this device.
by deafult, Xserver localhost doesn’t listen on TCP port. run: sudo netstat -lnp | grep "6010", no return. how to configure Xserver listen on TCP

1
Add DisallowTCP=false under directive [security] in /etc/gdm3/custom.conf file. Now open file /etc/X11/xinit/xserverrc and change exec /usr/bin/X -nolisten tcp to exec /usr/bin/X11/X -listen tcp. Then restart GDM with command sudo systemctl restart gdm3. To verify the status of listen at port 6000, issue command ss -ta | grep -F 6000. Assume that $DISPLAY value is :0.

virtual DISPLAY device

creating a virtual display/monitor
add fake display when no Monitor is plugged in

Xserver broadcast

the idea behind is to stand in one manager(Xserver) machine, and send command to a bunch of worker(Xclient) machines. the default way is all Xclient will talk to Xserver, which eat too much GPU and network bandwith resources on manager node. so it’s better that each worker node will do the display on its own. and if there is no monitor on these worker nodes, they can survive with virtual display.

xvfb

xvfb is the virtual Xserver solution, but doesn’t run well(need check more)

nvidia-xconfig

configure X server to work headless as well with any monitor connected

unity headless

env setup

to test with docker, vulkan, ssh, usually need the following packages:

vulkan dev env

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt upgrade
apt-get install libvulkan1 vulkan vulkan-utils 
sudo apt install vulkan-sdk 

nvidia env

install nvidia-driver, nvidia-container-runtime
install mesa-utils  #glxgears

docker env

install docker 

run glxgear/vkcube/lgsvl in docker through ssh tunnel

there is a very nice blog: Docker x11 client via SSH, disccussed the arguments passing to the following samples

run glxgear

glxgear is OpenGL benchmark test.

1
2
ssh -X -v abc@192.168.0.13
sudo docker run --runtime=nvidia -ti --rm -e DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -v "$HOME/.Xauthority:/root/.Xauthority" --net=host 192.168.0.10:5000/glxgears

if seting $DISPLAY=localhost:10.0 , then the gears will display at master node(ubuntu)

if setting $DISPLAY=:0, then the gears will display at worker node(worker)

and w/o /tmp/.X11-unix it works as well.

run vkcube

vkcube is Vulkan benchmark test.

1
2
3
ssh -X -v abc@192.168.0.13
export DISPLAY=:0
sudo docker run --runtime=nvidia -ti --rm -e DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -v "$HOME/.Xauthority:/root/.Xauthority" --net=host 192.168.0.10:5000/vkcube

in this way, vkcube is displayed in worker node(namely, using worker GPU resource), manager node has no burden at all.

if $DISPLAY=localhost:10.0, to run vkcube, give errors:

No protocol specified
Cannot find a compatible Vulkan installable client driver (ICD).
Exiting ...

looks vulkan has limitation.

run lgsvl

1
2
3
export DISPLAY=:0
sudo docker run --runtime=nvidia -ti --rm -p 8080:8080 -e DISPLAY -v /tmp/.X11-unix:/tmp/.X11-unix -v "$HOME/.Xauthority:/root/.Xauthority" --net=host 192.168.0.10:5000/lgsvl /bin/bash
./simulator

works well! the good news as if take the manager node as the end user monitor, and all worker nodes in cloud, without display, then this parameters will be used in docker service create to host in the swarm. so the next step is to vitual display for each worker node.

refer

wiki: Xorg/Xserver
IBM study: Xwindows
cnblogs: run GUI in remote server
xorg.conf in ubuntu
configure Xauthority
X11 forwarding of a GUI app running in docker
cnblogs: Linux DISPLAY skills
nvidia-runtime-container feature: Vulkan support

what about Tesla

Posted on 2019-11-26 |

Tesla is ROCKING the industry. OTA, camera only, fleet learning, shadow mode, Autopilot, Gega Factory, Cybertruck etc. There is a saying: “看不到,看不懂,追不上(can’t see it; can’t understand it; can’t chase it)”. I have to say most Tesla news are more exciting than news from traditional OEMs. and my best wishes to Tesla to grow greater.

Tesla timeline

2019

  • release Cybertruck
    image

  • Tesla software V10.0 OTA:

    • Smart Summon(enable vehicle to navigate a parking lot and come to them or their destination of choice, as long as their car is within their line of sight)
    • Driving Visualization(HMI)
      • Automatic lane change
      • lane departure avoidance: Autopilot will warn the driver and slow down the vehicle
    • emergency lane departure avoidance: Autopilot will steer the vehciel back into the driving lane if the off-lane may lead to collision
  • Model 3 safety reward from IIHS and Euro NCAP
  • Tesla Insurance
  • Megapack: battery storage for massive usage
  • V3 super charging station: more powerful station and pre-heating battery
  • Powerpack: energy storage system in South Austrilia
  • Model 3 release (March) and the most customer satisfied vehicles in China
  • cut-off 7% employees globally(Jan)

2018

  • race mode Model 3
  • Model 3 the lowest probability of injury by NHTSA: what make Model 3 safe
  • Tesla V9.0 OTA:
    • road status info
    • climate control
    • Navigate on Autopilot
    • Autosteer and Auto lane change combination
    • blindspot warning(when turn signal in engaged but a vehicle or obstacle is detected in the target lane)
    • use high occupancy vehicle(HOV) lane
    • obstacle aware acceleration(if obstacle detected, acc is automatically limited)
    • Dashcam (record and store video footage)
  • Tesla privatization (Aug)

2017

  • super charing station 10000 globally
  • collabration with Panasonic to produce battery at Buffalo, 1Mpw

2016

  • purchase SolarCity
  • purchase Grohmann Enginering(German): highly automatic manufacture
  • massive product of Tesla vehicles with hardwares to support fully self driving(Oct)
    image

    • 8 cameras to support 360 view, in front 250 meters env detection
    • 12 Ultrasonic
    • front Radar
    • ADS HAD (x40 powerful than previous)
    • ADS algorithm: deep learning network combine with vision, radar and ultrasonic(AEB, collision warning, lane keeping, ACC is not included yet)
  • Autopilot 8.0 OTA, or namely “see” the world through Radar

image

  • Tesla’s master plane 2

  • deadly accident across the truck(2016.7)

  • HEPA defense “biological weapon”
  • accept reservation for Model 3(march)
  • Tesla 7.1.1 OTA: remote summon
  • Tesla 7.1 OTA:
    • vertical parking
    • speed gentelly in living house area
    • highway ACC, traffic jam following
    • more road info in HMI, e.t.c truck, bus, motobike

image

2015

  • Autopilot 7.0 update:
    • Autopark(requires driver in the car and only parallel parking)
    • Autosteer
    • Auto lane changing
    • UI refresh
    • Automatic emergency steering
    • side collision warning

Autopilot evoluvation

in a nutshell, Autopilot is dynamic cruise control(ACC) + Autosteer + auto lane chang.

Autopilot 7.0 relied primarily on the front-facing camera. radar hasn’t been used primarily in 7.0 was due to false positives(wrong detection). but in 8.0 with fleet learning. 8.0 made radar the main sensor input.

almost entirely eliminate the false positive – the false braking events – and enable the car to initiate braking no matter what the object is as long as it is not large and fluffy

but any large, or metallic or dense, the radar system is able to detect and initiate a braking event. both when Autopilot active or not(then AEB)

even if the vision system doesn’t recognize the object, it actually doesn’t matter what the object is(while vision does need to know what the thing is), it just knows there is somehting dense.

fleeting learning will mark the geolocation of where all the false alarm occurs, and what the shape of that object. so Tesla system know at a particular position at a particular street or highway, if you see a radar object of a following shape - don’t worry it’s just a road sign or bridge or a Christmas decoration. basically marking these locations as a list of exceptions.

the radar system can track 2 cars/obstacles ahead and imporove the cut-in , cut-off reponse. so in case the car in front suddenly swerve out of the way of an obstacle.

the limit of hardware is reaching, but there will be still a quite improvement as the software and data would improve quite amount.

but still perfect safety is really an impossible goal, it’s really about improving the probability of safety.

in Autopilot 9.0, Navigate on Autopilot(Beta) intelligently suggests lane changes to keep you on your route in addition to making adjustments so you don’t get stuck behind slow cars or trucks. Navigate on Autopilot will also automatically steer toward and take the correct highway interchanges and exits based on your destination.

Autopilot is keeping evaluation with more exicting features:

  • traffic light and stop signs detection
  • enhanced summon
  • naviagte multi-story parking lots
  • automaticly send off vehicle to park
  • Autopilot on city streets
  • Robotaxi service

Tesla Hardware

Hardware 1.0

or Autopliot 1 or AP1, it was a joint development between Mobileye and Tesla. It featured a single front-facing camera and radar to sense the environment plus Mobileye’s hardware and software to control the driving experience. AP1 was so good that when Tesla decided to build their own system, it took them years to catch up to the baseline Autopilot functionality in AP1. Mobileye EyeQ3 is good to mark/label in free space, intuitive routing, obstacle-avoid, and traffic signal recognization etc. but it has a few limitations to env light, and reconstruct 3D world from 2D images etc does work as expect all the time. and EyeQ3 detects objects with traditional algorithms, not cool!

  • AP1 Hardware Suite:

    • Front camera (single monochrome)
    • Front radar with range of 525 feet / 160 meters
    • 12 ultrasonic sensors with 16 ft range / 5 meters
    • Rear camera for driver only (not used in Autopilot)
    • Mobileye EyeQ3 computing platform
  • AP1 Core features:

    • Traffic-Aware Cruise Control (TACC), start & stop
    • Autosteer (closed-access roads, like freeways)
    • Auto Lane Change (driver initiated)
    • Auto Park
    • Summon

Hardware 2.0

AP2 highlights machine learning/neurual networks with camera inputs, so with more sensors and more powerful computing platforms.

image

  • AP2 Hardware Suite:

    • Front cameras (3 cameras, medium, narrow and wide angle)
    • Side cameras (4 total, 2 forward and 2 rear-facing, on each side)
    • Rear camera (1 rear-facing)
    • Front radar with range of 525 feet / 160 meters
    • 12 ultrasonic sensors with 26 ft range / 8 meters
    • NVIDIA DRIVE PX 2 AI computing platform
  • AP2 Core features:

    • Traffic-Aware Cruise Control (TACC), start & stop
    • Autosteer (closed-access roads, like freeways)
    • Auto Lane Change (driver initiated)
    • Navigate on Autopilot (on-ramp to off-ramp)
    • Auto Park
    • Summon

    there was AP2.5 update, with redundant NVIDIA DRIVE PX2 and forward radar with longer range (170m)

Hardware 3.0

or Full Self Driving(FSD) Computer,

image

Telsa guys

Sterling Anderson from 2015 - 2016, director of Autopilot program.

Chris Latter in early 2017, VP for Autopilot software

Jim Keller, from 2016 to 2017, VP for Autopilot hardware

David Nister, from 2015 to 2017, VP for Autopilot

Stuart Bowers from 2018 -2019, VP for Autopilot

Pete Bannon, from 2016 to now, Director for Autopilot hardware

Andrej Karpathy, from 2017 to now, Director of AI

Tesla in media

TeslaRati

Tesla official

Telsa motor club

Autopilot review

zhihu: Tesla Autopilot history

2017 Mercedes-Benz E vs 2017 Tesla Model S

Tesla’s Autopilot 8.0: why Elon Mush says perfect safety is still impossible

Transcript: Elon Musk’s press conference about Tesla Autopilot under v8.0 update

Tesla reveals all the details of its autopilot and its software v7.0

Software update 2018.39

Tesla V10: first look at release notes and features

Tesla Autopilot’s stop sign, traffic light recognition and response is operating in shadow mode

Tesla’s full self-driving suite with enhanced summon

Tesla’s Robotaxi service will be an inevitable player in the AV taxi race

Tesla Autopilot AP1 vs AP2 vs AP3

Tesla Hardware 3 Detailed

Future Tesla Autopilot update coming soon

Autopilot and full self driving capability features

multi view Tesla FSD chips

zhihu: EyeQ5 vs Xavier vs FSD

deploy lgsvl in docker swarm-2

Posted on 2019-11-21 |

background

previously tried to deploy lgsvl by docker compose v3, which at first sounds promising, but due to lack of runtime support, which doesn’t work any way. docker service create --generic-resource is another choice.

docker service options

docker service support a few common options

--workdir is the working directory inside the container 

--args is used to update the command the service runs 

--publish <Published-Port>:<Service-Port>

--network

--mount 

--mode 

--env 

–config

docker service create with generic-resource

generic-resource

create services requesting generic resources is supported well:

1
2
3
4
$ docker service create --name cuda \
--generic-resource "NVIDIA-GPU=2" \
--generic-resource "SSD=1" \
nvidia/cuda

tips: acutally the keyword NVIDIA-GPU is not the real tags. generic_resource is also supported in docker compose v3.5:

1
2
3
4
generic_resources:
- discrete_resource_spec:
kind: 'gpu'
value: 2

--generic-resource has the ability to access GPU in service, a few blog topics:

  • GPU Orchestration Using Docker

  • access gpus from swarm service

first try

follow accessing GPUs from swarm service. install nvidia-container-runtime and install docker-compose, and run the following script:

1
2
3
4
5
6
7
8
9
10
export GPU_ID=`nvidia-smi -a | grep UUID | awk '{print substr($4,0,12)}'`
sudo mkdir -p /etc/systemd/system/docker.service.d
cat EOF | sudo tee --append /etc/systemd/system/docker.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fdd:// --default-runtime=nvidia --node-generic-resource gpu=${GPU_ID}
EOF
sudo sed -i 'swarm-resource = "DOCKER_RESOURCE_GPU"' /etc/nvidia-container-runtime/config.toml
sudo systemctl daemon-reload
sudo systemctl start docker

to understand supported dockerd options, can check here, then run the test as:

docker service create --name vkcc --generic-resource "gpu=0" --constraint 'node.role==manager' nvidia/cudagl:9.0-base-ubuntu16.04

docker service create --name vkcc --generic-resource "gpu=0" --env DISPLAY=unix:$DISPLAY --mount src="X11-unix",dst="/tmp/.X11-unix" --constraint 'node.role==manager' vkcube 

which gives the errors:

1/1: no suitable node (1 node not available for new tasks; insufficient resourc… 
1/1: no suitable node (insufficient resources on 2 nodes) 

if run as, where GPU-9b5113ed is the physical GPU ID in node:

docker service create --name vkcc --generic-resource "gpu=GPU-9b5113ed" nvidia/cudagl:9.0-base-ubuntu16.04

which gives the error:

invalid generic-resource request `gpu=GPU-9b5113ed`, Named Generic Resources is not supported for service create or update

these errors are due to swarm cluster can’t recognized this GPU resource, which is configured in /etc/nvidia-container-runtime/config.toml

second try

as mentioined in GPU orchestration using Docker, another change can be done:

ExecStart=/usr/bin/dockerd -H unix:///var/run/docker.sock --default-runtime=nvidia --node-generic-resource gpu=${GPU_ID}

which fixes the no suitable node issue, but start container failed: OCI..

1
2
3
4
root@ubuntu:~# docker service ps vkcc
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
orhcaxyujece vkcc.1 nvidia/cudagl:9.0-base-ubuntu16.04 ubuntu Ready Ready 3 seconds ago
e001nd557ka6 \_ vkcc.1 nvidia/cudagl:9.0-base-ubuntu16.04 ubuntu Shutdown Failed 3 seconds ago "starting container failed: OC…"

check daemon log with sudo journalctl -fu docker.service, which gives:

1
Nov 21 13:07:12 ubuntu dockerd[1372]: time="2019-11-21T13:07:12.089005034+08:00" level=error msg="fatal task error" error="starting container failed: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v1.linux/moby/9eee7ac30a376ee8f59704f7687455bfb163e5ea3dd6d09d24fbd69ca2dfaa4e/log.json: no such file or directory): nvidia-container-runtime did not terminate sucessfully: unknown" module=node/agent/taskmanager node.id=emzw1f9293rwdk97ki7gfqq1q service.id=qdma7vr1g519lz9hx2y1fen9o task.id=ex1l4wy61kvughns5uzo6qgxy

third try

following issue #141

1
2
3
4
5
6
nvidia-smi -a | grep UUID | awk '{print "--node-generic-resource gpu="substr($4,0,12)}' | paste -d' ' -s
sudo systemctl edit docker
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd -H fd:// --default-runtime=nvidia <resource output from the above>

and run:

docker service create --name vkcc --generic-resource "gpu=1" --env DISPLAY --constraint 'node.role==manager' nvidia/cudagl:9.0-base-ubuntu16.04

it works with output verify: Service converged. However, when test image with vucube or lgsvl it has errors:

1
Nov 21 19:33:20 ubuntu dockerd[52334]: time="2019-11-21T19:33:20.467968047+08:00" level=error msg="fatal task error" error="task: non-zero exit (1)" module=node/agent/taskmanager node.id=emzw1f9293rwdk97ki7gfqq1q service.id=spahe4h24fecq11ja3sp8t2cn task.id=uo7nk4a3ud201bo9ymmlpxzr3

to debug the non-zero exit (1) :

docker service  ls    #get the dead service-ID

docker [service] inspect  r14a68p6v1gu  # check 

docker ps -a  # find the dead container-ID 

docker logs  ff9a1b5ca0de   # check the log of the failure container

it gives: Cannot find a compatible Vulkan installable client driver (ICD)

check the issue at gitlab/nvidia-images

forth try

docker service create --name glx --generic-resource "gpu=1" --constraint 'node.role==manager'  --env DISPLAY --mount src="X11-unix",dst="/tmp/.X11-unix" --mount src="tmp",dst="/root/.Xauthority"  --network host  192.168.0.10:5000/glxgears 

BINGO !!!!! it does serve openGL/glxgears in service mode. However, there are a few issues:

  • constraint to manager node

  • require host network

the X11-unix and Xauthority are from X11 configuration, which need more study. also network parameter need to expand to ingress overlay

mostly, vulkan image still can’t run, with the same error: Cannot find a compatible Vulkan installable client driver (ICD)

generic-resource support discussion

moby issue 33439: add support for swarmkit generic resources

  • how to advertise Generic Resources(republish generic resources)
  • how to request Generic Resources

nvidia-docker issue 141: support for swarm mode in Docker 1.12

docker issue 5416: Add Generic Resources

Generic resources

Generic resources are a way to select the kind of nodes your task can land on.

In a swarm cluster, nodes can advertise Generic resources as discrete values or as named values such as SSD=3 or GPU=UID1, GPU=UID2.

The Generic resources on a service allows you to request for a number of these Generic resources advertised by swarm nodes and have your tasks land on nodes with enough available resources to statisfy your request.

If you request Named Generic resource(s), the resources selected are exposed in your container through the use of environment variables. E.g: DOCKER_RESOURCE_GPU=UID1,UID2

You can only set the generic_resources resources’ reservations field.

overstack: schedule a container with swarm using GPU memory as a constraint

label swarm nodes

$ docker node update --label-add <key>=<value> <node-id>

compose issue #6691

docker-nvidia issue #141

SwarmKit

swarmkit also support GenericResource, please check design doc

1
2
3
4
$ # Single resource
$ swarmctl service create --name nginx --image nginx:latest --generic-resources "banana=2"
$ # Multiple resources
$ swarmctl service create --name nginx --image nginx:latest --generic-resources "banana=2,apple=3"
./bin/swarmctl service create --device /dev/nvidia-uvm --device /dev/nvidiactl --device /dev/nvidia0 --bind /var/lib/nvidia-docker/volumes/nvidia_driver/367.35:/usr/local/nvidia --image nvidia/digits:4.0 --name digits

swarmkit add support devices option

refer

manage swarm service with config

UpCloud: how to configure Docker swarm

Docker compose v3 to swarm cluster

deploy docker compose services to swarm

docker deploy doc

alexei-led github

Docker ARG, ENV, .env – a complete guide

deploy lgsvl in docker swarm

Posted on 2019-11-19 |

Background

previously, vulkan in docker gives the way to run vulkan based apps in Docker; this post is about how to deploy a GPU-based app in docker swarm. Docker swarm has the ability to deploy apps(service) in scalability.

Docker registry

Docker Registry is acted as a local Docker Hub, so the nodes in the LAN can share images.

update docker daemon with insecure-registries

  • modify /etc/docker/daemon.json in worker node:

    "insecure-registries": ["192.168.0.10:5000"] 
    
  • systemctl restart docker

  • start registry service in manager node

    docker service create –name registry –publish published=5000,target=5000 registry:2

access docker registry on both manager node and worker node :

$ curl http://192.168.0.10:5000/v2/   #on manager node 
$ curl http://192.168.0.10:5000/v2/   #on worker node 

insecure registry is only for test; for product, it has to with secure connection, check the official doc about deploy a registry server

upload images to this local registry hub

docker tag  stackdemo  192.168.0.10:5000/stackdemo
docker push  192.168.0.10:5000/stackdemo:latest
curl  http://192.168.0.10:5000/v2/_catalog

on worker run:

docker pull 192.168.0.10:5000/stackdemo 
docker run -p 8000:8000  192.168.0.10:5000/stackdemo  

the purpose of local registry is to build a local docker image file server, to share in the cluster server.

Deploy compose

docker-compose build

docker-compose build is used to build the images. docker-compose up will run the image, if not exiting, will build the image first. for lgsvl app, the running has a few parameters, so directly run docker-compose up will report no protocol error.

run vkcube in docker-compose

docker-compose v2 does support runtime=nvidia, by appending the following to /etc/docker/daemon.json:

1
2
3
4
5
6
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}

to run vkcube in compose by:

xhost +si:localuser:root
docker-compose up

the docker-compose.yml is :

1
2
3
4
5
6
7
8
9
10
11
12
13
version: '2.3'
services:
vkcube-test:
runtime: nvidia
volumes:
- /tmp/.X11-unix:/tmp/.X11-unix
environment:
- NVIDIA_VISIBLE_DEVICES=0
- DISPLAY
# image: nvidia/cuda:9.0-base
image: vkcube
# build: .

however, currently composev3 doesn’t support NVIDIA runtime, who is required to run stack deploy.

support v3 compose with nvidia runtime

as discussed at #issue: support for NVIDIA GPUs under docker compose:

1
2
3
4
5
6
7
8
9
services:
my_app:
deploy:
resources:
reservations:
generic_resources:
- discrete_resource_spec:
kind: 'gpu'
value: 2

update daemon.json with node-generic-resources, an official sample of compose resource can be reviewed. but so far, it only reports error:

ERROR: The Compose file './docker-compose.yml' is invalid because:
services.nvidia-smi-test.deploy.resources.reservations value Additional properties are not allowed ('generic_resources' was unexpected`

deploy compose_V3 to swarm

docker compose v3 has two run options, if triggered by docker-compose up, it is in standalone mode, will all services in the stack is host in current node; if triggered through docker stack deploy and current node is the manager of the swarm cluster, the services will be hosted in the swarm. btw, docker compose v2 only support standalone mode.

take an example from the official doc: deploy a stack to swarm:

1
2
3
4
5
6
7
8
9
docker service create --name registry --publish published=5000,target=5000 registry:2
docker-compose up -d
docker-compose ps
docker-compose down --volumes
docker-compose push #push to local registry
docker stack deploy
docker stack services stackdemo
docker stack rm stackdemo
docker service rm registry

after deploy stackdemo in swarm, check on both manager node and worker node:

curl http://192.168.0.13:8000
curl http://192.168.0.10:8000

docker service runtime

docker run can support runtime env through -e in CLI or env-file, but actually docker service doesn’t have runtime env support. docker compose v3 give the possiblity to configure the runtime env and deploy the service to clusters, but so far v3 compose doesn’t support runtime=nvidia, so not helpful.

I tried to run vkcube, lgsvl with docker service:

docker service create --name vkcc --env NVIDIA_VISIBLE_DEVICES=0 --env DISPLAY=unix:$DISPLAY --mount src="/.X11-unix",dst="/tmp/.X11-unix"  vkcube
docker service create --name lgsvl  -p 8080:8080 --env NVIDIA_VISIBLE_DEVICES=0 --env DISPLAY=unix$DISPLAY --mount src="X11-unix",dst="/tmp/.X11-unix"  lgsvl

for vkcube, the service converged, but no GUI display; for lgsvl, the service failed.

Docker deploy

docker deploy is used to deploy a complete application stack to the swarm, which accepts the stack application in compose file, docker depoy is in experimental, which can be trigger in /etc/docker/daemon.json, check to enable experimental features

a sample from jianshu docker-compose.yml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
version: "3"
services:
nginx:
image: nginx:alpine
ports:
- 80:80
deploy:
mode: replicated
replicas: 4
visualizer:
image: dockersamples/visualizer
ports:
- "9001:8080"
volumes:
- "/var/run/docker.sock:/var/run/docker.sock"
deploy:
replicas: 1
placement:
constraints: [node.role == manager]
portainer:
image: portainer/portainer
ports:
- "9000:9000"
volumes:
- "/var/run/docker.sock:/var/run/docker.sock"
deploy:
replicas: 1
placement:
constraints: [node.role == manager]

a few commands to look into swarm services:

1
2
3
4
5
docker stack deploy -c docker-compose.yml stack-demo
docker stack services stack-demo
docker service inspect --pretty stack-demo # inspect service in the swarm
docker service ps <service-id> # check which nodes are running the service
docker ps #on the special node where the task is running, to see details about the container

summary

at this moment, it’s not possible to use v3 compose.yml to support runtime=nvidia, so using v3 compose.yml to depoly a gpu-based service in swarm is blocked. the nature swarm way maybe the right solution.

refer

run as an insecure registry

https configure for docker registry in LAN

a docker proxy for your LAN

alex: deploy compose(v3) to swarm

monitor docker swarm

docker swarm visulizer

swarm mode with docker service

inspect a service on the swarm

voting example

enable compose for nvidia-docker

nvidia-docker-compose

compose issue: to support nvidia under Docker compose

potential solution for composev3 with runtime

swarmkit: generic_resources

Docker ARG, ENV, .env – a complete guide

vulkan in docker to support new lgsvl

Posted on 2019-11-13 |

background

Docker is a great idea to package apps, the first time to try play with docker swarm. lg-sim has updated to HDRP rendering, which has a higher quality, as well requires more GPU features, Vulkan. currently Vulkan is not supported by standard docker neither nvidia-dcker, which is deprecated after docker enginer > 19.03.

there is nvidia images, the special one we are interesed is vulkan docker, and there is an related personal project, which is based on the cudagl=10.1, which is not supported by non-Tesla GPU. so for our platform, which has only Quadra P2000 GPUs, the supported CUDA is 9.0, so we need to rebuild the vulkan docker based on CUDA9.0. check the vulkan dockerfile, instead of using cudagl:9.0, change to: FROM nvidia/cudagl:9.0-base-ubuntu16.04

after build the image, we can build the vulkan test samples. if no issue, load lg-sim into this vulkan-docker.

a few lines may help:

1
2
3
4
5
/usr/lib/nvidia-384/libGLX_nvidia.s.0
/usr/share/vulkan/icd.d
/proc/driver/nvidia/version

new lgsvl in docker

the previous lg-sim(2019.04) can be easily run in docker, as mentioned here.

the above vulkan-docker image is the base to host lgsvl (2019.09). additionaly, adding vulkan_pso_cache.bin to the docker. the benefit of host lgsvl server in docker, is to access the webUI from host or remote. so the network should be configured to run as --net=host. if configure as a swarm overlay network, it should support swarm cluster.

a few related issue can be checked at lgsvl issues hub.

VOLUME in dockerfile

the following sample is from understand VOLUME instruction in Dockerfile

create a Dockerfile as:

1
2
3
FROM openjdk
VOLUME vol1 /vol2
CMD ["/bin/bash"]
1
2
docker build -t vol_test .
docker run --rm -it vol_test

check in the container, vol1, vol2 does both exist in the running container.

1
2
3
bash-4.2# ls
bin dev home lib64 mnt proc run srv tmp var vol2
boot etc lib media opt root sbin sys usr vol1

also check in host terminal:

1
2
3
4
root@ubuntu:~# docker volume ls
DRIVER VOLUME NAME
local 0ffca0474fe0d2bf8911fba9cd6b5875e51abe172f6a4b3eb5fd8b784e59ee76
local 7c03d43aaa018a8fb031ef8ed809d30f025478ef6a64aa87b87b224b83901445

and check further:

1
2
3
root@ubuntu:/var/lib/docker/volumes# ls
0ffca0474fe0d2bf8911fba9cd6b5875e51abe172f6a4b3eb5fd8b784e59ee76 metadata.db
7c03d43aaa018a8fb031ef8ed809d30f025478ef6a64aa87b87b224b83901445

once touch ass_file under container /vol1, we can find immediately in host machine at /var/lib/docker/volumes :

1
2
3
4
root@ubuntu:/var/lib/docker/volumes/0ffca0474fe0d2bf8911fba9cd6b5875e51abe172f6a4b3eb5fd8b784e59ee76/_data# ls -lt
total 0
-rw-r--r-- 1 root root 0 Nov 7 11:40 css_file
-rw-r--r-- 1 root root 0 Nov 7 11:40 ass_file

also if deleted file from host machine, it equally delete from the runnning container. The _data folder is also referred to as a mount point. Exit out from the container and list the volumes on the host. They are gone. We used the –rm flag when running the container and this option effectively wipes out not just the container on exit, but also the volumes.

sync localhost folder to container

by default, Dockerfile can not map to a host path, when trying to bring files in from the host to the container during runtime. namely, The Dockerfile can only specify the destination of the volume. for example, we expect to sync a localshost folder e.g. attach_me to container, by cd /path/to/dockfile && docker run -v /attache_me -it vol_test. a new data volume named attach_me is, just like the other /vol1, /vol2 located in the container, but this one is totally nothing to do with the localhost folder.

while a trick can do the sync:

1
docker run -it -v $(pwd)/attach_me:/attach_me vol_test

Both sides of the : character expects an absolute path. Left side being an absolute path on the host machine, right side being an absolute path inside the container.

volumes in compose

which is only works during compose build, and has nothing to do with docker container.

copy folder from host to container

  • COPY in dockerfile

ERROR: Service ‘lg-run’ failed to build: COPY failed: stat /var/lib/docker/tmp/docker-builder322528355/home/wubantu/zj/simulator201909/build: no such file or directory

the solution is to keep the folder in Dockerfile’s current pwd; if else, Docker engine will look from /var/lib/docker/tmp.

VOLUME summary

If you do not provide a volume in your run command, or compose file, the only option for docker is to create an anonymous volume. This is a local named volume with a long unique id for the name and no other indication for why it was created or what data it contains. If you override the volume, pointing to a named or host volume, your data will go there instead.

when VOLUME in DOCKERFILE, it actually has nothing to do with current host path, it actually generate something in host machine, located at /var/lib/docker/volumes/, which is nonreadable and managed by Docker Engine. also don’t forget to use --rm, which will delete the attached volumes in host when the container exit.

warning: VOLUME breaks things

understand docker-compose.yml

Understand and manage Docker container volumes

what is vulkan SDK

Graham blog

build cluster on Docker swarm

Posted on 2019-11-06 |

Docker micro services

key concepts in Docker

when deploy an application(lg-sim) to swarm cluster as a service, which is defined in a/the manager node, the manager node will dispatch units of work as taskes to worker nodes.

when create a service, you specify which container image to use and which commands to execute inside runing containers. In the replicated services, the swarm manager distributes a specific number of replica tasks among the nodes based upon the scale you set in the desired state. For global services, the swarm runs one task for the service on every available node in the cluster.

docker node

Docker swarm CLI commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
docker swarm init
docker swarm join
docker service create --name --env --workdir --user
docker service inspect
docker service ls
docker service rm
docker service scale
docker service ps
docker service update --args
docker node inspect
docker node update --label-add
docker node promote/demote
# run in worker
docker swarm leave
# run in manager
docker node rm worker-node
docker ps #get running container ID
docker exec -it containerID /bin/bash
docker run -it
docker-compose up build/run

delete unused Docker network

as Docker network may confuse external when access to the local network interfaces, sometimes need to remove the docker networks.

1
2
3
4
5
6
7
docker network ls
docker network disconnect -f {network} {endpoint-name}
docker network rm
docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)
docker volume prune
docker network prune

the above scripts will delete the unused(non-active) docker network, then may still active docker related networks, which can be deleted through:

1
2
3
4
5
6
7
8
9
10
11
12
13
sudo ip link del docker0
```
### access Docker service
Docker container has its own virutal IP(172.17.0.1) and port(2347), which allowed to access in the host machine; for externaly access, need to map the hostmachine IP to docker container, by `--publish-add`. the internal communication among docker nodes are configured by `advertise_addr` and `listen-addr`.
#### through IP externaly
To publish a service’s ports externally, use the --publish <PUBLISHED-PORT>:<SERVICE-PORT> flag. When a user or process connects to a service, any worker node running a service task may respond.
taking example from [5mins series](https://www.cnblogs.com/CloudMan6/tag/Swarm/)

docker service create –name web_server –replicas=2 httpd
docker service ps web_server

access service only on host machine through the Docker IP

curl 172.17.0.1
docker service update –publish-add 8080:80 web_server

access service externally

curl http://hostmachineIP:8080

1
2
3
4
5
6
7
#### configure websocket protocol
for lg-sim server to pythonAPI client, which is communicated through `websocket`, it's better if the service can be configured to publish through websocket.
#### publish httpd server to swarm service

docker service create –name web_server –publish 880:80 –replicas=2 httpd

1
the container IP is IP in network interface `docker0`(e.g. 172.17.0.1), which can be checked through `ifconfig`. `80` is the default port used by httpd, which is mapped to the host machine `880` port. so any of the following will check correctly:

curl 172.17.0.1:880
curl localhost:880
curl 192.168.0.1:880
curl 192.168.0.13:880 #the other docker node

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#### publish lg-sim into swarm service
the previous version(2019.04) of lg-sim doesn't have a http server built-in, since 2019.7, they have `Nancy http server`, which is a great step toward dockerlize the lg-sim server.
### manage data in Docker
`Volumes` are stored in a part of the host filesystem, which is located `/var/lib/docker/volumes`, which is actually managed by Docker, rather than by host machine.
Volumes are the preferred way to [persist data in Docker](https://docs.docker.com/v17.09/engine/admin/volumes/#more-details-about-mount-types) containers and services. some use cases of volume:
* once created, even the container stops or removed, the volume still exist.
* multiple containes can mount the same voume simultaneously;
* when need to store data on a remote host
* when need to backup, restore, or migrate data from one Dockr to another
#### RexRay
an even high-robost way is to separate volume manager and storge provider manager. [Rex-Ray](https://rexray.readthedocs.io/en/v0.3.3/)

docker service create –name web_s \
–publish 8080:80 \
–mount “type=volume, volume-driver=rexray, source=web_data, target=/usr/local/apache2/htdocs” \
httpd

docker exec -it containerID
ls -ld /usr/local/apahce2/htdocs
chown www-data:www-data

test visit

curl http://192.168.0.1:8080
docker inspect containerID

```

source reprensents the name of data volume, if null, will create new
target reprensents data volume will be mounted to /usr/local/apache2/htdocs in each container

in RexRay, data volume update, scale, failover(when any node crashed, the data volume won’t lose) also be taken care.

refer

5mins in Docker

Docker swarm in and out

what is swarm advertise-addr

can ran exec in swarm

execute a command within docker swarm service

Linux network tool

Posted on 2019-11-05 |

Linux network commands

ip

ip command is used to edit and display the configuration of network interfaces, routing, and tunnels. On many Linux systems, it replaces the deprecated ifconfig command.

1
2
3
4
5
6
7
8
9
ip link del docker0 # delete a virtual network interface
ip addr add 192.168.0.1 dev eth1 #assign IP to a specific interface(eht1)
ip addr show #check network interface
ip addr del 192.168.0.1 dev eth1
ip link set eth1 up/down
ip route [show]
ip route add 192.168.0.1 via 10.10.20.0 dev eth0 #add static route 192.168.0.1
ip route del 192.168.0.1 #remove static route
ip route add default via 192.168.0.1 # add default gw

netstate

netstate used to display active sockets/ports for each protocol (tcp/ip)

1
2
netstat -lat
netstat -us

nmcli

nmcli is a Gnome command tool to control NetworkManager and report network status:

1
2
nmcli device status
nmcli connection show =

route

route

1
2
3
4
route ==> ip route (modern version) ##print router
route add -net sample-net gw 192.168.0.1
route del -net link-local netmask 255.255.0.0 #delete a virtual network interface
ip route flush # flashing routing table

tracepath

tracepath is used to traces path to a network host discovering MTU along this path. a modern version is traceroute.

1
2
3
4
tracepath 192.168.0.1
```
### networking service

systemctl restart networking
/etc/init.d/networking restart

or

service NetworkManager stop

1
2
3
4
5
6
7
8
9
## network interface
location at `/etc/network/interfaces`
`eno1` is onboard Ethernet(wired) adapter. if machines has already `eth1` in its config file, for the second adapter, it will use `eno1` rather than using `eth2`.
[ifconfig](https://www.ibm.com/support/knowledgecenter/ssw_aix_71/i_commands/ifconfig.html) is used to set up network interfaces such as Loopback, Ethernet network interface: a software interface to networking hardware, e.g. physical or virtual. physical interface, such as `eth0`, namely Ethernet network card. virtual interface such as `Loopback`, `bridges`, `VLANs` e.t.c.

ifconfig -a
ifconfig eth0 #check specific network interface
ifconfig eth0 192.168.0.1 #assign static IP address to network interface
ifconfig eth0 netmask 255.255.0.0 #assign netmask

ifconfig docker0 down/up

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
replaced by `ip` command later.
### why enp4s0f2 instead of eth0
[change back to eth0](https://www.itzgeek.com/how-tos/mini-howtos/change-default-network-name-ens33-to-old-eth0-on-ubuntu-16-04.html)
```shell
lspci | grep -i "net"
dmesg | grep -i eth0
ip a
sudo vi /etc/default/grub
GRUB_CMDLINELINUX="net.ifnames=0 biosdevname=0"
update-grub
# update /etc/network/interfaces
auto eth0
iface eth0 inet static
sudo reboot

Unknown interface enp4s0f2

due TO /etc/network/interfaces has auto enp4s0f2 line, which always create this network interface , when restart the networking service.

ping hostname with docker0

usually there may be have multi network interfaces(eth0, docker0, bridge..) on a remote host, when ping this remote host with exisiting docker network (docker0), by default will talk to the docker0, which may not the desired one.

build LAN cluster with office PCs

  • setup PC IPs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
master node:
IP Address: 192.168.0.1
netmask: 24
Gateway: null
DNS serer: 10.3.101.101
worker node:
IP address: 192.168.0.12
netmask: 24
Gateway: 192.168.0.1
DNS: 10.255.18.3
```
* `ufw disable`
* update `/etc/hosts` file:
192.168.0.1 master
192.168.0.12 worker 
if need to update the default hostname to `worker`, can modify `/etc/hostname` file, and reboot.

* ping test :

```script 

ping -c 3 master 
ping -c 3 worker
  • set ssh access(optionally)

sudo apt-get install openssh-server 
ssh-keygen -t rsa 
# use custmized key
cp rsa_pub.key authorized_key

refer

Ubuntu add static route

10 useful IP commands

1…678…20
David Z.J. Lee

David Z.J. Lee

what I don't know

193 posts
51 tags
GitHub LinkedIn
© 2020 David Z.J. Lee
Powered by Hexo
Theme - NexT.Muse